patterncsharpMinor
Searching through various PDF files
Viewed 0 times
searchingfilesthroughvariouspdf
Problem
I'm just looking for advice on how I can get my code to operate faster. It's pretty quick right now with searching through 30 3-page PDFs, but I imagine once there gets to be thousands of files to search that it will take longer than I'd like. I can change
```
public string ReadPdfFile(string fileName, String searchText)
{
List pages = new List();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page "+fileNameOnly+"";
sb.AppendLine(pdfHyperlink);
sb.AppendLine("");
}
Regex regex = new Regex(txtBoxSearchString.Text, RegexOptions.IgnoreCase);
string domainURLfileName = Regex.Replace(f.File, @"C:\\schools\\syllabus", @"https://mywebsite.com/search/syllabus/");
string finalSyllabusURLfileName = Regex.Replace(domainURLfileName, " ", "%20");
string fileNameOnly2 = Regex.Replace(domainURLfileName, @"https://mywebsite.com/search/syllabus/", "");
string pdfHyperlinkMappedDrive = @"" + fileNameOnly2 + "";
if ((regex.IsMatch(fileNameOnly2)) && (fileNameOnly != fileNameOnly2))
{
sb.AppendLine(pdfHyperlinkMappedDrive);
sb.AppendLine("");
}
else
{
//moving on
}
}
Panel1.Controls.Clear();
if (sb.ToString() != "")
{
Panel1.Attributes["style"] = "height: 222px;";
Pane
SearchOption.AllDirectories to TopDirectoryOnly. I've done some testing though and it seems like what takes the longest is the searching in the files not actually enumerating the directory.```
public string ReadPdfFile(string fileName, String searchText)
{
List pages = new List();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page "+fileNameOnly+"";
sb.AppendLine(pdfHyperlink);
sb.AppendLine("");
}
Regex regex = new Regex(txtBoxSearchString.Text, RegexOptions.IgnoreCase);
string domainURLfileName = Regex.Replace(f.File, @"C:\\schools\\syllabus", @"https://mywebsite.com/search/syllabus/");
string finalSyllabusURLfileName = Regex.Replace(domainURLfileName, " ", "%20");
string fileNameOnly2 = Regex.Replace(domainURLfileName, @"https://mywebsite.com/search/syllabus/", "");
string pdfHyperlinkMappedDrive = @"" + fileNameOnly2 + "";
if ((regex.IsMatch(fileNameOnly2)) && (fileNameOnly != fileNameOnly2))
{
sb.AppendLine(pdfHyperlinkMappedDrive);
sb.AppendLine("");
}
else
{
//moving on
}
}
Panel1.Controls.Clear();
if (sb.ToString() != "")
{
Panel1.Attributes["style"] = "height: 222px;";
Pane
Solution
The major bottleneck is most likely in the
In your
Once the reader ran through every page, the method returns null or the filename based on whether numbers of
What you could do is to return as soon as you have found the text, so that you don't have to look through the entire document for nothing.
The method has been renamed to reflect more what it actually performs, and
the return type has been changed to
ReadPdfFile method as we are dealing with a PDF file.In your
ReadPdfFilemethod, a PdfReader is created to read through every page of the document to find the searchText and the page numbers on which the searchText is found is stored inside a List named pages.Once the reader ran through every page, the method returns null or the filename based on whether numbers of
pages is 0.What you could do is to return as soon as you have found the text, so that you don't have to look through the entire document for nothing.
The method has been renamed to reflect more what it actually performs, and
the return type has been changed to
bool, since we only need to know if the file contains the search text.public bool SearchPdfFile(string fileName, String searchText)
{
/* technically speaking this should not happen, since "you" are calling it
therefore this should be handled critically
if (!File.Exists(fileName)) return false; //original workflow
*/
if (!File.Exists(fileName))
throw new FileNotFoundException("File not found", fileName);
using (PdfReader reader = new PdfReader(fileName))
{
var strategy = new SimpleTextExtractionStrategy();
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searchText))
return true;
}
}
return false;
}Code Snippets
public bool SearchPdfFile(string fileName, String searchText)
{
/* technically speaking this should not happen, since "you" are calling it
therefore this should be handled critically
if (!File.Exists(fileName)) return false; //original workflow
*/
if (!File.Exists(fileName))
throw new FileNotFoundException("File not found", fileName);
using (PdfReader reader = new PdfReader(fileName))
{
var strategy = new SimpleTextExtractionStrategy();
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searchText))
return true;
}
}
return false;
}Context
StackExchange Code Review Q#57018, answer score: 7
Revisions (0)
No revisions yet.