HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Searching through various PDF files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
searchingfilesthroughvariouspdf

Problem

I'm just looking for advice on how I can get my code to operate faster. It's pretty quick right now with searching through 30 3-page PDFs, but I imagine once there gets to be thousands of files to search that it will take longer than I'd like. I can change SearchOption.AllDirectories to TopDirectoryOnly. I've done some testing though and it seems like what takes the longest is the searching in the files not actually enumerating the directory.

```
public string ReadPdfFile(string fileName, String searchText)
{
List pages = new List();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page "+fileNameOnly+"";
sb.AppendLine(pdfHyperlink);
sb.AppendLine("");
}

Regex regex = new Regex(txtBoxSearchString.Text, RegexOptions.IgnoreCase);
string domainURLfileName = Regex.Replace(f.File, @"C:\\schools\\syllabus", @"https://mywebsite.com/search/syllabus/");
string finalSyllabusURLfileName = Regex.Replace(domainURLfileName, " ", "%20");
string fileNameOnly2 = Regex.Replace(domainURLfileName, @"https://mywebsite.com/search/syllabus/", "");
string pdfHyperlinkMappedDrive = @"" + fileNameOnly2 + "";

if ((regex.IsMatch(fileNameOnly2)) && (fileNameOnly != fileNameOnly2))
{
sb.AppendLine(pdfHyperlinkMappedDrive);
sb.AppendLine("");
}
else
{
//moving on
}
}

Panel1.Controls.Clear();
if (sb.ToString() != "")
{
Panel1.Attributes["style"] = "height: 222px;";
Pane

Solution

The major bottleneck is most likely in the ReadPdfFile method as we are dealing with a PDF file.

In your ReadPdfFilemethod, a PdfReader is created to read through every page of the document to find the searchText and the page numbers on which the searchText is found is stored inside a List named pages.

Once the reader ran through every page, the method returns null or the filename based on whether numbers of pages is 0.

What you could do is to return as soon as you have found the text, so that you don't have to look through the entire document for nothing.

The method has been renamed to reflect more what it actually performs, and

the return type has been changed to bool, since we only need to know if the file contains the search text.

public bool SearchPdfFile(string fileName, String searchText)
{
    /* technically speaking this should not happen, since "you" are calling it
       therefore this should be handled critically
        if (!File.Exists(fileName)) return false; //original workflow
    */
    if (!File.Exists(fileName))
        throw new FileNotFoundException("File not found", fileName);

    using (PdfReader reader = new PdfReader(fileName))
    {
        var strategy = new SimpleTextExtractionStrategy();

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            if (currentPageText.Contains(searchText))
                return true;
        }
    }

    return false;
}

Code Snippets

public bool SearchPdfFile(string fileName, String searchText)
{
    /* technically speaking this should not happen, since "you" are calling it
       therefore this should be handled critically
        if (!File.Exists(fileName)) return false; //original workflow
    */
    if (!File.Exists(fileName))
        throw new FileNotFoundException("File not found", fileName);

    using (PdfReader reader = new PdfReader(fileName))
    {
        var strategy = new SimpleTextExtractionStrategy();

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            if (currentPageText.Contains(searchText))
                return true;
        }
    }

    return false;
}

Context

StackExchange Code Review Q#57018, answer score: 7

Revisions (0)

No revisions yet.