HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Indexing data with Lucene.net

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
indexingwithlucenenetdata

Problem

I'm using the following function to index ebook data with Lucene. I'm looking to improve the structure and organization of this function.

```
public string BuildIndex(int pageNum, int pageSize)
{
var watch = new System.Diagnostics.Stopwatch();
watch.Start();
var version = Lucene.Net.Util.Version.LUCENE_30;

Analyzer analyzer = new StandardAnalyzer(version);
DirectoryInfo path = new DirectoryInfo("C:\\LuceneIndex");
Lucene.Net.Store.Directory directory = new MMapDirectory(path);
using (IndexWriter iwriter = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
//Initialize the StringBuilder with 4 MB to accommodate our data
StringBuilder bodyBuilder = new StringBuilder(410241024);
var files = ReaderDAL.FetchBooklistSearchData(pageNum, pageSize);
foreach (var file in files) {
bodyBuilder.Clear();
DirectoryInfo filePath = new DirectoryInfo(file.IndexFilePath);

if (filePath.Exists)
{
foreach (var pageFile in filePath.EnumerateFiles())
{
using (StreamReader reader = pageFile.OpenText()) {
String line = reader.ReadToEnd();
bodyBuilder.AppendLine(line);
}
}
}
//file.IndexFilePath
Document doc = new Document();
doc.Add(new Field("id", file.BookId.ToString(), Field.Store.YES, Field.Index.NO));
doc.Add(new Field("author", file.Author, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("annotation", file.Annotation, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("publisher", file.Publisher, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("title", file.Title, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("body", bodyBuilder.ToString(), Field.Store.YES, Fiel

Solution

The function is called BuildIndex, but it could be seen as doing the following:

  • Measuring its execution time.



  • Creating an Analyzer.



  • Creating a Lucene.Net.Store.Directory.



  • Using the analyzer and the directory to create an IndexWriter.



  • Iterating files



  • Loading document body from physical file if it exists at specified location.



  • Creating a new Document for each file, adding 6 Fields.



  • Returning execution time.



The first thing that strikes me is that there seems to have a performance concern that shadows the code's intent: is the goal of the method to build an index, or to calculate how long it takes to build an index? The first thing I'd do is return void and remove the first thing in that list - focused code does one thing, it has only a single responsibility in mind.

The next 3 points are closely related and could easily fit inside a method whose job is to return a new IndexWriter:

private IndexWriter CreateWriter(Analyzer analyzer, string path)
{
    var directory = new DirectoryInfo(path);
    var mMapDirectory = new MMapDirectory(directory);

    return new IndexWriter(mMapDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED)
}


The reason this method is taking an Analyzer parameter is because it leaves you the liberty of calling it and passing in a StandardAnalyzer today, and a CustomAnalyzer tomorrow, without changing anything in the method, only the way the analyzer gets instantiated.

In your original code, the version, analyzer path and directory locals all only exist to create the IndexWriter. Extracting the instantiation of this writer into its own method further reduces noise and increases signal. Consider this:

public void BuildIndex(Analyzer analyzer, int pageNum, int pageSize)
{
    using(var writer = CreateWriter(analyzer, @"C:\LuceneIndex"))
    {
        //...
    }
}


Now BuildIndex doesn't care anymore about what Lucene version the analyzer is configured with; not anymore does it care about the actual type of that analyzer either.

The body of the using block declares a bodyBuilder variable that I would have simply called builder. In fact, the StreamReader, is part of implementation details of another method, just like the creation of a new document:

using(var writer = CreateWriter(analyzer, @"C:\LuceneIndex"))
{
    var files = ReaderDAL.FetchBooklistSearchData(pageNum, pageSize);
    foreach (var file in files)
    {
        //Initialize the StringBuilder with 4 MB to accommodate our data:
        var builder = new StringBuilder(4*1024*1024);

        var directory = new DirectoryInfo(file.IndexFilePath);
        if (directory.Exists)
        {
            ReadAllPageFilesInFolder(directory, builder);
        }

        writer.AddDocument(CreateDocument(file, builder.ToString()));
    }

    writer.Optimize();
    writer.Flush(true, true, true);
}


The advantage of a StreamReader is that it allows you to load a large file's content in small chunks.. but using ReadToEnd() is essentially the same as calling File.ReadAllText(), so if the files are easily digestible it would be more easily readable to go that way:

private void ReadAllPageFilesInFolder(DirectoryInfo directory, StringBuilder builder)
{
    foreach (var file in directory.EnumerateFiles())
    {
        // is this file really a PageFile?

        builder.AppendLine(File.ReadAllText(file.FullName));
    }        
}


Moving the document creation into its own method makes it easier to later change how a document gets created (new field(s), new settings, etc.):

private Document CreateDocument(WhateverThatTypeIs file, string body)
{
    var result = new Document();

    result.Add(new Field("id", file.BookId.ToString(), Field.Store.YES, Field.Index.NO));
    result.Add(new Field("author", file.Author, Field.Store.YES, Field.Index.ANALYZED));
    result.Add(new Field("annotation", file.Annotation, Field.Store.YES, Field.Index.ANALYZED));
    result.Add(new Field("publisher", file.Publisher, Field.Store.YES, Field.Index.ANALYZED));
    result.Add(new Field("title", file.Title, Field.Store.YES, Field.Index.ANALYZED));
    result.Add(new Field("body", body, Field.Store.YES, Field.Index.ANALYZED));

    return result;
}


That's all nice, but now you want BuildIndex to measure its execution time and return it as a string, right? Wrong.

Let's say the BuildIndex method is located inside an IndexBuilder class; I'd make the method virtual...

public class IndexBuilder
{
    public virtual void BuildIndex(Analyzer analyzer, int pageNum, int pageSize)
    {
        // the above code
    }
}


... and then I'd create a class whose role is to decorate the existing functionality with another functionality, that I can substitute for an IndexBuilder wherever I need one:

```
public IndexBuilderStopWatchDecorator : IndexBuilder
{
private readonly IndexBuilder _builder;

public IndexBuilderStopW

Code Snippets

private IndexWriter CreateWriter(Analyzer analyzer, string path)
{
    var directory = new DirectoryInfo(path);
    var mMapDirectory = new MMapDirectory(directory);

    return new IndexWriter(mMapDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED)
}
public void BuildIndex(Analyzer analyzer, int pageNum, int pageSize)
{
    using(var writer = CreateWriter(analyzer, @"C:\LuceneIndex"))
    {
        //...
    }
}
using(var writer = CreateWriter(analyzer, @"C:\LuceneIndex"))
{
    var files = ReaderDAL.FetchBooklistSearchData(pageNum, pageSize);
    foreach (var file in files)
    {
        //Initialize the StringBuilder with 4 MB to accommodate our data:
        var builder = new StringBuilder(4*1024*1024);

        var directory = new DirectoryInfo(file.IndexFilePath);
        if (directory.Exists)
        {
            ReadAllPageFilesInFolder(directory, builder);
        }

        writer.AddDocument(CreateDocument(file, builder.ToString()));
    }

    writer.Optimize();
    writer.Flush(true, true, true);
}
private void ReadAllPageFilesInFolder(DirectoryInfo directory, StringBuilder builder)
{
    foreach (var file in directory.EnumerateFiles())
    {
        // is this file really a PageFile?

        builder.AppendLine(File.ReadAllText(file.FullName));
    }        
}
private Document CreateDocument(WhateverThatTypeIs file, string body)
{
    var result = new Document();

    result.Add(new Field("id", file.BookId.ToString(), Field.Store.YES, Field.Index.NO));
    result.Add(new Field("author", file.Author, Field.Store.YES, Field.Index.ANALYZED));
    result.Add(new Field("annotation", file.Annotation, Field.Store.YES, Field.Index.ANALYZED));
    result.Add(new Field("publisher", file.Publisher, Field.Store.YES, Field.Index.ANALYZED));
    result.Add(new Field("title", file.Title, Field.Store.YES, Field.Index.ANALYZED));
    result.Add(new Field("body", body, Field.Store.YES, Field.Index.ANALYZED));

    return result;
}

Context

StackExchange Code Review Q#50957, answer score: 8

Revisions (0)

No revisions yet.