HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Merging SharePoint Word documents into a single PDF with iTextSharp

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
sharepointitextsharpmergingwithintowordsingledocumentspdf

Problem

I created a program which collects all the word files from a folder which belongs to our SharePoint library of documents. The program will only grab the documents if they were submitted/are a part of the HR team.

These documents are saved in a temporary folder, and then merged together into one large pdf.

I thought this seemed a bit superfluous, and figured there must be a way of achieving this without having to save the individual pdf files first. Ideally I would also like a way of dynamically creating a table of contents based on the files which are already present.

The files need to be in the correct order, so each document on SharePoint has been assigned an order number.

This list is retrieved so that I have something to order the documents against (as saving them in the file results in them being in their default, alphabetical order, appending the order number to the beginning of the file name also doesn't work, as this orders them as 1, 10, 11, 12 .. etc). I also feel that this can be improved upon. If I can work out how to convert the word files into PDFs without having to save them first, I could sort the FileInfo data before converting.

Does anyone have any ideas or suggestions on how I can improve/implement the above, or any other comments? Any advice/guidance at all is much appreciated.

Program

```
using System.Collections.Generic;
using System.IO;

namespace PdfConverter
{
class Program
{
#region fields
public static List orderedList;
#endregion fields

static void Main(string[] args)
{
// Get the word files FileInfo
FileInfo[] wordFileInfo = DataGrabber.GetWordFileInfo();

// Instantiate a new 'Convert' object and convert our files to pdf, saving them in a temporary folder
Converter convert = new Converter();
convert.ToPdf(wordFileInfo);

// Get the pdf filenames
var newPdfFiles = DataGrabber.GetPdfFileNames()

Solution

orderedList: you should not have public fields in your class. Is it OK if someone else assigns an entirely new value to that field? If not then you should make a property with private setter. Also name isn't so helpful: orderer list of what? Even better: remove it. Static fields, properties and methods are terrible to test and in this case you don't even need it because it can be a function return value.

newFileName is built manually with an hard-coded constant (move it to a const private field) and a manually parsed file name. To obtain file name (it doesn't matter how trivial it seems) you should use Path methods, in this case Path.GetFileNameWithoutExtension(). It's both clear (to readers) and more reliable.

In my opinion your Main() method does little bit too many things. Why don't you extract few self-descriptive methods like ConvertPdf(), MergePdf() and DeletePdf()? It's a short method but we have too read (and understand) too much code just to have a coarse overview or program logic.

DataGrabber class isn't derived. Why don't you also mark it as sealed?

Many static strings are constants, why don't you declare them as private const string?

sortBy is an escaped XML fragment and you build XML document with string concatenation. Assuming sortBy won't change (make it clear making it const string) it's too error-prone and not human-friendly (IMO). You'd better move your XML file into resources, reading it with XmlDocument.Load() from ResourceManager.GetStream(). You can change attribute value using a simple XPath query. It's slower that raw manual approach but easier to read (also because you don't pollute your C# code with XML fragments) and less error-prone (XmlDocument will escape things for you).

ClientContext class implements IDisposable then I'd wrap its usage with using (to correctly dispose it also in case of errors).

Sort() code is too long and convoluted, in my opinion it needs to be more self-descriptive. What it is doing? What all those fields are?

I'd also change how you manage exceptions, for example:

try
{
    title = litem.FieldValues["FileLeafRef"].ToString();
}
catch (Exception)
{
    title = ""; 
}


Why that code may fail? Check instead of catch exceptions (C# is not Python) and for sure never ever catch base Exception. If there may be an error you can't prevent then catch exactly that exception otherwise it will hide possible bugs. On the same style: do not ever use catch { }. Never, no exceptions to this.

In your GetPdfFileNames() the length local variable (and its associated comment) are pretty useless. What's its purpose? Why don't you simply use pdfFiles.Length? As it is now it doesn't exclude anything, is it a bug or an outdated comment?

Also consider to use some LINQ, it may make your code slightly more readable. Note that you do that for just to concatenate FileInfo.Name with localPath however (because FileInfo are retrieved from a DirectoryInfo created from localPath) you already have that value in FileInfo.FullName property. One step more: to get full path you don't even need FileInfo:

var pdfFileNames = Directory.GetFiles(localPath, "*.pdf");


In Converter I'd change more or less same things I already suggested for other modules. Do not build path by hand, to change extension you can use Path.ChangeExtension(). Also always (especially when working with Word instances!) use using statement when applicable (also read How to release COM handle in .NET.)

Next (optional!) step would be to make your utility more testable and extensible. The fact you're converting using iTextSharp and that documents come from SharePoint are merely implementation details. If you hide this in abstract base classes (or interfaces) you may plug-in (for example) different sources and also test all your logic mocking-up class currently not under-test. As very final note I'd add some sort of retry-pattern. Things may go wrong with networks but a temporary error may be recovered if you wait few seconds (and you won't need to restart whole process from beginning), see also this post.

Code Snippets

try
{
    title = litem.FieldValues["FileLeafRef"].ToString();
}
catch (Exception)
{
    title = ""; 
}
var pdfFileNames = Directory.GetFiles(localPath, "*.pdf");

Context

StackExchange Code Review Q#117154, answer score: 2

Revisions (0)

No revisions yet.