Back to Insights

Leveraging Aspose for PDF to Excel migration

Do you have data living in PDF forms that you'd like to analyze and report on?

We solved this problem for a client by leveraging the Aspose.Pdf and Aspose.Cells libraries to pull the data out of PDF and into Excel. Using the Aspose libraries to do the heavy lifting makes this a fairly trivial task. The data that we needed to import into Excel was contained within PDF forms with each form comprised of named fields. The resulting Excel file was created to have one row per PDF form, and one column per field. For this to work, it is important that all  PDF fields are named meaningfully and that only similar PDF forms are imported.

Step one - Licensing

First we headed over to the Aspose website and download the trial version of Aspose.Cells and Aspose.Pdf.

Note: the trial version of Aspose.Pdf restricts you to reading only the first four fields from a PDF form whilst the trial version of Aspose.Cells restricts you to 100 files and also inserts an additional worksheet with a trial message. In order to prevent these limitations you'll need to either purchase a license, or request a 30-day temporary license here

We went for the temporary license option in order to test extracting more than four PDF fields at a time. To register this license at runtime we completed the following.

var cellsLicense = new Aspose.Cells.License();

var pdfLicense = new Aspose.Pdf.License();

Step two - Workbook

Next, we decide whether to create a new Excel workbook or append to an existing one. In this example, we  appended to the same file if it exists.

Workbook workbook;
var outputFilePath = "Outfile.xls"

if (File.Exists(outputFilePath))
	workbook = new Workbook(outputFilePath);
	workbook = new Workbook();

Step three - Worksheet

Then we get to create our data worksheet if it doesn't already exist. We've chosen to store all our programmatically imported data in a worksheet named "Data". This way, users are free to create additional worksheets, to manipulate, and report on the data as required.

var worksheetName = "Data";
var worksheet = workbook.Worksheets[worksheetName];
if (worksheet == null)
	worksheet = workbook.Worksheets.Add(worksheetName);

Next we added a header column to store the field names. We also stored the PDF filename in the first column so we know which file each row comes from.

var headerRow = worksheet.Cells.Rows[0];
headerRow[0].Value = "File name";

Step four - Import data

From here, we are able to add rows to the worksheet quite easily by looping through the PDF fields and filling the appropriate cell. In this example, we first check if a column already exists for a given named field and if it doesn't we add a new column.

var directory = new DirectoryInfo("C:\PDFFiles");
var pdfFiles = directory.EnumerateFiles("*.pdf", SearchOption.AllDirectories);

foreach (var pdfFile in pdfFiles)
	// get the first empty row
	var dataRow = worksheet.Cells.Rows[worksheet.Cells.MaxDataRow + 1];
	// Add the PDF filename to the first column
	dataRow[0].Value = pdfFile.FullName;

	Document pdfDocument = new Document(pdfFile.OpenRead());

	var fields = pdfDocument.Form.Fields;

	// Loop through each of the PDF form fields 
	foreach (var field in fields)
		var fieldName = field.PartialName;
		var fieldValue = field.Value;

		if (!String.IsNullOrEmpty(fieldName))
			// Try to find a column header that matches the field name
			var headerCell = headerRow.Cast<Aspose.Cells.Cell>()
			.Where(c => c.StringValueWithoutFormat.Equals(fieldName, StringComparison.InvariantCultureIgnoreCase))

			if (headerCell == null)
				// No existing column with the field name so let's create one
				headerCell = headerRow[headerRow.LastDataCell.Column + 1];
				headerCell.Value = fieldName;

			var dataCell = dataRow[headerCell.Column];
			dataCell.Value = fieldValue;


And there we have it. By leveraging the Aspose libraries we can build out powerful tools quite easily.