Tutorial :Parsing Office Documents


I`d like to be able to read the content of office documents (for a custom crawler).

The office version that need to be readable are from 2000 to 2007. I mainly want to be crawling words, excel and powerpoint documents.

I don`t want to retrieve the formatting, only the text in it.

The crawler is based on lucene.NET if that can be of some help and is in c#.

I already used iTextSharp for parsing PDF


Here's a nice little post on c-charpcorner by Krishnan LN that gives basic code to grab the text from a Word document using the Word Primary Interop assemblies.

Basically, you get the "WholeStory" property out of the Word document, paste it to the clipboard, then pull it from the clipboard while converting it to text format. The clipboard step is presumably done to strip out formatting.

For PowerPoint, you do a similar thing, but you need to loop through the slides, then for each slide loop through the shapes, and grab the "TextFrame.TextRange.Text" property in each shape.

For Excel, since Excel can be an OleDb data source, it's easiest to use ADO.NET. Here's a good post by Laurent Bugnion that walks through this technique.


If you're already using Lucene.NET you might just want to take advantage of the various IFilters already available for doing this. Take a look at the open source SeekAFile project. It will show you how to use an IFilter to open and extract this information from any filetype where an IFilter is available. There are IFilters for Word, Excel, Powerpoint, PDf, and most of the other common document types.


There is an excelent open source project POI, only drawback - it is written for Java. The .net port is somehow very beta.


Here is a good list of various tools for converting Word documents to plaintext, which you can then do whatever with.


You might also consider checking out DtSearch (www.DtSearch.com). Although it is primarily a searching tool, it does a great job of extracting text from a large number of file types and is considerably cheaper than other options like the Oracle/Stellent OutsideIn technology or the equivalent from Autonomy.

I've been using DtSearch for years and find it indispensible for this type of task.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »