Tutorial :What is the best way to programatically convert a word document with a table structure to XML


So, I have this word document that has a whole bunch of tables some of which are pretty long. It spans many many pages in some cases. I need to programmatically convert this thing to XML.

I was initially told we could just copy paste into Excel and save it as a CSV, then I could convert from there which would be pretty easy. However, due to the formatting of some of the fields there would need to be a lot of extra manipulation on the spreadsheet after copying to Excel to get it to look right and to have the CSV come out correctly.

I should note that this is an add-on for an old app written in VB.Net 1.1 (cue frowny face) :(. However, I'm debating just writing a separate command line tool in C# 3.5 if that'll make it easier. Seems like C# has some Word interop stuff that I doubt was in the 1.1 framework, but I haven't investigated that too far.

So, I'm just looking for the best/quickest way this can be achieved. It doesn't matter so much how it's achieved as long as it is achieved and it's done programmatically. Some of the steps could be done manually if they aren't too tough. Like if getting it to some other format first would save a bunch of coding and isn't too difficult that would be fine.

Has anyone done anything like this before? Any ideas?

Update Ok, so here is an example of exactly what I'd need to do.

I have a word doc that looks something like this...

PROTOCOL:  BIRDS               Field Name      Data Type      Required      Length      Total Digits      Fraction Digits      ValidValues/Comparison      Description  OBSERVATION_ID  Text           Yes           16          n/a               n/a                                              Unique observation identification.  Primary key.   

So, there's the table with it's name and vendor (Protocol and Birds in this case). As an example it just has one field. Valid values/comparisons can have multiple things separated by commas where each thing would be enclosed by value tags inside the XML.

Now what I basically need to do is get that to convert to this XML...

<?xml version="1.0" encoding="utf-8"?>  <Formats xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="Formats.xsd">    <VendorFormats Vendor="PROTOCOL" LastModified="2005-9-13">      <Format Name="BIRDS" Version="3" VersionDate="2005-9-10">        <BaseTable>BIRDS</BaseTable>        <StageTable>STAGE_BIRDS</StageTable>        <Fields>          <Text Name="OBSERVATION_ID" Required="Y">            <NullValue />            <Description>Unique observation identification.  Primary key.</Description>            <Length>16</Length>          </Text>        </Fields>      </Format>     </VendorFormats>   </Formats>  

There will always be a base table and a stage table where base table is the same name as whatever follows the colon at the beginning of the (PROTOCOL: BIRDS, so it would be BIRDS) and the stage table is always STAGE_ then what follows the colon. You'll also notice the version and the last modified and version date in the XML. These things can be worried about later and perhaps manually added.


You should realize that there is no such thing as a MS Word document. There are numerous formats and some early format are not deserving of the name, but are better described as memory dumps of hacky compressed text. You're not really in need of XML, that is a later concern. You have to take control of the data in the document. Unless that is one of the newest, somewhat documented formats, you have but one option: hack it out. Write a program to manipulate the document, until you get what you want. The only one who knows MS-Word formats is MS-Word herself. So if you can convince her to dump the content to a more-or-less defined format like RTF, you have a better starting point.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »