Ubuntu: How can I extract the data from a corrupted .docx file?


My girlfriend's .docx file corrupted and I'm trying to recover the text from it. I learned that .docx is essentially a zip file that contains folders and bunch of .xml files (and one of them contains document text). I following command on Ubuntu 10.10 to unzip the archive:

unzip portfolio.docx -d file-dir  

The result I got is:

End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archieve.  unzip: cannot find zipfile directory in one of portfolio.docx or portfolio.docx.zip, and cannot find portfolio.docx.ZIP, period.  

On Windows 8.1 I tried WinZip, 7zip, WinRar and Zip2Fix but without any luck.

The file weights nearly 20Kb so I know there is some content inside. Is there any way to force unzip?


Run this:

cp portfolio.docx portfolio.zip  

Or just rename portfolio.docx to portfolio.zip, and you should be able to open the resultant portfolio.zip file with Archive Manager, and extract them.

Edit: I just ran a quick check, the files are likely to be in word/document.xml or docProps/core.xml in the extracted folder.

Another Edit: If the resultant zip file is corrupted, look here.

