Wednesday, June 08, 2005

Word document format

I've started to learn the Word document format.

First, a doc file is a container of several files, like a zip file but not compressed and optimized to be read in blocks of 512 bytes (or 4096 sometimes).

The container spec is easily understandable, so a few hours after reading it I had a little C program that was able to extract the different "sub-files" (streams from now on).

All the meat for a Word document is in the "WordDocument" stream. Besides this stream you will find at least the "SummaryInformation" and the "DocumentSummaryInformation" streams, that store things like the Title, Subject, Author, etc. (go to File -> Properties in Word to see them).

The "SummaryInformation" and "DocumentSummaryInformation" streams are described in the Ole2 Programmer's Reference (fortunately we have a copy at work).

The "WordDocument" stream's format is much harder to decode, as here Word stores all the format information, text, etc. It seems that Microsoft published the Word spec on MSDN, and people have copied it on Wotsit, correcting some bits in the process.

So, with this spec, I read the header of the WordDocument trying to extract the text of my mini Word test file. First surprise, the doc says that text is stored between FIB.fcMin and FIB.fcMac (FIB is the header), but fcMin is 0x200 bytes before the beginning of my "Hello". If I read the text between FIB.fcMin and FIB.fcMac I got:

00 00 00 ... repeat 0x200 times ... H e l l o

I first though that I had a bug in my Ole decoder, as 0x200 is exactly the size of one big block in the Ole Container, but no luck, it seems that my code was not buggy, as I extracted the WordDocument stream using laola, but I was getting the same result.

Weird.

After looking at libwv, it seems that it ignores completely fcMin, as since Word 97 you need to read the piece table stored in the word file, even if it has been saved without the "fast-save" option (to know the encoding of each piece of text you read).

So I headed to where the piece table is supposed to be, to extract at least the first piece of text, but so far I've not yet made any sense from the piece table in my test document :-(

To be continued...

4 comments:

Anonymous said...

It could be very nice to see your blog updated more often Cuenqui :)

Lil said...

Any conclusion on the topic? I have encountered the same issue and want to find an answer.

Joaquin Cuenca Abela said...

Hi Lil,

sorry, I got sidetracked working on Panoramio, and never found the bug here. Maybe you will be luckier than I was, try taking a look at the Abiword importer or to one of the doc -> txt converters.

Cheers,

Michal said...

Hi, many informations about file formats and file extensions are at File-extensions.org.