Wednesday, June 08, 2005

Word document format

I've started to learn the Word document format.

First, a doc file is a container of several files, like a zip file but not compressed and optimized to be read in blocks of 512 bytes (or 4096 sometimes).

The container spec is easily understandable, so a few hours after reading it I had a little C program that was able to extract the different "sub-files" (streams from now on).

All the meat for a Word document is in the "WordDocument" stream. Besides this stream you will find at least the "SummaryInformation" and the "DocumentSummaryInformation" streams, that store things like the Title, Subject, Author, etc. (go to File -> Properties in Word to see them).

The "SummaryInformation" and "DocumentSummaryInformation" streams are described in the Ole2 Programmer's Reference (fortunately we have a copy at work).

The "WordDocument" stream's format is much harder to decode, as here Word stores all the format information, text, etc. It seems that Microsoft published the Word spec on MSDN, and people have copied it on Wotsit, correcting some bits in the process.

So, with this spec, I read the header of the WordDocument trying to extract the text of my mini Word test file. First surprise, the doc says that text is stored between FIB.fcMin and FIB.fcMac (FIB is the header), but fcMin is 0x200 bytes before the beginning of my "Hello". If I read the text between FIB.fcMin and FIB.fcMac I got:

00 00 00 ... repeat 0x200 times ... H e l l o

I first though that I had a bug in my Ole decoder, as 0x200 is exactly the size of one big block in the Ole Container, but no luck, it seems that my code was not buggy, as I extracted the WordDocument stream using laola, but I was getting the same result.

Weird.

After looking at libwv, it seems that it ignores completely fcMin, as since Word 97 you need to read the piece table stored in the word file, even if it has been saved without the "fast-save" option (to know the encoding of each piece of text you read).

So I headed to where the piece table is supposed to be, to extract at least the first piece of text, but so far I've not yet made any sense from the piece table in my test document :-(

To be continued...

And then...

it crashed.

After 2 hours working on an icon, Gimp decided I should take a rest, and it crashed.

Of course, it would not be funny if I had saved it, and sure enough, I didn't save it. Not a single time.

I can hear you...
It's your fault! You should always save no matter what software are you using.


But sorry, I'm not that stupid. It's GIMP fault, 100%. Given that it's not the first time it decides to ruin several hours of my work, I wonder if I should start learning Photoshop. It can't be as hard as it seems...

And for all the artists out there, a wonderful link from simplebits: firewheel design has the best Portfolio I've never seen. Their colors on all their creations are extremely hot and vivid. A pleasure for the eye.