Tuesday, February 27, 2007

Generating Dynamic OpenXML Docx Files

Recently, I needed to integrate one of my applications to MS Office Word 2007 by generating dynamic *.docx reports. Actually, I didn't want to just find the steps to do it. I wanted to make a reusable library so that, I can use it independently in any project in the future.





Introducing Docx File Format

In Office 2007, new file formats are introduced such as docx (for MS Word) and xlsx for (MS Excel). These extensions are based on an ECMA OpenXML standard which enables you to define your data and styles based an XML format specifications (such as wordprocessorML for word and spreadsheetML for excel). So, in the end, it's just an XML and you easily control the style, data and configuration of your office documents by modifying some XML documents.

Here, I will focus on the new MS Word file format - docx. The format is actually a zip achieve file which means you can open it using any zip extractors. The *.docx file itself is called Package. When you unzip the content of *.docx file, you will have a collection of folders and XML files. Each file is called part. These parts contains all the needed styles, layout, fonts and configurations of your word document. And the relations between these parts or files are also defined in XML. This is an important point here as XML-like files enable the developers to change the file style or even the entire data using any programming language. It's not about Microsoft technologies. OpenXML is now an ECMA standard and the format now is accessible to any programming language.

The XML part of interest in docx files is "document.xml". This part contains all the data written in the word document. To see its formate, try to make a word document, write any text inside it, unzip, open document.xml which exists in "word" folder and see how your word document is expressed as XML file.


GenericWordDocument Class

In this section, I will explain my library design which I made to generate my dynamic documents. I created a class called GenericWordDocument. This is the main class in my library. Beside this class, I created a base class called TemplateFile. This class represents the docx template file which my document will inherits its styles, fonts and main characteristics. This help me so that I can make all the static visualization by hand (just by opening Office 2007, set the colors, document header, footers and all lovely static staff) and then use these visualization in my generated document by taking a copy of this template and modify it dynamically within my program.

I have also some additional classes for generating dynamic data. The first one is Iterator Class and the other one called TemplateRepeatedItem which is inherited from TemplateItem. So that you can have a repeated data to be generate in the word document. Simply you set the XML style of the iterator, when a new Item created in the Iterator it will inherit the style and reformulate itself with the manipulated data.



So, suppose I want to generate a simple word report for some products in my store. I will make a docx template file, add some keywords inside it such as: "#STORE_NAME#", "#COMPANY_NAME#" and so on. This keyword will be replaced with my data within my program. The following snippet show how I will use GenericWordDocument in this simple case:



Suppose, I want to generate iterated rows for some products in my company store. I write down something like that:



Also, you can make nested iterations as each iterator aggregate another iterator inside it. So, you can generate Product list and for each product you can, for example, list its accessories.

Reading/Editing XML Parts

.NET Framework 3.0 gives you the ability to read *.docx files and its XML parts using WindowsBase assembly. You just add a reference for WindowsBase assembly to your project and you can access the inner hierarchy and parts of docx format without extracting it.

The following is a method which read document.xml part from the docx file:




And the following snippet is the part which generate the word document and write the modified XML to document.xml


More about OpenXML?

I recommend these links for learning more about Office OpenXML - OOML and the new MS Office 2007 file formats:


Conclusion

In this post, I tried to make use of OpenXML format of the docx file to create dynamic word documents. I think the new OpenXML formats of Office 2007 is a worthy addition to the Microsoft products interoperability which will increase the developers ability to create more usable and productive projects.

kick it on DotNetKicks.com
Digg it
Vote For it

15 comments:

KevinBoske said...

Cool stuff, Nour, but I caution you not to rely on the URI of the part to find it. Remember the URI's are relative, and can change. It doesn't matter what the URI or part name is, only how the part is related in the context of the package. Take a look at this code, this is one method you can use to find the part you're looking for (wordML). Notice that we iterate through the relationships to find the one of the type we are looking for, then resolve the Part's URI:

const string documentRelationshipType = http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument
using (Package wdPackage = Package.Open(docName, FileMode.Open, FileAccess.ReadWrite))
{
PackagePart documentPart = null;
Uri documentUri = null;

// Get the main document part (document.xml).
foreach (System.IO.Packaging.PackageRelationship relationship in wdPackage.GetRelationshipsByType(documentRelationshipType))
{
documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
documentPart = wdPackage.GetPart(documentUri);
// There is only one document.
break;
}

Mohammad Nour said...

Yeah, but actually I was interested in "document.xml" as I do all my logic on this XML document. And I don't think that this part path will differ from file to another.

Saurav said...

Can you post the Solution you made

Saurav said...

Can you also send the link to your solution if its available to public.

Mohammad Nour said...

I am sorry, Saurav. But actually the library is not for public use. I just wanted to share the design so that you can have an overview about the concept or you may create something more efficient. Thanks

Saurav said...

I can buy it.
If you want to sell and support it.

Paul said...

Do you know of any good DOCX parsers? I need to parse a docx into chunks of html based on Header breaks. Any ideas on how to save all the formatting and enbedded objects?

Mohammad Nour said...

Hi Paul

There is an Open Source Project called OpenXML Package Explorer. You may have a check on its code. May be you can use the part that get the parts and its XML hierarchy.


Open XML Package Explorer


Thanks

Philip said...

Not sure if this is exactly what you need, but SketchPath may help parse DOCX files. It generates XPath queries for the parts of the XML you're interested in.

Social Network said...

hello mohammed,

I had added the reference windows.base to my project but why i could't use the package in project. iam creating a word add-in using visual c# 2008. i have also included the system.io.packaging class.

Thank you.

regards,
Dinesh.

Koray said...

Hello Nour,

nice article. I have to implement a project for logging in OpenXML-Format as well and would like to know if you publish the source-code of your library. Could be a great ispiration for me at this point.

thanks,
Koray

Mohammed Nour said...

I am sorry but the project is not an open source one.

Alex said...

For work with docx files use-recovery docx,this tool likes me,also it is free as how as i remember,software can help you and recover your damaged files in Microsoft Word format,will results in a preview window, that shows recovered text,recovering .docx documents and recovery docx became so easy, as never before, recover corrupt docx files right now, this reliable solution will save many hours of your precious time for manual recover damaged docx file.

Vitalii said...

Hi

nice work and nice description.

I faced similar problem few monthes ago and created similar lib for my own needs. Since that time i used it in few projects, made it simple and enough functional for me.

Now i've decided to publish it and let others to use it. It is free (i can guarantee that at least current version will be free forever) for any kind of use (except reselling :) )

It supports tables, .net formats (for money and/or dates), extensible for own data types, even can except linq queries as parameters for tables.

To learn more visit http://invoke.co.nz/products/docx.aspx. I'll appreciate any feedback.

Anonymous said...

Bonjorno, mnour.blogspot.com!
[url=http://viagraonline.pun.pl ]Vendita viagra in Italia[/url] [url=http://viagracqui.pun.pl/ ]Compra viagra online[/url] [url=http://cialisenta.pun.pl/ ]Vendita cialis online[/url] [url=http://viagrakhou.pun.pl/ ]Acquistare viagra in Italia[/url] [url=http://cialisashy.pun.pl/ ]Acquistare cialis [/url] [url=http://viagraater.pun.pl/ ]Compra viagra generico[/url]