Opening up Outlook’s data format

In Q4 last year, Microsoft announced through its Interoperability @ Microsoft blog that it was planning to open up its proprietary PST email format used by Outlook.

The data in .pst files has been accessible through the Messaging API (MAPI) and Outlook Object Model (two things of which my understanding is minimal at best), but only if the user has Outlook installed:

In order to facilitate interoperability and enable customers and vendors to access the data in .pst files on a variety of platforms, we will be releasing documentation for the .pst file format. This will allow developers to read, create, and interoperate with the data in .pst files in server and client scenarios using the programming language and platform of their choice. The technical documentation will detail how the data is stored, along with guidance for accessing that data from other software applications. It also will highlight the structure of the .pst file, provide details like how to navigate the folder hierarchy, and explain how to access the individual data objects and properties.

The documentation will be released under Microsoft’s Open Specification Promise, which means that it is protected against patent claims. Other Microsoft Office formats, such as the XML-based .docx and .xlsx, and the older binary formats .doc and .xls, are covered under this promise.

This seems like a big win for users of Microsoft Outlook. Along with CodePlex, which hosts open source projects, it seems like Microsoft is slowly opening things up and making life easier for their customers. It certainly has the potential to make it easier for customers to leave the Outlook platform. From GigaOM:

In the past, if someone was moving from Outlook/Exchange to Gmail or any other platform, there was a pretty tedious process of exporting pieces of data from Outlook into various formats before moving over to the new platform. Basically, once you didn’t have Outlook, that .pst was a useless brick of data. Now in that case you’ll be able to take that .pst file with you and if other apps/platforms build readers, they will be able access that data. So migration to other platforms is a valid use case where there’s some benefit.

Some more ideas as to the reasons why Microsoft is making this change were floated on ZDnet a day after the announcement:

[Rob Helm, an analyst with Directions on Microsoft,] added that he believed Microsoft is trying to wean large customers from storing mail in .PST files or file systems “because doing that makes it hard for organizations to back up all their e-mail, enforce e-mail retention policies, and locate relevant e-mails during legal discovery.”

Not just retention, but perhaps helping organizations mine their email data for knowledge which can all too frequently be lost forever if an employee leaves the company? Here’s an idea: How about a tool that will gather information from emails dating back years and populate a wiki automatically for new employees?

[Rob Sanfilippo, another Directions on Microsoft analyst] added that .PSTs “are used most frequently for archiving purposes and Exchange Server 2010 includes a new server-based Personal Archive feature that gives users a separate mailbox to use for archiving on the server instead of using a PST.” He said this gives weight to the aforementioned idea that Microsoft is trying to help organizations get users off PSTs and onto server storage.”

Then, in February of this year, the promised documentation was released on the MSDN website. Finally, about a month ago, two open source tools that make use of the documentation were released on CodePlex:

  • The PST Data Structure View Tool is a graphical tool allowing the developers to browse the internal data structures of a PST file. The primary goal of this tool is to assist people who are learning .pst format and help them to better understand the documentation.
  • The PST File Format SDK is a cross platform C++ library for reading .pst files that can be incorporated into solutions that run on top of the .pst file format. The capability to write data to .pst files is part of the roadmap will be added to the SDK.

The project has seen some exciting progress, which is good news for organizations that use Outlook. And as you might know, data visualization used to enhance understanding is a favourite topic of mine!

What risk do these developments address within Outlook’d organizations? Knowledge/information management is critical to so many companies. The use, retention and (hopefully) reuse of knowledge developed by employees and stored in email conversations within Outlook will be enhanced through this openness.

Has your organization taken these developments into account in your audits of knowledge/information management and strategy?