Josh Payne on content analytics, enterprise content and information management

Archive for January 2010

Eyeballing Your Content Analytics Results

with 2 comments

One of the basic tools of carpentry is a level. Without it, a carpenter (or any weekend Mr. Fixit) is left eyeballing whether or not a particular surface is perfectly horizontal or perfectly vertical. Sure, anyone can ‘eyeball’ the work and say “Yeah, it looks straight to me”, but more often than not, that results in a slanted bookshelf.

Through experience, we learn in carpentry that eyeballing it just isn’t an accurate method — our perception of reality can be misleading.

The same rigor needs to carry through to how we judge the success of content analytics and specifically content classification.  Content analytics, like any innovation, is the target of skepticism and misperception. This has been one of the challenges we’ve faced up as we push for adoption of our content classification product at IBM, and other products that leverage content analytics.

I bring this up having read an interesting post by Lexalytics, a content analytics vendor, on their investigations of their sentiment analysis accuracy.  One of the things that caught my eye in the post was a quote from Forrester analyst Suresh Vital who said “in talking to clients who have deployed some form of sentiment analysis, accuracy rests at about 50 percent.”

Its a pretty casual quote that reflects how too many approach their assessment of accuracy of content analytics. There’s no reference to rigorous studies. There’s no reference to hard data about the success or failure of content analytics. It is “Oh, I’ve spoken to a bunch of customers and they perceive it to be doing an iffy job.”

This is the wrong way to judge content analytics.

Last year, as our customers were going about their buying decisions on content classification, many would go through limited tests of our content classification capability. There were two types of these content classification “proof of technologies” that went on.

The first were the rigorous tests. The ones who followed best practices. They created a reasonably large corpus of pre-categorized documents, and segregated a large portion of this pre-categorized content as a test set. The remaining content was used to train the system. To assess the accuracy of the system, the segrated, pre-categorized test corpus was used. A human was not judging the system document by document as it categorized. Rather, a large, statistically valid sample was run through — a formal control set.

The other type of customer did the opposite. They trained the system and then pulled uncategorized content and asked a human and the Classification Module to categorize side by side.

The results, in terms of accuracy, were the same for both types of customers. About 80% of the top category response was correct for all customers.

The perception of the accuracy for the different types of customers was starkly different.

Those who followed a rigorous approach perceive the automatic classification process to be a success. Those who followed the ‘judge by hand’ approach perceived the system to be unreliable. Why? The human judges have a tendancy to latch onto the failures — the misfires are far more memorable in the eyes of the judge than the successes. The misfires are just numerous enought (10-20%) that they seem pervasive. In reality, the vast majority of the results are good.

This is why the Classification Module and its Classification Workbench tool itself has explicit workflows built into it for executing rigorous testing of your training set and potential categorization process. Because eyeballing it leads to the misperception of results — and crooked bookshelves.


Written by Josh Payne

January 26, 2010 at 10:18 am

Structure: the key to unstructured information

leave a comment »

This is part three in a three part series exploring master content. Part one explored the business need. Part two explored the definition of master content itself.

A recent article by Chris Dixon on the Silicon Allye Insider caught my eye recently. When discussing “knowledge systems”, Dixon stated

It has been widely noted that the amount of information in the world and in digital form has been growing exponentially. One way to make sense of all this information is to try to structure it after it is created.

This quote resonated with me as I’m wrapping up my mini-series on ‘master content’ and I’m using it as an opportunity to tie master content back to the name of this blog — (un)structured. Chris succinctly summarizes this point for me.

I named this blog (un)structured as a tip of the hat to the core of the challenge I see in the enterprise contnent world and specifically in my corner of that world — the content analytics field. The information we’re working with in ECM is, at its core, assumed to be completely unstructured. We’re working with long form texts that by default require a human to comprehend, analyze and handle.

Yet any process that aims to automate the handling of, extract an understanding of or derive more value from that unstructured information immediately goes about the process of adding more structure to that information. We add more data about our ‘unstructured’ data (thus the term metadata).  Structured data about our unstructured information becomes the primary language through which the surrounding software ecosystem interacts with the unstructured information. Structured data is the lingua franca of software information management.  (un)structured, as a name, acknowledges that the content has deep and true value, but lending structure is a critically neccesary task in order to exploit that content.

The idea of master content and incorporating trusted enterprise content into a master data managed single view is emblamatic of that need.  The unstructured information provides much of the motivation for synchronizing information from ECM to a single view of a customer. A scanned image of a driver’s license or a long-text description of a customer complaint is valuable in painting the full, in-depth picture of a particular individual and their relationship with a company — but we can’t relate that image back to the profile of the customer without some sort of structured information facilitating the linkage.

(un)structured is about the importance of structured information to delivering new business value from unstructued content.

Content classification lends more structure. By doing so, individual content decisions can be automated and executed consistently.

Content analytics lends more structure. By doing so, content can be explored and new insight can be uncovered.

Master content is identified by the structure associated to content. By doing so, content can be linked to trusted single views.

In the coming weeks, I’ll focus more on classification and content analytics and how they can better empower decision making on content decommissioning. Stay tuned.

Written by Josh Payne

January 19, 2010 at 2:53 pm

What is Master Content?

with 2 comments

Earlier in Google’s life as the key access point for public information, I’d pop open the now familiar search page, run a query and click on the first result to find a page with the information for which I was looking. But frequently, I’d be forced to take pause. More often than not, I’d say to myself “well, the information is here, but who is this person that put it up? I’ve never heard of them. Why should I trust this seemingly random webpage?”

As both the internet and Google matured, this problem of trust has been increasingly addressed. In part, I’ve developed a better sense of what to trust, but more critically Google has improved their search result delivery to take into account the trustworthiness and quality of the source. It’s no coincidence that wikipedia is almost always near the top of Google’s search results for a variety of queries. Google trusts wikipedia.

The same idea applies within the walls of your entprise. In large organizations, different departments andagencies have varying levels of thoroughness and quality associated to their business processes and by extension their content. I experience this every day – I work at IBM, a company of 300,000 prolific content creators.

Lets say you work at a bank and a customer has made an inquiry about their account — they want to execute a major transaction with you and you need to make a decision on whether to approve the transaction. If I’m looking for information about that particular customer, I need to know that I can trust the information I find. Is it timely or out of date? Who created it?  Were the right controls in place when it was created?  Did the right employees review it? Has it been processed completely?  Has the lifecycle of this document been managed properly? Should it have been disposed of already?

When you are trying to make a decision about a particular customer — you need to focus on the information you can trust — ideally the master data and the associated trusted content. This trusted content that informs your view of your customer or other entity is “master content.”  Master content is not only a collection of federated content sources from different departments, but also the high quality, authoritative, timely content from across your organization.  It is more than a comprehensive catalog because some level of perspective has been brought to ensure that master content is has been filtered out from the vast volumes of content your organization has created, likely stored in many, many different enterprise content repositories in many, many different departments.

Master content is the information your organization should be staking your daily decisions upon — without worrying about the lineage and reliability of the originating author.  The decision makers are more efficient — and better informed — when master content is accessible and delivered, ideally as part of a complete single view.

Written by Josh Payne

January 14, 2010 at 12:47 pm

Posted in Master Content

You want a copy of my driver’s license? . . . Again?

with 2 comments

From nytimes.com, via New York State Department of Motor Vehicles

“We’ll need a copy of your driver’s license, birth certificate and wedding certificate.”

I need to dig this information out. Again. Sometimes I have it in my wallet. Sometimes I have to dig around my files. Sometimes I have to go the safety deposity box. The same organization is going to scan another copy of my supporting documentation. My brain is saying “Don’t you already have this? I gave this to you 2 months ago!” but my mouth says “Sure, will do, I’ll fax it right over.”

The same organizations — whether it be insurers, banks, health care providers, government agencies — are constantly asking for the same information from us. The more we interact with the same bureaucracy, the more that bureaucracy asks us for the same information, over and over again. What’s your social security number? Date of birth? Your address? Insurance ID number? And it goes beyond the hard data in forms. Photocopies of this. Faxes of that. Copied and scanned in.


Though we perceive each of these organizations as unified and whole, oftentimes behind the marketed facade of brand uniformity is a patchwork of

different departments and lines of business. And each line of business frequently gathers and stores this information for their own needs. At a bank, the mortgage group gathers their information that they need. And the retail banking group gathers what they need. And of course the small business group gathers what they need.

Thus ConglomerateBankofYourCountry keeps asking you for the same information. And they keep storing it.  You and I think that they’re sharing all this information. But in reality, for each transaction, we’re dealing with different people in different organizations who probably come from different acquired companies through merger and acquisition. They all have their own infrastructure with their own business applications on their own servers with their own databases. And each of them store copies of all this hard data about you as a customer.

An entire discipline in information management has sprung up to solve these kinds of problems — master data management. Master data is high-value core information about customes (and other entities) that an organization stores. So the master data about me as a customer entails hard data like a clean version of my name, my up-to-date, accurate address, my email and other high-quality data that the organization generates about me — like the types of services I do purchase from them . . . and what services I have yet to purchase from them.

Rather than forcing customers to repeatedly deliver the same information to an organization, master data management facilitates the creation of the golden version of data about me as a customer — and then shares this single view of the customer across the organization to different departments.

Now, when you’re engaging me in a new transaction, or a new customer service interaction, instead of forcing me as a customer to provide all this information, you as an organization can pull up the master data instead. (I’m only touching on the tip of the iceberg here with respect to the benefits of master data management).

Let’s bring this back to ECM (because this post is too long already)

As the name implies, traditionally, master data management has been focused on structured data as opposed to unstructured content. But as we in the ECM industry know, unstructured information — content — makes up 80% of the information in an organization. Unstructured information provides more context than simple data. So as organizations work to solve problems with master data management, its only appropriate that they begin to incorporate the information in enteprise content managment repositories — their trusted content — to fill out their single view of the customer.

Rather than simply incorporating the date of birth in a master data record, organizations can incorporate a scanned image of the birth certificate.

Just as there is a discipline inside organizations to form their structured data into master data, this discipline needs to be expanded to incorporate enterprise content — this “master content” belongs alongside master data to provide greater depth and context to a view of a customer.  As case management applications and records management practices improve the quality and trustworthiness of content, this content — master content — can be delivered to master data solutions to deliver a more complete view of the customer.  More context, readily available means better customer outcomes.

The net effect? ConglomerateBankofYourCountry won’t be asking you and I for our driver’s license . . . yet again.

Written by Josh Payne

January 11, 2010 at 1:45 pm

Posted in Master Content

Redbook for InfoSphere Classification Module

leave a comment »

One of the great assets at our disposal at IBM is the IBM Redbooks program.  Small teams from around the world gather for short-term residencies to investigate a specific topic, product or solution. The output of these residencies is a technical document — the redbook.

Recently a team completed a redbook on the topic of InfoSphere Classification Module. Its got lots of great information on installing and getting off the ground with the Classification Module.  I’m particularly enamored with chapter 4 — some best practices guidances and typical challenges when it comes to training the Classification Module with examples.

Check it out. Many thanks to the great team who put in the weeks of effort to make the redbook on Classification Module a reality.

Written by Josh Payne

January 4, 2010 at 10:55 am