Josh Payne on content analytics, enterprise content and information management

Archive for the ‘Master Content’ Category

Google’s Improvements Extend into Information Governance

leave a comment »

I just read with great interest Steven Levy’s article in Wired on Google’s search algorithm and how Google works to improve it. A couple of things leaped out at me as concepts I’ve discussed here in the past (or on my old blog), as the concepts extend into the enterprise. Just as Google uses them to improve their consumer search experience, you can leverage them within the context of better information governance.

1) Google uses document context similar to how I have describe advanced content classification as a “context-based” method of classifying information.  Levy writes:

Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theoriesabout how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”

Google uses the context of the content it indexes to better understand the purpose and intent of a particular document and in turn the purpose and intent of your particular search query. Advanced content classification methods deliver better categorization results in a similar way — it is using the full context of the training documents provided to it to better results.

2) When discussing ‘trusted content’, I used the example of how Google trusts some sources over others. At the time, I didn’t have a source for this assertion. Levy describes this in some detail in the article:

That same year, an engineer namedKrishna Bharat, figuring that links from recognized authorities should carry more weight, devised a powerful signal that confers extra credibility to references from experts’ sites. (It would become Google’s first patent.) The most recent major change, codenamed Caffeine, revamped the entire indexing system to make it even easier for engineers to add signals.

Do read the entire article if you’re interested in these topics — given our universal reliance on Google as consumers, its certainly beneficial to be an educated consumer. And these concepts can extent into better proactive management of your enterprise content.


Written by Josh Payne

February 23, 2010 at 10:10 am

Structure: the key to unstructured information

leave a comment »

This is part three in a three part series exploring master content. Part one explored the business need. Part two explored the definition of master content itself.

A recent article by Chris Dixon on the Silicon Allye Insider caught my eye recently. When discussing “knowledge systems”, Dixon stated

It has been widely noted that the amount of information in the world and in digital form has been growing exponentially. One way to make sense of all this information is to try to structure it after it is created.

This quote resonated with me as I’m wrapping up my mini-series on ‘master content’ and I’m using it as an opportunity to tie master content back to the name of this blog — (un)structured. Chris succinctly summarizes this point for me.

I named this blog (un)structured as a tip of the hat to the core of the challenge I see in the enterprise contnent world and specifically in my corner of that world — the content analytics field. The information we’re working with in ECM is, at its core, assumed to be completely unstructured. We’re working with long form texts that by default require a human to comprehend, analyze and handle.

Yet any process that aims to automate the handling of, extract an understanding of or derive more value from that unstructured information immediately goes about the process of adding more structure to that information. We add more data about our ‘unstructured’ data (thus the term metadata).  Structured data about our unstructured information becomes the primary language through which the surrounding software ecosystem interacts with the unstructured information. Structured data is the lingua franca of software information management.  (un)structured, as a name, acknowledges that the content has deep and true value, but lending structure is a critically neccesary task in order to exploit that content.

The idea of master content and incorporating trusted enterprise content into a master data managed single view is emblamatic of that need.  The unstructured information provides much of the motivation for synchronizing information from ECM to a single view of a customer. A scanned image of a driver’s license or a long-text description of a customer complaint is valuable in painting the full, in-depth picture of a particular individual and their relationship with a company — but we can’t relate that image back to the profile of the customer without some sort of structured information facilitating the linkage.

(un)structured is about the importance of structured information to delivering new business value from unstructued content.

Content classification lends more structure. By doing so, individual content decisions can be automated and executed consistently.

Content analytics lends more structure. By doing so, content can be explored and new insight can be uncovered.

Master content is identified by the structure associated to content. By doing so, content can be linked to trusted single views.

In the coming weeks, I’ll focus more on classification and content analytics and how they can better empower decision making on content decommissioning. Stay tuned.

Written by Josh Payne

January 19, 2010 at 2:53 pm

What is Master Content?

with 2 comments

Earlier in Google’s life as the key access point for public information, I’d pop open the now familiar search page, run a query and click on the first result to find a page with the information for which I was looking. But frequently, I’d be forced to take pause. More often than not, I’d say to myself “well, the information is here, but who is this person that put it up? I’ve never heard of them. Why should I trust this seemingly random webpage?”

As both the internet and Google matured, this problem of trust has been increasingly addressed. In part, I’ve developed a better sense of what to trust, but more critically Google has improved their search result delivery to take into account the trustworthiness and quality of the source. It’s no coincidence that wikipedia is almost always near the top of Google’s search results for a variety of queries. Google trusts wikipedia.

The same idea applies within the walls of your entprise. In large organizations, different departments andagencies have varying levels of thoroughness and quality associated to their business processes and by extension their content. I experience this every day – I work at IBM, a company of 300,000 prolific content creators.

Lets say you work at a bank and a customer has made an inquiry about their account — they want to execute a major transaction with you and you need to make a decision on whether to approve the transaction. If I’m looking for information about that particular customer, I need to know that I can trust the information I find. Is it timely or out of date? Who created it?  Were the right controls in place when it was created?  Did the right employees review it? Has it been processed completely?  Has the lifecycle of this document been managed properly? Should it have been disposed of already?

When you are trying to make a decision about a particular customer — you need to focus on the information you can trust — ideally the master data and the associated trusted content. This trusted content that informs your view of your customer or other entity is “master content.”  Master content is not only a collection of federated content sources from different departments, but also the high quality, authoritative, timely content from across your organization.  It is more than a comprehensive catalog because some level of perspective has been brought to ensure that master content is has been filtered out from the vast volumes of content your organization has created, likely stored in many, many different enterprise content repositories in many, many different departments.

Master content is the information your organization should be staking your daily decisions upon — without worrying about the lineage and reliability of the originating author.  The decision makers are more efficient — and better informed — when master content is accessible and delivered, ideally as part of a complete single view.

Written by Josh Payne

January 14, 2010 at 12:47 pm

Posted in Master Content

You want a copy of my driver’s license? . . . Again?

with 2 comments

From nytimes.com, via New York State Department of Motor Vehicles

“We’ll need a copy of your driver’s license, birth certificate and wedding certificate.”

I need to dig this information out. Again. Sometimes I have it in my wallet. Sometimes I have to dig around my files. Sometimes I have to go the safety deposity box. The same organization is going to scan another copy of my supporting documentation. My brain is saying “Don’t you already have this? I gave this to you 2 months ago!” but my mouth says “Sure, will do, I’ll fax it right over.”

The same organizations — whether it be insurers, banks, health care providers, government agencies — are constantly asking for the same information from us. The more we interact with the same bureaucracy, the more that bureaucracy asks us for the same information, over and over again. What’s your social security number? Date of birth? Your address? Insurance ID number? And it goes beyond the hard data in forms. Photocopies of this. Faxes of that. Copied and scanned in.


Though we perceive each of these organizations as unified and whole, oftentimes behind the marketed facade of brand uniformity is a patchwork of

different departments and lines of business. And each line of business frequently gathers and stores this information for their own needs. At a bank, the mortgage group gathers their information that they need. And the retail banking group gathers what they need. And of course the small business group gathers what they need.

Thus ConglomerateBankofYourCountry keeps asking you for the same information. And they keep storing it.  You and I think that they’re sharing all this information. But in reality, for each transaction, we’re dealing with different people in different organizations who probably come from different acquired companies through merger and acquisition. They all have their own infrastructure with their own business applications on their own servers with their own databases. And each of them store copies of all this hard data about you as a customer.

An entire discipline in information management has sprung up to solve these kinds of problems — master data management. Master data is high-value core information about customes (and other entities) that an organization stores. So the master data about me as a customer entails hard data like a clean version of my name, my up-to-date, accurate address, my email and other high-quality data that the organization generates about me — like the types of services I do purchase from them . . . and what services I have yet to purchase from them.

Rather than forcing customers to repeatedly deliver the same information to an organization, master data management facilitates the creation of the golden version of data about me as a customer — and then shares this single view of the customer across the organization to different departments.

Now, when you’re engaging me in a new transaction, or a new customer service interaction, instead of forcing me as a customer to provide all this information, you as an organization can pull up the master data instead. (I’m only touching on the tip of the iceberg here with respect to the benefits of master data management).

Let’s bring this back to ECM (because this post is too long already)

As the name implies, traditionally, master data management has been focused on structured data as opposed to unstructured content. But as we in the ECM industry know, unstructured information — content — makes up 80% of the information in an organization. Unstructured information provides more context than simple data. So as organizations work to solve problems with master data management, its only appropriate that they begin to incorporate the information in enteprise content managment repositories — their trusted content — to fill out their single view of the customer.

Rather than simply incorporating the date of birth in a master data record, organizations can incorporate a scanned image of the birth certificate.

Just as there is a discipline inside organizations to form their structured data into master data, this discipline needs to be expanded to incorporate enterprise content — this “master content” belongs alongside master data to provide greater depth and context to a view of a customer.  As case management applications and records management practices improve the quality and trustworthiness of content, this content — master content — can be delivered to master data solutions to deliver a more complete view of the customer.  More context, readily available means better customer outcomes.

The net effect? ConglomerateBankofYourCountry won’t be asking you and I for our driver’s license . . . yet again.

Written by Josh Payne

January 11, 2010 at 1:45 pm

Posted in Master Content