Archive for the ‘Content Classification’ Category
As my colleagues inside IBM have known for a over a week, I’ve decided to leave IBM to pursue other professional opportunities. Before I put out some blog posts on my future, I wanted to use this opportunity to look backwards at how far the ECM business has come with respect to discovery and content ananlytics over the last 5 years.
I came to IBM as part of the acquisition of a small enterprise search vendor — iPhrase Technologies. We joined up with a product team inside IBM building a product called “Information Integrator OmniFind Edition” to attack the enterprise search market. Though we were grouped inside the Content Management organization, we really went about our business independently relative to our ECM brethren, focusing on the search solutions, leveraging content analytics technologies for ‘concept searching’.
1 year later, FileNet joined IBM and we began to try to apply our search and discovery technologies to ECM centric business scenarios. As we began to collaborate, one of the first things that struck me about ECM, was the treatment of the documents. In enterprise search, documents were something to be cracked open by definition — how else to search it?
Yet the ECM world had a tendancy to treat a document as an ‘object’ –objects to be handled and managed. It struck me as digital paper shuffling where the expectation that ECM was for readying the document for someone with 2 eyes to read it and use it (and don’t get me wrong, it was challenging paper shuffling — billions of objects, large scale scanning — tough, tough problems).
Within this context we set down a path of applying analytics technologies to ECM. Our first step was to weave IBM’s content classification product within the ECM architecture, applying it to compelling scenarios in email archiving and records management. Next, we brought to market an eDiscovery solution built with analytics at its core. These first two steps were exciting but focused attempts at bringing about a better solution to specific ECM problems with content analytics, especially in the information governance market.
Then last year, IBM brought made generally availabile our Content Analytics platform. This third step is especially gratifying. Content analytics technologies have moved being an isolated technology, separate from ECM to delivering insight about businesses by leveraging the text inside of documents — the insides of these objects.
The embrace and adoption of content analytics is especially gratifying for me personally. Though I had but a small role, the change inside IBM ECM and externally amongst customers, analysts and others is stark relative to when I joined IBM. Content is no longer simply an ‘object’ to be managed — its an asset to be leveraged and this is a striking difference. I am confident that in the coming months and years this will increasingly become the accepted attitude and approach in ECM.
On that note, I want to thank folks for reading this blog on the topics of content classification and content analytics. For folks who are interested in more writing on information lifecycle governance, Craig Rhinehart continues to write on this topic at his blog.
Since my professional life will take me away from content analytics in the near term, I expect that this blog will start to reflect the new paths I’ll be following on my professional, post-IBM journey.
I hope you’ll continue to read as my journey takes these exciting new steps.
“What is the accuracy of your product”
I’ve probably been asked that question in every presentation on content classification I’ve given, since I first started working on IBM’s classification product, over three years ago.
I know two things when I’m asked the question: that the inquisitor wants a short answer and that the answer isn’t as simple as the inquisitor expects.
The way the question is framed – the simple straightforward request for accuracy results – implies an underlying assumption that the proper categorization of content in a business scenario absolutely and definitively exists. I was reminded of this as I read a nice study on the accuracy of document categorization, written by the eDiscovery Institute and published this year. It stated:
Ultimately, measurement of accuracy implies that we have some reliable ground truth or gold standard against which to compare the classifier, but such a standard is generally lacking for measure of information retrieval in general and for legal discovery in particular.
The paper, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, explores the accuracy of automated document classification, specifically in comparison to human based alternatives. In the paper, the authors explore whether automated classification is a reasonable alternative to use when categorizing documents in a legal discovery review. The authors worked with a corpus of documents from a real regulatory inquiry.
The original lawyers involved in the case had categorized the documents. This is a ready-made training set from which the computer-based classifiers could learn and is exactly what the authors did. In turn, these well-trained classifiers categorized other content gathered for the case.
Yet,to assess quality of their automated classification methods, the authors didn’t compare the automated results against the results of the original reviewers. Rather, they tasked an entire new set of human reviewers (“re-reviewers”) to classify documents from the corpus.
The authors, for deriving their conclusions, compared the results of these re-reviewers with those of the automated classifiers. I think of this as a fair fight – comparing the results of the computers with the same task as executed by humans.
The human re-reviewers agreed with the original reviewers approximately 79.8% of the time.
Not exactly the kind of consistent accuracy we expect out of our reliable employees, is it?
Based on this level of disagreement, the authors have illustrated their assertion that there really can’t be a reliable ‘gold standard’ of truth in categorization of documents. The ‘right’ answer is not so easily identified in every case – in most cases, in fact.
By comparison, automated methods agreed with the original reviewers over 80% of the time.
So what did I learn from this paper?
1) The human reviewers aren’t perfect. The human re-reviewers aren’t perfect. And of course the automated replacements for the human analysis aren’t perfect. I tend to give human classifiers too much credit, in fact. No method is perfect.
But . . .
2) The fact that automated classification can do just as well, if not slightly better than the human re-reviewers leads the authors to conclude that “employing as system like one of the two systems employed in this task will yield results that are comparable to the traditional practice in discovery and would therefore appear to be reasonable.”
And that is the key – the software isn’t perfect. But neither are the motivated, knowledgeable humans. And the automated methods, though a bit more mysterious, give comparable results – at a fraction of the cost.
I just read with great interest Steven Levy’s article in Wired on Google’s search algorithm and how Google works to improve it. A couple of things leaped out at me as concepts I’ve discussed here in the past (or on my old blog), as the concepts extend into the enterprise. Just as Google uses them to improve their consumer search experience, you can leverage them within the context of better information governance.
1) Google uses document context similar to how I have describe advanced content classification as a “context-based” method of classifying information. Levy writes:
Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theoriesabout how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”
Google uses the context of the content it indexes to better understand the purpose and intent of a particular document and in turn the purpose and intent of your particular search query. Advanced content classification methods deliver better categorization results in a similar way — it is using the full context of the training documents provided to it to better results.
2) When discussing ‘trusted content’, I used the example of how Google trusts some sources over others. At the time, I didn’t have a source for this assertion. Levy describes this in some detail in the article:
That same year, an engineer namedKrishna Bharat, figuring that links from recognized authorities should carry more weight, devised a powerful signal that confers extra credibility to references from experts’ sites. (It would become Google’s first patent.) The most recent major change, codenamed Caffeine, revamped the entire indexing system to make it even easier for engineers to add signals.
Do read the entire article if you’re interested in these topics — given our universal reliance on Google as consumers, its certainly beneficial to be an educated consumer. And these concepts can extent into better proactive management of your enterprise content.
When I’m explaining advanced content classification to audiences and why these methods are more powerful and accurate, I frequently tell this story. It illustrates the value of advanced methods of classification over more rudimentary, rule-based approaches. I think audiences have found it helpful.
I emphasize the fact that advanced methods gain their greater accuracyfrom taking the ‘full context’ of the long-form text into account. Advanced methods aren’t just using one word, or two words into account — they’re taking into account hundreds and thousands of words, weighing each word’s significance and coming up with a cumulative, holistic assessment of the similarity of the text to each category. Innumerable factors are being taken into account. Hundreds and thousands of factors are being weighed. By comparison, a simple keyword based rule for categorization isn’t taking innumerable factors into account. Its just taking one.
The analogy I draw is the difference between an adult’s ability to categorize and my daughter. Let me explain.
My daughter is 18 months old. One of her very first words was “duck.”. [I believe the order of new words was “Mama”, “Duck”, 15 other words, then “Dada.” I digress]
In that first set of words, duck was pretty much alone relative to other animals. There might have been a “dog” in there, but that was about it when it came to naming things in the animal kingdom. Certainly it was the only bird she new in the bird class of the animal taxonomy.
So when she saw a duck, she enthusiastically blurted out “DUCK!.”
And when she saw a pigeon, she exclaimed “DUCK!.”
And when she saw a hawk, she of course shouted out “DUCK!.”
If it had wings and a beak, she named it a duck.
Why? Because her relatively immature mind was only taken a few factors into account. We as adults can say, “yes, its got a beak, but its beak is pretty sharp, and its feet aren’t webbed and . . . well therefore its a hawk, not a duck.”
My daughter was acting like a simple rules based classifier. She took one or two key factors into account and made her decision.
We worked on her and now she can distinguish between birds and ducks. She’s making progress. Her brain is constantly in ‘upgrade’ mode. You should look into upgrading your classification methods too if you’re only focused on rules-based approaches too.
In a previous post, I emphasized the importance of rigorous, controlled testing when assessing the potential of content analytics. This is especially important for content classification when it is being used to replace human decision-making. My broader point in that post was that when adopting new technology, you can’t rely on the qualititative perception of the skeptical observer.
A similar topic, that of adoption of technology in the legal profession, came up at the keynote to LegalTech last week. Law.com recounts Dr. Lisa Sanders’ response, which was far more eloquent than my post so I wanted to pass it along here:
During the question-and-answer session, Kelley Drye & Warren Practice Development Manager Jennifer Topper asked the panel how to convince litigators to use tools like technology and decision trees, repeatable processes that can help make handling similar cases more efficient.
It’s a long process of changing attitudes within a corporate culture, Dr. Sanders said. “Mistakes made by a computer or guideline live forever in the minds of people watching them. Mistakes made by people are forgivable.”
I guess that’s why she writes for the New York Times . . .
(the LegalTech keynote has been quite the blogging gift this week. Maybe I should volunteer to staff the IBM booth next year and get the scoop first hand)
I read up on some of the goings-on at Legaltech in New York city last week. A couple of things caught my eye from the write-up on legalcurrent.com.
1) I found it interesting, as I tweeted earlier in the day, that David Craig of Thomson-Reuters used the term “Tsunami of information”. We currently host a whitepaper from Cohasset Associates entitled “Meet the Content Tsunami Head On: Leveraging Classification for Compliant Information Management.” It will be interesting to see if that descriptor gains traction in the marketplace.
2) Malcolm Gladwell is hitting the information management circuit, isn’t he? First IOD last fall, now Legaltech. (I hope he continues it; he was hands down the most interesting keynote speaker I’ve seen at a tradeshow. Effective in tying his storytelling back to the themes of the show itself).
3) Lastly, Gladwell, as recounted in the writeup, referenced a story about the chess master Gary Kasparov:
Gladwell pointed to a Kasparov chess challenge in which both opponents used a computer throughout the match. Kasparov saw that the computer’s quick analysis of every possible move enabled these grandmasters to let their experience, creativity and knowledge come through.
That’s a nice summary of the core of my for content classificaiton specifically, and content analytics more broadly within the context of information governance.
One way to read that quote is that content classification frees up the mind of our knowledge workers such that they can focus on the truly complex matters and truly human endeavors that require our most valuable skills. Leave the mundane grunt work to the computers, automated, in the background.
Seen differently, when computers automatically intelligently provide the top, best choices for humans — assist in classification of informationwithout completely automating the task — humans are left to focus their brain power for classification to explicitly focus on the finer points of thedecision making process, and as such come to better conclusions.
Either way, I thought an interesting view on the role of automated analysis in relation to typically human based decision making. Dehumanizing the analysis lead to better, more humane results.
I’m quite enamored with twitter. Its my main source of information and news, especially on the weekends as I’m rarely in front of a computer and it delivers interesting tidbits to the blackberry in my pocket. And its certainly the best way I’ve found to keep my finger on the pulse of goings on in the niche relevant to my professional interests: ECM, information governance and records management.
This tweet, from @MimiDionne a couple of weekends ago caught my eye:
My initial response: Mimi, this is no joke! My 18 month old wasn’t that interested in hearing about cost savings and information governance at the time, so I returned to our conversation about ducks and birds and swallowed my observations until now.
Its always perilous to read too much into 140 character long observations, but my instant reaction to the tone of the tweet was that she was embarking on something that the general records management community would view as quixotic. Mimi is probably with me on the potential for cost savings, but the rest of the community is probably not, as reflected in her joking tone.
Information governance initiatives, and that very much includes a records management, can indeed be better for your budget. My friends at IBM who are focused exclusively on our records management product have been helping our customers calculate the ROI with the “No Paper Weight” initiative for the past few years. But when I view information governance through the lens of content analytics, I see even greater possibilities for easing your budget.
One of the key values of content analytics technology is that it is a substitute for human analysis of documents. And because those documents have become so numerous in our organizations, the cost to analyze them has become very high. The inability to execute analysis of your documents becomes an implicit assumption in how you plan your records management and information governance projects.
By adopting content analytics as a core element of your information lifecycle governance strategy, you can shatter this assumption and doing so reshapes your budgeting in two critical ways, cost prevention and cost reduction:
1) Document by document decision making for cost prevention. Technologies like automated content classification can augment, improve efficiency of, or outright replace document by document decision making. By using content classification and other analytics approaches to better automate your governance decisions, you’ll improve your organization’s productivity by letting the general population of users focus on their ‘real’ work and leave records management decision making out of their lives. More productive workers means a positive impact on your budget.
2) Content Decommissioning for cost reduction. More compelling, is the savings possible by gaining control and governing the lifecycle of your important information, and decommissioning the rest. Too frequently, we hear from customers that they keep their information active in its original store because they don’t know what’s important and it would be too difficult (i.e. involve a costly analysis of the documents) to figure out what to preserve. But the cost implications of retiring the systems that store this content is compelling. If we can reduce the barriers that prevent organizations from sifting through the information and picking out the relevant, valuable content (the content that a records manager would say is a record), then we can unlock the budget friedly implications of better information lifecycle governance. Content analytics does that.
Once you’ve assessed legacy content stores and picked out and preserved the valuable content, you have opened up new worlds of cost saving:
– File system storage, which assumes high availablity and rapid access, can be cut. Content is decommissioned and your disk purchasing budget for the next year can go down. You simply don’t need as much as before.
– Administration costs. Less storage means less cost to administer that disk. Lower power costs, cooling costs and of course human costs. Most organizations model a fully burden cost for storage. Decommission your content and you can decommission much of your fully burdened, ongoing cost.
– Further, the information you do keep can be sent off to lower cost storage as typically that information will be determined to require less frequent access than actively generated content.
– Application maintainenace is also an implication of legacy, uncontrolled content. Content is frequently not stored in totally uncontrolled ways (file systems) but rather in partially controlled places as part of software business applications (like a CRM system or a knowledge management application). And those business applications come with ongoing maintenance costs. Sometimes these costs come in the form of explicit maintenance fees from software vendors. Sometimes they are billings from IT organizations for the human cost of maintaining the application. But if you’re able to identify the important information inside those applications, preserve it, apply lifecycle governance to it and decommission the complete, originating application, you’re certain to cut costs.
Yes, lifecycle governance can be expensive. But there are savings to be had once you break down the implicit hurdle that understanding your information can be done cost effectively with content analytics.