As my colleagues inside IBM have known for a over a week, I’ve decided to leave IBM to pursue other professional opportunities. Before I put out some blog posts on my future, I wanted to use this opportunity to look backwards at how far the ECM business has come with respect to discovery and content ananlytics over the last 5 years.
I came to IBM as part of the acquisition of a small enterprise search vendor — iPhrase Technologies. We joined up with a product team inside IBM building a product called “Information Integrator OmniFind Edition” to attack the enterprise search market. Though we were grouped inside the Content Management organization, we really went about our business independently relative to our ECM brethren, focusing on the search solutions, leveraging content analytics technologies for ‘concept searching’.
1 year later, FileNet joined IBM and we began to try to apply our search and discovery technologies to ECM centric business scenarios. As we began to collaborate, one of the first things that struck me about ECM, was the treatment of the documents. In enterprise search, documents were something to be cracked open by definition — how else to search it?
Yet the ECM world had a tendancy to treat a document as an ‘object’ –objects to be handled and managed. It struck me as digital paper shuffling where the expectation that ECM was for readying the document for someone with 2 eyes to read it and use it (and don’t get me wrong, it was challenging paper shuffling — billions of objects, large scale scanning — tough, tough problems).
Within this context we set down a path of applying analytics technologies to ECM. Our first step was to weave IBM’s content classification product within the ECM architecture, applying it to compelling scenarios in email archiving and records management. Next, we brought to market an eDiscovery solution built with analytics at its core. These first two steps were exciting but focused attempts at bringing about a better solution to specific ECM problems with content analytics, especially in the information governance market.
Then last year, IBM brought made generally availabile our Content Analytics platform. This third step is especially gratifying. Content analytics technologies have moved being an isolated technology, separate from ECM to delivering insight about businesses by leveraging the text inside of documents — the insides of these objects.
The embrace and adoption of content analytics is especially gratifying for me personally. Though I had but a small role, the change inside IBM ECM and externally amongst customers, analysts and others is stark relative to when I joined IBM. Content is no longer simply an ‘object’ to be managed — its an asset to be leveraged and this is a striking difference. I am confident that in the coming months and years this will increasingly become the accepted attitude and approach in ECM.
On that note, I want to thank folks for reading this blog on the topics of content classification and content analytics. For folks who are interested in more writing on information lifecycle governance, Craig Rhinehart continues to write on this topic at his blog.
Since my professional life will take me away from content analytics in the near term, I expect that this blog will start to reflect the new paths I’ll be following on my professional, post-IBM journey.
I hope you’ll continue to read as my journey takes these exciting new steps.
Last week I gave 8 talks on the topic of content analytics over the course of 2 regional marketing events in Washington DC and Atlanta. Having given that many talks on related topics so frequently in such a short time period, I found myself locking in on a few key statistics and facts, and I was reminded of that fact as I read Craig Rhinehart’s most recent missive on his blog. In my talks last week I similarly made the point that the “save everything” ethos described by Craig is losing steam. Why? The cost of storage isn’t dropping as quickly as the information is being generated. Organizations are coming to the realization that it’s simply not cost effective to ‘throw storage’ at the problem. The statistic I found myself using repeatedly last week was cited in a recent Forrester blog posting
It’s no surprise that Forrester clients report their storage capacity requirements are growing 20% to 40% each year. Storage costs have grown to 17% of the IT hardware budget, up from 10% in 2007.
That jump from 10% to 17% is what I found myself repeating last week. Cost per GB is going down every year. But organizations keep on spending more and more of their budget on keeping stuff. Throwing more storage at the problem (and avoiding the cause) has simply led to increased costs across the board. Not the hallmark of an effective, long-term solution.
I went on vacation last week.* (side note — though I’ve embraced twitter, foursquare and other modern public media platforms, I’ve yet to embrace the idea of broadcasting to the world the fact that my house was completely empty and I was 1000 miles away to the world at large – call me old fashioned if you must).
I mention it not to gloat about how much fun I had with my kids, but to bring up what I did the day before I departed. Again, call me old fashioned, but I typically get my books not from amazon, a bookstore or via an iPad, but from a more cost effective source: the public library. Quaint, I know.
When I go to the library, I can’t go without a plan. I can’t simply browse the stacks to find a good book. Yes, the library is well organized (good classifications!). And each book has good information on the cover describing the contents (standard metadata!) like author and title. But that information exterior to the contents just is not effective in helping me quickly determining the value of a book relative to my needs. I prepare in advance by reading reviews of others – other people who’ve read the books and analyzed their value. Otherwise finding a good couple of books for my vacation is an overwhelming and frustrating task.
The same idea – expending effort to analyze the long-form text inside content – applies to the content inside your organization. In previous postings I’ve discussed the value of content assessment to your organization. And to execute content assessment you need to execute content analytics. Historic approaches to tackling the content assessment problem have focused on metadata exterior to a document – the title, the author, the dates. This is much like trying to find a library book just by browsing the stacks. Determining what content is necessary to your organization – what content is valuable, requires governance, is legally relevant – is virtually impossible simply by examining data exterior to your content.
Content analytics provides your organization the ability to determine the value of your content by interrogating the interior of those documents. Metadata on the outside of a document is only part of the story. What concepts are covered in the document? Does this document concern itself with a customer? A business partner? Does this document concern itself with a particular business activity?
All of these questions are difficult to answer without examining the text in a document but given the volume of information in your organization, it’s difficult to actually make these assessments on a large scale basis. In my next posting I’ll cover how content analytics can help to answer these valuation questions.
In my previous post, the first in my series on content assessment, I described the information landscape with respect to content. Organizations are facing ever increasing volume, velocity and variety of information. Understanding growing piles of uncontrolled content through content analytics has clear benefits to organizations of every size. Each organization – and the range of stakeholders in those organizations – will benefit from engaging in content assessment. How? In three main ways:
1) There is value to all stakeholders in simply understanding content better through analytics. Dynamically analyzing silos of unmanaged, uncontrolled content via content analytics provides new insight about this information stakeholders previously did not have. Before, stakeholders simply knew the ‘speeds and feeds’ about a content repository: the number of documents, the size of those document, etc. Content analytics now delivers insight about the content and that insight leads to better, more informed decision making. Which areas represent the most risk? Where should we start our governance efforts? Where should our priorities lie? What is the projected ROI of better information lifecycle governance?
Today, organizations make these kinds of decisions about their unstructured content repositories with limited data. More likely, they avoid making decisions because they lack this kind of insight. No longer. Improved understanding and insight about your unstructured information leads to better decisions about how to take action.
2) One such action to take is to decommission content, the systems that support that content and the systems that rely upon that content. Decommissioning is primarily an IT concern. They manage the costs of the information infrastructure. By default, most organizations have been doing nothing with their content. And as such their infrastructure costs have continued to rise. With an understanding of the content, you can take on these these once avoided decisions with more confidence. By understanding the content in a particular system, you can take action to shut those systems down and save costs.
3) There is a flip side to decommissioning old content and the systems that support content. It is that by understanding content, you will be empowered to preserve the necessary content. Preserving the necessary content enables the decommissioning you want to execute.
Content assessment provides you the ability to identify content that is valuable. This makes general line of business users happy, as they are resistant to decommissioning because they don’t want you to throw away ‘something they’ll need’ in the future.
Content assessment provides you the tools to identify content that requires lifecycle governance. The compliance officers and records managers will be happy because your organization’s obligations will be met in a documented process. You will be taking steps to enforce your content policies on disposition of content while still working to control your costs.
Content assessment provides you the tool to identify content that is legally relevant. The lawyers will be happy because they can use it to find the information relevant to legal cases where it resides in uncontrolled environments – and exert the kind of control the eDiscovery process demand.
Three main ways content assessment delivers value to your organization: via understanding of you content on its own; via decommissioning and consequently reduction of IT cost; via preservation and governance for fulfilling the needs of line-of-business stakeholders and compliance minded stakeholders alike.
Next in the content assessment series . . . what content is ‘necessary’ to your organization and how does content analytics help to make this determination?
First in a series of posts on content assessment.
It has been quiet around this here blog. One reason was that the month of March saw two “once in 50 year” rain storms in the Boston area. I got to learn some valuable skills in flood prevention as a result – unfortunately, those lessons came at the cost of activities like blogging and tweeting . . . but I’m back and ready to roll with a series of posts on a topic I’ve been thinking and working on over the past 3 months – content assessment.
I introduced this topic after our original announcement for our content assessment offering. And I’ve spent the last few months talking to IBM customers, analysts and other enterprise content professionals inside IBM. It’s an exciting application of content analytics technology to solve a class of problems that our customers have traditionally ignored . . . and hoped that it would go away — kind of like my laundry in college. Back then I kept on wearing my clean cloths day after day, hoping my laundry would magically wash itself. Not surprisingly, the cloths kept piling up. Finally, a random Sunday afternoon would arrive; I’d wake up, bite the bullet and wash my cloths. Ah . . . to be 19 again . . . I digress.
Much as I continuously generated dirty cloths, organizations continue to generate content. And similar to the haphazard piles of laundry in my dorm room, these chaotic uncontrolled piles of content aren’t cleaning up themselves. And these piles of content are growing at a much faster pace.
In college, I’d wait until I couldn’t stand it anymore. And then I’d take action to take control of my clothing situation. With the velocity, volume and variety of content growth, organizations are hitting a similar stage. They can’t maintain the same ‘do nothing, save everything’ practices about the content. The day has arrived to tackle those piles.
To IT, the costs are continuing to rise upwards (17% of IT budgets are devoted to storage alone, up from 10% just a few years ago). Records managers increasingly realize they can’t rely on users to identify and control business records. Legal needs to find the documents they need for eDiscovery proceedings – and fast. Line of business users need better access and control of trusted content to better execute their business activities.
These information stake holders need better control over the necessary information for their business. But to take action to exert that control they need better understanding of their content landscape. They see the mounds of content, as far as their virtual eye can see. Years of bad content habits have created an intimidating problem that leaves them paralyzed as to how to solve it.
Content assessment solutions – powered by innovations in content analytics – are now ready to meet this challenge. Content assessment solutions deliver the kind of understanding organizations need to make decisions about their content. Empowered with insight about their content via content analytics, organizations can now take action. They can take action by decommissioning the content they no longer need. They can take action by decommissioning the systems and infrastructure that supports their unnecessary content. And they will be willing to take these cost cutting actions because they’ve identified and preserved the content that is necessary to their organization.
In the coming days and weeks, I’ll post more in this series of posts on content assessment – covering in more detail who benefits from content assessment, what those benefits are, and the key elements to a content assessment solution. Its an exciting new solution area.
You can’t avoid the grappling with the piles of content . . . just as I couldn’t avoid doing laundry. If your content governance practices are analogous to my college laundry habits, content assessment is an idea you need to learn more about.
“What is the accuracy of your product”
I’ve probably been asked that question in every presentation on content classification I’ve given, since I first started working on IBM’s classification product, over three years ago.
I know two things when I’m asked the question: that the inquisitor wants a short answer and that the answer isn’t as simple as the inquisitor expects.
The way the question is framed – the simple straightforward request for accuracy results – implies an underlying assumption that the proper categorization of content in a business scenario absolutely and definitively exists. I was reminded of this as I read a nice study on the accuracy of document categorization, written by the eDiscovery Institute and published this year. It stated:
Ultimately, measurement of accuracy implies that we have some reliable ground truth or gold standard against which to compare the classifier, but such a standard is generally lacking for measure of information retrieval in general and for legal discovery in particular.
The paper, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, explores the accuracy of automated document classification, specifically in comparison to human based alternatives. In the paper, the authors explore whether automated classification is a reasonable alternative to use when categorizing documents in a legal discovery review. The authors worked with a corpus of documents from a real regulatory inquiry.
The original lawyers involved in the case had categorized the documents. This is a ready-made training set from which the computer-based classifiers could learn and is exactly what the authors did. In turn, these well-trained classifiers categorized other content gathered for the case.
Yet,to assess quality of their automated classification methods, the authors didn’t compare the automated results against the results of the original reviewers. Rather, they tasked an entire new set of human reviewers (“re-reviewers”) to classify documents from the corpus.
The authors, for deriving their conclusions, compared the results of these re-reviewers with those of the automated classifiers. I think of this as a fair fight – comparing the results of the computers with the same task as executed by humans.
The human re-reviewers agreed with the original reviewers approximately 79.8% of the time.
Not exactly the kind of consistent accuracy we expect out of our reliable employees, is it?
Based on this level of disagreement, the authors have illustrated their assertion that there really can’t be a reliable ‘gold standard’ of truth in categorization of documents. The ‘right’ answer is not so easily identified in every case – in most cases, in fact.
By comparison, automated methods agreed with the original reviewers over 80% of the time.
So what did I learn from this paper?
1) The human reviewers aren’t perfect. The human re-reviewers aren’t perfect. And of course the automated replacements for the human analysis aren’t perfect. I tend to give human classifiers too much credit, in fact. No method is perfect.
But . . .
2) The fact that automated classification can do just as well, if not slightly better than the human re-reviewers leads the authors to conclude that “employing as system like one of the two systems employed in this task will yield results that are comparable to the traditional practice in discovery and would therefore appear to be reasonable.”
And that is the key – the software isn’t perfect. But neither are the motivated, knowledgeable humans. And the automated methods, though a bit more mysterious, give comparable results – at a fraction of the cost.