I just read with great interest Steven Levy’s article in Wired on Google’s search algorithm and how Google works to improve it. A couple of things leaped out at me as concepts I’ve discussed here in the past (or on my old blog), as the concepts extend into the enterprise. Just as Google uses them to improve their consumer search experience, you can leverage them within the context of better information governance.
1) Google uses document context similar to how I have describe advanced content classification as a “context-based” method of classifying information. Levy writes:
Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theoriesabout how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”
Google uses the context of the content it indexes to better understand the purpose and intent of a particular document and in turn the purpose and intent of your particular search query. Advanced content classification methods deliver better categorization results in a similar way — it is using the full context of the training documents provided to it to better results.
2) When discussing ‘trusted content’, I used the example of how Google trusts some sources over others. At the time, I didn’t have a source for this assertion. Levy describes this in some detail in the article:
That same year, an engineer namedKrishna Bharat, figuring that links from recognized authorities should carry more weight, devised a powerful signal that confers extra credibility to references from experts’ sites. (It would become Google’s first patent.) The most recent major change, codenamed Caffeine, revamped the entire indexing system to make it even easier for engineers to add signals.
Do read the entire article if you’re interested in these topics — given our universal reliance on Google as consumers, its certainly beneficial to be an educated consumer. And these concepts can extent into better proactive management of your enterprise content.
When I’m explaining advanced content classification to audiences and why these methods are more powerful and accurate, I frequently tell this story. It illustrates the value of advanced methods of classification over more rudimentary, rule-based approaches. I think audiences have found it helpful.
I emphasize the fact that advanced methods gain their greater accuracyfrom taking the ‘full context’ of the long-form text into account. Advanced methods aren’t just using one word, or two words into account — they’re taking into account hundreds and thousands of words, weighing each word’s significance and coming up with a cumulative, holistic assessment of the similarity of the text to each category. Innumerable factors are being taken into account. Hundreds and thousands of factors are being weighed. By comparison, a simple keyword based rule for categorization isn’t taking innumerable factors into account. Its just taking one.
The analogy I draw is the difference between an adult’s ability to categorize and my daughter. Let me explain.
My daughter is 18 months old. One of her very first words was “duck.”. [I believe the order of new words was “Mama”, “Duck”, 15 other words, then “Dada.” I digress]
In that first set of words, duck was pretty much alone relative to other animals. There might have been a “dog” in there, but that was about it when it came to naming things in the animal kingdom. Certainly it was the only bird she new in the bird class of the animal taxonomy.
So when she saw a duck, she enthusiastically blurted out “DUCK!.”
And when she saw a pigeon, she exclaimed “DUCK!.”
And when she saw a hawk, she of course shouted out “DUCK!.”
If it had wings and a beak, she named it a duck.
Why? Because her relatively immature mind was only taken a few factors into account. We as adults can say, “yes, its got a beak, but its beak is pretty sharp, and its feet aren’t webbed and . . . well therefore its a hawk, not a duck.”
My daughter was acting like a simple rules based classifier. She took one or two key factors into account and made her decision.
We worked on her and now she can distinguish between birds and ducks. She’s making progress. Her brain is constantly in ‘upgrade’ mode. You should look into upgrading your classification methods too if you’re only focused on rules-based approaches too.
In a previous post, I emphasized the importance of rigorous, controlled testing when assessing the potential of content analytics. This is especially important for content classification when it is being used to replace human decision-making. My broader point in that post was that when adopting new technology, you can’t rely on the qualititative perception of the skeptical observer.
A similar topic, that of adoption of technology in the legal profession, came up at the keynote to LegalTech last week. Law.com recounts Dr. Lisa Sanders’ response, which was far more eloquent than my post so I wanted to pass it along here:
During the question-and-answer session, Kelley Drye & Warren Practice Development Manager Jennifer Topper asked the panel how to convince litigators to use tools like technology and decision trees, repeatable processes that can help make handling similar cases more efficient.
It’s a long process of changing attitudes within a corporate culture, Dr. Sanders said. “Mistakes made by a computer or guideline live forever in the minds of people watching them. Mistakes made by people are forgivable.”
I guess that’s why she writes for the New York Times . . .
(the LegalTech keynote has been quite the blogging gift this week. Maybe I should volunteer to staff the IBM booth next year and get the scoop first hand)
Next, an outline of the elements of the Smart Archive Offerings. The content analytics capabilites I discuss here on this blog are a part of the offerings mentioned (InfoSphere Content Assessment, InfoSphere Classification Module):
(Really, this is just an excuse to try out embedding a YouTube video on this blog for the first time).
I read up on some of the goings-on at Legaltech in New York city last week. A couple of things caught my eye from the write-up on legalcurrent.com.
1) I found it interesting, as I tweeted earlier in the day, that David Craig of Thomson-Reuters used the term “Tsunami of information”. We currently host a whitepaper from Cohasset Associates entitled “Meet the Content Tsunami Head On: Leveraging Classification for Compliant Information Management.” It will be interesting to see if that descriptor gains traction in the marketplace.
2) Malcolm Gladwell is hitting the information management circuit, isn’t he? First IOD last fall, now Legaltech. (I hope he continues it; he was hands down the most interesting keynote speaker I’ve seen at a tradeshow. Effective in tying his storytelling back to the themes of the show itself).
3) Lastly, Gladwell, as recounted in the writeup, referenced a story about the chess master Gary Kasparov:
Gladwell pointed to a Kasparov chess challenge in which both opponents used a computer throughout the match. Kasparov saw that the computer’s quick analysis of every possible move enabled these grandmasters to let their experience, creativity and knowledge come through.
That’s a nice summary of the core of my for content classificaiton specifically, and content analytics more broadly within the context of information governance.
One way to read that quote is that content classification frees up the mind of our knowledge workers such that they can focus on the truly complex matters and truly human endeavors that require our most valuable skills. Leave the mundane grunt work to the computers, automated, in the background.
Seen differently, when computers automatically intelligently provide the top, best choices for humans — assist in classification of informationwithout completely automating the task — humans are left to focus their brain power for classification to explicitly focus on the finer points of thedecision making process, and as such come to better conclusions.
Either way, I thought an interesting view on the role of automated analysis in relation to typically human based decision making. Dehumanizing the analysis lead to better, more humane results.
I’m quite enamored with twitter. Its my main source of information and news, especially on the weekends as I’m rarely in front of a computer and it delivers interesting tidbits to the blackberry in my pocket. And its certainly the best way I’ve found to keep my finger on the pulse of goings on in the niche relevant to my professional interests: ECM, information governance and records management.
This tweet, from @MimiDionne a couple of weekends ago caught my eye:
My initial response: Mimi, this is no joke! My 18 month old wasn’t that interested in hearing about cost savings and information governance at the time, so I returned to our conversation about ducks and birds and swallowed my observations until now.
Its always perilous to read too much into 140 character long observations, but my instant reaction to the tone of the tweet was that she was embarking on something that the general records management community would view as quixotic. Mimi is probably with me on the potential for cost savings, but the rest of the community is probably not, as reflected in her joking tone.
Information governance initiatives, and that very much includes a records management, can indeed be better for your budget. My friends at IBM who are focused exclusively on our records management product have been helping our customers calculate the ROI with the “No Paper Weight” initiative for the past few years. But when I view information governance through the lens of content analytics, I see even greater possibilities for easing your budget.
One of the key values of content analytics technology is that it is a substitute for human analysis of documents. And because those documents have become so numerous in our organizations, the cost to analyze them has become very high. The inability to execute analysis of your documents becomes an implicit assumption in how you plan your records management and information governance projects.
By adopting content analytics as a core element of your information lifecycle governance strategy, you can shatter this assumption and doing so reshapes your budgeting in two critical ways, cost prevention and cost reduction:
1) Document by document decision making for cost prevention. Technologies like automated content classification can augment, improve efficiency of, or outright replace document by document decision making. By using content classification and other analytics approaches to better automate your governance decisions, you’ll improve your organization’s productivity by letting the general population of users focus on their ‘real’ work and leave records management decision making out of their lives. More productive workers means a positive impact on your budget.
2) Content Decommissioning for cost reduction. More compelling, is the savings possible by gaining control and governing the lifecycle of your important information, and decommissioning the rest. Too frequently, we hear from customers that they keep their information active in its original store because they don’t know what’s important and it would be too difficult (i.e. involve a costly analysis of the documents) to figure out what to preserve. But the cost implications of retiring the systems that store this content is compelling. If we can reduce the barriers that prevent organizations from sifting through the information and picking out the relevant, valuable content (the content that a records manager would say is a record), then we can unlock the budget friedly implications of better information lifecycle governance. Content analytics does that.
Once you’ve assessed legacy content stores and picked out and preserved the valuable content, you have opened up new worlds of cost saving:
– File system storage, which assumes high availablity and rapid access, can be cut. Content is decommissioned and your disk purchasing budget for the next year can go down. You simply don’t need as much as before.
– Administration costs. Less storage means less cost to administer that disk. Lower power costs, cooling costs and of course human costs. Most organizations model a fully burden cost for storage. Decommission your content and you can decommission much of your fully burdened, ongoing cost.
– Further, the information you do keep can be sent off to lower cost storage as typically that information will be determined to require less frequent access than actively generated content.
– Application maintainenace is also an implication of legacy, uncontrolled content. Content is frequently not stored in totally uncontrolled ways (file systems) but rather in partially controlled places as part of software business applications (like a CRM system or a knowledge management application). And those business applications come with ongoing maintenance costs. Sometimes these costs come in the form of explicit maintenance fees from software vendors. Sometimes they are billings from IT organizations for the human cost of maintaining the application. But if you’re able to identify the important information inside those applications, preserve it, apply lifecycle governance to it and decommission the complete, originating application, you’re certain to cut costs.
Yes, lifecycle governance can be expensive. But there are savings to be had once you break down the implicit hurdle that understanding your information can be done cost effectively with content analytics.
One of the basic tools of carpentry is a level. Without it, a carpenter (or any weekend Mr. Fixit) is left eyeballing whether or not a particular surface is perfectly horizontal or perfectly vertical. Sure, anyone can ‘eyeball’ the work and say “Yeah, it looks straight to me”, but more often than not, that results in a slanted bookshelf.
Through experience, we learn in carpentry that eyeballing it just isn’t an accurate method — our perception of reality can be misleading.
The same rigor needs to carry through to how we judge the success of content analytics and specifically content classification. Content analytics, like any innovation, is the target of skepticism and misperception. This has been one of the challenges we’ve faced up as we push for adoption of our content classification product at IBM, and other products that leverage content analytics.
I bring this up having read an interesting post by Lexalytics, a content analytics vendor, on their investigations of their sentiment analysis accuracy. One of the things that caught my eye in the post was a quote from Forrester analyst Suresh Vital who said “in talking to clients who have deployed some form of sentiment analysis, accuracy rests at about 50 percent.”
Its a pretty casual quote that reflects how too many approach their assessment of accuracy of content analytics. There’s no reference to rigorous studies. There’s no reference to hard data about the success or failure of content analytics. It is “Oh, I’ve spoken to a bunch of customers and they perceive it to be doing an iffy job.”
This is the wrong way to judge content analytics.
Last year, as our customers were going about their buying decisions on content classification, many would go through limited tests of our content classification capability. There were two types of these content classification “proof of technologies” that went on.
The first were the rigorous tests. The ones who followed best practices. They created a reasonably large corpus of pre-categorized documents, and segregated a large portion of this pre-categorized content as a test set. The remaining content was used to train the system. To assess the accuracy of the system, the segrated, pre-categorized test corpus was used. A human was not judging the system document by document as it categorized. Rather, a large, statistically valid sample was run through — a formal control set.
The other type of customer did the opposite. They trained the system and then pulled uncategorized content and asked a human and the Classification Module to categorize side by side.
The results, in terms of accuracy, were the same for both types of customers. About 80% of the top category response was correct for all customers.
The perception of the accuracy for the different types of customers was starkly different.
Those who followed a rigorous approach perceive the automatic classification process to be a success. Those who followed the ‘judge by hand’ approach perceived the system to be unreliable. Why? The human judges have a tendancy to latch onto the failures — the misfires are far more memorable in the eyes of the judge than the successes. The misfires are just numerous enought (10-20%) that they seem pervasive. In reality, the vast majority of the results are good.
This is why the Classification Module and its Classification Workbench tool itself has explicit workflows built into it for executing rigorous testing of your training set and potential categorization process. Because eyeballing it leads to the misperception of results — and crooked bookshelves.