The Question I Hear Every Day – What is your Accuracy?
“What is the accuracy of your product”
I’ve probably been asked that question in every presentation on content classification I’ve given, since I first started working on IBM’s classification product, over three years ago.
I know two things when I’m asked the question: that the inquisitor wants a short answer and that the answer isn’t as simple as the inquisitor expects.
The way the question is framed – the simple straightforward request for accuracy results – implies an underlying assumption that the proper categorization of content in a business scenario absolutely and definitively exists. I was reminded of this as I read a nice study on the accuracy of document categorization, written by the eDiscovery Institute and published this year. It stated:
Ultimately, measurement of accuracy implies that we have some reliable ground truth or gold standard against which to compare the classifier, but such a standard is generally lacking for measure of information retrieval in general and for legal discovery in particular.
The paper, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, explores the accuracy of automated document classification, specifically in comparison to human based alternatives. In the paper, the authors explore whether automated classification is a reasonable alternative to use when categorizing documents in a legal discovery review. The authors worked with a corpus of documents from a real regulatory inquiry.
The original lawyers involved in the case had categorized the documents. This is a ready-made training set from which the computer-based classifiers could learn and is exactly what the authors did. In turn, these well-trained classifiers categorized other content gathered for the case.
Yet,to assess quality of their automated classification methods, the authors didn’t compare the automated results against the results of the original reviewers. Rather, they tasked an entire new set of human reviewers (“re-reviewers”) to classify documents from the corpus.
The authors, for deriving their conclusions, compared the results of these re-reviewers with those of the automated classifiers. I think of this as a fair fight – comparing the results of the computers with the same task as executed by humans.
The human re-reviewers agreed with the original reviewers approximately 79.8% of the time.
Not exactly the kind of consistent accuracy we expect out of our reliable employees, is it?
Based on this level of disagreement, the authors have illustrated their assertion that there really can’t be a reliable ‘gold standard’ of truth in categorization of documents. The ‘right’ answer is not so easily identified in every case – in most cases, in fact.
By comparison, automated methods agreed with the original reviewers over 80% of the time.
So what did I learn from this paper?
1) The human reviewers aren’t perfect. The human re-reviewers aren’t perfect. And of course the automated replacements for the human analysis aren’t perfect. I tend to give human classifiers too much credit, in fact. No method is perfect.
But . . .
2) The fact that automated classification can do just as well, if not slightly better than the human re-reviewers leads the authors to conclude that “employing as system like one of the two systems employed in this task will yield results that are comparable to the traditional practice in discovery and would therefore appear to be reasonable.”
And that is the key – the software isn’t perfect. But neither are the motivated, knowledgeable humans. And the automated methods, though a bit more mysterious, give comparable results – at a fraction of the cost.