(un)structured

Josh Payne on content analytics, enterprise content and information management

Better Content Classification Than My One Year Old

with 3 comments

DUCK!

courtesy freefoto.com

When I’m explaining advanced content classification to audiences and why these methods are more powerful and accurate, I frequently tell this story. It illustrates the value of advanced methods of classification over more rudimentary, rule-based approaches. I think audiences have found it helpful.

I emphasize the fact that advanced methods gain their greater accuracyfrom taking the ‘full context’ of the long-form text into account. Advanced methods aren’t just using one word, or two words into account — they’re taking into account hundreds and thousands of words, weighing each word’s significance and coming up with a cumulative, holistic assessment of the similarity of the text to each category. Innumerable factors are being taken into account.  Hundreds and thousands of factors are being weighed. By comparison, a simple keyword based rule for categorization isn’t taking innumerable factors into account. Its just taking one.

The analogy I draw is the difference between an adult’s ability to categorize and my daughter. Let me explain.

My daughter is 18 months old. One of her very first words was “duck.”. [I believe the order of new words was “Mama”, “Duck”, 15 other words, then “Dada.” I digress]

In that first set of words, duck was pretty much alone relative to other animals. There might have been a “dog” in there, but that was about it when it came to naming things in the animal kingdom. Certainly it was the only bird she new in the bird class of the animal taxonomy.

So when she saw a duck, she enthusiastically blurted out “DUCK!.”

And when she saw a pigeon, she exclaimed “DUCK!.”

And when she saw a hawk, she of course shouted out “DUCK!.”

If it had wings and a beak, she named it a duck.

Why? Because her relatively immature mind was only taken a few factors into account. We as adults can say, “yes, its got a beak, but its beak is pretty sharp, and its feet aren’t webbed and . . . well therefore its a hawk, not a duck.”

My daughter was acting like a simple rules based classifier. She took one or two key factors into account and made her decision.

We worked on her and now she can distinguish between birds and ducks. She’s making progress. Her brain is constantly in ‘upgrade’ mode. You should look into upgrading your classification methods too if you’re only focused on rules-based approaches too.

Advertisements

Written by Josh Payne

February 18, 2010 at 9:40 pm

3 Responses

Subscribe to comments with RSS.

  1. So where would you put user generated tags in this context Josh. Are they akin to several 1yr olds, one of which knows birds as ducks, another as chickens, and a third by their sound. It seems to be that tag clouds work to get a consensus classification that is able helps you find some things like you are looking for, but not all.

    Martin Sumner-Smith

    February 19, 2010 at 9:35 am

  2. I’m going to go with highly motivated high-school students We’re not quite sure if we trust them because we don’t know their background and their trustworthiness (how authoritative are they?), but they’re certainly motivated and enthusiastic so that gives them a good measure of credibility. And the wisdom of the crowd helps. If they’ve gone out of their way to tag something then that’s certainly a plus for them.
    To stretch the analogy way too far, the corporate employee who is forced to classify information is manually is probably the unenthusiastic, grumpy, moody teenager.

    Josh Payne

    February 19, 2010 at 12:57 pm

  3. I did a little “study” a few years ago. I gave 7 people 7 short documents and asked them to name a category for each one. There was NO overlap in the categories they named. There were a total of 49 categories assigned to these 7 documents.

    How does one understand this? Unlike the toddler, these adults looked at a very broad range of features–one could argue too broad a range of features–to support their categorization. That bird before you could actually be categorized in a near infinite number of ways. The specific categories applied will often depend on the specific use or function or context the person has in mind.

    Herbert Roitblat

    February 27, 2010 at 8:55 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: