Tutorial :Categorizing Words and Category Values


We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.

Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.

We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.

i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.

If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.

Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?

Even a keyword DB or something might be helpful ? Anyone know of any free ones?


First of all you need sample text to analyze, to get the relationship of words. A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.

A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.


Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...

For example (results first):

weather: fish  sport: ball  weather: hat  fashion: trousers  weather: snowball  weather: tornado  

With code (TODO: add threading ;-p):

static void Main() {      string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };      string[] categories = { "sport", "fashion", "weather" };        using(WebClient client = new WebClient()){          foreach(string word in words) {              var bestCategory = categories.OrderByDescending(                  cat => Rank(client, word, cat)).First();              Console.WriteLine("{0}: {1}", bestCategory, word);          }      }  }    static int Rank(WebClient client, string word, string category) {      string s = client.DownloadString("http://www.google.com/search?q=%2B" +          Uri.EscapeDataString(word) + "+%2B" +          Uri.EscapeDataString(category));      var match = Regex.Match(s, @"of about \<b\>([0-9,]+)\</b\>");      int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;      Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));      return rank;  }  


Maybe you are all making this too hard.

Obviously, you need an external reference of some sort to rank the probability that X is in category Y. Is it possible that he's testing your "out of the box" thinking and that YOU could be the external reference? That is, the algorithm is a simple matter of running through each category and each word and asking YOU (or whoever sits at the terminal) whether word X is in the displayed category Y. There are a few simple variations on this theme but they all involve blowing past the Gordian knot by simply cutting it.

Or not...depends on the teacher.


So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.

In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.

Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:

  1. Start by more or less randomly assigning name-value pairs.
  2. Output the results.
  3. Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
  4. Input the results of the AMT evaluation back into the system along with the random assignments.
  5. If everything was approved, then we are done.
  6. Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
  7. Return to step 3.

Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.


You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.

Create a set of categorization rules like the one above and see how high an accuracy you get.


Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.


This sounds like you could use some sort of Bayesian classification as it is used in spam filtering. But this would still require "external data" in the form of some sort of text base that provides context.

Without that, the problem is impossible to solve. It's not an algorithm problem, it's an AI problem. But even AI (and natural intelligence as well, for that matter) needs some sort of input to learn from.

I suspect that the professor is giving you an impossible problem to make you understand at what different levels you can think about a problem.

The key question here is: who decides what a "correct" classification is? What is this decision based on? How could this decision be reproduced programmatically, and what input data would it need?


I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.

Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?


Fire this teacher.

The only solution to this problem is to already have the solution to the problem. Ie. you need a table of keywords and categories to build your code that puts keywords into categories.

Unless, as you suggest, you add a system which "understands" english. This is the person sitting in front of the computer, or an expert system.

If you're building an expert system and doesn't even know it, the teacher is not good at giving problems.


Google is forbidden, but they have almost a perfect solution - Google Sets.

Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.


Yeah I'd go for the wordnet approach. Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem. Hope this helps,



Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)

Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,

# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)  * direct hypernym / inherited hypernym / sister term        o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)        + direct hypernym / inherited hypernym / sister term              # S: (n) **sport**, athletics               (an active diversion requiring physical exertion and competition)   

What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.

Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)

As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/


My first thought would be to leverage external data. Write a program that google-searches each word, and takes the 'category' that appears first/highest in the search results :)

That might be considered cheating, though.


Well, you can't use Google, but you CAN use Yahoo, Ask, Bing, Ding, Dong, Kong... I would do a few passes. First query the 100 words against 2-3 search engines, grab the first y resulting articles (y being a threshold to experiment with. 5 is a good start I think) and scan the text. In particular I"ll search for the 10 categories. If a category appears more than x time (x again being some threshold you need to experiment with) its a match. Based on that x threshold (ie how many times a category appears in the text) and how may of the top y pages it appears in you can assign a weigh to a word-category pair. for better accuracy you can then do another pass with those non-google search engines with the word-category pair (with a AND relationship) and apply the number of resulting pages to the weight of that pair. Them simply assume the word-category pair with highest weight is the right one (assuming you'll even have more than one option). You can also multi assign a word to a multiple category if the weights are close enough (z threshold maybe). Based on that you can introduce any number of words and any number of categories. And You'll win your challenge. I also think this method is good to evaluate the weight of potential adwords in advertising. but that's another topic....

Good luck



Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.


Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.

Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.

Be creative. There is other data out there besides Google.


My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.


My naive approach:

  1. Create a huge text file like this (read the article for inspiration)
  2. For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
  3. The word is likely to belong in the category with the greatest counter.


Scrape delicious.com and search for each word, looking at collective tag counts, etc.

Not much more I can say about that, but delicious is old, huge, incredibly-heavily tagged and contains a wealth of current relevant semantic information to draw from. It would be very easy to build a semantics database this way, using your word list as a basis from scraping.

The knowledge is in the tags.


As you don't need to attend the subject when you solve this 'riddle' it's not supposed to be easy I think. Nevertheless I would do something like this (told in a very simplistic way)

Build up a Neuronal Network which you give some input (a (e)book, some (e)books) => no google needed

this network classifies words (Neural networks are great for 'unsure' classification). I think you may simply know which word belongs to which category because of the occurences in the text. ('fishing' is likely to be mentioned near 'sports'). After some training of the neural network it should "link" you the words to the categories.


You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.


You could implement a learning algorithm to do this using a monte carlo method and human feedback. Have the system randomly categorize words, then ask you to vote them as "match" or "not match." If it matches, the word is categorized and can be eliminated. If not, the system excludes it from that category in future iterations since it knows it doesn't belong there. This will get very accurate results.

This will work for the 100 word problem fairly easily. For the larger problem, you could combine this with educated guessing to make the process work faster. Here, as many people above have mentioned, you will need external sources. The google method would probably work the best, since google's already done a ton of work on it, but barring that you could, for example, pull data from your facebook account using the facebook apis and try to figure out which words are statistically more likely to appear with previously categorized words.

Either way, though, this cannot be done without some kind of external input that at some point came from a human. Unless you want to be cheeky and, for example, define the categories by some serialized value contained in the ascii text for the name :P

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »