Term Extraction API and TagCloud.com

One of the most inspiring backend pieces of the Event Browser for me was the innovative use of the Content Analysis Term Extraction API, which Ed describes in his post about the Event Browser:

One of the problems we had were that there were no images in our event feed. We knew we wanted to get images from either Flickr or Yahoo Image Search but it wasn’t immediately obvious how we would get an image from a phrase like “Highlights of the Textiles Permanent Collection at the MH De Young Memorial Museum”. It was another Yahoo, Toby Elliott, who suggested that we use the Content Analysis API and then Image Search. Honestly, I didn’t even know Yahoo had term extraction as a public API. To get the images you see in the demo I concatenate the title and venue to get the most important terms extracted and then use that as the image query. The first thumbnail gets used as the photo for each event. It’s really simple code on my end and all the real work is done by the term extraction service. My favorite example for images is the De Young Museum’s list of events.

I’m a total news and politics junkie (when I worked at CNN.com several years ago, it was like giving an alcoholic a job in a liquor store), so last week I started thinking about how using a tag cloud to represent breaking news from a set of RSS feeds that I choose. Last week being a big news week, I envisioned a dynamic tag cloud where words like “Scooter,” “Miers,” and “indictment” would get huge in breaking news situations to tip me off that something big was going on at that moment.

I started digging around and was getting ready to sit down and write the code, but then I found TagCloud.com, which uses (you guessed it) the Yahoo! Term Extraction API to produce a tag cloud built from RSS feeds you specify in an interface that is nicer than anything I could build quickly myself. No need to re-invent the wheel, so I signed up and had my tag cloud in a few minutes, using these feeds: CNN.com Top News, Fox News (to be fair and balanced), MSNBC, NY Times home page, Washington Post top news, and Yahoo! Top News.

Here is the resulting tag cloud from those sources. TagCloud.com offers a nice “stop words” feature so you can remove common (but useless) words from the tag cloud display. I specified “full story” and “story” as stop words, for example, but I also specified “white” and “supreme” because they were generally represented in the phrases “white house” and “supreme court” (terms which were preserved in the tag cloud even though I specified words within those phrases as stop words). (And check out their implementation guide for simple instructions on how to put your TagClouds on your site in badge form).

Displaying this tag cloud more dynamically in a Konfabulator widget would be cool. . . . TagCloud.com has already done the difficult work using the Term Extraction API, perhaps the most underappreciated API in the Yahoo! API arsenal. Sounds like a fun project.

4 thoughts on “Term Extraction API and TagCloud.com

  1. I myself am a big news and politics junkie, but I fail to understand how I could use tag cloud to help me. Could you explain why it is so useful and what I could do with it in layman’s terms? I am kind of confused here. 😦

  2. Hi everybody!
    TermExtractor, my master thesis, is online at the
    address http://lcl2.di.uniroma1.it.

    TermExtractor is a software package for Terminology Extraction. The software helps a web community to extract and validate relevant domain terms in their interest domain, by submitting an archive of domain-related documents in any format.

    TermExtractor extracts terminology consensually
    referred in a specific application domain. The software takes as input a corpus of domain documents, parses the documents, and extracts a list of “syntactically plausible” terms (e.g. compounds, adjective-nouns, etc.).
    Documents parsing assigns a greater importance
    to terms with text layouts (title, bold, italic,
    underlined, etc.). Two entropy-based measures, called
    Domain Relevance and Domain Consensus, are then used.
    Domain Consensus is used to select only the terms
    which are consensually referred throughout the corpus
    documents. Domain Relevance to select only the terms
    which are relevant to the domain of interest, Domain
    Relevance is computed with reference to a set of
    contrastive terminologies from different domains.
    Finally, extracted terms are further filtered using
    Lexical Cohesion, that measures the degree of
    association of all the words in a terminological
    string. Accept files formats are: txt, pdf, ps, dvi,
    tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
    also zip archives.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s