Wikipedia has fairly standard search functionality - full-text keyword matches, really. To enhance this functionality, some researchers in Germany have extracted various types of structured information from Wikipedia and used RDF to build semantic models to represent this information, organizing them for more intelligent retrieval. The result of this is a semi-Semantic-Web construct called dbpedia.

The benefit with this is that now we can make semantic searches. Some of the sample queries include searches to generate a list of notable scientists and their doctoral advisors. Another finds a list of notable films lasting longer than 5 hours.


The model has the potential to be extended with domain knowledge, meaning that it can be made aware (for some definition of aware) of meanings of words, via ontologies. The query engine would then be able to understand simple relationships.

These are toy queries of course. The query language is slightly arcane, and I’m not seeing much domain knowledge in there. However, I can’t help but think that perhaps this would be the way to the Semantic Web, or at least a significant subset of semantically searchable content. Not to force everyone to write semantic markup, but to have some means of generating semantic content from data sources. Above all, you can see a slight hint of the great power and promise of queries against a semantic database.

On a less dreamy note, as mentioned on New Media Hack, given its free license and large data set, Wikipedia seems to be an interesting corpus for various natural language and semantic processing projects. Especially since you can download snapshots of the Wikipedia database to your local machine, in SQL and XML formats. A dump of the current versions of English pages typically clocks in at a couple of gigabytes, which is very manageable for the processing power available to plain old personal PCs. Encyclopedias are generally pretty structured, formally written text, and there are categories to distinguish various types of content. MIght be an interesting resource to use, especially for NLP or information extraction tasks.


4 Responses to “dbpedia - semantic search for Wikipedia”  

  1. 1 Kesava Mallela

    If somebody can reformat questions on Yahoo! answers to this query structure and then let users compare the responses to the ones on Yahoo! answers, we may have a working QA system for wikipedia. I am trying to think if that would work as a positive or negative feedback loop, though!

  2. 2 Ken-ichi

    I will not be satisfied until I can ask the question, “Show me pictures of cartoon villains without necks” and get a picture of Skeletor.

    Speaking of NLP on Wikipedia, there was an interesting /. article a while back on using Wikipedia to detect neologisms by running page titles against WordNet. The relevant paper by Tony Veale at University College Dublin seems to be down, but Google cache to the rescue.

  3. 3 Kesava Mallela

    QA = Quality Assurance (Disambiguating QA in my prev comment)

  1. 1 PageTurner.info

Leave a Reply