Monthly Archives: December 2014

Parsing Wikipedia Page Hierarchy

Wikipedia is a set of pages that has relationships with one another. For example, the Anarchism page has a set of categories (located at the bottom of the page). Categories group related pages together, so the categories of Anarchism page are like its parents. One of Anarchism’s categories is Political Culture. It too has (parent) categories [Culture, Political Science, Politics] and (children) subcategories like [Anarchism, Socialism].

I needed to build this graph of pages and here’s how I did it.


First, we need to understand what wikipedia provides and how to use it.

Wikipedia open sources and releases snapshots of its site here [http://dumps.wikimedia.org/enwiki/latest/]

Page

Download enwiki-latest-page.sql.gz (~1.2GB)

Information about this table can be found here.

You should have ~34M rows in this table that represent all the english wikipedia pages. Of course, there are different types of pages.

Articles are ones that have encyclopedic content. These are identified by rows where page_namespace=0.

Category pages are the ones that represent parent-child relationship between pages. These are identified by rows where page_namespace=14.

You can find out about other page types and their namespace here.

Categorylinks

Download enwiki-latest-categorylinks.sql.gz (~1.5GB)

Information about this table can be found here.

You should have ~65M rows in this table that represent the link between articles and categories, as well as categories and subcategories.

The cl_from column corresponds to page.page_id.

The cl_to column corresponds to page.page_title.


Now let’s see the relationship of categorylinks and page tables in action.

I wanted to build the tree of wiki pages from the leaf up and I was only interested in article pages.

Here’s how you would get all the articles:

select * from page where page_namespace = 0;

But let’s start with the example we began this post with: Anarchism.
To find that article specifically

select * from page where page_namespace = 0 and page_title='Anarchism'

We get back a row with page_id=12
Screen Shot 2014-12-03 at 1.18.56 PM

Now, perform this query on categorylinks

select * from categorylinks where cl_from = 12

We get back 19 rows that represent the Anarchism page’s categories
Screen Shot 2014-12-03 at 1.24.31 PM

Notice one of them is “Political_culture”. Next, we query the page table again for this category

select * from page where page_namespace = 14 and page_title='Political_culture'

Notice that the page_namespace is set to 14, which means we only want back category pages. This should return a row with page_id = 21722732
Screen Shot 2014-12-03 at 1.28.44 PM

You should do this for each and every one of 19 Anarchism page categories.

Now you repeat this process. So let’s say we want to find the categories of Category page “Political_culture”

select * from categorylinks where cl_from = 21722732

That gives us 4 categories
Screen Shot 2014-12-03 at 1.31.10 PM

Again, for each of these 4 categories, query the page and continue until we reach all the nodes. The top of this tree will be the Contents category.

Special thanks to svick for helping me understand some of this.

Tagged