Web-scraping and cleaning data with R · leachjim

In my first post I introduced a simple shiny app that I’ve been putting together in my free time over the last few weeks. This is the first of the promised posts about my app - covering how I went about getting and cleaning the data for the app. The full code is on my GitHub page so this post will just be a few highlights.

I’ve broken this down in to two sections. The first is a brief overview of what TfL provide in their weekly csv files and the second is a (more interesting) look at how I solved some problems with web-scraping in R and the fantastic rvest package. Hopefully I can highlight some of the benefits of rvest and this might serve as a simple web-scraping in R example.

Tfl Oyster Card Data Wrangling in R

The data provided by the TfL Oyster system are quite messy:

only the journey start date is given (meaning some simple manipulation is required to get the date for journeys ending after midnight),
both/all stations in a journey are in one text field, and must be parsed out.
dates/times are given as text, not as date/time or datetime fields
bus journeys are simply listed as “bus journey on route [bus route]”.

You can see the full extent of the various manipulations I had to make in the code file on GitHub here. Let’s just say that dplyr, stringr and lubridate were all fantastic tools for getting everything sorted in to a nice, clean and tidy table.

Looking at the data as provided by TfL, I knew that I could do some simple analysis of journeys, times spent on transport etc (more on that in a later post). But I wanted to do something cool with the leaflet package and put my data on a map. So, as the raw data don’t come with geographic locations (beyond station names), I’d have to get that data myself.

Web-scraping with R

Wikipedia already contains a clean table of London railway station co-ordinates so I wrote a quick function to do a look up to grab the table. However it doesn’t have the underground stations - just mainline rail services - so I had to do some more digging.

Station locations

Looking at the page source for a handful of underground stations I spotted that they all contained the co-ordinates as part of the page. A colleague suggested trying some web-scraping to get at them on wikipedia. I hadn’t done this with R before but thought it sounded fun so I turned to the then-newly-released rvest package.

(If I’d done some more googling I would have found the geocode() function from ggplot2 and saved myself the trouble. This was an interesting problem to solve, though, and gave me some good experience with web-scraping in R.)

My first step was to pick a starting station page, and grab all the links from it (I went with Baker Street). The function is very simple. Given an input url it:

scans the web page,
finds all the links,
grabs the link url (held in the html ‘href’ attribute), and
returns a character vector of all the links.

As I was only looking for links to wikipedia pages that were for London underground stations I used a couple of simple ‘grep’ steps to narrow down my search a bit.

Extracting page links from Wikipedia

ExtractLinks <-
function(url = "http://en.wikipedia.org/wiki/Baker_Street_tube_station") {
  # takes a URL and extracts all links from it
  url %>%
    html() %>%
    html_nodes("a") %>%
    html_attr("href") %>%
    grep("wiki", ., value = T) %>%
    grep("station", ., value = T) %>%
    paste0("http://en.wikipedia.org",.)
}

After that I needed a way to get at the coordinates data I’d seen on each page. Some playing with the selectorgadget tool gave me the html node on each page that contained the coordinates data. A quick rvest function got me a messy set of coordinates data.

Scraping data from wikipedia pages

ExtractGeo <- function(url) {
  # takes a URL and uses the CSS selector provided by SelectorGadget widget to
  # extract the dirty coordinates data from it
  url %>%
    html() %>%
    html_nodes("span span span span span span.geo") %>%
    html_text()
}

However I also needed the station name to link back to my TfL data, so I put together another function to get that from the wiki page, too:

ExtractTitle <- function(url) {
  # takes a URL and uses the CSS selector provided by SelectorGadget widget to
  # extract the first header, which in this case is the wikipedia page title
  url %>%
    html() %>%
    html_nodes("div h1") %>%
    html_text()
}

Once I’d created these functions I proceeded to:

Run the function to get all the links from a station page - this gave me all the TfL underground station pages on wikipedia.
Loop over all of these links to get the station coordinates and name using the second two functions.

So far, so-straight-forwards. However the data-scrape was still very messy and not fit for my purposes. The full code is here but to put it simply I had to:

use a lot of sapply() and functions from the stringr package (mainly str_extract()),
clean up some guff I’d scraped as part of my wide-shot approach (mainly page errors and elements I didn’t need),
clean up some individual station names that I knew wouldn’t match to the data from TfL, and
run my scraping functions one-by-one for a few stations that I’d missed initially.

Wrapping up and thoughts

By the end of all this I had my hands on a clean and tidy data set containing station name, longitude and lattitude - perfect for plotting my data on to a map. A quick use of dplyr’s left_join() and I had everything I needed to get going.

rvest was incredibly easy to use once I’d figured out the various functions and gave me exactly what I was looking for. I can definitely see myself turning to it again in future when I’m dealing with web-based data. Definitely easier to navigate html/XML with it than the (older but still useful) XML package. dplyr, lubridate and stringr (yet) again proved their fantastic-ness. Hadley Wickham (and anyone else who contributes but gets less recognition) deserve a medal.

In my next post I’ll talk about how I got on with ggplot2 and putting my data on the map before I pen a third on the app itself.