Scraping Wikpedia Tables
Want to know how get data from a Wikipedia table into your R script? Look no further.
I was recently attempting to scrape a table from Wikipedia. I was responding to a group chat and wanted to make a quick histogram (something I love about R and ggplot). In my search, I came across this very helpful tweet from Julia Silge:
library(rvest)— Julia Silge (@juliasilge) January 12, 2018
h <- read_html("https://t.co/gloY1eErBn")
reps <- h %>%
html_node("#mw-content-text > div > table:nth-child(18)") %>%
reps <- reps[,c(1:2,4:9)] %>%
This is great piece of quick code to scrape the data from the table. However this tweet is specific only to the example, and I want to generalize the code for any Wikipedia table.
I’m going to show you to scrape data from any Wikipedia table using Firefox browser and R (of course), although I’m sure this could be adapted to other browsers.
I’ll show how to change the above code into a code to scrape the table I was interested in, which was the attendance size of Trump’s rallies. Here, I’d like to scrape the
Primary rallies (June 2015–June 2016) table.
The first thing to change is the
all we need to here is update the hyperlink by copying the web address of the wikipedia page:
The next thing to change is the
html_node() line. For this this you’ll have to know the tag of the table in the html code. To find this you’ll have to inspect the html code of the wikipedia page you’re looking at.
Inspect the table you’re interested in and you’ll see a line starting
<table class= ...
Right click on that line and select
then update the with
html_node line, using the
html_node(xpath = '/html/body/div/div/div/div/table')