In real-world situations, jumping right into data analysis almost never happens –or at least it shouldn’t happen. First, we need to get the data ready, clean it from outliers, wrong data types, and the like. But even before that, we may need to collect the data ourselves. Apart from internal databases and all kinds of surveys, websites have become an important source of data.
In this tutorial, you'll read a practical example on web scraping using R. You can follow the steps to extracting data from the website of a university’s sports program. Our goal is to collect data on each course’s name, level, time, period, and prices.
We’ll cover some HTML basics, the concept of parsing, and XPath expressions, while working on our practical example towards automating the search-and-extract process. In a second tutorial, you’ll see how we can replicate the same process and arrive at the same results using modern packages from the tidyverse.
Our starting point is the webpage of the Humboldt University Sports program. It is an index page with the list of sports courses offered for the current semester. The list is ordered alphabetically and it links to each course page with further information. We would like to collect data on level, time schedule, and prices for all sports courses.
Behind the page in your browser, there is text written in HTML, the language for web content. What makes HTML so practical is that it uses a set of tags to give hierarchical order to pieces of text and to indicate how they should be displayed. This is a distinctive feature of markup languages, just like HTML (HyperText Markup Language). Thanks to how HTML is structured, we can work with a document as a tree representation of its elements, which makes it easier to query.
To see the HTML document itself on your browser, right-click anywhere on the index page and select View Page Source. Some pages might display html code in a compact form with comments and formatting removed. We would then see that many lines of code are arranged horizontally instead of vertically. This process is known as minification. And although it brings us the enjoyment of speedy websites, it can make your job as online data collector a bit harder. Fortunately, we can use an HTML beautifier to make the code readable for us.
We can also use the element inspector. From the same index page, highlight one of the courses from the list, right-click and select Inspect Element. We can then navigate through the elements and track down the information we want. This is how it looks like if you’re using Firefox:
An element in HTML is the combination of start tag <tagname>
, closing tag </tagname>
, and in-between content:<div>I'm a section in some document</div>
. Elements have attributes, which give them meaning and context. Look at the image above, what elements contain the sports names? It’s easy to see they are surrounded by <a>
tags. These are anchor tags that associate text, for example ‘Aikido’, with a hyperlink to another page. The link is specified using the href
attribute, here href="Aikido.html"
. Note that the <a>
tags are found inside <dd>
tags, which are inside the <dl>
tag, and so on. Elements nested inside one another is what establishes hierarchy, hence the tree structure we mentioned before.
If you’d like to read more about HTML, I would recommend checking w3schools’ tutorials. For an easy reference to HTML tags go to http://www.w3schools.com/tags.
Let’s come back to our practical example and start by getting the list of sports courses from the index page. We’ll use getURL()
from the RCurl package to access the website and bring the HTML document into R. In most cases, all we need to do is set the url
argument to the web address. For example, the address of the index page stored in husports_url
. In our case, we also need to add the argument ssl.verifypeer = FALSE
to the function. This is because the website we are connecting to uses HTTPS, which is an encrypted connection. The encryption occurs under a SSL mechanism that first verifies the server’s certificate and then proceeds with the connection. Unfortunately, RCurl sometimes fails to validate a certificate (even if a certificate is valid) and throws an error. If we trust the server, we can set ssl.verifypeer to FALSE and skip the layer of security. Although this approach works safely here, be cautious when implementing it in any other cases.
The imported html document is saved in husports_html
. At this point, the document is just text and our object husports_html
is of class character:
library(RCurl)
husports_url <- "https://zeh2.zeh.hu-berlin.de/angebote/aktueller_zeitraum/"
husports_html <- getURL(husports_url, ssl.verifypeer = FALSE)
str(husports_html)
## chr "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-trans"| __truncated__
Even if we have the html code in our husports_html
object, we still cannot make use of its html structure to further extract document parts. R needs to understand the tree structure in an object that reflects the hierarchy of the html file. This transformation from merely reading html content to interpreting it is known as parsing, and to do the job we need, well, a parser.
Let’s parse husports_html
and store it in a new object called husports
using htmlParse()
from the XML package:
library(XML)
husports <- htmlParse(file = husports_html)
str(husports)
## Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
Now we have a data object we can query, husports
. In this object, html elements are represented as hierarchical nodes that R handles as lists.
Note on htmlParse()
: We are passing husports_html
as file
argument to the function. This is the HTML document we downloaded in the previous step. If you read the help page ?htmlParse
, you’ll notice that the argument file
can also be a URL, and you might be tempted to run the code htmlParse(husport_url)
. Unfortunately, it will not work in our case. The reason has to do with communication protocols. Under the hood, htmlParse(file = url)
makes the connection to the HTTP server and gets us the html file. Since HTTP pages are very common, this works in most web scraping cases. But our web page is delivered by HTTPS, a protocol not supported by htmlParse(). Many more protocols are supported by the RCurl package, which is why we use getURL()
as a first step to get the html file we want.
# Protocols supported by the RCurl package
curlVersion()$protocols
## [1] "dict" "file" "ftp" "ftps" "gopher" "http" "https"
## [8] "imap" "imaps" "ldap" "pop3" "pop3s" "rtmp" "rtsp"
## [15] "scp" "sftp" "smtp" "smtps" "telnet" "tftp"
We’d like to know what courses are offered as part of the university’s sports program. From inspecting the source code, we see that this information is located at the <a>
nodes. To navigate through the nodes and extract their content we’ll use xpathSApply()
from the XML package.
xpathSApply()
works on a parsed document: husports
in our case. With the function’s second argument, path
, we’ll define how to search for the node in our document, using what is called an XPath expression.
XPath (short for XML Path) is a query language for processing and extracting parts from HTML/XML documents. The basic structure of an XPath expression looks very similar to a file location path. But instead of folders and subfolders in a directory, we have a hierarchical sequence of nodes. And just like in a file system, we can work with absolute and relative paths. With an absolute path, we need to describe the exact sequence of nodes starting at the root node, separating the elements by /
. A relative path is less demanding: it doesn’t have to start at the root node and it doesn’t require the exact sequence of nodes. We can construct a relative path from any node to our target node by skipping nodes between them, indicating it with //
.
For example, to get to our <a>
nodes we start with a relative path to the <dl>
node ("//dl"
). In comparison with an absolute path, this is eight nodes shorter in typing "/html/body/div/div[7]/div[2]/div[2]/div/div/dl"
. From there on, giving the exact path costs no effort so we use an absolute path: "/dd/a"
.
Notice all the <div>
elements involved in the (absolute) path to our target node. It’s not uncommon to have many of the same elements in a webpage, but they can be distinguishable through their attributes. Thus, <div class="navigation">
and <div class="content">
are both <div>
elements defining two different sections in the document via the class attribute. The reason why we care about this is that we can write an XPath expression of the form //div[@class='content']
to condition our selection based on class attribute. The conditioning statement inside square brackets is called the predicate.
In our case, //dl[@class='bs_menu']
means “find the <dl>
node that has an attribute class with value bs_menu”.
# Relative path
xpathSApply(husports, path = "//dl[@class='bs_menu']/dd/a")[3]
## [[1]]
## <a href="_Aikido.html">Aikido</a>
# Absolute path
xpathSApply(husports, path = "/html/body/div/div[7]/div[2]/div[2]/div/div/dl/dd/a")[3]
## [[1]]
## <a href="_Aikido.html">Aikido</a>
xpathSApply()
returns the complete node set that matches the XPath expression. By indexing with [3]
we are accessing the third node in the set as an example.
At this point, our goal is to extract only the names of the courses. Luckily, there’s a handy third argument in xpathSApply()
that we can use to extract any node element: value, attribute, or attribute value. We simply pick any of the extractor functions available from the XML package. Here, setting fun = xmlValue
will extract the text from our target nodes.
Let’s finally get to it and create a character vector,sports_program
, that holds all the sports listed on the index page:
sports_program <- xpathSApply(
husports,
"//dl[@class='bs_menu']/dd/a",
fun = xmlValue
)
sports_program[1:5]
## [1] "Afrikanischer Tanz"
## [2] "Afrikanischer Tanz Kompaktkurs"
## [3] "Aikido"
## [4] "Aikido (mit angewandter Selbstverteidigung)"
## [5] "Alexandertechnik Kompaktkurs"
If you prefer an organized column, you can view sports_program
as a data frame:
head(as.data.frame(sports_program))
length(sports_program)
## [1] 233
Our scraping strategy may not always be as efficient and error-free as we would like it to be. So let’s do some quick check here. We may think we collected 233 sports from the webpage, but take a look at the last element in sports_program
:
tail(sports_program, n = 1L)
## [1] "RESTPLÄTZE - alle freien Kursplätze dieses Zeitraums"
This last element is not a sport but just a link to the remaining unbooked courses. It gets picked up because it belongs to the <a>
nodes we searched for with the XPath query "//dl[@class='bs_menu']/dd/a"
. Given the structure of the webpage, there’s no way to distinguish the last node from the rest –at least not via some attribute. What we can do is use XPath functions in the predicate to set the condition: “Bring all <dd>
nodes except the last one”.
The XPath function last()
returns the total number of items in the processed node set. This helps if we don’t know in advance how many nodes are included in the set. For example, we would use /dd[227]
if we knew the last index position, but /dd[last()]
if we didn’t.
Let’s see how this predicate brings the <dd>
node that appears on last position:
xpathSApply(husports, "//dl[@class='bs_menu']/dd[last()]/a", fun = xmlValue)
## [1] "RESTPLÄTZE - alle freien Kursplätze dieses Zeitraums"
For the condition we intend to form, we’ll need another XPath function, position()
, which gives the index position of the node. Now we can add a predicate that reads: Select elements in the <dd>
node set whose position()
is not equal to the last()
position.
Here’s the solution that allows us to extract all the sports listed on the index page excluding the last that is not a sport:
sports_program <- xpathSApply(
husports,
"//dl[@class='bs_menu']/dd[position()!=last()]/a",
xmlValue
)
We should have now collected 232 sports names, and the last element should be “Zumba”:
length(sports_program)
## [1] 232
tail(sports_program, n = 1L)
## [1] "Zumba"
We have extracted the names of the courses from our parsed document husports
. From the same document, we can also gather the links to each of these courses. We will need the links later on to access the pages, import the HTML files, and extract information about each course.
The root purpose of <a>
elements is to define links from one page to another. With this in mind, we can easily extract the href
attribute from the selected nodes using xmlGetAttr
:
hrefs <- xpathSApply(
husports,
"//dl[@class='bs_menu']/dd[position()!=last()]/a",
xmlGetAttr, "href"
)
head(as.data.frame(hrefs))
Let’s check if we have the same number of elements in hrefs
as we have in sports_program
, and if the last link refers to “Zumba”:
length(hrefs) == length(sports_program)
## [1] TRUE
tail(hrefs, n = 1L)
## [1] "_Zumba.html"
You probably noticed that our “links” dont’t quite look like true links. Typing “_Aikido.html" in a browser won’t get us anywhere. But they are the last (variable) part of each page’s full URL. This means we can follow a pattern to form the links. We’ll take care of that in the next section.
Our goal here is to write a function that automates the tasks of downloading and parsing HTML files from various pages. We create a function, getCoursesHtml
, that will do the following:
1. Constructing URLs for every course page. If you take a look at the webpage, you’ll see that the URL of each course page is formed by attaching the names we stored in hrefs
to our base URL husports_url
. We can use str_c()
from the stringr package to join the two character vectors into one:
str_c(husports_url,hrefs[3])
## [1] "https://zeh2.zeh.hu-berlin.de/angebote/aktueller_zeitraum/_Aikido.html"
2. Importing the HTML documents of all courses into R. We achieve this by applying the function getURL()
to the list of URLs from the previous step.
3. Parsing the HTML documents and saving them in a list to be returned by the function.
getCoursesHtml <- function(pagename, baseurl) {
require(stringr)
require(RCurl)
urls <- str_c(baseurl,pagename) # constructing direct links
htmldocs <- getURL(urls, ssl.verifypeer = FALSE) # importing html files
courses_parsed <- lapply(htmldocs,htmlParse) # parsing
}
The arguments to the function are the vector of page names we collected in hrefs
, and the URL of our index page in husports_url
.
Applying the function will get us a list of 232 parsed html documents:
husports_htmls <- getCoursesHtml(hrefs, husports_url)
typeof(husports_htmls)
## [1] "list"
length(husports_htmls)
## [1] 232
Elements in our list are data objects we can query with xpathSApply()
, just like we did with husports
. For example, let’s check the third element. This should refer to the Aikido course:
str(husports_htmls[3])
## List of 1
## $ https://zeh2.zeh.hu-berlin.de/angebote/aktueller_zeitraum/_Aikido.html:Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
To make a simple query, we can extract the name of the course, which appears as header at the course’s page. If you inspect the page, Aikido.html, you’ll identify what node we need to select:
xpathSApply(
husports_htmls[[3]],
"//div[@class='bs_head']",
xmlValue
)
## [1] "Aikido"
For a finishing touch to our list, we could give its elements clearer and more compact names:
names(husports_htmls) <- sports_program
str(husports_htmls[1:3])
## List of 3
## $ Afrikanischer Tanz :Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
## $ Afrikanischer Tanz Kompaktkurs:Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
## $ Aikido :Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
Not all courses listed on the sports program have price information; some of them are free of charge. Since we ultimately want to gather a dataset where price information is available, it makes sense to remove unnecessary documents from husports_htmls
before moving to the data collection. It will also make our scraping strategy easier to implement since we can standardize the extraction tasks.
One way to identify what courses are free of charge is to test the existence of a specific node. In courses with no cost, the HTML code reveals that there’s a <span>
node with attribute title="free of charge"
.
So far, we have employed xpathSApply()
to return nodes and their elements. But we can also use XPath expressions and square brackets []
for indexing nodes directly. For example, the following code would select the first span
element from the node set in the document “Windsurfen | Freies Surfen”:
husports_htmls$`Windsurfen | Freies Surfen`["//span[@title='free of charge']"][1]
## [[1]]
## <span title="free of charge">entgeltfrei</span>
If the node does not exist, an empty node is returned :
husports_htmls$Aikido["//span[@title='free of charge']"][1]
## [[1]]
## NULL
Using this pattern to select nodes, we employ sapply()
to apply an anonymous function, function(x) x["//span[@title='free of charge']"][1]
to all our documents in husports_htmls
. As a result, we get a list with elements that either have the selected node or are NULL (because the node was not found).
# List of nodes
spanNodes <- sapply(
husports_htmls,
function(x) x["//span[@title='free of charge']"][1]
)
We then build a logical vector by testing if an element in spanNodes
is not null, !is.null(x)
. Elements that are not null are those with the “free of charge” node, which are the ones we want to exclude from the list. In our logical vector, these elements will be assigned TRUE.
# Logical vector, True if "free of charge"
spanNodes_free <- sapply(spanNodes, function(x) !is.null(x))
How many courses are free of charge? To get the answer, simply count how many TRUE values are in our logical vector:
length(which(spanNodes_free))
## [1] 16
Finally, we use our logical vector spanNodes_free
to subset our list and exclude free courses:
husports_htmls <- husports_htmls[!spanNodes_free]
length(husports_htmls)
## [1] 216
Now that we have 216 parsed documents ready to work with, it’s time to put all our searching and extracting techniques into a function and get the information we need. The mechanics of our function getData()
work as follows:
1. Sport’s name, level, day, time, and period data are easy to extract. They are all found in nodes with a specific class. For example, time is part of a <td>
node with a class named bs_szeit
. This obviously calls for xpathSApply()
and the extractor function xmlValue
.
2. Prices are part of a <div>
node with class bs_tip
. There are four prices, one for each group segment (students, employees, alumni, external). The returned node set would look something like: “24/ 36/ 36/ 56 €24 EURfür Studierende36 ..”. If we extract the node’s value as it is, we’ll get a chunk of text that we don’t need. But if we do a bit of preprocessing along with the extraction , we can restrict the returned value to the characters before the currency symbol. To implement it, we create our own customized function and pass it to the fun
argument in xpathSApply()
. This way, we let our function cleanPrice()
take care of text manipulation. Using the stringr
package, text characters are extracted with str_sub(string, start, end)
. Here, start = 1
, and end
should equal the position of the last character we want to extract. Since we don’t like to manually count characters’ positions, we’ll use str_locate(string, pattern)
to track down the position of “€”, and then substract one from the result to avoid extracting the symbol. Note that str_locate()
returns both start and end positions as a matrix, so that str_locate(xmlValue(x), "€")[,1]
is only selecting the “end” values.
3. Putting together a data frame of six variables is the last step in our function, which is straightforward using data.frame()
. The only thing to watch for is that vectors must have equal length. Our object sport_name
is a vector of length one, since there’s only one sport name per course page. But on each page there could be several courses if different levels and schedules are offered. Let’s say there are 10 Ballet courses. Then our vectors level
, day
, time
, and so on, are ten elements long. As the data frame is formed, each vector’s element becomes a row, so we need sport_name
to have the same number of elements as the other vectors. Adding Course = rep(sport_name, length(level))
does the trick.
getData <- function(html_doc){
sport_name <- xpathSApply(html_doc, "//div[@class='bs_head']", xmlValue)
level <- xpathSApply(html_doc, "//td[@class='bs_sdet']", xmlValue)
day <- xpathSApply(html_doc, "//td[@class='bs_stag']", xmlValue)
time <- xpathSApply(html_doc, "//td[@class='bs_szeit']", xmlValue)
period <- xpathSApply(html_doc, "//td[@class='bs_szr']/a", xmlValue)
# Auxiliary function for preprocessing prices
cleanPrice <- function(x){
require(stringr)
# Locate index number before euro symbol
ends <- str_locate(xmlValue(x), "€")[,1] - 1
# Extract based on start and end position
prices <- str_sub(xmlValue(x), start = 1, end = ends)
# Remove whitespace
prices <- str_trim(prices, side = "right")
}
prices <- xpathSApply(html_doc,
"//div[@class='bs_tip']",
cleanPrice)
# Data Frame
sports_df <- data.frame(Course = rep(sport_name, length(level)),
Level = level,
Day = day,
Time = time,
Period = period,
Prices = prices,
stringsAsFactors = FALSE)
return(sports_df)
}
Let’s use our function to extract data from Ballett courses:
getData(husports_htmls$Ballett)
As expected, the function returns a data frame with six variables and, in this case, 10 rows. Now we want to gather the data from all courses. Since we already have a list of course pages, we use lapply()
to run getData()
on all pages and extract the information:
husports_data_ls <- lapply(husports_htmls,getData)
As a result, we should have 216 data frames stored in our list object:
husports_data_ls[1:3]
## $`Afrikanischer Tanz`
## Course Level Day Time Period
## 1 Afrikanischer Tanz A1 (ausgen. 20.6.) Do 20:00-21:30 23.4.-14.07.
## Prices
## 1 12/ 18/ 18/ 28
##
## $`Afrikanischer Tanz Kompaktkurs`
## Course Level Day Time Period
## 1 Afrikanischer Tanz Kompaktkurs Sa 16:00-20:00 11.05.
## Prices
## 1 5/ 10/ 10/ 15
##
## $Aikido
## Course Level Day Time Period Prices
## 1 Aikido A1 Mi 20:30-22:00 23.04.-14.07. 12/ 18/ 18/ 28
## 2 Aikido A1 Fr 16:00-17:30 23.04.-14.07. 12/ 18/ 18/ 28
Having a bunch of data frames inside a list object is not the optimal way to organize our data. Let's put all our data into one big data frame. We can stack all the data frames using bind_rows()
from the dplyr package:
library(dplyr)
husports_data <- bind_rows(husports_data_ls)
glimpse(husports_data)
## Observations: 960
## Variables: 6
## $ Course <chr> "Afrikanischer Tanz", "Afrikanischer Tanz Kompaktkurs",...
## $ Level <chr> "A1 (ausgen. 20.6.)", "", "A1", "A1", "A/F", "Einführun...
## $ Day <chr> "Do", "Sa", "Mi", "Fr", "Do", "Sa", "Sa", "Sa", "Sa", "...
## $ Time <chr> "20:00-21:30", "16:00-20:00", "20:30-22:00", "16:00-17:...
## $ Period <chr> "23.4.-14.07.", "11.05.", "23.04.-14.07.", "23.04.-14.0...
## $ Prices <chr> "12/ 18/ 18/ 28", "5/ 10/ 10/ 15", "12/ 18/ 18/ 28", "1...
tail(husports_data)
You can save the data as RData or as csv file:
save(husports_data, file = "husports.RData")
save(husports_data, file = "husports.csv")
Our data collection is completed and we have a wonderful data frame of 960 observations and 6 variables! Well, maybe it’s not that wonderful. There are still variables that shouldn’t be of character type, prices that should be split into four columns, and we have mixed contents in the variable Level.
So.. there’s plenty of data cleaning to do! But we'll take care of that another time.
For now, go check the next tutorial on Web Scraping à la Tidyverse
Munyert, S. et. al.(2014) Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. 1st ed. Wiley Publishing.
https://stringr.tidyverse.org/articles/regular-expressions.html