Import HTML files into R with read_html()

When we worked with the XML package in the previous tutorial, importing and parsing required two steps: accessing the website to import the HTML document with getURL(), and parsing it with htmlParse(). In the tidyverse, this can be accomplished in one step using read_html() from the rvest package. We only need to provide the URL of our target website (already stored in husports_url) and the function will take care of converting the URL into a connection, importing, and parsing the document.

library(xml2)

husports <- read_html(husports_url)

str(husports)

## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Extract nodes with html_nodes()

We want to search and extract the sports names from our parsed document husports_ti. We already know this information resides on <a> nodes. To find them, we employ html_nodes() and the XPath expression we used before. Then we need to extract the values from the nodes. We can do this using the pipe operator %>% from magrittr, passing the node set as argument to the function html_text().

library(rvest)
library(magrittr)

sports_program <- husports %>% 
                      # Locate nodes
                      html_nodes(xpath = "//dl[@class='bs_menu']/dd[position()!=last()]/a") %>% 
                      # Extract text
                      html_text() 

head(as.data.frame(sports_program))

Get the links to all courses

Extracting the links works very similar to how we obtained the sports names. But instead of using html_teXT(), we need html_attr("href") to extract the values of the attributes in the <a> nodes.

hrefs <- husports %>% 
                # Locate nodes
                html_nodes(xpath = "//dl[@class='bs_menu']/dd[position()!=last()]/a") %>% 
                # Extract attribute                
                html_attr("href")
  

head(as.data.frame(hrefs))

Import and parse several pages

To gather the HTML documents of all courses we set up our function in the same way described in Import and parse several pages with our own function. The only difference is that with tidyverse tools we can write a more concise function. After appending each page name to our base url and form the full links, we use map() from the purrr package to apply read_html() to each element in courses_parsed.

getCoursesHtmlv2 <- function(pagename, baseurl) {
  require(purrr)
  require(stringr)
  # Construct direct links
  courses_parsed <- str_c(baseurl, pagename) %>% 
  # Import and parse
  map(read_html)
}

Let’s apply our function on the list of page names hrefs. As a result, we get a list of parsed documents.

husports_htmls <- getCoursesHtmlv2(hrefs, husports_url)

names(husports_htmls) <- sports_program

length(husports_htmls)

## [1] 232

Filtering out free courses

From the 232 courses, we want to exclude those that are free of charge. We can identify courses with no cost because their pages contain <span> nodes with attribute title="free of charge".

We know we can extract nodes with html_nodes(). So we’ll use the map() function to extract the “free of charge” nodes from each of the documents in our husports_htmls list. Free courses will have at least one of these nodes, while the rest will have zero. Based on this fact, we would get the length of each node set and create a logical vector to test if the length is different form zero. The lengths are stored in freeNodes_length:

freeNodes_length <- map(
  husports_htmls, 
  html_nodes, 
  xpath = "//span[@title='free of charge']"
  ) %>% 
  map(length)

# A free course has length different from 0
freeNodes_length[40]

## $`Business Yoga Beschäftigte/ Gruppenanmeldung`
## [1] 25

The condition freeNodes_length != 0 goes through each element in the vector and evaluates to TRUE if a value is different from zero, FALSE if it’s zero. As a result, we have a logical vector freecourses to flag cases of free courses.

# Logical vector, TRUE if "free of charge"
freecourses <- freeNodes_length != 0
  
# A free course should have value TRUE
freecourses[40]

## Business Yoga Beschäftigte/ Gruppenanmeldung 
##                                         TRUE

So how do we exclude those free courses? We just subset our husports_htmls with a logical index [!freecourses] that reads “no free courses”:

husports_htmls <- husports_htmls[!freecourses]
length(husports_htmls)

## [1] 216

We have now 216 courses in our list.

Extract data on sport courses

Creating our own function getData2 will help us keep an automatic process of data extraction. The function gathers the sport’s name, level, day, time, period, and prices from each of the courses in our list.

The extra steps in making the object prices are intended to process the contents of the node set and avoid extracting irrelevant text. Recall that the four prices we need are contained in the characters before the currency symbol. To extract them, we use str_extract() from the sringr package. Then, str_trim() will get rid of the white space at the end of the string.

To describe the "24/ 36/ 36/ 56 €" string pattern, we can use the regular expression .*€, which matches any single character appearing zero or more times followed by an euro symbol. However, we only want to extract the part that’s before the euro symbol. What we need is to combine our regular expression with a look-around assertion. More precisely, a positive look-ahead assertion which matches something followed by something else (e.g. euro symbol) without including the latter in the matching result. For example, p(?=€) matches a p that is followed by €, without including the € part.

The last part of our function simply puts together a data frame just as we did in Create your function to extract the data3, except that we are now creating a tibble, which is a modern version of the data frame. Among other things, tibble() does not convert strings into factors (so we don’t need the argument stringsAsFactors = FALSE), and it recycles inputs of length 1 (helpful for our object sport_name).

getData2 <- function(html_doc){
    
    sport_name <- html_doc %>% 
      html_nodes(xpath = "//div[@class='bs_head']") %>% 
      html_text()
    
    level <- html_doc %>% 
      html_nodes(xpath = "//td[@class='bs_sdet']") %>% 
      html_text()

    day <- html_doc %>% 
      html_nodes(xpath = "//td[@class='bs_stag']") %>% 
      html_text()

    time <- html_doc %>% 
      html_nodes(xpath = "//td[@class='bs_szeit']") %>% 
      html_text()

    period <- html_doc %>% 
      html_nodes(xpath = "//td[@class='bs_szr']/a") %>% 
      html_text()

    prices <- html_doc %>% 
      html_nodes(xpath = "//div[@class='bs_tip']") %>% 
      html_text() %>% 
      # Extract matching pattern
      str_extract(".*(?=€)") %>% 
      # Remove whitespace
      str_trim(side = "right")

  
    # Data Frame 
    sports_df <- tibble(Course = sport_name,
                           Level = level,
                           Day = day, 
                           Time = time,
                           Period = period,
                           Prices = prices
                        )
    
    return(sports_df)
    
  }

Let’s test our function on one of the courses:

getData2(husports_htmls$Ballett)

Applying our function to each of the documents in husports_htmls using map() would return a list of data frames. Instead, by using map_dfr() we get a data frame based on row-binding individual data frames. This is a shorter alternative to using lapply() and bind_rows() as we did in Stack the data frames.

library(dplyr)

husports_data <- husports_htmls %>% 
                    map_dfr(getData2)

glimpse(husports_data)

## Observations: 960
## Variables: 6
## $ Course <chr> "Afrikanischer Tanz", "Afrikanischer Tanz Kompaktkurs",...
## $ Level  <chr> "A1 (ausgen. 20.6.)", "", "A1", "A1", "A/F", "Einführun...
## $ Day    <chr> "Do", "Sa", "Mi", "Fr", "Do", "Sa", "Sa", "Sa", "Sa", "...
## $ Time   <chr> "20:00-21:30", "16:00-20:00", "20:30-22:00", "16:00-17:...
## $ Period <chr> "23.4.-14.07.", "11.05.", "23.04.-14.07.", "23.04.-14.0...
## $ Prices <chr> "12/ 18/ 18/ 28", "5/ 10/ 10/ 15", "12/ 18/ 18/ 28", "1...

tail(husports_data)

Our data frame has 960 observations and 6 variables. And in terms of data collection our job is done.

Now try to practice on your own!

References

Munyert, S. et. al.(2014) Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. 1st ed. Wiley Publishing.
https://cran.r-project.org/web/packages/XML/XML.pdf
https://cran.r-project.org/web/packages/RCurl/RCurl.pdf
https://cran.r-project.org/web/packages/rvest/rvest.pdf
https://stringr.tidyverse.org/articles/regular-expressions.html
https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
https://www.w3schools.com/html/

Web Scraping à la Tidyverse: rvest

by Juliana Vega Lacorte

May 22, 2019