Friday, January 24, 2014

Scraping data from web pages in R with XML package

In the last years a lot of data has been released publicly in different formats, but sometimes the data we're interested in are still inside the HTML of a web page: let's see how to get those data.

One of the existing packages for doing this job is the XML package. This package allows us to read and create XML and HTML documents; among the many features, there's a function called readHTMLTable() that analyze the parsed HTML and returns the tables present in the page. The details of the package are available in the official documentation of the package.

Let's start.
Suppose we're interested in the italian demographic info present in this page http://sdw.ecb.europa.eu/browse.do?node=2120803 from the EU website. We start loading and parsing the page:
page <- "http://sdw.ecb.europa.eu/browse.do?node=2120803"
parsed <- htmlParse(page)
Now that we have parsed HTML, we can use the readHTMLTable() function to return a list with all the tables present in the page; we'll call the function with these parameters:
  • parsed: the parsed HTML
  • skip.rows: the rows we want to skip (at the beginning of this table there are a couple of rows that don't contain data but just formatting elements)
  • colClasses: the datatype of the different columns of the table (in our case all the columns have integer values); the rep() function is used to replicate 31 times the "integer" value
table <- readHTMLTable(parsed, skip.rows=c(1,3,4,5), colClasses = c(rep("integer", 31)))
As we can see from the page source code, this web page contains six HTML tables; the one that contains the data we're interested in is the fifth, so we extract that one from the list of tables, as a data frame:
values <- as.data.frame(table[5])
Just for convenience, we rename the columns with the period and italian data:
# renames the columns for the period and Italy
colnames(values)[1] <- 'Period'
colnames(values)[19] <- 'Italy'
The italian data lasts from 1990 to 2014, so we have to subset only those rows and, of course, only the two columns of period and italian data:
# subsets the data: we are interested only in the first and the 19th column (period and italian info)
ids <- values[c(1,19)]

# Italy has only 25 years of info, so we cut away the other rows
ids <- as.data.frame(ids[1:25,])
Now we can plot these data calling the plot function with these parameters:
  • ids: the data to plot
  • xlab: the label of the X axis
  • ylab: the label of the Y axis
  • main: the title of the plot
  • pch: the symbol to draw for evey point (19 is a solid circle: look here for an overview)
  • cex: the size of the symbol
plot(ids, xlab="Year", ylab="Population in thousands", main="Population 1990-2014", pch=19, cex=0.5)
and here is the result:

Here's the full code, also available on my github:
library(XML)

# sets the URL
url <- "http://sdw.ecb.europa.eu/browse.do?node=2120803"

# let the XML library parse the HTMl of the page
parsed <- htmlParse(url)

# reads the HTML table present inside the page, paying attention
# to the data types contained in the HTML table
table <- readHTMLTable(parsed, skip.rows=c(1,3,4,5), colClasses = c(rep("integer", 31) ))

# this web page contains seven HTML pages, but the one that contains the data
# is the fifth
values <- as.data.frame(table[5])

# renames the columns for the period and Italy
colnames(values)[1] <- 'Period'
colnames(values)[19] <- 'Italy'

# now subsets the data: we are interested only in the first and 
# the 19th column (period and Italy info)
ids <- values[c(1,19)]

# Italy has only 25 year of info, so we cut away the others
ids <- as.data.frame(ids[1:25,])

# plots the data
plot(ids, xlab="Year", ylab="Population in thousands", main="Population 1990-2014", pch=19, cex=0.5)

Wednesday, January 15, 2014

Creating a choropleth map with R

A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map (see the full article on Wikipedia).
In this post we'll see how to create a choropleth map with R programming language. We'll draw a map of Italy and the data about unemployment based on its administrative regions. First of all, we need the administrative borders of the country: the italian national statistics institute (ISTAT) has made available these data on the page: http://www3.istat.it/dati/catalogo/20061102_00/; the data we're interested in are in form of a zipped file that contains the shapefiles . The shapefile format is a geospatial data format that can describe polygons, and our administrative regions are polygons. We can use the readShapeSpatial() function from the maptools package:
regions <- readShapeSpatial("data/prov2011_g.shp")
This function creates an object that contains several information, included the polygons we need to draw.
We now need the unemployment data; again from ISTAT we can download some unemployment data based on administrative regions: is a zipped file containing a CSV. This file contains the number on unemployed people for each region from 2004 to 2012): we'll use the 2012 data. Let's load the file and remove the data we don't need:
# reads the unemployment data
unemployment <- read.csv(file="data/unemployment.csv")

# selects only the data we are interested (the name and the year 2012)
unemployment <- subset(unemployment, select=c(1,10))
To create a choropleth we have to plot the areas corresponding to the different administrative regions in different colors, and to do that, we need to add to the shapefile the values of unemployment.
Since is not possible to merge data into a shapefile, we need to add a whole column containing the unemployment values to it. Be careful: the regions in the shapefile have an order, and the value's column to add must refer to the regions in the same order! To do that, we've to add to the shapefile object the unemployment values ordered as the regions are:
# adds the unemployment values to the shapefile matching on regions' names (the column name is NOME_PRO)
regions@data = data.frame(regions@data, unemployment[match(regions@data$NOME_PRO, unemployment$NOME_PRO),])
We're almost done: we have the borders of every region and the value associated to it, so we can draw the choropleth. We'll use the spplot() function of sp package; this function can be used to plot a spatial area using data from a shapefile, and we can tell this function to use a different color for every region, according to the magnitude of the unemployment data; the parameters are:
  • regions: the shapefile
  • "Year.2012": the column name of the unemployment data (used by spplot for choosing a color)
  • col.regions: here we tell spplot the palette to use for coloring the regions
  • main: the title of the plot
In this plot we'll use a gradient of gray color that go from 0.9 to 0.2 (from very dark grey to very light gray):
# plots it
spplot(regions, "Year.2012", col.regions=gray.colors(32, 0.9, 0.2), main="Unemployment in 2012")
We then add a legend:
grid.text("Number of unemployed people in thousands", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 12))
and here is the result (click on the image to enlarge it):
Looking at the choropleth map, besides a couple of outliers located in the major cities, the values of unemployment seem quite the same overall the whole country; this visual effect is due to the outliers which "kick down" the other values to a smaller range of colors and thus giving the impression of almost equality.
Let's try to remove the outliers (setting them to 0), and see how it changes:
# removes the outliers
unemployment$Year.2012[which(unemployment$NOME_PRO == 'NAPOLI' | unemployment$NOME_PRO == 'TORINO' | unemployment$NOME_PRO == 'ROMA'| unemployment$NOME_PRO == 'MILANO')] <- 0

# reloads the shapefile
regions <- readShapeSpatial("data/prov2011_g.shp")

# adds the new unemployment data (without outliers)
regions@data = data.frame(regions@data, unemployment[match(regions@data$NOME_PRO, unemployment$NOME_PRO),])

# plots it, legend included
spplot(regions, "Year.2012", col.regions=gray.colors(32, 0.9, 0.2), main="Unemployment in 2012")
grid.text("Number of unemployed people in thousands (excluded Milan,Turin,Rome,Naples)", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 12))
This is the result:
Now that there are no more outliers, the max value has passed from more than 200.000 to less than 100.000; the result is that we can see better the distribution of unemployed people in the other regions. But we're still not satisfied with these results, because we realized that we'd prefer to know about unemployment rate in Italy rather than visualizing absolute numbers, as it gives a better description of the phenomenon.
As we did before, we have to find some data. To compute the unemployment rate (having the absolute numbers) we just need the population for every region for the same year we have unemployment data for; then we can divide the unemployment number by the population to have the unemployment rate.

First we need to find the italian population per region; once again, the ISTAT has the data we need. The zipped file is here: http://demo.istat.it/pop2012/dati/province.zip; this archive contains a CSV with the population of every sub-region of any region, with males and females data separated: this means that we have to sum all the values belonging to the same region to have its population. Let's see how.
First of all, we load and clean the data:
# loads data from CSV
population <- read.csv(file="data/population_per_region.csv")

# leaves only the three columns we're interested in (name, total_males, total_females)
population <- subset(population, select=c(2,8,13))

# adds a "Total" column to the dataset with the sum of total_males and total_females
population$Total <- population$Totale_Maschi + population$Totale_Femmine

# capitalize the names of the regions
population$Descrizione_Provincia <- toupper(population$Descrizione_Provincia)
Now, we have to aggregate all the rows according to the name of region, sum all the values in them, and assign the result to a data frame:
# now we group by "Descrizione_Provincia" (the region name) and sum the "Total" column
pop <- as.data.frame(aggregate(population$Total, by=list(population$Descrizione_Provincia), sum))

# renames the columns for clarity
colnames(pop)[1] <- "Descrizione_Provincia"
colnames(pop)[2] <- "Total"
We're now ready to compute the rate. The original unemployment numbers are in thousands, while actual population is not, so we have to multiply it by 1000 to have the correct result, and then the compute the rate dividing the two numbers (we multiply by 100 to have a percentage instead of a [0,1] value):
# computes the number of unemployed 
pop$unemployed <- unemployment$Year.2012[match(pop$Descrizione_Provincia, unemployment$NOME_PRO)]*1000

# computes the percentage of unemployed people related to total population
pop$rate <- pop$unemployed / pop$Total * 100
We have now the unemployment rate, we just need to update the shapefile with these new data:
# updates the shapefile with these new data
regions@data = data.frame(regions@data, pop[match(regions@data$NOME_PRO, pop$Descrizione_Provincia),])
and plot again:
# plots it
cols <- gray.colors(32, 0.9, 0.2)
spplot(regions, "rate", col.regions=cols, main="Unemployment in 2012")
grid.text("% of unemployed population", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 14))
Here is the final result:
Finally, this choropleth gives us an idea of unemployment in Italy.


Here's the full code, also available on my github:
# you can also download all the datafiles from my github: 
https://github.com/andreaiacono/playingwithdata/tree/master/data

library(maptools) # for reading shapefiles
library(grid) # for adding a title to the legend of the plot

### loads data ###

# sets current directory
setwd("/home/andrea/Programming/code/R/PlayingWithData/")

# reads the shapefile of administrative regions
regions <- readShapeSpatial("data/prov2011_g.shp")

# reads the unemployment data
unemployment <- read.csv(file="data/unemployment.csv")

# selects only the data we are interested (the name and the year 2012)
unemployment <- subset(unemployment, select=c(1,10))

# adds the unemployment value to the shapefile
regions@data = data.frame(regions@data, unemployment[match(regions@data$NOME_PRO, unemployment$NOME_PRO),])

# plots it
spplot(regions, "Year.2012", col.regions=gray.colors(32, 0.9, 0.2), main="Unemployment in 2012")

# adds a title to the legend
grid.text("Number of unemployed people in thousands", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 12))

# removes the outliers
unemployment$Year.2012[which(unemployment$NOME_PRO == 'NAPOLI' | unemployment$NOME_PRO == 'TORINO' | unemployment$NOME_PRO == 'ROMA'| unemployment$NOME_PRO == 'MILANO')] <- 0

# reloads the shapefile
regions <- readShapeSpatial("data/prov2011_g.shp")

# adds the new unemployment data (without outliers)
regions@data = data.frame(regions@data, unemployment[match(regions@data$NOME_PRO, unemployment$NOME_PRO),])

# plots it, legend included
spplot(regions, "Year.2012", col.regions=gray.colors(32, 0.9, 0.2), main="Unemployment in 2012")
grid.text("Number of unemployed people in thousands (excluded Milan,Turin,Rome,Naples)", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 12))

# now we want the rate of unemployment instead of absolute number, so we load the population number for every region for having a rate

# loads data from CSV
population <- read.csv(file="data/population_per_region.csv")

# leaves only the three columns we're interested in (name, total_males, total_females)
population <- subset(population, select=c(2,8,13))

# adds a "Total" column to the dataset with the sum of total_males and total_females
population$Total <- population$Totale_Maschi + population$Totale_Femmine

# capitalize the names of the regions
population$Descrizione_Provincia <- toupper(population$Descrizione_Provincia)

# now we group by "Descrizione_Provincia" and sum the "Total" column
pop <- as.data.frame(aggregate(population$Total, by=list(population$Descrizione_Provincia), sum))

# rename the columns
colnames(pop)[1] <- "Descrizione_Provincia"
colnames(pop)[2] <- "Total"

# computes the number of unemployed 
pop$unemployed <- unemployment$Year.2012[match(pop$Descrizione_Provincia, unemployment$NOME_PRO)]*1000

# computes the rate of unemployed people related to total population
pop$rate <- pop$unemployed / pop$Total * 100

# updates the shapefile with these new data
regions@data = data.frame(regions@data, pop[match(regions@data$NOME_PRO, pop$Descrizione_Provincia),])

# and plots it
cols <- gray.colors(32, 0.9, 0.2)
spplot(regions, "rate", col.regions=cols, main="Unemployment in 2012")
grid.text("% of unemployed population", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 14))

Thursday, January 9, 2014

How to draw data on a map with R

In this post we'll see how to use R language to plot some data on a map; the full code is available on my GitHub.
I've used data freely available from the italian ministry of education about the rate of school abandonment in Italy. It is available for download (in form of zipped files) from this page: http://archivio.pubblica.istruzione.it/scuola_in_chiaro/open_data/index.html.

First we need the data of the schools (including their geolocation). Once downloaded and unzipped on a PC, we can load it with R:
schools <- read.csv(file="data/schools.csv")
This dataset contains for each school a unique code, its complete address, the lat/lon and other info; since we want to display this data on a map, we remove rows that don't have lat/lon info:
schools <- schools[which(schools$LATITUDINE != "NA"),]
For the same reason we can forget all the columns except the unique code and the lat/info info; we create a new dataframe with only the columns we're interested in:
schools_geo <- data.frame(schools$codice_scuola, schools$LATITUDINE, schools$LONGITUDINE)
We now have a clean dataset for the italian schools.

The abandonment archive contains several files: the one we're interested in is "abbandoni.csv"; this dataset contains the unique code of the school, the abandonment rate of students in that school (in percentage) and other info. Let's load the dataset:
abandonments <- read.csv(file="data/abandonments.csv")
Looking at the data we see that some schools don't have the abandonment rate (the value is 'NA' instead of a number), so we'll exclude those schools from our dataset:
abandonments <- abandonments[which(abandonments$scuola != "NA"),]
Since the column names of the unique code of the schools are different in the two datastes, we'll rename one of the two to match the other (the name of the column is 'cod_scuola'):
colnames(schools_geo)[1] <- "cod_scuola"
Now we merge the two dataset so that we have on the same row the abandonment rate and the lat/lon info:
data <- merge(abandonments, schools_geo, by="cod_scuola")
If we draw on the map all the schools, even the ones that have an abandonment rate of 0, we'll have too many data; to make the visualization clearer, let's get rid the schools where the abandonment rate is equal to 0:
data <- data[which(data$scuola>0),]
The data is now ok. Let's setup the map. We'll use an R package called RgoogleMaps; we'll first define a rectangle of lat/lon coordinates for getting an image of Italy from Google Maps.
# setup of latitude and longitude centered on Italy
lat_c<-42.1
lon_c<-12.5

# gets a rectangle of coordinates
rectangle<-qbbox(lat = c(lat_c[1]+5, lat_c[1]-5), lon = c(lon_c[1]+5, lon_c[1]-5))

# retrieves a map from googlemaps
map<-GetMap.bbox(rectangle$lonR, rectangle$latR)
Now that we have the map, we can plot on it the abandonment rate; we will show the abandonment rate as a circle centered of the lat/lon coordinates of the school. The PlotOnStaticMap function of the RGoogleMaps package gets these parameter:
  • a map: the one we just have created
  • lat/lon: the coordinates where to plot
  • pch: what to plot; 1 means an empty circle
  • cex: the radius of the circle; we used the value of abandonment rate so that the bigger the rate, the bigger the circle
  • col: the color of plot
PlotOnStaticMap(MyMap=map, lat=data$schools.LATITUDINE, lon=data$schools.LONGITUDINE, TrueProj=TRUE, cex=data$scuola/15, pch=1, col="blue")
We also want a legend that describes the circles' values:
  • x,y: the top-left corner of the box
  • x.intersp: the horizontal distance between the circle and its description
  • y.intersp: the vertical distance between one row and the next
  • legend: the descriptions of the legend's items
  • col: the color of the circles
  • pch: the symbol to use (a circle)
  • cex: the font size of the text
  • pt.cex: the size of the circles
legend(x=120, y=300, title="Abandonment rate", x.intersp=2, y.intersp=c(1.1,1.1,1.1,1.1,1.15,1.2),legend=c("< 1%", "1% - 5%", "5% - 10%", "10% - 20%", "20% - 50%", "> 50%"), col="blue",pch=1,cex=0.8, pt.cex=c(1/15, 5/15, 10/15, 20/15, 50/15, 70/15)) 
This is the image returned (click on the image to enlarge it):
From this image, we can quickly see that the phenomenon of abandonment is more frequent in southern Italy and in the big cities (Milan, Turin, Genoa, Naples, Palermo) of the whole country.