Thursday, January 9, 2014

How to draw data on a map with R

In this post we'll see how to use R language to plot some data on a map; the full code is available on my GitHub.
I've used data freely available from the italian ministry of education about the rate of school abandonment in Italy. It is available for download (in form of zipped files) from this page: http://archivio.pubblica.istruzione.it/scuola_in_chiaro/open_data/index.html.

First we need the data of the schools (including their geolocation). Once downloaded and unzipped on a PC, we can load it with R:
schools <- read.csv(file="data/schools.csv")
This dataset contains for each school a unique code, its complete address, the lat/lon and other info; since we want to display this data on a map, we remove rows that don't have lat/lon info:
schools <- schools[which(schools$LATITUDINE != "NA"),]
For the same reason we can forget all the columns except the unique code and the lat/info info; we create a new dataframe with only the columns we're interested in:
schools_geo <- data.frame(schools$codice_scuola, schools$LATITUDINE, schools$LONGITUDINE)
We now have a clean dataset for the italian schools.

The abandonment archive contains several files: the one we're interested in is "abbandoni.csv"; this dataset contains the unique code of the school, the abandonment rate of students in that school (in percentage) and other info. Let's load the dataset:
abandonments <- read.csv(file="data/abandonments.csv")
Looking at the data we see that some schools don't have the abandonment rate (the value is 'NA' instead of a number), so we'll exclude those schools from our dataset:
abandonments <- abandonments[which(abandonments$scuola != "NA"),]
Since the column names of the unique code of the schools are different in the two datastes, we'll rename one of the two to match the other (the name of the column is 'cod_scuola'):
colnames(schools_geo)[1] <- "cod_scuola"
Now we merge the two dataset so that we have on the same row the abandonment rate and the lat/lon info:
data <- merge(abandonments, schools_geo, by="cod_scuola")
If we draw on the map all the schools, even the ones that have an abandonment rate of 0, we'll have too many data; to make the visualization clearer, let's get rid the schools where the abandonment rate is equal to 0:
data <- data[which(data$scuola>0),]
The data is now ok. Let's setup the map. We'll use an R package called RgoogleMaps; we'll first define a rectangle of lat/lon coordinates for getting an image of Italy from Google Maps.
# setup of latitude and longitude centered on Italy
lat_c<-42.1
lon_c<-12.5

# gets a rectangle of coordinates
rectangle<-qbbox(lat = c(lat_c[1]+5, lat_c[1]-5), lon = c(lon_c[1]+5, lon_c[1]-5))

# retrieves a map from googlemaps
map<-GetMap.bbox(rectangle$lonR, rectangle$latR)
Now that we have the map, we can plot on it the abandonment rate; we will show the abandonment rate as a circle centered of the lat/lon coordinates of the school. The PlotOnStaticMap function of the RGoogleMaps package gets these parameter:
  • a map: the one we just have created
  • lat/lon: the coordinates where to plot
  • pch: what to plot; 1 means an empty circle
  • cex: the radius of the circle; we used the value of abandonment rate so that the bigger the rate, the bigger the circle
  • col: the color of plot
PlotOnStaticMap(MyMap=map, lat=data$schools.LATITUDINE, lon=data$schools.LONGITUDINE, TrueProj=TRUE, cex=data$scuola/15, pch=1, col="blue")
We also want a legend that describes the circles' values:
  • x,y: the top-left corner of the box
  • x.intersp: the horizontal distance between the circle and its description
  • y.intersp: the vertical distance between one row and the next
  • legend: the descriptions of the legend's items
  • col: the color of the circles
  • pch: the symbol to use (a circle)
  • cex: the font size of the text
  • pt.cex: the size of the circles
legend(x=120, y=300, title="Abandonment rate", x.intersp=2, y.intersp=c(1.1,1.1,1.1,1.1,1.15,1.2),legend=c("< 1%", "1% - 5%", "5% - 10%", "10% - 20%", "20% - 50%", "> 50%"), col="blue",pch=1,cex=0.8, pt.cex=c(1/15, 5/15, 10/15, 20/15, 50/15, 70/15)) 
This is the image returned (click on the image to enlarge it):
From this image, we can quickly see that the phenomenon of abandonment is more frequent in southern Italy and in the big cities (Milan, Turin, Genoa, Naples, Palermo) of the whole country.