Wednesday, January 15, 2014

Creating a choropleth map with R

A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map (see the full article on Wikipedia).
In this post we'll see how to create a choropleth map with R programming language. We'll draw a map of Italy and the data about unemployment based on its administrative regions. First of all, we need the administrative borders of the country: the italian national statistics institute (ISTAT) has made available these data on the page: http://www3.istat.it/dati/catalogo/20061102_00/; the data we're interested in are in form of a zipped file that contains the shapefiles . The shapefile format is a geospatial data format that can describe polygons, and our administrative regions are polygons. We can use the readShapeSpatial() function from the maptools package:
regions <- readShapeSpatial("data/prov2011_g.shp")
This function creates an object that contains several information, included the polygons we need to draw.
We now need the unemployment data; again from ISTAT we can download some unemployment data based on administrative regions: is a zipped file containing a CSV. This file contains the number on unemployed people for each region from 2004 to 2012): we'll use the 2012 data. Let's load the file and remove the data we don't need:
# reads the unemployment data
unemployment <- read.csv(file="data/unemployment.csv")

# selects only the data we are interested (the name and the year 2012)
unemployment <- subset(unemployment, select=c(1,10))
To create a choropleth we have to plot the areas corresponding to the different administrative regions in different colors, and to do that, we need to add to the shapefile the values of unemployment.
Since is not possible to merge data into a shapefile, we need to add a whole column containing the unemployment values to it. Be careful: the regions in the shapefile have an order, and the value's column to add must refer to the regions in the same order! To do that, we've to add to the shapefile object the unemployment values ordered as the regions are:
# adds the unemployment values to the shapefile matching on regions' names (the column name is NOME_PRO)
regions@data = data.frame(regions@data, unemployment[match(regions@data$NOME_PRO, unemployment$NOME_PRO),])
We're almost done: we have the borders of every region and the value associated to it, so we can draw the choropleth. We'll use the spplot() function of sp package; this function can be used to plot a spatial area using data from a shapefile, and we can tell this function to use a different color for every region, according to the magnitude of the unemployment data; the parameters are:
  • regions: the shapefile
  • "Year.2012": the column name of the unemployment data (used by spplot for choosing a color)
  • col.regions: here we tell spplot the palette to use for coloring the regions
  • main: the title of the plot
In this plot we'll use a gradient of gray color that go from 0.9 to 0.2 (from very dark grey to very light gray):
# plots it
spplot(regions, "Year.2012", col.regions=gray.colors(32, 0.9, 0.2), main="Unemployment in 2012")
We then add a legend:
grid.text("Number of unemployed people in thousands", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 12))
and here is the result (click on the image to enlarge it):
Looking at the choropleth map, besides a couple of outliers located in the major cities, the values of unemployment seem quite the same overall the whole country; this visual effect is due to the outliers which "kick down" the other values to a smaller range of colors and thus giving the impression of almost equality.
Let's try to remove the outliers (setting them to 0), and see how it changes:
# removes the outliers
unemployment$Year.2012[which(unemployment$NOME_PRO == 'NAPOLI' | unemployment$NOME_PRO == 'TORINO' | unemployment$NOME_PRO == 'ROMA'| unemployment$NOME_PRO == 'MILANO')] <- 0

# reloads the shapefile
regions <- readShapeSpatial("data/prov2011_g.shp")

# adds the new unemployment data (without outliers)
regions@data = data.frame(regions@data, unemployment[match(regions@data$NOME_PRO, unemployment$NOME_PRO),])

# plots it, legend included
spplot(regions, "Year.2012", col.regions=gray.colors(32, 0.9, 0.2), main="Unemployment in 2012")
grid.text("Number of unemployed people in thousands (excluded Milan,Turin,Rome,Naples)", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 12))
This is the result:
Now that there are no more outliers, the max value has passed from more than 200.000 to less than 100.000; the result is that we can see better the distribution of unemployed people in the other regions. But we're still not satisfied with these results, because we realized that we'd prefer to know about unemployment rate in Italy rather than visualizing absolute numbers, as it gives a better description of the phenomenon.
As we did before, we have to find some data. To compute the unemployment rate (having the absolute numbers) we just need the population for every region for the same year we have unemployment data for; then we can divide the unemployment number by the population to have the unemployment rate.

First we need to find the italian population per region; once again, the ISTAT has the data we need. The zipped file is here: http://demo.istat.it/pop2012/dati/province.zip; this archive contains a CSV with the population of every sub-region of any region, with males and females data separated: this means that we have to sum all the values belonging to the same region to have its population. Let's see how.
First of all, we load and clean the data:
# loads data from CSV
population <- read.csv(file="data/population_per_region.csv")

# leaves only the three columns we're interested in (name, total_males, total_females)
population <- subset(population, select=c(2,8,13))

# adds a "Total" column to the dataset with the sum of total_males and total_females
population$Total <- population$Totale_Maschi + population$Totale_Femmine

# capitalize the names of the regions
population$Descrizione_Provincia <- toupper(population$Descrizione_Provincia)
Now, we have to aggregate all the rows according to the name of region, sum all the values in them, and assign the result to a data frame:
# now we group by "Descrizione_Provincia" (the region name) and sum the "Total" column
pop <- as.data.frame(aggregate(population$Total, by=list(population$Descrizione_Provincia), sum))

# renames the columns for clarity
colnames(pop)[1] <- "Descrizione_Provincia"
colnames(pop)[2] <- "Total"
We're now ready to compute the rate. The original unemployment numbers are in thousands, while actual population is not, so we have to multiply it by 1000 to have the correct result, and then the compute the rate dividing the two numbers (we multiply by 100 to have a percentage instead of a [0,1] value):
# computes the number of unemployed 
pop$unemployed <- unemployment$Year.2012[match(pop$Descrizione_Provincia, unemployment$NOME_PRO)]*1000

# computes the percentage of unemployed people related to total population
pop$rate <- pop$unemployed / pop$Total * 100
We have now the unemployment rate, we just need to update the shapefile with these new data:
# updates the shapefile with these new data
regions@data = data.frame(regions@data, pop[match(regions@data$NOME_PRO, pop$Descrizione_Provincia),])
and plot again:
# plots it
cols <- gray.colors(32, 0.9, 0.2)
spplot(regions, "rate", col.regions=cols, main="Unemployment in 2012")
grid.text("% of unemployed population", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 14))
Here is the final result:
Finally, this choropleth gives us an idea of unemployment in Italy.


Here's the full code, also available on my github:
# you can also download all the datafiles from my github: 
https://github.com/andreaiacono/playingwithdata/tree/master/data

library(maptools) # for reading shapefiles
library(grid) # for adding a title to the legend of the plot

### loads data ###

# sets current directory
setwd("/home/andrea/Programming/code/R/PlayingWithData/")

# reads the shapefile of administrative regions
regions <- readShapeSpatial("data/prov2011_g.shp")

# reads the unemployment data
unemployment <- read.csv(file="data/unemployment.csv")

# selects only the data we are interested (the name and the year 2012)
unemployment <- subset(unemployment, select=c(1,10))

# adds the unemployment value to the shapefile
regions@data = data.frame(regions@data, unemployment[match(regions@data$NOME_PRO, unemployment$NOME_PRO),])

# plots it
spplot(regions, "Year.2012", col.regions=gray.colors(32, 0.9, 0.2), main="Unemployment in 2012")

# adds a title to the legend
grid.text("Number of unemployed people in thousands", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 12))

# removes the outliers
unemployment$Year.2012[which(unemployment$NOME_PRO == 'NAPOLI' | unemployment$NOME_PRO == 'TORINO' | unemployment$NOME_PRO == 'ROMA'| unemployment$NOME_PRO == 'MILANO')] <- 0

# reloads the shapefile
regions <- readShapeSpatial("data/prov2011_g.shp")

# adds the new unemployment data (without outliers)
regions@data = data.frame(regions@data, unemployment[match(regions@data$NOME_PRO, unemployment$NOME_PRO),])

# plots it, legend included
spplot(regions, "Year.2012", col.regions=gray.colors(32, 0.9, 0.2), main="Unemployment in 2012")
grid.text("Number of unemployed people in thousands (excluded Milan,Turin,Rome,Naples)", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 12))

# now we want the rate of unemployment instead of absolute number, so we load the population number for every region for having a rate

# loads data from CSV
population <- read.csv(file="data/population_per_region.csv")

# leaves only the three columns we're interested in (name, total_males, total_females)
population <- subset(population, select=c(2,8,13))

# adds a "Total" column to the dataset with the sum of total_males and total_females
population$Total <- population$Totale_Maschi + population$Totale_Femmine

# capitalize the names of the regions
population$Descrizione_Provincia <- toupper(population$Descrizione_Provincia)

# now we group by "Descrizione_Provincia" and sum the "Total" column
pop <- as.data.frame(aggregate(population$Total, by=list(population$Descrizione_Provincia), sum))

# rename the columns
colnames(pop)[1] <- "Descrizione_Provincia"
colnames(pop)[2] <- "Total"

# computes the number of unemployed 
pop$unemployed <- unemployment$Year.2012[match(pop$Descrizione_Provincia, unemployment$NOME_PRO)]*1000

# computes the rate of unemployed people related to total population
pop$rate <- pop$unemployed / pop$Total * 100

# updates the shapefile with these new data
regions@data = data.frame(regions@data, pop[match(regions@data$NOME_PRO, pop$Descrizione_Provincia),])

# and plots it
cols <- gray.colors(32, 0.9, 0.2)
spplot(regions, "rate", col.regions=cols, main="Unemployment in 2012")
grid.text("% of unemployed population", x=unit(0.95, "npc"), y=unit(0.50, "npc"), rot=90,  gp = gpar(fontsize = 14))