MEPs tools part 1 – installing R, ggplot2 and sample data

To ease the burden of supporting our benchmarking project “managing excellent planning services” (aka MEPS) we have been experimenting with some stats toys. While this started out as a “howto” in excel, the idea of being welded to a mouse for the next few weeks did not appeal. This is the technical set-up, where I describe in some detail the tools and code used to make the graphs that benchmark performance. Following on from this I go through the five sets of visualisations that support MEPS. If you want to know how we made it, or if you are living in a local authority trying to find ways of tracking performance, this might just be useful.

Just in case I can help people over the initial learning curve I’ll set out in some detail the tools and code I used to create the models for the planning application benchmarking work. It’s nothing too rocket sciency, but by posting it here I can remind myself if I need this stuff in the future. You may find it useful as a kind of recipe for some fairly basic transformations. It also helps me organise and download this part of my brain before the next project takes off.

Overview

I worked out quite quickly that excel just wouldn’t do what I wanted. It’s fine for quick, ad hoc graphs but when you need to compare two datasets, you could either spend ages trying to get the graphs to align (not just the borders, but the axes – and the scales of the axes – and the colours of the components) or just give up and use a “proper” data visualisation tool. I gave up quite quickly, as I’d had my eye on some new toys for a while and this seemed like a good opportunity to try them out. It is a statistical programming environment called ‘R‘ and a graphical add-on called ggplot2. This is my unbeautiful, hacky first go at getting these elegant and powerful tools to do something useful for me. This is (one of) the great things about open source software – you can just try things to see if you like them.

I’m going to explain how I used these things to explore the application data and then visualise it. Because I may have to do several (many ?) of these over the next few months, it was worth me putting in some investment of time upfront to reduce hassle later. If you have never done any coding, never done any SQL you may struggle. My hope is that people with a rough grounding in the basics (you know that “#” comments out a line) get enough from this to be able to get a feel for the approach so you can begin customising it for you. We may run a course or two (for people supporting LPAs), depending on interest. It’s exhausting writing this type of blog, so ask questions if I haven’t explained well enough.

application jitterplot

application jitterplot

Garbage in, garbage out
The first job is to ensure that your data is aligned with some kind of standard. We’ve put one together in the CoP over here, along with some sample data. If you are going to give this a go, you might want to get a copy so you can see what it looks and feels like. It’s nothing too rocket sciency – where you might have the word “granted” someone else might use “permitted”, and someone else again might record it is “Granted”. There are some set types of things, and you’ll need to translate from your native dataset to an aligned one. For the first few times, just do a pivot table, work out your mapping and then a vlookup. If you do it more often, it may be worth investing some time in a database table. For now, you’ll need to do it to:-

  • planning decisions
  • planning application types
  • decision authority
  • application status

It feels like a bit of a pain, but once you have adopted standard vocabulary you are able to talk the same language as your peers, and that’s where it gets really interesting.

Platform
[ These brief notes written beginning of Oct 2009. As always, they will age fast]
As usual, I did this work on one of my home machines running ubuntu 9.04 “jaunty” (nothing fancy – a really old work laptop that I rescued from the recyclers). From synaptic I installed R version 2.8.1 (2008-12-22) [the ‘official’ ubuntu version, not the CRAN].
I recommend using an IDE, eventually I settled on RKWard. I’m using 0.5.1 on KDE 4.2.2. Note that this doesn’t install using synaptic, I followed the straightforward instructions on their project page.
Once R is in, there are a couple of packages you’ll need that get embedded in R itself – there are two of them in the file posted below. Note that the code references my filepath – you’ll have to change it to suit your system (unless your login is also ‘richardc’).

If you’ve never used linux before, I think all these tools are available for windows.

Getting the data into R, and doing some calculations

I suggest you get the example data from the CoP for your first runthrough. It’s not “real”, in that it is a randomised sample and doesn’t reflect a real place. It is, however, “realistic” in both what it contains and how grotty it is. If you need to, register for the CoP, download the file, unzip it and put it somewhere you can remember and update the paths as below.

This reads the data in from a CSV file, and puts it into an object called “applist”.

> applist <- read.table("/home/richardc/Desktop/MEPS/pilotdata2.csv",header=T,sep=",",quote="\"")
> setwd("/home/richardc/Desktop/MEPS") # and set the working directory (this is where you'll find the outputs)

Check what we’ve read in.

str(applist)
'data.frame':    16510 obs. of  10 variables:
 $ Place   : Factor w/ 3 levels "E1","E2","E3": 1 1 1 1 1 1 1 1 1 1 ...
 $ rref    : Factor w/ 16510 levels "app00001","app00002",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ rtype   : Factor w/ 10 levels "Advert","Certs",..: 7 3 5 5 7 5 3 7 3 5 ...
 $ fee     : num  0 0 110 110 110 110 0 440 0 110 ...
 $ rstatus : Factor w/ 8 levels "0","Appeal","Determined",..: 3 2 3 3 3 3 3 3 3 3 ...
 $ date_rec: Factor w/ 1354 levels "01/01/07","01/01/08",..: 351 351 469 587 587 704 704 1057 1057 1163 ...
 $ date_val: Factor w/ 1289 levels "","01/02/05",..: 88 1020 436 545 57 2 262 79 1004 61 ...
 $ date_dec: Factor w/ 1174 levels "","01/02/06",..: 258 414 150 195 1015 283 927 258 1115 240 ...
 $ rroute  : Factor w/ 4 levels "Call-in","Committee",..: 2 3 2 3 3 3 3 2 1 3 ...
 $ rdec    : Factor w/ 6 levels "0","Grant","#N/A",..: 2 5 2 5 2 5 2 2 2 5 ...

(This says that “applist” is a dataframe of 16,510 planning applications (now you can see why it’s so big !). All the types of data (except ‘Fee’ which is a number) are declared as ‘Factors’.  We’ll need to tell R that they are dates, and then it can unlock all the standard libraries and their date-wrangling). You’ll see from the examples below, that we refer to each part of the dataframe using the ‘$’ sign – to reference the fee variables we use ‘applist$fee’. Yes it has to be spelt correctly, case observed.

So, let’s change the three date fields into proper dates.  Taking “date_rec” (date the application was first received) as an example, the code says replace ‘date_rec’ with the result of casting date_rec as a date using a format of MM/DD/YY. Note that this operation is a bit like =DATE(text) that you may put into an excel cell, and then copy down. Here, the fact that the operation is done to the whole column is taken as read.

applist$date_rec <- as.Date(applist$date_rec, "%m/%d/%y")

Now we can calculate some fields we’ll use later. In English, the time taken to validate an application (we’ll call it “vdays”) is the date it’s valid (date_val) less the date it was received (date_rec). Similarly, the time taken to determine an application (we’ll call it “ddays”) is the date it was determined (date_dec) less the date it was valid (date_val).

applist$vdays <- applist$date_val - applist$date_rec
applist$ddays <- applist$date_dec - applist$date_val

Now we need to set up some grouping variables. We will, for example, want to count how many applications re receive in a month. We might want to graph something over a number of years. Let’s calculate them:

applist$month <- as.numeric(format(applist$date_rec, "%m")) + (as.numeric(format(applist$date_rec,"%Y"))-2005)*12
applist$year <- as.numeric(format(applist$date_rec,"%Y"))

Important note for planners. These two use the date received as the ‘marker’ of where the application belongs. Traditional PS2 uses the date determined. You’ll have to see the sister articles to see why I think it is worth changing this around.

Note also that the “month” is not a calendar month, it is an elapsed month from the beginning of the series. If I’d written code more elegantly, I would have had the code work out where to start from. As I specc’d the data, I know that the months should begin in January 2005. Let’s check the data so far :-

>table(applist$year, applist$rtype)
       Advert Certs Cons Details House Major Minor Misc #N/A  Use
  2005    185   104  433     122  1091    55  1165  103  185  206
  2006    194   105  459     148  1111    67  1178   96   54  197
  2007    186   130  477     227  1292    48  1247  107   54  171
  2008    155   127  469     207  1057    37  1051  118   78  126
  2009     98    57  281      79   581    20   551  105   50   66

You can see the command “table” creates something a bit like a pivot, and because we didn’t specify anything different it has created a “count” of each application type per year. Remember I said the data is grotty ? The #N/A records are technically deleted, but still kept in the database for (presumably) some kind of audit purpose. Let’s kill them, or there’ll get in the way:

> applist <- applist[applist$rtype != '#N/A',]

[this is matching the text string “#N/A”, as that is how our read in of the CSV file has coded it. There is also a genuine NA in R, and we would remove them in a different way – applist <- applist[!is.na(applist$rtype),]. Note the general form of this is newthing <- oldthing[test,] – note that ! means ‘not’ and that trailing comma is needed !]

We’re ready for our first plot. Let’s compare the number of applications received and their status in each of the three LPAs:-

qplot(month, data=applist, xlab="months starting Jan 2005", ylab="work", geom="histogram", binwidth=1, facets= Place ~ .,fill=rstatus)

firstplot
This is good. See how the plot has taken care of aligning the axes and scales, making it easy to compare what’s going on in these three places. However, we’re not used to a decimal approach to a year -they have 12 months and could be broken as quarters. Also, there are several years here and it would be good to mark them out.

quarters <- seq(3,54,3) # from 3 to 54 in intervals of 3
dec <- c(12,24,36,48) # marking dec, not january
qplot(month, data=applist, xlab="months starting Jan 2005", ylab="work", geom="histogram", binwidth=1, facets= Place ~ .,fill=rstatus) +scale_x_continuous(breaks = quarters)+geom_vline(xintercept = dec)

plot1 - the factory gate
Great. We’re up and running. Let’s take a discussion on what this visualisation actually shows (and how you might use it) onto the next issue of this geeky series.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s