Open Data companion

Last updated: 5 May 2018

The Open Data companion shows the steps R users can follow to retrieve and tidy open data from various government Application Programming Interfaces (APIs). Additional open data sources will be added in the future.


Nomis provide labour market, benefit and census data for the United Kingdom.

You can download data from Nomis in CSV format using their API. To do this we need to construct an API URL by appending a dataset id and search parameters to the following web root:

NB Those who prefer to create the API URL automatically can use Nomis’ query data interface.

First, let’s load the necessary R packages.

library(tidyverse) ; library(jsonlite) ; library(purrr) ; library(stringr)

Then we need to pull the metadata for all of the Nomis datasets that are available to query via the API.

metadata <- fromJSON("", flatten = TRUE) %>%
  map("keyfamilies") %>%
  map_df(bind_rows) %>% 
  unnest(components.dimension) %>% 
  select(id, name = name.value, description = description.value, parameter = conceptref)

We’d like to obtain the id for the Claimant count by sex and age dataset so let’s list all datasets with the keywords ‘claimant’ and ‘age’ using a regular expression.

filter(metadata, str_detect(name, regex('^.*?\\b(claimant(s)*)\\b.*?\\bage\\b.*?$', ignore_case = T))) %>%
  distinct(id, name) %>% 
  kable(caption = "Nomis datasets with 'claimant' and 'age' keywords")
Nomis datasets with ‘claimant’ and ‘age’ keywords
id name
NM_2_1 claimant count - age and duration
NM_7_1 claimant count - occupation, age and duration
NM_68_1 benefit claimants 5% data - working age client group
NM_69_1 benefit claimants 5% data - working age families
NM_70_1 benefit claimants 5% data - children of working age clients
NM_71_1 benefit claimants 5% data - pensionable age
NM_103_1 benefit claimants - working age clients for small areas
NM_105_1 benefit claimants - working age client group
NM_162_1 Claimant count by sex and age

We are interested in ‘Claimant count by sex and age’ which has the id of ‘NM_162_1’. We can therefore update the API URL with the dataset id:

Next we need to identify the search parameters (or dimensions) available for NM_162_1. Each dataset can have a number of different parameters such as ‘date’, ‘geography’, ‘gender’, and ‘measures’. The date parameter can have values like ‘latest’, ‘prevyear’ etc. All the other available parameters and their corresponding values need to be identified from the dataset’s metadata.

filter(metadata, id == "NM_162_1") %>% 
  pull(parameter) %>% 
  kable(col.names = "Parameter", caption = "Available parameters for NM_162_1")
Available parameters for NM_162_1

To find which genders are available for the ‘Claimant count by sex and age’ dataset we need to add gender to the API URL and def.sdmx.json which extracts the codelist for the dataset:

fromJSON("", flatten = TRUE) %>% %>% 
  unnest() %>%
  select(description = description.value, value) %>% 
  kable(caption = "Available categories of gender")
Available categories of gender
description value
Total 0
Male 1
Female 2

We want ‘Total’ which has a value of 0. To append this to the API URL we would add &gender=0. The ampersand is used to separate each parameter in the query string of the API URL.

To find which geographic areas the claimant count data are available at requires a slight tweak to the API URL to the dataset codelist:

fromJSON("", flatten = TRUE) %>% %>% 
  unnest() %>%
  select(description = description.value, value) %>% 
  head() %>% 
  kable(caption = "First six available geographic areas")
First six available geographic areas
description value
jobcentre plus group as of April 2018 TYPE83
jobcentre plus district as of April 2018 TYPE84
live jobcentre plus offices TYPE86
major towns and cities TYPE225
2011 scottish datazones TYPE231
2011 scottish intermediate zones TYPE232

We’d like claimant count data at the ‘2011 super output areas - lower layer’ level so we’d append &geography=TYPE298 to the API URL

All the parameters we require and the corresponding values of interest are listed below:

  • date=latest (latest available period)
  • geography=TYPE298 (2011 super output areas - lower layer)
  • gender=0 (Total)
  • age=0 (All categories: Age 16+)
  • measure=1 (Claimant count)
  • measures=20100 (value)

We can now append all the necessary parameters to the API URL Note that we have added .data.csv to the dataset id which is required to pull CSV data from the API.

df <- read_csv("") 

kable(select(df, 1:6) %>% slice(1:6), caption = "Raw claimant count data")
Raw claimant count data
2018-04 April 2018 2018-04 date 0 0
2018-04 April 2018 2018-04 date 0 0
2018-04 April 2018 2018-04 date 0 0
2018-04 April 2018 2018-04 date 0 0
2018-04 April 2018 2018-04 date 0 0
2018-04 April 2018 2018-04 date 0 0

Then we can tidy the results up a little by filtering by Greater Manchester boroughs, creating new variables and renaming others.

df %>% 
  filter(grepl('Bolton|Bury|Manchester|Oldham|Rochdale|Salford|Stockport|Tameside|Trafford|Wigan', GEOGRAPHY_NAME)) %>% 
  mutate(date = as.Date(paste('01', DATE_NAME), format = '%d %B %Y'),
         lad17nm = str_sub(GEOGRAPHY_NAME, 1, str_length(GEOGRAPHY_NAME)-5),
         measure = "Residents claiming JSA or Universal Credit") %>% 
         lsoa11nm = GEOGRAPHY_NAME,
         lsoa11cd = GEOGRAPHY_CODE, 
         value = OBS_VALUE) %>% 
  head() %>% 
  kable(caption = "Tidy claimant count data")
Tidy claimant count data
date lsoa11nm lsoa11cd lad17nm measure value
2018-04-01 Bolton 001A E01004768 Bolton Residents claiming JSA or Universal Credit 5
2018-04-01 Bolton 001B E01004803 Bolton Residents claiming JSA or Universal Credit 5
2018-04-01 Bolton 001C E01004804 Bolton Residents claiming JSA or Universal Credit 5
2018-04-01 Bolton 001D E01004807 Bolton Residents claiming JSA or Universal Credit 10
2018-04-01 Bolton 001E E01004808 Bolton Residents claiming JSA or Universal Credit 5
2018-04-01 Bolton 002A E01004788 Bolton Residents claiming JSA or Universal Credit 5

Useful resources

The website provides incidents of police recorded crime and anti-social behaviour in England, Wales and Northern Ireland.

The API can be used to request incident data for any geographic area of interest. In this example, we are going to extract crime and ASB data for the borough of Trafford.

First, let’s load the necessary R packages.

library(tidyverse) ; library(httr) ; library(jsonlite) ; library(sf)

Now we need to download a vector boundary layer. The code below downloads a generalised vector boundary layer of Local Authority Districts from the Open Geography Portal and filters the borough of Trafford.

bdy <- st_read("") %>% 
  filter(lad17nm == "Trafford")

Next, we extract the coordinates from the boundary layer and format for inclusion in the API URL.

coords <- bdy %>% 
  st_coordinates() %>% %>%
  select(X, Y) %>% 
  unite(coords, Y, X, sep = ',') %>% 
  mutate(coords = sub("$", ":", coords)) %>% 
  .[["coords"]] %>% 
  paste(collapse = "") %>% 
  str_sub(., 1, str_length(.)-1)

We are now ready to construct the API request. This has two parts: the polygon coordinates and a month. The code below requests incidents within the boundary of Trafford recorded in April 2018.

request <- POST(url = "",
            query = list(poly = coords, date = "2018-04"))

Next, we process the content of the API request …

content <- content(request, as = "text", encoding = "UTF-8")

and parse the JSON response.

results <- fromJSON(txt = content) 

The crime and ASB categories listed in the API response are concatenated to make them machine readable so we need pull more human readable categories from the API.

categories <- GET("")
categories <- fromJSON(content(categories, as = "text", encoding = "UTF-8"))
head(categories) %>% kable()
url name
all-crime All crime
anti-social-behaviour Anti-social behaviour
bicycle-theft Bicycle theft
burglary Burglary
criminal-damage-arson Criminal damage and arson
drugs Drugs

Next we’ll create a new data frame using the content that we’ve parsed from the API request.

results_df <- data.frame(
  month = results$month,
  url = results$category,
  location = results$location$street$name,
  long = as.numeric(as.character(results$location$longitude)),
  lat = as.numeric(as.character(results$location$latitude)),
  stringsAsFactors = FALSE

Then update it with the readable crime and ASB categories.

df <- left_join(results_df, categories, by = "url") %>% select(category = name, 
    long, lat, location, month)
head(df) %>% kable(caption = "Crime and ASB data for Trafford, April 2018")
Crime and ASB data for Trafford, April 2018
category long lat location month
Anti-social behaviour -2.359042 53.40966 On or near Little Brook Road 2018-04
Anti-social behaviour -2.286686 53.45244 On or near Parking Area 2018-04
Anti-social behaviour -2.281867 53.45239 On or near Trentham Road 2018-04
Anti-social behaviour -2.387101 53.44150 On or near Carrington Road 2018-04
Anti-social behaviour -2.259607 53.46156 On or near Eton Close 2018-04
Anti-social behaviour -2.345055 53.44909 On or near Lorne Grove 2018-04

Finally, we’ll convert the data frame to a spatial object so that we can check the results by mapping it.

sf <- df %>% 
  st_as_sf(coords = c("long", "lat")) %>% 

plot(st_geometry(sf), add = T)