Making big data work for economics

Arthur Turrell, Bradley Speigner, James Thurgood, Jyldyz Djumalieva, and David Copple

‘Big Data’ present big opportunities for understanding the economy.  They can be cheaper and more detailed than traditional data sources, and on scales undreamt of by survey designers.  But they can be challenging to use because they rarely adhere to the nice neat classifications used in surveys.  We faced just this challenge when trying to understand the relationship between the efficiency with which job vacancies are filled and output and productivity growth in the UK.  In this post, we describe how we analysed text from 15 million job adverts to glean insights into the UK labour market.

First, a health warning: the research which underlies this blog post, and therefore the post itself, is quite technical, combining advanced statistical processing techniques with economic theory. Some technical details, including of the algorithm we developed for the text analysis, are included for those interested but are not needed to understand the article – they can be safely skipped.

Our job vacancy data come from an online recruitment website, Reed.co.uk.  These ads have been posted each day online, over a number of years. They contain rich information including descriptions of the job description, title, and sector.

This information can help us understand the supply and demand for labour in different occupations in the UK.  But in order to make the most of it, we needed to combine it with other labour market data.  That meant that we needed to classify the jobs advertised using the Office for National Statistics (ONS’) standard occupational classification (SOC) numbers (also known as SOC codes).  They classify every job in the UK economy into pre-determined sectors.  Assigning SOC numbers to each advert proved to be much easier said than done.

Putting SOCs on

SOC numbers are just a shorthand way of describing a job. So our task was to devise an algorithm to read a job advert and classify it with a SOC number. The difficulty is that job descriptions also contain a lot of information which is not specific to the occupation being advertised, and algorithm needed to discard this information while retaining the salient parts of the job description.

The need to take text data and match it to official categories is sure to arise frequently for economists and statisticians.  We have therefore released this algorithm as the first repository on the Bank of England’s new Github account so that others can use and adapt it.  You can find details of how to download and use the code at the end of the blog post.  More technical details of the algorithm now follow – so feel free to skip ahead to the next section if these don’t interest you.

Our algorithm relies on materials from the ONS which describe SOC codes in great detail.  From these, we extracted all phrases up to three words long associated with each SOC code. To get these words into a quantitative form, we used term frequency – inverse document frequency which represents our phrases as a matrix where each SOC code $d$ is a row and each phrase a column.  Its dimensions are therefore given by the number of unique terms, $T$, and the number of SOC codes, $D$.

The neat part of this is that we can then use the same matrix to express job vacancies, $i$, as vectors $\mathbf{v}_i$ in the same vector space described by the columns of $D$.  Because the job vacancy vectors are created from only the official titles and descriptions produced by the ONS, much of the extraneous information in the vacancies (the requirement to have a driver’s licence and the like) fell away in this stage. With vacancies expressed as vectors, the process of finding the top SOC code for each job vacancy $i$ was completed by solving:

$\arg \max_{d}\left\{\mathbf{v}_i \cdot \mathbf{v}_d\right\}$

This finds the SOC code closest to the job ad vector in question. Because there may be several similar official jobs to the vacancy, we found the top five SOC codes using this method and then we chose between them based on which had the closest job title to the job vacancy title using fuzzy matching.  Fuzzy matching, in this case, counts the number of changes it takes to go from one word to another.  For instance, to get from ‘ekonomist’ to ‘economist’ is just one move.

A deeper look at the labour market

Classifying the job adverts, as described above, allows for a much richer view of developments in job vacancies.  Understanding vacancies is crucial for understanding many aspects of the labour market and the economy as a whole.  Some important open questions about the current state of the economy relate directly to it.

A crucial outstanding question is why productivity growth has been so slow in the UK over the past decade.  A factor which may shed light on this is how long it takes the unemployed to find new jobs, what kind of jobs they find, and whether the picture varies across different regions of the country.  To help understand this we combined the data described above with other labour market information using a popular model of the labour market to help give structure to the analysis.  Again, the details of the model matter but are fairly technical.  Feel free to skip the next paragraph if you’re not interested in the model of the labour market we use.

We use the Diamond-Mortensen-Pissarides (DMP) matching model of equilibrium unemployment, in which vacancies play a key role. The cornerstone of the DMP model is the idea that it takes time for a worker to find a job, and for employers to fill vacancies. This is represented by a ‘matching function’, $M(u,v)$, which takes as its inputs the stock of job seekers (unemployed people) and job vacancies, and returns the number of newly created jobs in each time period.

Once estimated, the model we use can generate so-called ‘Beveridge curves’, which show the relationship between vacancies and unemployment. Usually, they show the aggregate relationship between vacancies and unemployment, as in Figure 1. The circles show quarterly unemployment-vacancy rate data, and the green line the relationship suggested by the model.  But the aggregate picture may conceal very different underlying situations.  Using the data produced by our algorithm, we are able to produce Beveridge curves at the level of occupations, shown in Figure 2.

Figure 1: Beveridge curve showing theoretical relationship between vacancies and unemployment alongside points representing quarterly data. Arrows indicate the flow of time. Source: ONS, Reed, author calculations.

Figure 2: Beveridge curves for different occupations. SOC numbers are shown in brackets. Source: ONS, Reed, author calculations.

There are significant differences between occupations which are hidden by the aggregate Beveridge curve, particularly in how ‘tight’ the markets for different types of labour are.  Tightness is just the ratio of vacancies to unemployment; a ‘tight’ labour market means that there are many vacancies relative to the number of unemployed so it is typically more difficult for firms to recruit workers.

We can use the data and model to investigate further.  Figure 3 shows that matching efficiency – the speed with which job vacancies are filled – differs substantially across occupations. The lowest matching efficiency occurs in the most productive occupation.  An important thing to bear in mind with this analysis is that the speed of match is not the only thing that matters.  Identifying the right person for a job is also important, and it may be that employers take longer to fill higher-productivity roles because they are more productive, rather than despite it.  Nonetheless, the longer a vacancy takes to be filled, the larger the amount of lost output.

Figure 3: Estimates of productivity (left-hand y-axis) and of the matching efficiency (right-hand y-axis). Standard errors are shown for the estimates of the matching efficiency. Source: ONS, Reed, author calculations.

The picture shown in Figure 3 has two important implications. First, because matching efficiencies are heterogeneous, the speed with which jobseekers find employment will depend on the composition of vacancies.

The second implication is that a shock to the demand for labour will have different effects on output depending on how it falls across occupations.  For example, all else equal, if the demand for managers and professionals increased the short-term output loss relative to potential while the vacancies were filled would be relatively large because of the combination of high productivity but slow matching efficiency.

In this blog post, we have barely scratched the surface of the insights such data sources can provide.  In the paper associated with this post, we also look at what our data can tell us about regional mismatch and its effects.  All this has only been possible because we have combined our ‘Big Data’ with the outputs of statistical agencies.  The full benefits of such ‘naturally occurring’ data come from using it as a complement to, rather than as a replacement for, existing survey data.

Using and applying the occupational coding algorithm

The Python package we created to apply SOC codes to job descriptions is called occupationcoder. There are instructions on how to install this package in the ‘README’ file on the Github repository. Once installed, occupationcoder can be used with the following Python code:

[code language=”python”]
import pandas as pd
from occupationcoder.coder import coder
myCoder = coder.Coder()
[/code]

To run the code on a single job, use the following syntax with the

[code language=”python”] codejobrow(job_title,job_description,job_sector) [/code]

method. For example

[code language=”python”]

myCoder.codejobrow(‘Physicist’,

‘Make calculations about the universe, do research, perform experiments and understand the physical environment.’,

‘Professional scientific’)

[/code]

will return

 job_title job_description job_sector SOC_code Physicist Make calculations about the universe, do research, perform experiments and understand the physical environment. Professional, scientific & technical activities 211

Arthur Turrell works in the Bank’s Advanced Analytics Division, Bradley Speigner works in the Bank’s Structural Economic Analysis Division, James Thurgood works in the Bank’s Technology Division, Jyldyz Djumalieva formerly worked in the Bank’s Technology Division and David Copple works in the Bank’s External Monetary Policy Committee Unit Division