Using machine learning to understand the mix of jobs in the economy in real-time

Arthur Turrell, Bradley Speigner, James Thurgood, Jyldyz Djumalieva and David Copple

Recently, economists have been discussing, on the one hand, how artificial intelligence (AI) powered by machine learning might increase unemployment, and, on the other, how AI might create new jobs. Either way, the future of work is set to change. We show in recent research how unsupervised machine learning, driven by data, can capture changes in the type of work demanded.

Many economic models make the convenient assumption that labour markets are homogeneous, such as in the basic version of the Diamond-Mortensen-Pissarides model of job creation and destruction. When economists and statisticians assess the labour market in more detail, looking at the demand for workers of specific types, they use classification schema such as occupation, region, or sector. These classifications help to make sense of the millions of jobs in the economy by putting them into similar buckets. They are carefully designed by statisticians. Because these schema are fixed for many years at a time, they allow for analysis with a time dimension.

However, there is a cost to this; it ignores changes in the ‘true’ set of jobs in the economy. New jobs may emerge, and old ones become less important. So, alongside the fixed schema, it is useful to have real-time, data-driven methods to classify jobs. Real-time job classifications could be used to understand structural changes in the labour market and also to inform the design of future classification schema.

As an example of the latter point, in economic models where there is heterogeneity, it is often assumed that workers do not switch between different groupings within a classification schema, as in the Sahin model of mismatch unemployment. But Figure 1 shows that workers do transition across occupational classifications. Although most job-to-job transitions are within-occupation (shown on the diagonal), a significant number involve a change to another occupation (off-diagonals). Data-driven, real-time approaches such as ours may also be able to create groupings which are more closely aligned to the career paths that workers follow.

Our method also chooses the number of ‘buckets’ to create based on how varied the job descriptions are in the data. This is particularly useful as economic theories of the labour market do not typically specify how many distinct classes of job there should be and results may be sensitive to this.

Figure 1: Quarterly job-to-job probability transition matrix from occupation (rows) to occupation (columns) averaged over 2007Q1 to 2017Q1. The entries are probabilities so that each row sums to one. For example, reading across the third row gives the probability of a worker transitioning from Caring and Leisure occupations to any other occupation. Data from the ONS Labour Force Survey.

The data on job vacancies

Our data are the text of millions of advertised job descriptions; nearly 15 million job adverts posted online between 2008 and 2016 on by firms. Figure 2 shows how such text can be useful. It displays the results from a search for text relating to specific job titles. The data imply that the average number of vacancies for data scientists open at any time has increased substantially in recent years, albeit from a low base as demonstrated by the other titles shown. Part of the increase may be due to the rebranding of similar jobs which were previously advertised under a different name. In the official scheme for occupations (defined by the 2010 Standard Occupational Classification codes, or SOC codes), ‘Data Scientist’ doesn’t exist. But the text of job vacancy postings shows not only that this job does exist, but that demand for it has increased substantially.

Figure 2: Occurrences of terms in job titles can be indicative of the stock of vacancies of certain types. ‘data sci’, which is associated with the job ‘Data Scientist’, has increased substantially since 2011, albeit from a low base. The mean stock of vacancies is the average number of advertised jobs with a particular title in a given year.

The methodology for turning job descriptions into groups within the labour market

Machine learning is a type of computer driven statistical modelling that has proven to be powerful in many applications. One of its strengths is that it can be applied to data in the form of text or images, in addition to quantitative data. We now explain how our algorithm does this; feel free to skip these details if you’re only interested in the results.

We use an unsupervised machine learning method called ‘Latent Dirichlet Allocation’ (LDA) to create groupings of jobs which are driven directly by the cleaned text describing each job vacancy. LDA identifies groups of words that are typically found in the same advert. For example, ‘teach’, ‘class’, and ‘GCSE’ might commonly occur together. These groups of words are called topics, and each word is assigned a weight within the topic.

The total number of topics is an input into the LDA algorithm. We choose it using a metric known as ‘weighted saliency’, which gives a higher weight to rare words that are associated with a single topic. Each job has a distribution over topics, but we want jobs to have mutually exclusive membership of our data-driven classification (in analogy with official schema), so we do a further step of grouping together jobs in the (vector) space defined by all topics. We use the popular K-means algorithm to do this clustering, with the number of clusters determined by the silhouette score: a metric which chooses the number of clusters based on making those clusters both well-separated and tightly packed.

Machine learning based clusters can highlight structural change in the labour market

The resulting clusters of jobs include the groups you might expect, such as teachers and nurses. But these clusters also contain new groups which cut across categories such as regions, occupations, and sectors in existing schema.

Figure 3 compares two of the clusters, one which clearly captures a traditional role, ‘teaching’ as a distinct type of job, and one which captures a newer type of career, ‘Project manager’, and which cuts across existing classifications. For each cluster in Figure 3, we show the most common words which appear in the job description in the word clouds in the top panel row, and the breakdown of the clusters by ONS sectoral classification in the second panel row.

Figure 3: Representations of two of the clusters created by our algorithm. The first is split across many sectors and is most associated with the job title ‘project manager’, while the second closely aligns with the Education sector. The top panel shows the most common words in the cluster’s job descriptions, while the bottom panel shows the counts of vacancies in each cluster by ONS Sectoral classification.

What is impressive about this method is that no information about teaching was given to the machine learning algorithms used, and yet the sector panel clearly shows that most of the jobs are in ‘Education’. Additionally, it demonstrates a role which is split across sectors and occupations (the latter is not shown in the figure).

We want to know if the clusters capture the demand for labour quantitatively as well as qualitatively. One way to check this is to compare time series of the clusters against the most similar time series using official schema. If they are similar, their time series will be strongly correlated to one another.

In Table 1, we present the correlation of the time series of these easily identified types of cluster (e.g. teachers) with their closest known Sector (e.g. Education) and SOC code (e.g. Teaching and educational professionals). These correlations are calculated from the quarterly time series over the entire, fully labelled dataset of vacancies broken down by the relevant cluster group, Sector code, and SOC code. We draw on our related research on UK job vacancies to obtain time series of vacancies by occupation. The cluster description column features the three most common words associated with that cluster. These vacancy cluster time series are very strongly correlated with the relevant vacancy time series using the official classifications. The table shows that the ‘bottom-up’ clusters designed by the machine learning algorithms can recreate the groupings observed in labour markets over many years by statisticians.

  Compared to Correlation
Cluster Description   Standard Occupational Classification code Sector SOC SIC
School, Teacher, Education Teaching and Educational Professionals (231) Education 0.940 0.873
Chef, Restaurant, Food Food Preparation and Hospitality Trades  (543) Accommodation and food service activities 0.965 0.970
Nurse, Home, Nursing Nursing and Midwifery Professionals (223) Human health and  social work activities 0.954 0.975


Table 1: The correlation between selected time series based on the groupings of jobs as designed by the machine learning algorithm, and those based on official classifications.

Using machine learning on the text of job descriptions is a promising way to better understand different types of demand for labour. We have shown that this methodology can both complement, by informing, and affirm, the usual top-down classifications both quantitatively and qualitatively. It is a powerful way to keep up with, and to understand, the structural changes which are occurring within the labour market.

Arthur Turrell works in the Bank’s Advanced Analytics Division, Bradley Speigner works in the Bank’s Structural Economic Analysis Division, James Thurgood works in the Bank’s Technology Division, Jyldyz Djumalieva formerly worked in the Bank’s Technology Division and David Copple works in the Bank’s External Monetary Policy Committee Unit Division.

If you want to get in touch, please email us at or leave a comment below.

Comments will only appear once approved by a moderator, and are only published where a full name is supplied.

Bank Underground is a blog for Bank of England staff to share views that challenge – or support – prevailing policy orthodoxies. The views expressed here are those of the authors, and are not necessarily those of the Bank of England, or its policy committees.