Angelina Carvalho, Chiranjit Chakraborty and Georgia Latsi.
Policy makers have access to more and more detailed datasets. These can be joined together to give an unprecedentedly rich description of the economy. But the data are often noisy and individual entries are not uniquely identifiable. This leads to a trade-off: very strict matching criteria may result in a limited and biased sample; making them too loose risks inaccurate data. The problem gets worse when joining large datasets as the potential number of matches increases exponentially. Even with today’s astonishing computer power, we need efficient techniques. In this post we describe a bipartite matching algorithm on such big data to deal with these issues. Similar algorithms are often used in online dating, closely modelled as the stable marriage problem.
The home-mover problem
The housing market matters and affects almost everything that central banks care about. We want to know why, when and how people move home. And a lot do move: one in nine UK households in 2013/4 according to the Office for National Statistics (ONS). Fortunately, it is also a market that we have an increasing amount of information about. We are going to illustrate the use of the matching algorithm in the context of identifying the characteristics of these movers and the mortgages that many of them took out.
A Potential Solution
The FCA’s Product Sales Data (PSD) on owner-occupied mortgage lending contains loan level product, borrower and property characteristics for all loans originated in the UK since Q2 2005. This dataset captures the attributes of each loan at the point of origination but does not follow the borrowers afterwards. Hence, it does not meaningfully capture if a loan was transferred to another property or closed for certain reason. Also, there is no unique borrower identifier and that is why we cannot easily monitor if a borrower repaid their old mortgage and got a new one against another property.
However, the dataset identify whether a borrower is a first time buyer or a home-mover, together with other information. Even though we do not have information before 2005, we can still try to use this dataset to identify some of the owners’ moving patterns. We try to find from where a home-mover may have moved (origination point) and who moved in to his/her vacant property. If we can successfully track the movers, it will also help us to remove corresponding old mortgages to calculate the stock of mortgages from our flow data. A previous Bank Underground post showed how probabilistic record linkage techniques can be used to join related datasets that do not have unique common identifiers. We have used bipartite graph matching techniques here to extend those ideas.
Bipartite graph matching
The approach of modelling this as a graph problem is described below. We have a set of home-movers for whom we know the dates of birth. For each of them, we can find a list of potential previous borrowers or candidates with the same date of birth. We expect one of them is the same person who is now moving home. Each mover can have different preferences; some may want to move to another property in the same area, others may want to move to a larger house and they all face a budget constraint. Keeping this in mind, we add a probability/score corresponding to each of the candidates. Hence, for each home-mover, there are many such candidates with different probabilities, and these sets of candidates are obviously overlapping. So finding the best possible one-to-one match for all the home-movers is a very interesting computational puzzle.
A bipartite graph is a graphical depiction of linking two datasets. The graph represents the two datasets as two separate sets of vertices, say Y and Z. When a vertex in set Y is related to the vertex in set Z, they are linked with a line called an edge. In our case, suppose Y is the set of possible houses and Z is the set of home-movers. The linking of vertices from Y to Z uses very high level criteria so a vertex in Y can be linked with more than one vertex in Z. Finding the best possible one-to-one matching of vertices from Y to Z is known as Maximum Bipartite Matching. In our case, each edge is associated with some probability (or weight/score), so we aim to find the best quality weighted bipartite matching. There are several algorithms to solve this for both un-weighted and weighted cases. However, we have re-designed the algorithms to take care of large volumes of data in our dataset(s).
High-level description of the algorithm to track the movers
The steps of the algorithm are shown in the animations. We have first calculated the list of potential candidates for each home-mover matched by dates-of-birth. In the first animation, we represent all these potential candidate pairs by edges in a bipartite graph. Left hand side vertices (Y) are the old mortgage entries, where the right hand vertices are home-movers (Z) marked in our data. For each edge, we have allocated a score signifying how closely a corresponding candidate matches to the home-mover. For example, we expect the majority of people to find a newer home closer to their old home and not in a drastically different price compared to the old one. A score function is used to capture these practical issues including a constraint on the length of tenure as nobody buys new houses every day or month. A higher score means they are more probable to be the right candidate. In practice, we know that one person cannot move into two houses, and two persons with different mortgages cannot move into one house. Given that constrain, to find the best list of candidates, we need to find a maximum weighted bipartite matching as mentioned before. We have used a push-relabel algorithm to find this matching.
However, we may want to do some pre-filtering of the candidates in the left hand side column of the animation (Y). In this specific case, we may assume that the old house left by the home-mover is occupied by some other home buyer within a few days of the property becoming vacant. If we find some candidates (X) to move into those empty houses very soon after it becomes vacant, we are more confident that they have exchanged hands as we have found a valid moving chain. This idea is shown in our second animation. Vertices in the further left (X) are matched by post code to the Y vertices. Similarly, edges come with scores capturing matching quality. For every X vertex, we have then picked the edge with the highest score to pre-select candidates from Y vertices. Those Y vertices not selected by any of the X vertices, are deleted before we try to find our bipartite matching mentioned previously. One can obviously use different filtering criteria or use another bipartite matching (i.e. tri-partite matching with three sets). In the end, we get a list of mover chains with the highest overall score.
We have calculated what percentage of movers we can track successfully and presented them in a time series below. The percentage is higher in recent years, primarily because we have more historical information for recent movers. However, we see a sharp fall and then a big jump around 2008-09. It suggests that, during that period, a comparatively smaller percentage of home-movers moved house after staying in their old property for less than three years. This is not surprising as they bought houses in the pre-crisis period and hence were possibly in negative equity during the crisis. As soon as Bank rate fell to a historically low rate helping mortgage borrowers and house prices increased slightly, we see a sharp increase in home movements of those recent buyers. This behaviour is not unexpected but we now have quantitative evidence about how home-movers were affected during crisis periods, distributed by their demographics, types of loan and choices for their new houses. For example, we have found that the first time buyers and Londoners are affected most.
Good algorithms to solve a mathematical puzzle do not just clarify the nature of the relationship; they can help regulators make economies more understandable and stable as well.
Angelina Carvalho and Chiranjit Chakraborty work in the Bank’s Advanced Analytics Division. This post was written whilst Georgia Latsi worked in the Advanced Analytics Division.
If you want to get in touch, please email us at firstname.lastname@example.org. You are also welcome to leave a comment below. Comments are moderated and will not appear until they have been approved.
Bank Underground is a blog for Bank of England staff to share views that challenge – or support – prevailing policy orthodoxies. The views expressed here are those of the authors, and are not necessarily those of the Bank of England, or its policy committees.