*Sinem Hacioglu Hoke and Kerem Tuzcuoglu
*

We economists want to have our cake and eat it. We have far more data series at our disposal now than ever before. But using all of them in regressions would lead to wild “over-fitting” – finding random correlations in the data rather than explaining the true underlying relationships. Researchers using large data sets have historically experienced this dilemma – you can either throw away some of the information and retain clean, interpretable models; or keep most of the information but lose interpretability. This trade-off is particularly frustrating in a policy environment where understanding the identified relationships is crucial. However, in a recent working paper we show how to sidestep this trade-off by estimating a factor model with intuitive results.

How can we use the information our large data sets contain without falling in to this trade-off? Two common approaches are to throw away series that provide little information (“shrinkage methods”) or summarise the information into a much smaller number of series that can be safely used in estimations (“factor models”). Factor models are attractive to reduce dimensionality but lead to another problem. The factors are mixtures of all the different series which can mean that they are very difficult to interpret economically. In our working paper, we propose a way to analyse the information content of factors by using a threshold factor-augmented vector autoregression (FAVAR) model. By observing when the estimated coefficients (“factor loadings”) fall below a threshold, suggesting that those factors add little explanatory information for particular series in the data set at a given time, we set their loading (“shrink”) to zero. (There have been other papers acknowledging the interpretability problem and proposing some other solutions, such as Belviso and Milani (2006), Ludvigson and Ng (2009), Ludvigson and Ng (2009).)

We extend Nakajima and West (2013)‘s threshold factor model designed for sparsity modelling by combining it with Bernanke, Boivin and Eliasz (2005)’s FAVAR model. Bernanke, Boivin and Eliasz (2005) combined factor models with vector autoregressions to be able to use both large information sets and explain the effects of monetary shocks on various macroeconomic indicators. By using a FAVAR model, we ensure to keep the number of parameters in the estimation small but at the same time we can trace the macroeconomic shocks back to individual series. Combined with Nakajima and West (2013)’s threshold factor model, our model also ensures the shrinkage of the irrelevant factors.

Our data consist of 158 quarterly US macroeconomic aggregates from 1964 to 2013. If we wanted to use all these series in a simple regression, we would have faced a serious over-fitting problem, i.e. the number of parameters to be estimated would have exceeded the number of observations. Looking closely at the data set reveals its subcategories: production (20 variables), employment (27 variables), housing (13 variables), interest rate (16 variables), inflation (29 variables), finance (13 variables), money (22 variables), expectations (7 variables), and credit (11 variables). In this environment, Belviso and Milani (2006)‘s method would be to extract one factor from each of these subcategories. We, instead, estimate the model with the whole data set rather than splitting the information contained in the whole data set. Our ultimate aim is to find a factor (or set of factors) that are related to each subcategory. Our results show that we can link the factors to particular economic activities, such as real activity, unemployment, etc., without any prior specification on the data set, such as splitting the data into groups.

We estimate our threshold FAVAR model with Bayesian methods. Throughout the simulations, we observe the frequency of the factor loadings being induced to zero when they fall below the estimated threshold over time. We call this the *shut-down* *rate*. We can think of this as observing the time-varying importance of the factors. If the factor loadings of a factor fall below the estimated threshold, that factor becomes irrelevant for some of the macroeconomic series at a given time. Then the corresponding factor loadings are shut down to zero for the corresponding time periods and variables. Otherwise, they survive. The shut down rate is a metric to measure how relevant the factors are for particular series in our data set.

We infer factors’ economic relevance by averaging their *survival rates *(1-shut down rates) over time and simulations. We combine the results for the variables in the same data subgroup. For keeping the illustration simple, we normalise them by considering the maximum survival rate over the factors. This normalisation ensures that each group is associated with at least one of the factors but not vice versa. We also impose a threshold survival rate of 75% to discard the small survival rates. (Interested readers can refer to the working paper for the table showing all the survival rates and for the justification of the number of the factors in our analysis.) After these steps, we construct **Chart 1**.

**Chart 1**: Survival rates of the factor loadings for the subcategories of the data set

The upper half of the circular plot indicates all seven factors in our analysis. On the bottom half, there are the data subgroups. The chords from the variables to the factors indicate the data subcategories that the factors are associated with. Note that the width of the variable bins shows how many variables are in each subgroup, in line with the data subcategories mentioned above. But the width of the factor bins on top does not carry any interpretation; they are entirely proportional to the width of the chords.

Each data category feeds into one or more factors. But there are some cases where some factors predominantly carry information on particular set of variables. For instance, the first factor is highly associated with the employment variables. The second factor is mostly associated with inflation variables alongside financial variables. Similarly, the sixth factor carries information on only money related variables. For the other factors, the distinction isn’t that clear but we can still follow a similar interpretation. We can associate the fourth factor mainly with real activity; the fifth with expectations and employment and the seventh with interest rates, money and inflation. We cannot relate the third factor to any of the data categories and similarly the credit variables to any of the factors.

Our approach provides us a high level interpretation of what type of information factors carry. Basically, it gives us an idea about which factors we need to pay attention to. If your specific analysis is about the drivers of inflation, this approach tells you to only look at factors 2 and 7. In a way, it is a shrinkage method that enables you to work with a smaller and a relevant set of factors. Especially while you are looking for informative factor(s) among dozens, this approach can be your ‘go-to’ approach to work only with the relevant ones.

*Kerem Tuzcuoglu is a PhD student at Columbia University. Sinem Hacioglu Hoke **works in the Bank’s Stress Testing Strategy Division.*

*If you want to get in touch, please email us at **bankunderground@bankofengland.co.uk** or leave a comment below.*

*Comments*** **will only appear once approved by a moderator, and are only published where a full name is supplied.

*Bank Underground is a blog for Bank of England staff to share views that challenge – or support – prevailing policy orthodoxies. The views expressed here are those of the authors, and are not necessarily those of the Bank of England, or its policy committees.*