Build-your-own fancharts in R

Andrew Blake

Central banks the world over calculate and plot forecast fancharts as a way of illustrating uncertainty. Explaining the details of how this is done in a single blog post is a big ask, but leveraging free software tools means showing how to go about it isn’t. Each necessary step (getting data, building a model, forecasting with it, creating a fanchart) is shown as R code. In this post, a simple data-coherent model (a vector auto-regression or VAR) is used to forecast US GDP growth and inflation and the resulting fanchart plotted, all in a few self-contained chunks of code.

Why use R? Built for statisticians, it’s becoming increasingly popular with economists. R has a number of obvious advantages: it is free to use, it is well supported by a highly engaged user community, can produce stellar graphics, and allows you to disseminate your work through easy-to-build interactive apps which can be publicised through your own blog. However, R is famously hard to get into and for many – me included – making good use of the tools that comprise the tidyverse can make using it a lot easier. This set of highly-integrated packages is designed as a seamless way to manipulate and visualise data. Let’s build some fancharts.

The code

First load some libraries:

library(tidyverse)
library(quantmod)   # For data access

 

Data

US data is easy to get. Below, the getSymbols command from the library quantmod is used to download quarterly values of year-on-year GDP growth and the CPI index (as of 5 November, 2019) from the US Fed’s data repository FRED.

series = c('A191RO1Q156NBEA', 'CPALTT01USQ661S') # FRED codes for US GDP growth and CPI
Growth = getSymbols(series[1], src='FRED', auto.assign=FALSE)
CPI    = getSymbols(series[2], src='FRED', auto.assign=FALSE)

The next bit of code stores the data series in Data along with the date, calculates the annual inflation rate and then keeps only what’s necessary.

Data = inner_join(tibble(Date=time(Growth), Growth=coredata(Growth)), 
                   tibble(Date=time(CPI), CPI=coredata(CPI)), by=c("Date")) %>% 
  mutate(Inflation=100*(CPI/lag(CPI,4)-1)) %>%
  select(Date, Growth, Inflation) %>% 
  drop_na() # Drop missing obs to balance dataset

All good modellers inspect their data, so plot the series using ggplot2, a key part of the tidyverse.

centre_colour = c("seagreen","tomato") # Colours for time series/centre of fancharts
tail_colour   = "gray95"               # Colour for the tails, used later but defined here 
pivot_longer(Data, cols=-Date, names_to="Variables", values_to="Values") %>% 
  ggplot() + 
  geom_line(aes(x=Date, y=Values, group=Variables, colour=Variables), size=1.1, show.legend=TRUE) +
  scale_colour_manual(values=centre_colour) +
  theme_minimal() + 
  theme(legend.title = element_blank()) +
  labs(title="US GDP growth and CPI inflation", x="", y="",
       caption=paste0("Source: FRED series ", paste(series, collapse=", ")))

Chart 1: US GDP growth and CPI inflation

These look fine. Let’s build a model.

Model

Excellent packages exist to estimate VARs (such as vars) but the point is to do it from scratch. Algebraically a VAR with m lags is:

Y_t = \beta_0 +\sum_{i=1}^{m} \beta_i Y_{t-i} + \varepsilon_t

where Y_t  is a vector of growth and inflation in each period. Clever use of the tidyverse creates (and names) the required lags of each variable in a similar fashion to Answer 3 to this question on stackoverflow and a constant.

m     = 4  # maximum lag in VAR
Datal = Data %>%
   pivot_longer(cols=-Date, names_to="Names", values_to="Values") %>%
   mutate(lag_value=list(0:m)) %>%
   unnest(cols=lag_value) %>%
   group_by(Names, lag_value) %>%
   mutate(Values=lag(Values, unique(lag_value))) %>%
   ungroup() %>%
   mutate(Names = if_else(lag_value==0, Names,              # No suffix at lag 0
   paste0(Names, "_", str_pad(lag_value, 2, pad="0")))) %>% # All other lags 
   select(-lag_value) %>%      # Drop the redundant lag index
   pivot_wider(names_from=Names, values_from=Values) %>%
   slice(-c(1:m)) %>%          # Remove missing lagged initial values
   mutate(constant = 1)           # Add column of ones at end

Now select the lagged values (those with a suffix) and constant as explanatory variables and the rest (except for the date) as dependent ones using a regular expression match. These are put in the matrices X and Y respectively.

s = paste(paste0(str_pad(1:m, 2, pad="0"), "$"), collapse="|")
X = data.matrix(select(Datal, matches(paste0(s,"|constant"))))
Y = data.matrix(select(Datal, -matches(paste0(s,"|constant|Date"))))

The VAR is easy to estimate by solving for the unknown \beta’s using:

(bhat = solve(crossprod(X), crossprod(X,Y)))
##               Growth      Inflation
## Growth_01     1.11860225  0.113957757
## Growth_02    -0.17202036 -0.094868237
## Growth_03    -0.07633554  0.006027886
## Growth_04    -0.06501444  0.052454614
## Inflation_01 -0.22304645  1.273950043
## Inflation_02  0.17785249 -0.327281242
## Inflation_03 -0.08725688  0.122644763
## Inflation_04  0.09025376 -0.110628806
## constant      0.74673184 -0.077816987

A nice feature of calculating bhat this way is that it automatically labels the output for ready interpretation. An econometrician would spend some time evaluating the statistical model, but let’s just press ahead.

Forecast

Simulating the model to calculate the forecasts and the forecast error variances is done in a loop. A first-order representation of the VAR works best, with the small complication that the parameters need to be re-ordered.

nv    = ncol(Y) # Number of variables
nf    = 12      # Periods to forecast
nb    = 16      # Periods of back data to plot, used later

v     = crossprod(Y - X %*% bhat)/(nrow(Y)-m*nv-1)            # Calculate error variance
bhat2 = bhat[rep(seq(1,m*nv,m),m) + rep(seq(0,m-1), each=nv),] # Reorder for simulation
A     = rbind(t(bhat2), diag(1,nv*(m-1), nv*m))                # First order form - A 
B     = diag(1,nv*m,nv)                                        # First order form - B
cnst  = c(t(tail(bhat,1)), rep(0,nv*(m-1)))                    # First order constants
# Simulation loop
Yf     = matrix(0,nv*m,nf+1)                # Stores forecasts
Yf[,1] = c(t(tail(Y,m)[m:1,]))              # Lagged data
Pf     = matrix(0,nv,nf+1)                  # Stores variances
P      = matrix(0,nv*m,nv*m)                # First period state covariance

for (k in 1:nf) { 
  Yf[,k+1] = cnst + A %*% Yf[,k]
  P        = A %*% P %*% t(A) + B %*% v %*% t(B)
  Pf[,k+1] = diag(P)[1:nv]
}

At the end Yf contains the forecast levels of each variable and Pf the forecast standard errors.

Fancharts

There are packages to plot fancharts too. The fanplot package actually has Bank of England fancharts built in but not in the tidyverse, although for the tidy-minded there is ggfan. But it isn’t hard to do it from scratch.

Each forecast fanchart is built up of five shaded areas, with the darkest shade representing an area expected to contain the outcome 30% of the time. Two lighter adjacent areas are the next 15% probability bands below and above the central area, and then another 15% probability bands outside these are shaded lighter still. The edges of these bands are forecast quantiles, evaluated using the values below. Starting at the bottom, each selected forecast quantile is a lower edge of a polygon and next higher quantile the upper edge. The upper coordinates need to be reversed so the perimeter lines join to make the right side of the polygon. Creating this series for each polygon and each variable is done in the code segment below in the curly-bracketed bit {bind_rows(…)}. And as everything in a single data frame is convenient, a last step binds in the historical data.

qu     = c(.05,.2,.35,.65,.8,.95)  # Chosen quantiles ensures 30% of the distribution each colour
nq     = length(qu)
fdates = seq.Date(tail(Data$Date,1), by="quarter", length.out=nf+1) # Forecast dates

forecast_data = tibble(Date     = rep(fdates, 2), 
                       Variable = rep(colnames(Data)[-1], each=(nf+1)), 
                       Forecast = c(t(Yf[1:nv,])),
                       Variance = c(t(sqrt(Pf)))) %>% 
  bind_cols(map(qu, qnorm, .$Forecast, .$Variance)) %>%         # Calculate quantiles
  select(-c("Forecast", "Variance")) %>% 
  {bind_rows(select(., -(nq+2)),                                # Drop last quantile 
             select(., -3) %>%                                  # Drop first quantile
               arrange(Variable, desc(Date)) %>%                # Reverse order
               rename_at(-(1:2), ~paste0("V",1:(nq-1))) )} %>%  # Shift names of reversed ones 
  pivot_longer(cols=-c(Date, Variable), names_to="Area", values_to="Coordinates") %>% 
  unite(VarArea, Variable, Area, remove=FALSE) %>%              # Create variable to index polygons
  bind_rows(pivot_longer(tail(Data,nb), cols = -Date, names_to="Variable", values_to="Backdata"), .)

That’s pretty much it. Shaded rectangles made using geom_rect indicate the forecast region, the filled polygons plotted using geom_polygon define the different bands and historical data is added using geom_line. A bit of formatting, apply facet_wrap() and we’re done.

# Band colours 'ramp' from the centre to the tail colour 
band_colours = colorRampPalette(c(rbind(tail_colour, centre_colour), tail_colour), 
                                space="Lab")(nv*nq+1)[-seq(1, nv*nq+1, nq)]

ggplot(forecast_data) + 
  geom_rect(aes(xmin=Date[nv*nb], xmax=max(Date), ymin=-Inf, ymax=Inf), fill=tail_colour, alpha=.2) +  
  geom_polygon(aes(x=Date, y=Coordinates, group=VarArea, fill=VarArea)) +
  scale_fill_manual(values=band_colours) +
  geom_line(aes(x=Date, y=Backdata, group=Variable, colour=Variable)) +
  scale_colour_manual(values=centre_colour) +
  scale_x_date(expand=c(0,0)) +
  theme_minimal() +
  theme(legend.position="none") +
  facet_wrap(~ Variable, ncol=1) +
  labs(title="Forecasts of US GDP growth and CPI inflation", 
       subtitle=paste("Quarterly data, annual rates of change, VAR with", m, "lags"), 
       caption=paste("Source: FRED series", paste(series, collapse=", ")), x="", y="")

Chart 2: Forecasts of US GDP growth and CPI inflation

Extensions

These look great but customisation is easy – why not see how BBC style graphics look? (Answer: they look better without the shaded rectangles.) Try experimenting with the model too – see what effect changing the maximum lag has or maybe add another variable. And use different data. The ONS makes UK data available for download in JSON format. A bit more code is needed than with FRED – this excellent blog post will get you up and running in no time. This blog post will do the same using the DBnomics data portal which gives access to a ton of data form all over the world.

Central bank forecasts are based on considerably more sophisticated models than the one used here, with much more data and above all incorporating expert judgment. But if this post has whetted your appetite, the code chunks should be enough to get you started. So get copying, pasting, running! (Or just download the code here.) Oh, and this post was completely written in R Markdown. Because you can.

Andrew Blake works in the Centre for Central Banking Studies.

If you want to get in touch, please email us at bankunderground@bankofengland.co.uk or leave a comment below.

Comments will only appear once approved by a moderator, and are only published where a full name is supplied. Bank Underground is a blog for Bank of England staff to share views that challenge – or support – prevailing policy orthodoxies. The views expressed here are those of the authors, and are not necessarily those of the Bank of England, or its policy committees.

5 thoughts on “Build-your-own fancharts in R

  1. This is great. Thanks for sharing the code and for the links which are very helpful. I happen to believe R is the next big thing in quantitative economics (if it isn’t already there). Showing what is possible with examples like this only helps to “democratise” quantitative analysis by encouraging the switch to free software platforms.

  2. Great blog post. I also concur that there is a bot of code missing before using the dplyr code to sort the data.

  3. Again let me thank Andrew for an excellent post. Just a few suggestions given some problems I ran into in executing the code. Firstly there are conflicts with dplyr functions, MASS and plyr packages which cause the code to halt and throw errors in some places. To get around these problems just reference dplyr package when these errors come up. Specifically these errors are related to select and mutate which conflicts with some other package functions.

  4. Thanks a lot for this contribution. In my case I have been able to do the same nevertheless I have chosen Australia.

Comments are closed.