Change Point Detection (Part 1): Benchmarking Piketty’s Inflation Chart

Many of the time series charts that we see have equal time buckets. For example, this daily stock chart shows one data point per day – the stock’s closing price. More generalized, an annual chart between 2000 and 2010 would show 10 observations with a single year between observations. You have seen hundreds of these types of charts. In Chapter 2 of Tom Piketty’s book Capital in the 21st Century, the researcher tries something different. He creates discrete, unequal time buckets to support a historical narrative.

This minor adaptation made it easier to convey a story, in turn helping readers to alter their perspective. It is easy for readers to identify, say, Germany’s hyperinflationary spiral after WWI or the Great Inflation of the 1970s. Ultimately, in the book, inflation plays a supporting role in Piketty’s wealth inequality narrative so portraying inflation trends in these buckets helps the main points stick. 

But, I am uncomfortable defining these time buckets based on self-identified historical eras (for instance, the “WW era” or “the era central banks started targeting interest rates instead of the money supply”). It feels too subjective. I appreciate the art aspect of analytics and data science, but more so I appreciate the balance between the art and the math. In Piketty’s chart, what if the buckets are wrong? What if there should be more of them? Would it paint a different picture? Let’s see what a change point detection algorithm would define as the buckets…

# Change Point Detection

From Wikipedia: in statistical analysis, change point detection tries to identify times when the probability distribution of a time series changes. For this exercise, I used the r package called ‘changepoint’ specifically leveraging the cpt.meanvar() function to identify changes in both mean and variance of the inflation data used by Piketty in his analysis. 

# Precursor: The Data

I downloaded the data from Piketty’s book website. Since the chart was based on four countries, I focused on the UK, USA, Germany, and France’s data. Note that each of those spreadsheets are basically books in themselves. I aggregated the historical inflation data for each of those countries here. Since we are creating buckets for the “whole world” rather than a single country, I averaged the annual data. 

Outliers. There are always outliers. Germany’s inflation between 1913 and 1924 was out of control. I don’t know how Piketty treated these data points, but for this 10 year period, I ignored them and only took the average inflation for the UK, USA, and France.

As for data prep, I also excluded years before 1760. The only data from the spreadsheets for those years was for the UK and it was all 0%. I think we do know that monetary inflation was effectively 0% prior to the industrial revolution. This begs the question; why didn’t Piketty start his chart in 1760 rather than 1700? I don’t know, but I have a feeling his averages excluded all those 0’s in the calculations…

# A Few Lines of Code

Implementing different change point detection algorithms is super simple with this package in R, so it’s worth sharing a few lines of code to showcase that simplicity. But, a fair warning: the challenge is in the art of tweaking the algorithm’s parameters to get change points that make sense in the context of the problem. For those very interested, you can review the code for this article here. For those uninterested in the code, you can simply skip down to the next section to see the results.

First, one of my favorite tricks: copy and paste into R. You can copy the selection from Google Sheets, run this line of code (on a Linux machine) and… presto; data. 

inf_data <- read.delim(pipe("pbpaste"))

Next important component, identifying the change points:

temp_cps <- cpt.meanvar(inf_data$AVERAGE, penalty="Manual",
                        pen.value="3*log(n)",
                        method="BinSeg",
                        Q=6, class=FALSE) 

This will output a vector of change points. The one issue - and the algorithm will tell you this when you run it - is that there are more than 6 change points… but I wanted to stay consistent with what Piketty had. We are trying to balance narrative with detail. 

The only other piece of code worth sharing here is the loop to turn those change points into years:

new_cps <- c(1761)
for (i in temp_cps) {
  hold <- as.numeric(inf_data$YEAR[1]) + as.numeric(i)
  new_cps <- c(new_cps, hold)
}

The rest of it is manipulating output to create visualizations. While the above seems overly simple, I did a lot of iterating with different data wrangling and model parameters to get what seemed reasonable in the context of the problem. Perhaps this is too subjective, but data science isn’t all math… there’s still that element of art.

# Results

So, how close does the change point detection algorithm get to identifying Piketty’s buckets representing eras in our inflation data? Pretty darn close. 

Tweaking the parameters of the changepoint algorithm, we can come very close to matching Piketty's discrete-time buckets.

The only major difference is the 2nd changepoint; 1810 vs. 1793. Otherwise, it’s pretty close. We can visualize it another way - by looking at the average inflation rate in each bucket (I know, I’m using averages of averages… but I’m going to roll with it for now..).

# Conclusion

To an extent, this is still too subjective. Tweaking the penalty parameter in the changepoint algorithm gives you a lot of liberty to create more or fewer changepoints. In fact, the documentation for the changepoint package alludes to this causality dilemma: “The choice of appropriate penalty is still an open question and typically depends on many factors including the size of the changes and the length of segments, both of which are unknown prior to analysis.” To me, this is a reinforcement of the fact that you need to go into the analysis with a strong baseline understanding of the problem and data you are working with. In a second post on multiple change point analysis, I'll demo how that context applies in an operations analytics example.

As mentioned above, try this out yourself with the gist here.