Break Detection on Borussia Dortmund (BVB) stock prices - Part II The optimal h-parameter of BFAST
In this blog post I use stock prices of the Borussia Dortmund GmbH & Co. KGaA sports club to detect and characterize abrupt changes within the trend component of the time series. The main objective is, to search and find the optimal segmentation parameter which characterizes the timing and magnitude of abrupt changes best.
Furthermore, I point to possible reasons behind the abrupt changing prices by means of soccer game results of the Ballspielverein Borussia 09 e.V. Dortmund, commonly known as Borussia Dortmund or BVB, player transfers or outcome of FIFA Soccer World Cups.
The blog post is quite long, so I decided to divide it thematically (You are reading Part II).
BFAST and the search for the optimal h-parameter
Well, we know almost every BVB game result and the most important transfers of the last 18 years and focus now on the search for abrupt changes in the time series of stock prices. For this I use the BFAST
R package with the function of the same name bfast()
.
Usage
### R
library(bfast)
bfast(Yt, h = ??, season = c("dummy", "harmonic", "none"),
max.iter = NULL, breaks = NULL, hpc = "none",
level = 0.05, type= "OLS-MOSUM")
Short Description
BFAST performs a iterative break detection in seasonal and trend component of a time series. Seasonal breaks is a function that combines the iterative decomposition of time series into trend, seasonal and remainder components with significant break detection in the decomposed components of the time series (source: R Documentation).
Arguments
I am applying the above function and change two arguments.
h
- Minimal segment size between potentially detected breaks in the trend model given as fraction relative to the sample size. (i.e. 0.5)
season
- The seasonal model used to fit the seasonal component and detect seasonal breaks (i.e. significant phenological change). There are three options: dummy
, harmonic
, or none
. If there is no seasonal cycle, none
can be selected to avoid fitting a seasonal model.
The Workflow
I will go through the workflow step by step.
### R
library(zoo)
library(xts)
library(lubridate)
library(dplyr)
library(tibble)
library(bfast)
library(patchwork)
library(ggplot2)
library(ggiraph)
library(kableExtra)
Create Daily Time Series
Take the daily BVB stock price data.frame
and turn it into a daily time series ts
.
### R
load("finance_all_data.Rda")
finance_all_data <- finance_all_data %>%
dplyr::select(date, PX_LAST, weekdays) %>% # select important columns
mutate(year = lubridate::year(date)) # add column: year
ts.bvb <- ts(data = finance_all_data$PX_LAST,
start = c(min(finance_all_data$year), 60), # year 2001, day 60
end = c(max(finance_all_data$year), 138), # year 2018, day 138
frequency = 365)
# Replace NA by interpolation from the weekends and public holidays
ts.bvb <- zoo::na.approx(ts.bvb)
tibble::glimpse(ts.bvb)
## Time-Series [1:6284] from 2001 to 2018: 8.16 8.2 8.2 8.2 8.2 ...
Convert Daily To Weekly
Take the daily BVB stock price time series and turn it into a weekly time series (BFAST on daily data takes up a lot of processing time, therefore I gonna work here with weekly data).
At this point I can use the aggregate.daily.to.weekly
function from my last blog post. The result is a time series with the length of 896 weeks.
### R
aggregate.daily.to.weekly <- function(ts.daily) {
library(xts)
library(lubridate)
dates <- as.Date(date_decimal(as.numeric(time(ts.daily))))
xts.daily <- xts(ts.daily, order.by = dates)
xts.weekly <- round(xts::apply.weekly(xts.daily, mean),2) # xts
ts.weekly <- ts(data = xts.weekly,
# define the start and end (Year, Week)
start = c(as.numeric(format(start(xts.weekly), "%Y")),
as.numeric(format(start(xts.weekly), "%W"))),
end = c(as.numeric(format(end(xts.weekly), "%Y")),
as.numeric(format(end(xts.weekly), "%W"))),
frequency = 52)
return(ts.weekly)
}
ts.bvb.w <- aggregate.daily.to.weekly(ts.bvb)
length(ts.bvb.w) # 896 weeks
## [1] 896
plot(ts.bvb.w, main = "Borussia Dortmund weekly aggregated stock prices",
ylab = "Stock Price (Euro)")
Question: Which is the most ideal h-parameter?
We have now a valid weekly time series and would be ready to fire up the bfast()
function. At this point we have to decide which h-parameter is the most ideal for our dataset. Lets start by figuring out which is, for our dataset, the biggest and which is the smallest h-parameter possible and how many options are actually there.
The biggest / smallest h-parameter
The biggest possible h-parameter is 0.5 and the minimal segment size between the one potentially detected break is 448 weeks (i.e. 448 weeks / 896 weeks = 0.5). We potentially get 2 trend segments and 1 breakpoint.
The smallest possible h-parameter is 0.001116071 and the minimal segment size between the 895 potentially detected breaks is 1 week (i.e. 1 week / 896 weeks = 0.001116071).
The most ideal h-parameter is somewhere between 0.5 and 0.001116071.
In total there are 448 different h-parameter value options for our BVB weekly time.series. From 448⁄896 = 0.5 until until 1⁄896 = 0.001116071.
Call bfast()
with wrong arguments
What happens when we call bfast()
with incorrect arguments? Two examples.
Here, I use a wrong h-parameter (too big). We know the maximum is 0.5, anyway I try 0.9 and get the following error:
bfast(ts.bvb.w, h = 0.9, season = "none", max.iter = 1)
## [1] "No seasonal model will be fitted!"
## Error in breakpoints.formula(Vt ~ ti, h = h, breaks = breaks, hpc = hpc):
## minimum segment size must be smaller than half the number of observations
What about a different seasonal model?
bfast(ts.bvb.w, h = 0.5, season = "dummy", max.iter = 1)
## Error in stl(Yt, "periodic"): only univariate series are allowed
bfast(ts.bvb.w, h = 0.5, season = "harmonic", max.iter = 1)
## Error in stl(Yt, "periodic"): only univariate series are allowed
Loop over
Apply bfast()
with different h-parameters
I run bfast()
45 times, each time with a changing h
value. I am starting with 448⁄896 = 0.5 (448 weeks are the minimal number of observations allowed in each segment, 896 weeks is the total length of the time series).
The second h
value I use is 440⁄896 = 0.4910714. From here I decrease the minimal number by 10 weeks, .i.e 430⁄896 = 0.4799107 until I reach 10 weeks (10⁄896 = 0.01116071).
The result is a list
of the class “bfast”. I extract from the list
a) the time of the abrupt changes (breakpoints),
b) the break magnitude and
c) the processing time.
df <- data.frame()
x <- seq(440, 10, by = -10)
y <- x/length(ts.bvb.w) # 896 weeks
y <- append(y, 0.5, 0) # add 448 / 896 = 0.5
for (i in y) {
start_time <- Sys.time()
fit <- bfast::bfast(ts.bvb.w, h = i, season = "none", max.iter = 1)
end_time <- Sys.time()
brk_pts <- fit$output[[1]]$ci.Vt$confint # time of breakpoint
brk_pts.df <- as.data.frame(brk_pts)
# magnitude of the biggest change detected in the trend component
brk_pts.df$max_mag <- fit$Magnitude
brk_pts.df$mag <- NA # here goes the magnitude of each breakpoint
# the fitted trend component
Tt <- fit$output[[1]]$Tt
for (xx in 1:nrow(brk_pts.df)) {
break_week <- brk_pts[xx, 2]
# read the magnitude from the fitted trend component
brk_pts.df[xx, 5] <- Tt[[(break_week + 1)]] - Tt[[break_week]]
}
brk_pts.df$h <- i
brk_pts.df$speed <- end_time - start_time # get the processing time
df <- rbind(df, brk_pts.df)
}
Plots with sample results for bfast with h = 0.5, 0.402, 0.301, 0.201, 0.1, 0.056 and 0.011 are shown below.
### R
tibble::glimpse(df)
## Observations: 203
## Variables: 7
## $ `2.5 %` <dbl447, 439, 429, 419, 409, 406, 406, 406, 406, 406, ...
## $ breakpoints <dbl448, 440, 430, 420, 410, 407, 407, 407, 407, 407, ...
## $ `97.5 %` <dbl449, 441, 431, 421, 411, 408, 408, 408, 408, 408, ...
## $ max_mag <dbl0.661649797, 0.532089150, 0.353913696, 0.192414896...
## $ mag <dbl0.661649797, 0.532089150, 0.353913696, 0.192414896...
## $ h <dbl0.5000000, 0.4910714, 0.4799107, 0.4687500, 0.4575...
## $ speed <time10.80762 secs, 10.77663 secs, 10.61063 secs, 10.7...
We get the following results from the for-loop.
- 2.5 % - confidence interval of abrupt change (week)
- breakpoints - abrupt change (week)
- 97.5 % - confidence interval of abrupt change (week)
- max_mag - largest jump in the trend component (Euro)
- mag - jump in the trend component (Euro)
- h - used h parameter (fraction relative to the sample size)
- speed - processing time (seconds)
Add extra information for plots
df <- df %>%
dplyr::group_by(h) %>%
mutate(n = n()) %>% # number of observations per h - group
mutate(id = row_number()) %>% # add row number per h - group
mutate(mag = round(mag,5)) %>% # round number of magnitude
mutate(absolute_mag = abs(mag)) %>% # get absolute values for the magnitude
mutate(max_mag = round(max_mag, 5)) %>% # round number of maximum manitude
mutate(speed = round(speed, 4)) %>% # round number of processing time
mutate(mag_mean = mean(absolute_mag)) %>% # add mean of all breakpoints magnitude per h-group
mutate(mag_median = median(absolute_mag)) # add median of all breakpoints magnitude per h-group
# check at which breakpoint the biggest magnitude change occured
sel <- df[, 4] == df[, 5] # max_mag vs. mag
df$is_max_mag <- sel # TRUE / FALSE
glimpse(df)
## Observations: 203
## Variables: 13
## $ `2.5 %` <dbl447, 439, 429, 419, 409, 406, 406, 406, 406, 406,...
## $ breakpoints <dbl448, 440, 430, 420, 410, 407, 407, 407, 407, 407,...
## $ `97.5 %` <dbl449, 441, 431, 421, 411, 408, 408, 408, 408, 408,...
## $ max_mag <dbl0.66165, 0.53209, 0.35391, 0.19241, 0.04382, 0.00...
## $ mag <dbl0.66165, 0.53209, 0.35391, 0.19241, 0.04382, 0.00...
## $ h <dbl0.5000000, 0.4910714, 0.4799107, 0.4687500, 0.457...
## $ speed <time10.8076 secs, 10.7766 secs, 10.6106 secs, 10.774...
## $ n <int1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2...
## $ id <int1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ absolute_mag <dbl0.66165, 0.53209, 0.35391, 0.19241, 0.04382, 0.00...
## $ mag_mean <dbl0.661650, 0.532090, 0.353910, 0.192410, 0.043820,...
## $ mag_median <dbl0.661650, 0.532090, 0.353910, 0.192410, 0.043820,...
## $ is_max_mag <lglTRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
The Results
The following plot shows all breakpoints, their occurence in time and the corresponding h - parameters. The highlighted points (red) mark the h - parameters from the previous detailed plots.
The plot resembles a pyramid, a few breakpoints at high h-values and many breakpoints at low h-values. h-value 0.01116071 (10⁄896) has 31 breakpoints. Interestingly, it takes 23 weeks before the first breakpoint appears. The last breakpoint was registered in week 876, exactly 20 weeks before the end of the time series.
The following plot shows the same points, but the magnitude of the change detected is now highlighted in spectral colours. Black-edged points indicate the point in time when the magnitude of the biggest change was detected in the trend component.
Find all segments with the minimum size
The bfast plots way above shows trend components in blue. These trend components have different lengths, according to their connected breakpoints. They also have a theoretical minimal length. Here is a example of what I mean.
bfast with a
h
-parameter of 0.5 can theoretically create two segments with the length of 448 weeks (896 weeks * 0.5 = 448 weeks or 896 weeks / 2 segments = 448 weeks per segment) .bfast with a
h
-parameter of 0.3013393 (270⁄896) can theoretically create segments with a minimum segment size of 270 weeks (0.3013393 * 896 = 270 weeks). Meaning, there can be max. 2 breakpoints & 3 trend components (segments). The segment can always be longer, but not shorter. Applied to the bvb time series the h-parameter of 0.3013393 got 2 breakpoints and three trend segments with the length of 270 weeks (mininum) + 270 weeks (minimum) + 356 weeks = 896 weeks.
The idea is now to find all segments which have the theoretical minimum segment size.
To do so I created for each h
-value all theoretically possible minimum segment sizes and compared them against the real segment size.
The result is shown in the following plot, where I also added the 1:1 line of the minimal segment size. Which means, the first segment of the trend component cannot be left (shorter) from this line.
## Observations: 10,558
## Variables: 6
## $ h <dbl0.01116071, 0.01116071, 0.01116071, 0.01116071, 0....
## $ breakpoints <dbl23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23...
## $ n <int31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31...
## $ id <int1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ values <dbl10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120,...
## $ direction <chr"forward", "forward", "forward", "forward", "forwa...
Selection Criteria
Sorry for the lengthy exercise and the many details of the BFAST results. While writing the blog post I considered finally, three criteria for the selection of the optimal h-parameter.
The optimal h parameter, for the BVB dataset, should
- recognise more than 2 breakpoints
- not contain any segment with a minimum segment size,
- have a high median absolute magnitude (absolute).
The following three plots show the sequential h
-value filtering on the basis of the above mentioned criteria.
2.5 % | breakpoints | 97.5 % | mag | date |
---|---|---|---|---|
114 | 115 | 116 | 1.22994 | 2003-05-08 |
244 | 248 | 265 | 0.43795 | 2005-11-24 |
378 | 397 | 399 | -0.43969 | 2008-10-02 |
505 | 506 | 507 | 1.61659 | 2010-11-04 |
589 | 590 | 591 | 0.14791 | 2012-06-14 |
717 | 718 | 723 | -0.88305 | 2014-11-27 |
813 | 814 | 816 | 1.42852 | 2016-09-29 |
The last three remaining h
-parameters have between 6 and 7 breakpoints. The first four breakpoints are almost 100% identical at both the time (including both intervals) and the magnitude of the change.
The fifth breakpoint is spread over the weeks 590, 606 and 616. While the sixth breakpoint is at week 718, 733 and 734.
Of the remaining three h
-parameters, parameter 0.0892857 is the only one with 7 breakpoints. The seventh breakpoint is in week 814.
If we, in our youthful recklessness, regard 0.08928571 as the optimal h
-parameter value, then we get the following bfast result.
Breakpoint 1 - week 115 (2003-05-08)
A sustained negative trend is interrupted by a short-term positive upswing. The time series shows no special events at the time of the breakpoint.
Breakpoint 2 - week 248 (2005-11-24)
A sustained negative trend is interrupted by a short-term positive upswing. The time series shows no special events at the time of the breakpoint.
Breakpoint 3 - week 397 (2008-10-02)
The breakpoint hits exactly the day on which BVB lost 3-4 to Udinese Calcio and dropped out of the Europa League in first round.
In general, Thomas Doll resigned 4.5 month earlier (on 19 May 2008) and was replaced by Jürgen Klopp. With this change ended also the long-lasting negative trend. In the following 2.5 years the trend of the stock value is slightly positive.
Breakpoint 4 - week 506 (2010-11-04)
This is the breaking point with the largest magnitude of 1.61659 Euro. In this season Dortmund fielded a young and vibrant roster. Exactly a month after the detected breakpoint (4 December 2010), Borussia became Herbstmeister (“Autumn Champion”), an unofficial titel going to the league leader at the winter break. They did this three matches before the winter break.
Breakpoint 5 - week 590 (2012-06-14)
The breakpoint with the lowest magnitude (0.14791 Euro) is between a short negative trend (approx. 1.5 years) and a longer significant upward trend (approx. 2.5 years).
The trigger for the breakpoint could be the win of the double for the first time by beating Bayern 5-2 in the final of the DFB-Pokal, just a month earlier (2012-05-12).
Breakpoint 6 - week 718 (2014-11-27)
The profit of the World Cup title in Brazil also boosted the BVB share which reached a new high for the year. However, the breakpoint comes a day after a 2-0 defeat to Arsenal London in the Champion League.
Breakpoint 7 - week 814 (2016-09-29)
The sale of Hummels, Guendogan and Mkhitaryan in the summer of 2016 gives the BVB share an enormous boost. At the same time the team has coped well with the loss of key players started off on a high, winning 4-0 against Borussia Mönchengladbach on the opening day, followed by five-straight wins which took them to the top of the Bundesliga.
Benchmarking
With every execution of bfast()
I recorded the the processing time. As you might expect, the time increases with decreasing h-parameter. The fastest run took about 10 seconds and the slowest about 10 times more.
When running almost tens of thousands of different time series, i.e. pixels in an image stack, the waiting time can be considerable.
There are small tricks how you can shorten a few percent of the process times here and there, maybe more about that in another blog entry.
Bottom line
This blog post showed you how to find the optimal h-parameter to use for detecting and characterizing abrupt changes in the bvb stock price time series .
If you have any questions, suggestions or spotted a mistake, please use the comment function at the bottom of this page.
Previous blog posts are available within the blog archive. Feel free to connect or follow me on Twitter - @Mixed_Pixels.