For over a year now, I have been recording my daily commute by bike, as well as other recreational cycles and runs, on a widely used app called Strava. By accessing the Strava API, it is possible (and not all that hard, I used the steps described here: http://www.open-thoughts.com/2017/01/the-quantified-cyclist-analysing-strava-data-using-r/) to import data about these sessions to analyse in a program such as R. In addition, we can obtain the raw GPS data by downloading .gpx files directly from the Strava profile page, as described in this R package: https://github.com/marcusvolz/strava/ . In this post, I will explore some of these data.

First we need to combine two different data sets: the GPS data and the data from the strava API. These tell us different things but should be describing the same rides.

library(dplyr)
new_data <- gpx_data %>%
  group_by(id) %>%
  summarise(start_date = min(time) %>%
              as.character %>%
              gsub(' ', 'T', .) %>%
              paste(.,'Z',sep='') ) %>%
  left_join(strava_data,.,by=c('start_date')) %>%
  left_join(gpx_data,.,by=c('id'))

Here is a summary of the combined dataset:

str(new_data)
## 'data.frame':    1729581 obs. of  22 variables:
##  $ id                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ lat                 : num  51.8 51.8 51.8 51.8 51.8 ...
##  $ lon                 : num  -1.26 -1.26 -1.26 -1.26 -1.26 ...
##  $ ele                 : num  71.7 71.9 71.9 71.9 71.9 71.8 71.7 71.6 71.6 71.5 ...
##  $ time                : POSIXct, format: "2016-02-13 09:31:30" "2016-02-13 09:32:02" ...
##  $ dist_to_prev        : num  0 0.02392 0.00476 0.00781 0.00577 ...
##  $ cumdist             : num  0 0.0239 0.0287 0.0365 0.0423 ...
##  $ time_diff_to_prev   : num  0 32 1 2 2 2 2 2 2 1 ...
##  $ cumtime             : num  0 32 33 35 37 39 41 43 45 46 ...
##  $ name                : chr  "Red flag run" "Red flag run" "Red flag run" "Red flag run" ...
##  $ commute             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ distance            : num  8111 8111 8111 8111 8111 ...
##  $ athlete_count       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ total_elevation_gain: num  15 15 15 15 15 15 15 15 15 15 ...
##  $ elapsed_time        : int  3122 3122 3122 3122 3122 3122 3122 3122 3122 3122 ...
##  $ type                : chr  "Run" "Run" "Run" "Run" ...
##  $ kudos_count         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ average_speed       : num  2.63 2.63 2.63 2.63 2.63 ...
##  $ max_speed           : num  4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 ...
##  $ average_watts       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pr_count            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ start_date          : chr  "2016-02-13T09:31:30Z" "2016-02-13T09:31:30Z" "2016-02-13T09:31:30Z" "2016-02-13T09:31:30Z" ...

First lets scale some of the features to give more familiar units, and pick out the runs and rides separately.

#process data to convert units
processed_data <- mutate(new_data,dist_km = distance/1000,av_speed_kmph = average_speed*3600/1000,time_mins = elapsed_time/60)

Visualising this data gives us a better idea of how it is structured. How different are my runs and rides? Can we tell them apart?

library(ggplot2)
grouped_by_id <- processed_data %>%
  group_by(id) %>%
  summarise(dist_km = unique(dist_km),
            av_speed_kmph = unique(av_speed_kmph),
            type = unique(type))
ggplot(data = grouped_by_id, aes(dist_km,av_speed_kmph)) + 
geom_point(aes(col=type)) + 
theme_bw()

It seems we have two overlapping classes, runs and rides. Although mostly I will cycle faster and further than I run, that is not always the case. Some of the data may include walking around a supermarket in the middle of a ride home for instance! Nevertheless, can we parameterise a model to distinguish between runs and rides, based on the speed and distance of an activity?

The data seem to be bimodal, with two overlapping classes, so a sensible first choice of model is a mixture of Gaussian distributions. This can be fitted using the expectation maximization (EM) algorithm.

library(mixtools)
set.seed(123)
#fit normal mixture to speed data
mvmix <- grouped_by_id %>%
  select(dist_km,av_speed_kmph) %>% 
  mvnormalmixEM(.,k=2)
## number of iterations= 29
plot(mvmix,which=2)

The EM algorithm in this case has identified clusters for runs and cycles well. Visually we may be able to see other patterns and clusters, including differences between short commuting cycles, long cycles, slower cycles and runs. We could investigate this further by varying the nunber of clusters, \(k\), in the EM algorithm.

Of course for this problem we do in fact have access to the class labels (run or ride) for the training and test data sets, and in our approach so far have not made use of these. The EM algorithm is an unsupervised learning approach, whereas taking a supervised approach where we use these labels can give much better results.

What do the routes actually look like?

For this task, we make use of the brilliant strava package for R https://github.com/marcusvolz/strava
This allows us to process .gpx files and show routes with a facet plot. We restrict to recent activities here so that we can see what’s going on.

devtools::load_all('strava/') #I have made slight changes to some of the plotting functions in a local version of the strava package
processed_data %>%
  filter(time > as.POSIXct('2017-10-01')) %>%
  filter(time < as.POSIXct('2017-12-31')) %>%
  plot_facets()

Lots of the shapes of these routes are similar. Why might this be? Well I repeat the same route each day for my commute. Fortunately I have manually tagged these as commutes when recording the data, so we can tell which these are. The commutes are shown blue with all other activities shown red. Perhaps we can gain more insight by removing these commutes from the dataset.

We can look at all (non commute) rides from 2017 and plot the routes with the alpha showing the distance of the ride.

processed_data %>%
  filter(!commute) %>%
  filter(time > as.POSIXct('2017-01-01')) %>%
  filter(time < as.POSIXct('2017-12-31')) %>%
  filter(type=='Ride') %>%
  plot_facets()

Can we measure fitness levels over time?

The best way to measure fitness levels is to do directly with an objective test such as a VO2 max test in a lab. Since this isn’t possible with the data available to us, perhaps we can use a proxy for fitness levels such as the number of activities per week or time exercising per week, and consider this over time.

library(tidyr)
time_series_by_week <- processed_data %>%
  filter(!commute) %>%
  filter(time > as.POSIXct('2017-01-01')) %>%
  filter(time < as.POSIXct('2017-12-31')) %>%
  group_by(week=as.numeric(strftime(start_date,format='%V'))) %>%
  summarise(num_activities = length(unique(id)), total_duration_hrs = sum(unique(elapsed_time))/60^2) %>% gather(variable,value,-week)
g <- ggplot(time_series_by_week, aes(x=week,y=value,colour=variable)) +
  geom_line() + 
  theme_bw() + 
  scale_y_continuous('value', c(0,2,4,6,8,10))
print(g)

This suggests I did more training (distinct from commuting) in the second half of 2017, peaking roughly around September/October, which is in fact when I ran a half-marathon. With a slightly more complex model we could use these proxys to track a measure of fitness over time, similar to VO2 max estimates provided by popular sports apps.

Well that’s all for now. Perhaps I can revist this in a year and compare to data from 2018, possibly including heart rate data too!