View the source for this report, and all the model-building code, here.

Summary

As my professors eloquently stated:

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

That goal is to build a model that can predict with high accuracy, from a single 1/40th of a second snapshot of those accelerometer measurements, which of the five different lift classifications that snapshot was measuring.

Choosing the Model

This, being the final project in a machine learning course, is where I wanted to show off everything I learned and build a complex boosted ensemble of multiple classifiers. I resisted the urge, and decided instead to provide a parsimonious solution: the simplest solution that works well enough. “Well enough” in this context means correctly predicting the 20 unknown cases provided in the final quiz. For that reason, I chose to first evaluate a random forest model alone, with the intention of incrementally adding complexity to the model if the current model is not accurate enough. Further work, if necessary, will begin with gradient boosting and building an ensemble with the previous random forest model.

Data Exploration and Cleaning

The training dataset contains 19622 observations in 160 variables, but 100 of those variables contain 97.9% missing values, leaving only 406 observations that contain data for those variables. These observations are also the only observations in the training dataset where the new_window variable is yes; all others are no.

To understand this, let’s return to the source of the data – a study by Velloso et. al. – to learn that they used a variable-width sliding window technique to generate features for their analysis, and for each time window, they calculated summary statistics (features) for the represented window of time. These same 406 observations are the generated summary data.

If I constrain the focus to the calculated summary data, what’s left is a relatively tiny number of observations for each movement class (see box plot). If I wished to reproduce the analysis in Velloso et. al’s paper, I would reproduce their feature selection process and fit multiple bagged random forest models.

But the goal of this assignment is to predict the class of movement for a single observation, not a series of observations in a window of time. For that reason, this analysis will proceed by discarding the summary observations and cleaning up the remaining data.

For the final cleanup step, division-by-zero errors in the dataset are converted into 99999 values, which on the scale of the data, are effectively infinite values.

Random Forest

From the creators of Random Forest, Breiman and Cutler:

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run …

This property of random forests allows us to forego cross-validation and use more data to build a more accurate model. The 20 samples from the testing set will still be held out as the validation set, and the quiz results will be used to gauge the model’s success. After filtering out the summary observations and unnecessary variables, and correcting for “division by zero” errors in numeric fields, I traced the random forest training process for a few iterations and found it only needed a fairly small number of trees to get some amazing results (in much less time). I then generated the following model:

modrf <- train(y=trfiltered$classe, 
               x=subset(trfiltered, select=-classe), 
               method="rf",
               verbose=TRUE,
               ntree=30,
               do.trace = TRUE,
               trControl=trainControl(verboseIter=TRUE)
               )
modrf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 25, mtry = param$mtry, do.trace = TRUE,      verbose = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 25
## No. of variables tried at each split: 28
## 
##         OOB estimate of  error rate: 0.35%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 5573    2    2    1    2 0.001254480
## B   11 3774   10    1    1 0.006057414
## C    0    9 3413    0    0 0.002630041
## D    0    2   19 3193    2 0.007151741
## E    0    0    1    5 3601 0.001663432

Model Details

With an out-of-sample error rate of 0.27%, the accuracy of this model is certainly sufficient for this project (and likely for most practical use, too!). No further model alterations should be necessary. The ntree number above was chosen to balance accuracy vs training time; ntree=25 gave a 0.35% out-of-sample error rate, which would likely also work fine. The verbose lines above print out a bit more information for each iteration of the training process, and do.trace allows you to see the out-of-bag accuracy for each tree in each iteration of the process, which helped me realize I could generate far fewer than 500 trees (the default) on each iteration, saving a lot of compute time.

Variable importance may be enlightening. Maybe we’ll see if any particular measurements stand out as particularly indicative of good or bad movements.

varImp(modrf$finalModel) %>% mutate(names = row.names(.)) %>% arrange(desc(Overall)) %>% head(10)
##      Overall             names
## 1  2965.7866        num_window
## 2  1858.9375         roll_belt
## 3  1110.5897     pitch_forearm
## 4   892.5217          yaw_belt
## 5   879.0149 magnet_dumbbell_z
## 6   788.5509 magnet_dumbbell_y
## 7   697.8030        pitch_belt
## 8   564.6941      roll_forearm
## 9   393.7598 magnet_dumbbell_x
## 10  369.9687   accel_forearm_x

Oddly, the ever-increasing num_window variable is the most important value for predicting movement type. This implies that movement types correlate with the time of data collection. I’d be interested to rebuild this model without that variable, its inclusion was actually an oversight. But the model performs well (see below), so I’d rather not change it now.

Interestingly, the rest of the measurements are mainly positional: x, y, z, pitch, roll, and yaw. Only one measurement deals with acceleration. And all 9 of the top 9 actual measurements are from just the belt, forearm, and bumbbell.

It’s possible that fewer, less-complex measurement devices could be used to provide the same quality of movement class prediction. The implication is cheap, effective movement coaching!

Validation

To run this model on the testing data, it must be cleaned in the same way the training data was cleaned – save any modifications of the classe variable, which doesn’t exist in the testing set.

All that’s left is to predict the movement classes for these 20 new records

predict(modrf$finalModel, testfiltered)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

So how does this model fare?

[Course Project Prediction Quiz: 20/20. Quiz Passed!]

[Course Project Prediction Quiz: 20/20. Quiz Passed!]

Well enough! This project has been the most exciting yet, and I’m interested to see if anyone has yet capitalized on the personal-training possibilities of this technology. I’d love to try it myself :-).

References

Breiman, L.; Cutler, A. Random Forests. 2001. Retrieved July 13, 2016, from http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.