Human Activity Recognition with Smartphones
I took an amazing PhD level course last year, CSE 780 - Data Science taught by Prof. Paul McNicholas. It was my introduction to Data Science and I thoroughly enjoyed it. Dr. McNicholas is a fantastic and brilliant professor. We were allowed to pick a dataset and showcase our data science knowledge around it. I chose the Kaggle contest Human Activity Recognition with Smartphones. Below is my findings. I will upload my R code one day to my GitHub.
Devices such as the Fitbit and various smart watches use biometric data to give information to the user on their fitness. The ability to categorize the physical activity that a user is performing when wearing a personal fitness device is crucial for accuracy and provides more useful information to the user. While external devices such as the Fitbit are gaining popularity, there are still more smartphones on the market than wearable electronics. Having a smartphone capable of predicting physical activity would be extremely useful and if the original hardware of the smartphone could be utilized, then it would just require software (like a mobile application) to predict user activity.
Most smartphones have an accelerometer and an internal gyroscope typically utilized for applications such as GPS navigation, auto-rotation, and for mobile video games. The dataset Human Activity Recognition with Smartphones was obtained through the data processing competition website Kaggle and was posted by UCI Machine Learning . It contains data recorded (10 299 observations, 562 variables) from 30 individuals performing one of six activities (running up/down stairs, walking, sitting, running, laying and standing).
562 variables were created by recording 3-axial linear acceleration and 3-axial angular velocity at a rate of 50 Hz. Acceleration data could be further separated into gravitational and body motion components. Statistical analysis was performed on the dataset as a function of time to create more variables such as the standard deviation, min, max, energy, entropy and mean of each measured variable. However a clear description of each specific variable was not given. Variable label names were not useful for the analysis (vague) and were omitted from the present study.
The dataset was originally recorded by Anguita et al., who hoped to use smartphones as a way for doctors to monitor the physical activity of elderly patients while they are at home . Using a Support Vector Machine (SVM) learning approach, they produced 89.3 % classification accuracy.
The aim of the present study is two-fold. First, a supervised machine learning classification approach using all 562 predictors is used to classify an unlabeled test set. Secondly, a feature selection step is implemented prior to classification to reduce the number of predictors then a supervised machine learning classification step is used for classification of the test set. 70 % of the dataset is used for training three algorithms (random forest, single layer neural networks and gradient boosting), while the remaining 30 % of the dataset was used for testing the trained algorithms.
All Predictor Variables – Random Forest.
The random forest  algorithm uses many decision trees to determine the class of an observation by aggregating the results of many bags (sampling via bootstrapping). The most frequent classification decided by many decision trees determines the class of the observation. The random forest algorithm has many advantages over traditional bootstrap aggregating (bagging) because of the ability to only sample some of the predictors per bag as opposed to all of the predictors per bag. This is advantageous because it allows variable importance to be determined and to prevent some variables from dominating the model.
Two random forests and one bag was compared using the misclassification rate as a benchmark for accuracy. Sampling 3 and 23 predictors per bag produced a misclassification rate of ~4 and ~2.5 % respectively. The bagging misclassification rate was found to be ~3 %. The best random forest was found when . A more detailed parameter grid search tuning was not carried out due to being computationally expensive. Random forests were not expected to out compete boosting and neural networks and as such were not studied extensively. Variable importance can be found in Figure 1a when m = 23 and the number of trees is 100 which shows that there are no single variables which dominate explicitly. 100 trees was found to be satisfactory for convergence as shown in Figure 1c.
All Predictor Variables – Neural Networks.
Trained neural networks are incredibly powerful at predictive classification. The single layer neural network implemented in the nnet library within R is similar to the one proposed by Ripley . In a single layer neural network, there is one layer of hidden variables that are used to combine weighted input variables into accurately modeling the output classification. There are two main degrees of freedom in the single layer neural network: the number of hidden variables and how quickly the weights in the model will decay after each iteration. The maximum number of iterations used was always in excess to allow convergence.
Cross-validation was not used for model development because of the complexity of the model (10 299 observations, 562 variables). Models were built by using the training set and not a validation set. Computation time was limiting with only one core and as such models were only built with a maximum size of 18. The results of the model development as a function of decay and size can be found in Figure 2a using the misclassification rate as a benchmark for success. The best training of a single layer neural network using all predictor variables had a misclassification rate of 1.45 % (10 248 weights, decay = 0.0001 and size = 18).
All Predictor Variables – Boosting.
Boosting is an ensemble learning technique which uses decision trees to classify an observation in a bag. The generalized boosting method implemented in the gbm package within R is a variant of the Gradient Boosting Machine proposed by Friedman . After rules are created in the decision tree, the remaining observations in the training set are used as a test set. The observations with the highest error are more likely to be added to the next iteration of randomly selecting observations to go into a bag. After many iterations, we are left with an accurate set of decision tree rules which can be used for classification of an unlabeled observation.
There are two main degrees of freedom to a generalized gradient boosting method: learning rates (shrinkage parameter) and interaction depth. Half of the training set was used per bag and 500 trees (iterations) are used to find an appropriate set of rules. The amount of iterations used was always in excess so that the calculation converged.
To hone in on an optimal learning rate and interaction depth, boosting was performed on a sub-set of the data (~3000 observations instead of the ~7000 in the full training set). It was found that a learning rate of 0.1 and an interaction depth of 4 produced optimal results of 1.5 % error with the subset. Although a learning rate of 0.01 also has a 1.5 % error, a learning rate of 0.1 will converge faster with less iterations being required.
With these optimal parameters a 5-fold cross validation was performed on the full training set to determine the misclassification rate and variable importance. Cross-validation was not used for model selection due to it being extremely computationally expensive (even parallelized). Using 5-fold cross validation on 5 CPUs with an interaction depth of 4 and a learning rate of 0.1 the resulting misclassification rate of the test set was found to be 0.90 %. Using cross-validation, it was found (Figure 4a) that the deviance converged after 293 trees (1000 trees were allowed).
Variable importance can be determined using boosting by counting the instances that a specific variable was used as a rule. Variable importance can give more information on what the algorithm is having difficulty classifying. If we look at the predicted class probability for three variables shown in Figure 5 we can observe the main problem we have in classification: the differentiation between sitting and standing. This is believed to be the case due to the nature of the actual recording of the data. The smartphone was fastened to the chest of the subject. Considering that a person’s chest is upright when both sitting and standing, there will be significant overlap between these two classes.
It is also clear from Figure 5 that a handful of variables will not be suffice to describe the physical activity of the subject. A relatively large subset of the predictor variables are required to sufficiently differentiate sitting from standing. Using the variable importance of boosting (since it performed better than random forests and neural networks), a subset of the predictor variables was created (shown in Figure 4b) when the variable importance was above 0.001. The choice of 0.001 was arbitrary. While these variables are good for boosting, they may not be optimal for neural networks, so a more liberal estimate was used (taking all variables above 0.001). These top variables (349 out of 562) are used as a subset for further analysis.
Boosting Subset – Random Forest.
Using the subset of predictors previously derived from boosting, random forest was used for classification. Using (18 variables) for the number of variables sampled per bag, the resulting test-misclassification after 300 trees was found to be 2.4 %. Model development was not extensively carried out (finding the optimal m value) in the interest of time. It is clear that random forest is not a better model at predicting the class of an observation when compared to boosting or neural networks. This test misclassification error is similar to the results found when using all predictor variables which suggests that variables that are important for boosting may not be extremely important to random forests.
Boosting Subset – Neural Networks.
A single layer neural network was used on the variable subset determined through cross-validated gradient boosting. With two main degrees of freedom (size and decay) a model had to first be developed. Cross-validation was used to hone in an appropriate range of variables.
The misclassification rates on the test set are shown in Figure 6b and the best performing neural network was one which was not tested through cross-validation. Having a decay of 0.01 and a size of 18 produced a misclassification rate of 1.00 %. This is a stark improvement compared with the all-predictor-variable neural network having a misclassification rate of 1.45 %. This model also had only roughly two-thirds of the weights as the original neural network (6414 compared with 10 248 originally).
Boosting Subset – Boosting.
Boosting was performed on the subset of predictor variables representing all variables with a boosting relative importance greater than 10-3. Using the optimal parameters derived from the original boosting training (learning rate = 0.1, interaction depth = 4), 5-fold cross-validated boosting was performed on the training set using 349 of the 562 predictor variables. 462 iterations were required for convergence of multinomial deviance (compared to 293 when using all predictor variables). The resulting test misclassification rate was 0.81 % which is an improvement from the 0.9 % test misclassification rate when using all 562 predictor variables.
Conclusions & Comments.
Although not shown, principal component analysis (PCA) was not a viable dimensionality reduction technique for this dataset. Lower variance components may be needed for accurate model fitting. PCA increased the misclassification rate in all three machine learning techniques. However dimensionality reduction via variable importance ranking was successful and this technique produced a more accurate model in all three machine learning algorithms studied. Boosting and neural networks on the reduced dataset produced the best results having 0.81 and 1.00 % misclassification error respectively.
When compared to the original authors of the dataset, the outlined models in this study perform significantly better. Anguita et al., had a misclassification rate of ~10 % . Kaggle user UmairAbro had a misclassification rate of ~ 5% using a multilayer neural network, although they did not perform a rigorous parameter tuning . Kaggle user christopheemilempaga used a SVM technique without rigorous parameter tuning and had a misclassification rate of ~8 % .
Finely tuned neural networks and finely tuned boosting perform extraordinarily well on the data set and perform significantly better than all published models on the Human Activity Recognition with Smartphones dataset to date. Perfect accuracy is likely not achievable on this dataset due to the similarity of the classes Sitting and Standing. The data was recorded using a chest-mounted smartphone and there are very few differences according to an accelerometer and a gyroscope for these two activities. To produce a more accurate trained model, one could use two smartphones, one attached to the thigh of the participant and one attached to the chest to eliminate this error.
In terms of providing a useful model for real life applications, data will need to be recorded with a smartphone in the pocket of an individual. This provides other challenges however. Sitting and Standing will likely be fine in terms of differentiating clusters, but Laying and Sitting will likely have significant overlap. If this smartphone pocket dataset was created, and a model was trained using gradient boosting or single layer neural networks, then human activity could be tracked in extremely fine detail when coupled with GPS. The location and body position of an individual could be recorded and monitored from anywhere in the world after it is sent via the Internet.
 UCI Machine Learning. Human Activity Recognition with Smartphones. Website: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones. Published: October 6 2016
 Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge L. Reyes-Ortiz. Energy Efficient Smartphone-Based Activity Recognition using Fixed-Point Arithmetic. Journal of Universal Computer Science. Special Issue in Ambient Assisted Living: Home Care. Volume 19, Issue 9. May 2013
 Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
 Chatzidakis, Michael. CSE 780 – Assignment 3. Instructor: Paul McNicholas. Submitted October 19 2016.
 Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge.
 J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.
 Kaggle user: umairabro. Performance Comparison of ML algorithms. Website: https://www.kaggle.com/umairabro/d/uciml/human-activity-recognition-with-smartphones/performance-comparison-of-ml-algorithms.
 Kaggle user: Christopheemilempaga. SVM fit to UCI HAR data. Website: https://www.kaggle.com/chris36/d/uciml/human-activity-recognition-with-smartphones/svm-fit-to-uci-har-data.