Fit a model to on-body sensor training data to predict the dumbbell lifting actions of 20 test cases.
There are 5 possible classes of dumbbell lifting actions:
* Class A. Lifting the dumbbell correctly
* Class B. Lifting the dumbbell incorrectly by throwing the elbows to the front
* Class C. Lifting the dumbbell incorrectly by lifting the dumbbell only halfway
* Class D. Lifting the dumbbell incorrectly by lowering the dumbbell only halfway
* Class E. Lifting the dumbbell incorrectly y throwing the hips to the front
Data source
http://groupware.les.inf.puc-rio.br/har
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
The data was loaded into R and split into the training and testing sets.
library(caret)
library(gdata)
data <- read.csv("/Users/angelmak/Dropbox/Course/2015_05_Coursera_PracticalMachineLearning/project_HAR/pml-training.csv")
problem <- read.csv("/Users/angelmak/Dropbox/Course/2015_05_Coursera_PracticalMachineLearning/project_HAR/pml-testing.csv")
set.seed(333)
inTrain <- createDataPartition(y=data$classe, p=0.75,list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]
1.1 A few lines of the data were displayed to get an overall impression of the data.
training[1:5,]
1.2 The summary statistics of all the predictors were generated.
summary(training)
From the information above, the following observations were made:
Some columns contain a majority of NA values and should be removed from further analysis.
Column “X” (row number) and “user_name” are irrelevant predictors and should be removed from further analysis.
Timestamp data may be user/dataset-specific and its direct use may cause overfitting (see feature plot below). It should either be transformed into time period data or removed from further analysis.
col.timestamp <-matchcols(training,with=c("timestamp"))
featurePlot(x=training[,col.timestamp],y=training$classe,plot="pairs")
Based on the observation above, the training data was preprocessed according to the following steps:
nsv <- nearZeroVar(training,saveMetrics=TRUE)
var <- rownames(nsv[(nsv$nzv==FALSE),])
training.var <- training[,var]
training.var.noNa <- training.var[,(colSums(is.na(training.var)) == 0)]
training.var.noNa.noTime <- training.var.noNa
training.var.noNa.noTime[,col.timestamp] <- list(NULL)
training.var.noNa.noTime[,'X'] <- list(NULL)
training.var.noNa.noTime[,'user_name'] <- list(NULL)
Below are the predictors that were used for model fitting.
names(training.var.noNa.noTime)
## [1] "num_window" "roll_belt" "pitch_belt"
## [4] "yaw_belt" "total_accel_belt" "gyros_belt_x"
## [7] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [10] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
## [13] "magnet_belt_y" "magnet_belt_z" "roll_arm"
## [16] "pitch_arm" "yaw_arm" "total_accel_arm"
## [19] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [22] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [25] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [28] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [31] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [34] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [37] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [40] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [43] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
## [46] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
## [49] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
## [52] "magnet_forearm_y" "magnet_forearm_z" "classe"
The train and predict functions of the caret package were used.
As a first attempt, the tree method was used. Comparison of the prediction classes and the expected classes showed that this simple classification tree method failed to differentiate the 5 classes. The accurarcy was low. Boosting was used instead to fit a better model.
modelFit.rpart <- train(training.var.noNa.noTime$classe ~.,method="rpart",data=training.var.noNa.noTime)
pred.train.rpart <- predict(modelFit.rpart,newdata=training.var.noNa.noTime)
confusionMatrix(training.var.noNa.noTime$classe,pred.train.rpart)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3802 68 304 0 11
## B 1187 977 684 0 0
## C 1205 86 1276 0 0
## D 1112 420 880 0 0
## E 393 372 725 0 1216
##
## Overall Statistics
##
## Accuracy : 0.494
## 95% CI : (0.4859, 0.5021)
## No Information Rate : 0.5231
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3384
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4938 0.50806 0.3298 NA 0.99104
## Specificity 0.9454 0.85377 0.8810 0.8361 0.88956
## Pos Pred Value 0.9085 0.34305 0.4971 NA 0.44937
## Neg Pred Value 0.6300 0.92030 0.7866 NA 0.99908
## Prevalence 0.5231 0.13066 0.2629 0.0000 0.08337
## Detection Rate 0.2583 0.06638 0.0867 0.0000 0.08262
## Detection Prevalence 0.2843 0.19350 0.1744 0.1639 0.18386
## Balanced Accuracy 0.7196 0.68092 0.6054 NA 0.94030
Using boosting with tree method to fit a model to the training data and make prediction.
Sys.time()
## [1] "2015-05-23 21:55:58 PDT"
modelFit.gbm <- train(training.var.noNa.noTime$classe ~.,method="gbm",data=training.var.noNa.noTime,verbose=FALSE)
Sys.time()
## [1] "2015-05-23 22:23:25 PDT"
pred.train.gbm <- predict(modelFit.gbm,newdata=training.var.noNa.noTime)
confusionMatrix(training.var.noNa.noTime$classe,pred.train.gbm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4180 4 0 1 0
## B 11 2829 7 1 0
## C 0 11 2546 7 3
## D 0 7 21 2383 1
## E 0 4 5 11 2686
##
## Overall Statistics
##
## Accuracy : 0.9936
## 95% CI : (0.9922, 0.9948)
## No Information Rate : 0.2848
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9919
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9974 0.9909 0.9872 0.9917 0.9985
## Specificity 0.9995 0.9984 0.9983 0.9976 0.9983
## Pos Pred Value 0.9988 0.9933 0.9918 0.9880 0.9926
## Neg Pred Value 0.9990 0.9978 0.9973 0.9984 0.9997
## Prevalence 0.2848 0.1940 0.1752 0.1633 0.1828
## Detection Rate 0.2840 0.1922 0.1730 0.1619 0.1825
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Balanced Accuracy 0.9985 0.9946 0.9927 0.9947 0.9984
varImp(modelFit.gbm)
## gbm variable importance
##
## only 20 most important variables shown (out of 53)
##
## Overall
## num_window 100.000
## roll_belt 92.810
## pitch_forearm 47.218
## magnet_dumbbell_z 33.426
## yaw_belt 30.653
## magnet_dumbbell_y 25.524
## roll_forearm 17.028
## pitch_belt 15.013
## accel_forearm_x 14.178
## magnet_belt_z 13.367
## accel_dumbbell_z 12.316
## roll_dumbbell 10.404
## gyros_dumbbell_y 9.227
## gyros_belt_z 9.133
## accel_dumbbell_y 8.491
## accel_dumbbell_x 7.291
## magnet_forearm_z 6.541
## accel_forearm_z 5.268
## magnet_arm_z 4.897
## roll_arm 3.921
The top predictor (num_window) shown above seems to be correlated to the users who performed the action. Since user-specific data would probably create bias or overfitting, the column “num_window” was removed from training data set. Model fitting was rerun.
qplot(training.var.noNa$classe,training.var.noNa$num_window,data=training.var.noNa)+geom_point(aes(color=training.var.noNa$user_name))
training.var.noNa.noTime2 <- training.var.noNa.noTime
training.var.noNa.noTime2[,'num_window'] <- list(NULL)
Sys.time()
## [1] "2015-05-23 22:23:27 PDT"
modelFit.gbm2 <- train(training.var.noNa.noTime2$classe ~.,method="gbm",data=training.var.noNa.noTime2,verbose=FALSE)
Sys.time()
## [1] "2015-05-23 22:49:35 PDT"
pred.train.gbm2 <- predict(modelFit.gbm2,newdata=training.var.noNa.noTime2)
confusionMatrix(training.var.noNa.noTime2$classe,pred.train.gbm2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4151 24 9 0 1
## B 77 2715 53 2 1
## C 0 63 2479 21 4
## D 1 3 62 2337 9
## E 1 18 16 31 2640
##
## Overall Statistics
##
## Accuracy : 0.9731
## 95% CI : (0.9704, 0.9756)
## No Information Rate : 0.2874
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.966
## Mcnemar's Test P-Value : 1.014e-15
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9813 0.9617 0.9465 0.9774 0.9944
## Specificity 0.9968 0.9888 0.9927 0.9939 0.9945
## Pos Pred Value 0.9919 0.9533 0.9657 0.9689 0.9756
## Neg Pred Value 0.9925 0.9909 0.9885 0.9956 0.9988
## Prevalence 0.2874 0.1918 0.1779 0.1625 0.1804
## Detection Rate 0.2820 0.1845 0.1684 0.1588 0.1794
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Balanced Accuracy 0.9890 0.9753 0.9696 0.9857 0.9944
varImp(modelFit.gbm2)
## gbm variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.000
## pitch_forearm 50.560
## yaw_belt 37.480
## magnet_dumbbell_z 33.026
## magnet_dumbbell_y 26.444
## roll_forearm 26.254
## magnet_belt_z 17.398
## pitch_belt 16.354
## gyros_belt_z 13.163
## roll_dumbbell 11.777
## accel_forearm_x 11.379
## gyros_dumbbell_y 9.468
## accel_dumbbell_y 9.080
## magnet_forearm_z 7.700
## accel_forearm_z 7.550
## accel_dumbbell_x 6.995
## magnet_forearm_x 6.205
## magnet_arm_z 5.975
## yaw_arm 5.566
## magnet_belt_y 4.907
Compare the performance of the first and second boosting with tree models to the testing set. The accuracy of the second model was lower than the first but the difference was small.
Prediction on testing set using the first boosting with tree model
testing.eq <- testing[,colnames(training.var.noNa.noTime)]
pred.test.gbm <- predict(modelFit.gbm,newdata=testing.eq)
confusionMatrix(testing.eq$classe,pred.test.gbm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 1 0 0 0
## B 11 928 8 2 0
## C 0 5 846 3 1
## D 0 6 13 785 0
## E 0 6 1 6 888
##
## Overall Statistics
##
## Accuracy : 0.9872
## 95% CI : (0.9836, 0.9901)
## No Information Rate : 0.2865
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9837
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9922 0.9810 0.9747 0.9862 0.9989
## Specificity 0.9997 0.9947 0.9978 0.9954 0.9968
## Pos Pred Value 0.9993 0.9779 0.9895 0.9764 0.9856
## Neg Pred Value 0.9969 0.9954 0.9946 0.9973 0.9998
## Prevalence 0.2865 0.1929 0.1770 0.1623 0.1813
## Detection Rate 0.2843 0.1892 0.1725 0.1601 0.1811
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9959 0.9878 0.9862 0.9908 0.9978
Prediction on testing set using the second boosting with tree model
testing.eq2 <- testing[,colnames(training.var.noNa.noTime2)]
pred.test.gbm2 <- predict(modelFit.gbm2,newdata=testing.eq2)
confusionMatrix(testing.eq2$classe,pred.test.gbm2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1373 12 9 0 1
## B 28 887 32 1 1
## C 0 26 817 11 1
## D 0 3 30 766 5
## E 6 8 14 15 858
##
## Overall Statistics
##
## Accuracy : 0.9586
## 95% CI : (0.9526, 0.964)
## No Information Rate : 0.2869
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9476
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9758 0.9476 0.9058 0.9660 0.9908
## Specificity 0.9937 0.9844 0.9905 0.9908 0.9894
## Pos Pred Value 0.9842 0.9347 0.9556 0.9527 0.9523
## Neg Pred Value 0.9903 0.9876 0.9790 0.9934 0.9980
## Prevalence 0.2869 0.1909 0.1839 0.1617 0.1766
## Detection Rate 0.2800 0.1809 0.1666 0.1562 0.1750
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9848 0.9660 0.9481 0.9784 0.9901
The first and second boosting with tree models gave the same prediction resutls to the 20 test cases in the quiz.
Prediction using the first boosting model
problem.col <- colnames(training.var.noNa.noTime)[-54]
problem.eq <- problem[,problem.col]
pred.problem.gbm <- predict(modelFit.gbm,newdata=problem.eq)
pred.problem.gbm
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Prediction using the second boosting model
problem.col2 <- colnames(training.var.noNa.noTime2)[-53]
problem.eq2 <- problem[,problem.col2]
pred.problem.gbm2 <- predict(modelFit.gbm2,newdata=problem.eq2)
pred.problem.gbm2
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
These predictions were submitted as answers.