Pactical Machine Learning : HAR Course Project

1. Goal

Fit a model to on-body sensor training data to predict the dumbbell lifting actions of 20 test cases.

There are 5 possible classes of dumbbell lifting actions:
* Class A. Lifting the dumbbell correctly
* Class B. Lifting the dumbbell incorrectly by throwing the elbows to the front
* Class C. Lifting the dumbbell incorrectly by lifting the dumbbell only halfway
* Class D. Lifting the dumbbell incorrectly by lowering the dumbbell only halfway
* Class E. Lifting the dumbbell incorrectly y throwing the hips to the front

Data source
http://groupware.les.inf.puc-rio.br/har
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

2. Data splitting

The data was loaded into R and split into the training and testing sets.

library(caret)
library(gdata)
data <- read.csv("/Users/angelmak/Dropbox/Course/2015_05_Coursera_PracticalMachineLearning/project_HAR/pml-training.csv")
problem <- read.csv("/Users/angelmak/Dropbox/Course/2015_05_Coursera_PracticalMachineLearning/project_HAR/pml-testing.csv")

set.seed(333)
inTrain <- createDataPartition(y=data$classe, p=0.75,list=FALSE)

training <- data[inTrain,]
testing <- data[-inTrain,]

3. Exploratory analysis

1.1 A few lines of the data were displayed to get an overall impression of the data.

training[1:5,]

1.2 The summary statistics of all the predictors were generated.

summary(training)

From the information above, the following observations were made:

Some columns contain a majority of NA values and should be removed from further analysis.
Column “X” (row number) and “user_name” are irrelevant predictors and should be removed from further analysis.
Timestamp data may be user/dataset-specific and its direct use may cause overfitting (see feature plot below). It should either be transformed into time period data or removed from further analysis.

col.timestamp <-matchcols(training,with=c("timestamp"))
featurePlot(x=training[,col.timestamp],y=training$classe,plot="pairs")

4. Data preprocessing

Based on the observation above, the training data was preprocessed according to the following steps:

Zero covariates were removed.

nsv <- nearZeroVar(training,saveMetrics=TRUE)
var <- rownames(nsv[(nsv$nzv==FALSE),])
training.var <- training[,var]

Columns which contain missing data was removed.

training.var.noNa <- training.var[,(colSums(is.na(training.var)) == 0)]

Column X, user_name and timestamp data was removed from further analysis.

training.var.noNa.noTime <- training.var.noNa
training.var.noNa.noTime[,col.timestamp] <- list(NULL)
training.var.noNa.noTime[,'X'] <- list(NULL)
training.var.noNa.noTime[,'user_name'] <- list(NULL)

Below are the predictors that were used for model fitting.

names(training.var.noNa.noTime)

##  [1] "num_window"           "roll_belt"            "pitch_belt"          
##  [4] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
##  [7] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [10] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [13] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [16] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [19] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [22] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [25] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [28] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [31] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [34] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [37] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [40] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [43] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [46] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [49] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [52] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

5. Model fitting using training data

The train and predict functions of the caret package were used.

As a first attempt, the tree method was used. Comparison of the prediction classes and the expected classes showed that this simple classification tree method failed to differentiate the 5 classes. The accurarcy was low. Boosting was used instead to fit a better model.

modelFit.rpart <- train(training.var.noNa.noTime$classe ~.,method="rpart",data=training.var.noNa.noTime)
pred.train.rpart <- predict(modelFit.rpart,newdata=training.var.noNa.noTime)
confusionMatrix(training.var.noNa.noTime$classe,pred.train.rpart)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3802   68  304    0   11
##          B 1187  977  684    0    0
##          C 1205   86 1276    0    0
##          D 1112  420  880    0    0
##          E  393  372  725    0 1216
## 
## Overall Statistics
##                                           
##                Accuracy : 0.494           
##                  95% CI : (0.4859, 0.5021)
##     No Information Rate : 0.5231          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3384          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4938  0.50806   0.3298       NA  0.99104
## Specificity            0.9454  0.85377   0.8810   0.8361  0.88956
## Pos Pred Value         0.9085  0.34305   0.4971       NA  0.44937
## Neg Pred Value         0.6300  0.92030   0.7866       NA  0.99908
## Prevalence             0.5231  0.13066   0.2629   0.0000  0.08337
## Detection Rate         0.2583  0.06638   0.0867   0.0000  0.08262
## Detection Prevalence   0.2843  0.19350   0.1744   0.1639  0.18386
## Balanced Accuracy      0.7196  0.68092   0.6054       NA  0.94030

Using boosting with tree method to fit a model to the training data and make prediction.

Sys.time()

## [1] "2015-05-23 21:55:58 PDT"

modelFit.gbm <- train(training.var.noNa.noTime$classe ~.,method="gbm",data=training.var.noNa.noTime,verbose=FALSE)
Sys.time()

## [1] "2015-05-23 22:23:25 PDT"

pred.train.gbm <- predict(modelFit.gbm,newdata=training.var.noNa.noTime)
confusionMatrix(training.var.noNa.noTime$classe,pred.train.gbm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4180    4    0    1    0
##          B   11 2829    7    1    0
##          C    0   11 2546    7    3
##          D    0    7   21 2383    1
##          E    0    4    5   11 2686
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9936          
##                  95% CI : (0.9922, 0.9948)
##     No Information Rate : 0.2848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9919          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9974   0.9909   0.9872   0.9917   0.9985
## Specificity            0.9995   0.9984   0.9983   0.9976   0.9983
## Pos Pred Value         0.9988   0.9933   0.9918   0.9880   0.9926
## Neg Pred Value         0.9990   0.9978   0.9973   0.9984   0.9997
## Prevalence             0.2848   0.1940   0.1752   0.1633   0.1828
## Detection Rate         0.2840   0.1922   0.1730   0.1619   0.1825
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1839
## Balanced Accuracy      0.9985   0.9946   0.9927   0.9947   0.9984

varImp(modelFit.gbm)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                   Overall
## num_window        100.000
## roll_belt          92.810
## pitch_forearm      47.218
## magnet_dumbbell_z  33.426
## yaw_belt           30.653
## magnet_dumbbell_y  25.524
## roll_forearm       17.028
## pitch_belt         15.013
## accel_forearm_x    14.178
## magnet_belt_z      13.367
## accel_dumbbell_z   12.316
## roll_dumbbell      10.404
## gyros_dumbbell_y    9.227
## gyros_belt_z        9.133
## accel_dumbbell_y    8.491
## accel_dumbbell_x    7.291
## magnet_forearm_z    6.541
## accel_forearm_z     5.268
## magnet_arm_z        4.897
## roll_arm            3.921

The top predictor (num_window) shown above seems to be correlated to the users who performed the action. Since user-specific data would probably create bias or overfitting, the column “num_window” was removed from training data set. Model fitting was rerun.

qplot(training.var.noNa$classe,training.var.noNa$num_window,data=training.var.noNa)+geom_point(aes(color=training.var.noNa$user_name))

training.var.noNa.noTime2 <- training.var.noNa.noTime
training.var.noNa.noTime2[,'num_window'] <- list(NULL)
Sys.time()

## [1] "2015-05-23 22:23:27 PDT"

modelFit.gbm2 <- train(training.var.noNa.noTime2$classe ~.,method="gbm",data=training.var.noNa.noTime2,verbose=FALSE)
Sys.time()

## [1] "2015-05-23 22:49:35 PDT"

pred.train.gbm2 <- predict(modelFit.gbm2,newdata=training.var.noNa.noTime2)
confusionMatrix(training.var.noNa.noTime2$classe,pred.train.gbm2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4151   24    9    0    1
##          B   77 2715   53    2    1
##          C    0   63 2479   21    4
##          D    1    3   62 2337    9
##          E    1   18   16   31 2640
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9731          
##                  95% CI : (0.9704, 0.9756)
##     No Information Rate : 0.2874          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.966           
##  Mcnemar's Test P-Value : 1.014e-15       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9813   0.9617   0.9465   0.9774   0.9944
## Specificity            0.9968   0.9888   0.9927   0.9939   0.9945
## Pos Pred Value         0.9919   0.9533   0.9657   0.9689   0.9756
## Neg Pred Value         0.9925   0.9909   0.9885   0.9956   0.9988
## Prevalence             0.2874   0.1918   0.1779   0.1625   0.1804
## Detection Rate         0.2820   0.1845   0.1684   0.1588   0.1794
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1839
## Balanced Accuracy      0.9890   0.9753   0.9696   0.9857   0.9944

varImp(modelFit.gbm2)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                   Overall
## roll_belt         100.000
## pitch_forearm      50.560
## yaw_belt           37.480
## magnet_dumbbell_z  33.026
## magnet_dumbbell_y  26.444
## roll_forearm       26.254
## magnet_belt_z      17.398
## pitch_belt         16.354
## gyros_belt_z       13.163
## roll_dumbbell      11.777
## accel_forearm_x    11.379
## gyros_dumbbell_y    9.468
## accel_dumbbell_y    9.080
## magnet_forearm_z    7.700
## accel_forearm_z     7.550
## accel_dumbbell_x    6.995
## magnet_forearm_x    6.205
## magnet_arm_z        5.975
## yaw_arm             5.566
## magnet_belt_y       4.907

6. Evaluate model in testing data

Compare the performance of the first and second boosting with tree models to the testing set. The accuracy of the second model was lower than the first but the difference was small.

Prediction on testing set using the first boosting with tree model

testing.eq <- testing[,colnames(training.var.noNa.noTime)]
pred.test.gbm <- predict(modelFit.gbm,newdata=testing.eq)
confusionMatrix(testing.eq$classe,pred.test.gbm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    1    0    0    0
##          B   11  928    8    2    0
##          C    0    5  846    3    1
##          D    0    6   13  785    0
##          E    0    6    1    6  888
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9872          
##                  95% CI : (0.9836, 0.9901)
##     No Information Rate : 0.2865          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9837          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9922   0.9810   0.9747   0.9862   0.9989
## Specificity            0.9997   0.9947   0.9978   0.9954   0.9968
## Pos Pred Value         0.9993   0.9779   0.9895   0.9764   0.9856
## Neg Pred Value         0.9969   0.9954   0.9946   0.9973   0.9998
## Prevalence             0.2865   0.1929   0.1770   0.1623   0.1813
## Detection Rate         0.2843   0.1892   0.1725   0.1601   0.1811
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9959   0.9878   0.9862   0.9908   0.9978

Prediction on testing set using the second boosting with tree model

testing.eq2 <- testing[,colnames(training.var.noNa.noTime2)]
pred.test.gbm2 <- predict(modelFit.gbm2,newdata=testing.eq2)
confusionMatrix(testing.eq2$classe,pred.test.gbm2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1373   12    9    0    1
##          B   28  887   32    1    1
##          C    0   26  817   11    1
##          D    0    3   30  766    5
##          E    6    8   14   15  858
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9586         
##                  95% CI : (0.9526, 0.964)
##     No Information Rate : 0.2869         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9476         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9758   0.9476   0.9058   0.9660   0.9908
## Specificity            0.9937   0.9844   0.9905   0.9908   0.9894
## Pos Pred Value         0.9842   0.9347   0.9556   0.9527   0.9523
## Neg Pred Value         0.9903   0.9876   0.9790   0.9934   0.9980
## Prevalence             0.2869   0.1909   0.1839   0.1617   0.1766
## Detection Rate         0.2800   0.1809   0.1666   0.1562   0.1750
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9848   0.9660   0.9481   0.9784   0.9901

7. Making predictions on the 20 test cases

The first and second boosting with tree models gave the same prediction resutls to the 20 test cases in the quiz.

Prediction using the first boosting model

problem.col <- colnames(training.var.noNa.noTime)[-54]
problem.eq <- problem[,problem.col]
pred.problem.gbm <- predict(modelFit.gbm,newdata=problem.eq)
pred.problem.gbm

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Prediction using the second boosting model

problem.col2 <- colnames(training.var.noNa.noTime2)[-53]
problem.eq2 <- problem[,problem.col2]
pred.problem.gbm2 <- predict(modelFit.gbm2,newdata=problem.eq2)
pred.problem.gbm2

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

These predictions were submitted as answers.