Project Description

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Loading and preprocessing the data

First I change the locale to English. Then I download the training and test sets, if not already available.

Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
require("caret")
require("data.table")

if(!file.exists("Human_Activity_Training.csv")) {
  
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
  destfile = "Human_Activity_Training.csv")
  
}

if(!file.exists("Human_Activity_Testing.csv")) {
  
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
  destfile = "Human_Activity_Testing.csv")
  
}

training = fread("Human_Activity_Training.csv")
training = as.data.frame(training)

testing = fread("Human_Activity_Testing.csv")
testing = as.data.frame(testing)

Now before I continue, some basic transformations must be done, like the removal of features with too many NAs, as they are not going to provide any useful information to the model, as well as the removal of non-numeric features.

tidy_training=training[,which(as.numeric(colSums(is.na(training)))==0)]
tidy_testing=testing[,which(as.numeric(colSums(is.na(testing)))==0)]

We also want to get rid of variables 1 to 7 since they are useless for the analysis (they are documentation variables like name, timestamp,indexes etc…)

tidy_training=tidy_training[,-c(1:7)]
tidy_testing=tidy_testing[,-c(1:7)]

Cross Validation

Now let’s further split the training dataset in training sets and validation sets in order to perform cross validation.

require(caret)
set.seed(123123)

row_indexes = 1:nrow(tidy_training)

inTrain1 = sample(row_indexes, round(nrow(tidy_training)/3), replace = FALSE)

train1 = tidy_training[inTrain1,]
validation1 = tidy_training[-inTrain1,]

inTrain2 = sample(row_indexes[-inTrain1], round(nrow(tidy_training)/3), replace = FALSE)

train2 = tidy_training[inTrain2,]
validation2 = tidy_training[-inTrain2,]

inTrain3 = sample(row_indexes[-c(inTrain2,inTrain1)], nrow(tidy_training)- 2*round(nrow(tidy_training)/3), replace = FALSE)

train3 = tidy_training[inTrain3,]
validation3 = tidy_training[-inTrain3,]

We will use the random forest algorithm with 15 number of trees for each model. The training dataset was split randomly into 3 different parts. Each part will be used as the training set in order to validate the model and estimate the out of sample error using the other two parts as the validation set.

In the following part I train the models used for cross validation. In order to save time I will include the model objects in my repository (https://github.com/Costaspap/Machine-Learning-Coursera-Project) so anyone can load them by using the load() command instead of running the train commands below.

# Download the model objects and load them in order to save time
load("model1")
load("model2")
load("model3")
load("FinalModel")
#Train the three Models
model1 = train(classe ~ ., data = train1, method = "rf", importance = TRUE, ntrees = 15)

model2 = train(classe ~ ., data = train2, method = "rf", importance = TRUE, ntrees = 15)

model3 = train(classe ~ ., data = train3, method = "rf", importance = TRUE, ntrees = 15)

The random forest algorithm is highly accurate and by cleaning the dataset and using the right features, the out of sample error is expected to be low.

However let’s evaluate the results:

confusionMatrix(validation1$classe,predict(model1,validation1))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3673   14    2    0    0
##          B   59 2488   21    9    0
##          C    0   20 2259   18    0
##          D    0    2   54 2079    0
##          E    0    4    4   25 2350
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9823          
##                  95% CI : (0.9799, 0.9845)
##     No Information Rate : 0.2853          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9776          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9842   0.9842   0.9654   0.9756   1.0000
## Specificity            0.9983   0.9916   0.9965   0.9949   0.9969
## Pos Pred Value         0.9957   0.9655   0.9835   0.9738   0.9862
## Neg Pred Value         0.9937   0.9962   0.9925   0.9952   1.0000
## Prevalence             0.2853   0.1933   0.1789   0.1629   0.1796
## Detection Rate         0.2808   0.1902   0.1727   0.1589   0.1796
## Detection Prevalence   0.2820   0.1970   0.1756   0.1632   0.1822
## Balanced Accuracy      0.9912   0.9879   0.9809   0.9852   0.9985
confusionMatrix(validation2$classe,predict(model2,validation2))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3710    8    2    1    1
##          B   48 2417   30    0    0
##          C    0   30 2208   30    0
##          D    0    1   32 2094   10
##          E    0    2    9   12 2436
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9835          
##                  95% CI : (0.9812, 0.9856)
##     No Information Rate : 0.2873          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9791          
##  Mcnemar's Test P-Value : 2.346e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9872   0.9833   0.9680   0.9799   0.9955
## Specificity            0.9987   0.9927   0.9944   0.9961   0.9978
## Pos Pred Value         0.9968   0.9687   0.9735   0.9799   0.9906
## Neg Pred Value         0.9949   0.9961   0.9932   0.9961   0.9990
## Prevalence             0.2873   0.1879   0.1744   0.1634   0.1871
## Detection Rate         0.2836   0.1848   0.1688   0.1601   0.1862
## Detection Prevalence   0.2845   0.1907   0.1734   0.1634   0.1880
## Balanced Accuracy      0.9930   0.9880   0.9812   0.9880   0.9967
confusionMatrix(validation3$classe,predict(model3,validation3))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3730   11    6    1    1
##          B   36 2448   38    0    0
##          C    0   38 2236    5    0
##          D    0    0   83 2074    3
##          E    0    3    9    5 2355
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9817         
##                  95% CI : (0.9793, 0.984)
##     No Information Rate : 0.2879         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9769         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9904   0.9792   0.9427   0.9947   0.9983
## Specificity            0.9980   0.9930   0.9960   0.9922   0.9984
## Pos Pred Value         0.9949   0.9707   0.9811   0.9602   0.9928
## Neg Pred Value         0.9961   0.9951   0.9874   0.9990   0.9996
## Prevalence             0.2879   0.1911   0.1813   0.1594   0.1803
## Detection Rate         0.2851   0.1871   0.1709   0.1585   0.1800
## Detection Prevalence   0.2866   0.1928   0.1742   0.1651   0.1813
## Balanced Accuracy      0.9942   0.9861   0.9693   0.9935   0.9984

As we can see the model has an estimated accuracy over 98% , so the out of sample error is indeed very low.

Final Model and submission

Our final model will use the same random forest algorithm, will be trained on the entire training dataset and will be used to predict the 20 requested test cases in the tidy_testing dataset.

FinalModel = train(classe ~ ., data = tidy_training, method = "rf", importance = TRUE, ntrees = 15)

answers=as.character(predict(FinalModel,tidy_testing))
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(answers)
predict(FinalModel,tidy_testing)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E