Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
First I change the locale to English. Then I download the training and test sets, if not already available.
Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
require("caret")
require("data.table")
if(!file.exists("Human_Activity_Training.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
destfile = "Human_Activity_Training.csv")
}
if(!file.exists("Human_Activity_Testing.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
destfile = "Human_Activity_Testing.csv")
}
training = fread("Human_Activity_Training.csv")
training = as.data.frame(training)
testing = fread("Human_Activity_Testing.csv")
testing = as.data.frame(testing)
Now before I continue, some basic transformations must be done, like the removal of features with too many NAs, as they are not going to provide any useful information to the model, as well as the removal of non-numeric features.
tidy_training=training[,which(as.numeric(colSums(is.na(training)))==0)]
tidy_testing=testing[,which(as.numeric(colSums(is.na(testing)))==0)]
We also want to get rid of variables 1 to 7 since they are useless for the analysis (they are documentation variables like name, timestamp,indexes etc…)
tidy_training=tidy_training[,-c(1:7)]
tidy_testing=tidy_testing[,-c(1:7)]
Now let’s further split the training dataset in training sets and validation sets in order to perform cross validation.
require(caret)
set.seed(123123)
row_indexes = 1:nrow(tidy_training)
inTrain1 = sample(row_indexes, round(nrow(tidy_training)/3), replace = FALSE)
train1 = tidy_training[inTrain1,]
validation1 = tidy_training[-inTrain1,]
inTrain2 = sample(row_indexes[-inTrain1], round(nrow(tidy_training)/3), replace = FALSE)
train2 = tidy_training[inTrain2,]
validation2 = tidy_training[-inTrain2,]
inTrain3 = sample(row_indexes[-c(inTrain2,inTrain1)], nrow(tidy_training)- 2*round(nrow(tidy_training)/3), replace = FALSE)
train3 = tidy_training[inTrain3,]
validation3 = tidy_training[-inTrain3,]
We will use the random forest algorithm with 15 number of trees for each model. The training dataset was split randomly into 3 different parts. Each part will be used as the training set in order to validate the model and estimate the out of sample error using the other two parts as the validation set.
In the following part I train the models used for cross validation. In order to save time I will include the model objects in my repository (https://github.com/Costaspap/Machine-Learning-Coursera-Project) so anyone can load them by using the load() command instead of running the train commands below.
# Download the model objects and load them in order to save time
load("model1")
load("model2")
load("model3")
load("FinalModel")
#Train the three Models
model1 = train(classe ~ ., data = train1, method = "rf", importance = TRUE, ntrees = 15)
model2 = train(classe ~ ., data = train2, method = "rf", importance = TRUE, ntrees = 15)
model3 = train(classe ~ ., data = train3, method = "rf", importance = TRUE, ntrees = 15)
The random forest algorithm is highly accurate and by cleaning the dataset and using the right features, the out of sample error is expected to be low.
However let’s evaluate the results:
confusionMatrix(validation1$classe,predict(model1,validation1))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3673 14 2 0 0
## B 59 2488 21 9 0
## C 0 20 2259 18 0
## D 0 2 54 2079 0
## E 0 4 4 25 2350
##
## Overall Statistics
##
## Accuracy : 0.9823
## 95% CI : (0.9799, 0.9845)
## No Information Rate : 0.2853
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9776
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9842 0.9842 0.9654 0.9756 1.0000
## Specificity 0.9983 0.9916 0.9965 0.9949 0.9969
## Pos Pred Value 0.9957 0.9655 0.9835 0.9738 0.9862
## Neg Pred Value 0.9937 0.9962 0.9925 0.9952 1.0000
## Prevalence 0.2853 0.1933 0.1789 0.1629 0.1796
## Detection Rate 0.2808 0.1902 0.1727 0.1589 0.1796
## Detection Prevalence 0.2820 0.1970 0.1756 0.1632 0.1822
## Balanced Accuracy 0.9912 0.9879 0.9809 0.9852 0.9985
confusionMatrix(validation2$classe,predict(model2,validation2))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3710 8 2 1 1
## B 48 2417 30 0 0
## C 0 30 2208 30 0
## D 0 1 32 2094 10
## E 0 2 9 12 2436
##
## Overall Statistics
##
## Accuracy : 0.9835
## 95% CI : (0.9812, 0.9856)
## No Information Rate : 0.2873
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9791
## Mcnemar's Test P-Value : 2.346e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9872 0.9833 0.9680 0.9799 0.9955
## Specificity 0.9987 0.9927 0.9944 0.9961 0.9978
## Pos Pred Value 0.9968 0.9687 0.9735 0.9799 0.9906
## Neg Pred Value 0.9949 0.9961 0.9932 0.9961 0.9990
## Prevalence 0.2873 0.1879 0.1744 0.1634 0.1871
## Detection Rate 0.2836 0.1848 0.1688 0.1601 0.1862
## Detection Prevalence 0.2845 0.1907 0.1734 0.1634 0.1880
## Balanced Accuracy 0.9930 0.9880 0.9812 0.9880 0.9967
confusionMatrix(validation3$classe,predict(model3,validation3))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3730 11 6 1 1
## B 36 2448 38 0 0
## C 0 38 2236 5 0
## D 0 0 83 2074 3
## E 0 3 9 5 2355
##
## Overall Statistics
##
## Accuracy : 0.9817
## 95% CI : (0.9793, 0.984)
## No Information Rate : 0.2879
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9769
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9904 0.9792 0.9427 0.9947 0.9983
## Specificity 0.9980 0.9930 0.9960 0.9922 0.9984
## Pos Pred Value 0.9949 0.9707 0.9811 0.9602 0.9928
## Neg Pred Value 0.9961 0.9951 0.9874 0.9990 0.9996
## Prevalence 0.2879 0.1911 0.1813 0.1594 0.1803
## Detection Rate 0.2851 0.1871 0.1709 0.1585 0.1800
## Detection Prevalence 0.2866 0.1928 0.1742 0.1651 0.1813
## Balanced Accuracy 0.9942 0.9861 0.9693 0.9935 0.9984
As we can see the model has an estimated accuracy over 98% , so the out of sample error is indeed very low.
Our final model will use the same random forest algorithm, will be trained on the entire training dataset and will be used to predict the 20 requested test cases in the tidy_testing dataset.
FinalModel = train(classe ~ ., data = tidy_training, method = "rf", importance = TRUE, ntrees = 15)
answers=as.character(predict(FinalModel,tidy_testing))
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers)
predict(FinalModel,tidy_testing)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E