Hyewon Kwon
Class: Data Integration & Fusion
April 28, 2023
Group B2: Hyewon Kwon, Natalie Chow, Nisarga Kadam
Quality of Sleep Predictor
The project goal is to find the regression model that can best predict the quality of sleep, determined by how long it takes them to fall asleep, based on their daily activities. It is known that there is a correlation between exercise and quality of sleep. This data used is collected from a FitBit device which is a common way that people track both how much activity they get during the day and various measures of sleep, such as how long they were asleep.
Data Integration
The datasets were from Kaggle, initially generated by thirty respondents among FitBit Tracker users via distributed survey in 2016. They provided their personal tracker data, including minute-level output for physical activity and sleep monitoring. In this project, sleepDay_merged and dailyActivity_merged datasets were used, which are available at the bottom of this webpage.
​
-
Make common columns by merging them on and integrate two datasets by id and date
-
Set target value for train/test model: sleepMins = TotalTimeInBed - TotalMinutesAsleep
-
Drop unnecessary variables: Original 15 variables in the integrated dataset became 4 variables with "id," "steps," "calories," and "distance."
-
Perform an 80/20 training-testing split
Modeling
To predict how long it would take to fall asleep based on their activity that day, we performed regression tasks using four models.
-
Lasso
-
Ridge
-
SVM
-
Random Forest
Applying ChatGPT
We utilized ChatGPT by providing it with our existing code and asking the question, "Given the code, how can I perform hyperparameter tuning to improve my model?" We posed this question for each of the models we used, and ChatGPT provided us with an improvement of our code. At a high level, using ChatGPT's suggested code generally improved our models or stayed around the same.
Result
For the Lasso and Ridge regression, we had similar values of RMSE, our measure of efficiency. Despite our optimization using GridSearch, this did not result in less difference between the actual and predicted values. In the case of the Ridge regression, our original model was more fine-tuned to our data and produced a better RMSE. The SVM model was computationally expensive and resulted in the least accurate model we ran. The Random Forest model produced the best results with and without the use of GridSearch with the lowest RMSE. Though we expected our Ensemble model, which included the Lasso and Random Forest, our best-performing models, to perform better, it was not the best fit compared to the Random Forest model alone.
Contact Us
-
Hyewon Kwon
-
Natalie Chow
-
Nisarga Kadam