top of page

Class: Data Integration & Fusion

April 28, 2023

Group B2: Hyewon Kwon, Natalie Chow, Nisarga Kadam

Quality of Sleep Predictor

The project goal is to find the regression model that can best predict the quality of sleep, determined by how long it takes them to fall asleep, based on their daily activities. It is known that there is a correlation between exercise and quality of sleep. This data used is collected from a FitBit device which is a common way that people track both how much activity they get during the day and various measures of sleep, such as how long they were asleep.

Data Integration

The datasets were from Kaggle, initially generated by thirty respondents among FitBit Tracker users via distributed survey in 2016. They provided their personal tracker data, including minute-level output for physical activity and sleep monitoring. In this project, sleepDay_merged and dailyActivity_merged datasets were used, which are available at the bottom of this webpage.

​

  • Make common columns by merging them on and integrate two datasets by id and date

  • Set target value for train/test model: sleepMins = TotalTimeInBed - TotalMinutesAsleep

  • Drop unnecessary variables: Original 15 variables in the integrated dataset became 4 variables with "id," "steps," "calories," and "distance."

  • Perform an 80/20 training-testing split

Modeling

To predict how long it would take to fall asleep based on their activity that day, we performed regression tasks using four models.

  • Lasso

  • Ridge

  • SVM

  • Random Forest

Applying ChatGPT

We utilized ChatGPT by providing it with our existing code and asking the question, "Given the code, how can I perform hyperparameter tuning to improve my model?" We posed this question for each of the models we used, and ChatGPT provided us with an improvement of our code. At a high level, using ChatGPT's suggested code generally improved our models or stayed around the same.

Result

For the Lasso and Ridge regression, we had similar values of RMSE, our measure of efficiency. Despite our optimization using GridSearch, this did not result in less difference between the actual and predicted values. In the case of the Ridge regression, our original model was more fine-tuned to our data and produced a better RMSE. The SVM model was computationally expensive and resulted in the least accurate model we ran. The Random Forest model produced the best results with and without the use of GridSearch with the lowest RMSE. Though we expected our Ensemble model, which included the Lasso and Random Forest, our best-performing models, to perform better, it was not the best fit compared to the Random Forest model alone.

Download Files

Presentation

Datasets

Jupyter Notebook

Final Report

Contact Us

  • Hyewon Kwon

  • Natalie Chow

  • Nisarga Kadam

bottom of page