Buildings' Energy Consumption Prediction Models
- HUYEN TRAN
- Apr 5, 2022
- 3 min read
Updated: Aug 2, 2022

Motivation
Climate change is a globally relevant, urgent, and multi-faceted issue heavily impacted by energy policy and infrastructure. Warmer temperatures over time are changing weather patterns and disrupting the usual balance of nature. Accurate predictions of energy consumption can help policymakers target retrofitting efforts to maximize emissions reductions.
Where'd we find of the data?
The dataset is from the Women in Data Science Datathon dataset in 2022. The dataset includes roughly 100k observations of building energy usage records collected over 7 years and a number of states within the United States. The dataset consists of building characteristics (e.g. floor area, facility type etc), weather data for the location of the building (e.g. annual average temperature, annual total precipitation etc) as well as the energy usage for the building and the given year.
This project was a joined effort between me and three other classmates Junfei Zhou, Shouzheng Huan and Zhenkun Zang to predict energy consumption for each building.
How’d we make sense of it?
We started by cleaning the data. First, we encoded the categorical features in both train and test set then we checked missing variables, figured out the reasons behind the missing variables. As the result, we imputed the mean value for the missing values.
Using Python’s sklearn, we were able to built several models such as linear regression, hyperparameter tuning with Polynomial Regression, ridge regression, LASSO regression, SVM with kernel, Light GBM. We propose using RMSE for evaluation metric to compare all the models.
Exploratory Data Analysis
The dataset consists of 64 variables which are mostly the building characteristics (e.g. floor area, facility type etc), weather data for the location of the building (e.g. annual average temperature, annual total precipitation etc). Below is a snapshot of all the variables.
63 of the variables are numerical variables and three are categorical variables. The categorical variables are the anonymized state in which the building is located, building classification, and building usage type. These variables will be encoded in the data engineering part. Our target variable is the Site Energy Usage Intensity (site_eui) which is the amount of heat and electricity consumed by a building as reflected in utility bills.
Four variables in the datasets contains a large amount of missing data which are direction_max_wind_speed, direction_peak_wind_speed, max_wind_speed, and days_with_fog with 54.23%, 55.19%, 54.23%, 60.45% missing data, respectively. The energy_star_rating has 35.26% missing data and yea_built has 2.43% missing data. Here is a summary of missing ratio for both the training and testing datasets.

Our target variable, Site Energy Usage Intensity, is right-skewed, which could lead to inaccurate results in building models; therefore, we could handle this by either applying log or removing outliers to achieve normal distribution.

Data Engineering
In this section, we conduct several data engineering techniques to handle skewed data, missing data, and numerical data, while creating new features during this step (codes and graphs attached).
- We filled the missing values with the mean value.
- The original data for energy usage is skewed. We can handle this by applying log to achieve normal distribution

- We encoded categorical variables which are state_factor, building_class and building_type.
Modelling
In this section, we test different models to pick the one with the best performance. At the beginning, we need to decide on whether to conduct clustering or not. We try K-means clustering at first.
We get the result that K=3. Then we apply linear regression and XGBoost to each cluster and get their performance metrics. For the XGBoost model, their metrics are [23.92, 24.6, 21.9] and for the linear regression model, their metrics are [28.5, 27.4, 141412046.2].
For comparison, we also run models without clustering and find even better results. XGBoost model shows a RMSE of 22.6 and the linear model shows a RMSE of 28.4. In other word, running models without clustering is better.
Then we further split the data and test a set of models to see which one works the best. The models we run include linear regression, K Nearest Neighbor, XGBoost and LightGBM. We also apply grid search when doing the model selection.
Result

To have a better sense of how the performance of each model stack up against each other, we rank the performance metric as below.
Performance: KNN < Linear < Random Forest < XGBoost < LightGBM
Moreover, we compare the running time of each model and rank accordingly.
Linear(45.8s) < LightGBM(2m 37s) < KNN(4m 50s) < XGBoost(2h) < RForest
Comments