3. Project Abstract
● The Energy Section is key
contributor to climate change
● The city has placed an
emphasis on becoming
carbon neutral and more
green in the future.
● The group aims to provide
the city and building owners
with more information to help
optimize resources.
4. Initial Hypotheses
1. The best performing buildings are those most recently constructed and with higher ES ratings.
2. Buildings with improving ES ratings over time exhibit reduced energy and water consumption and GHG
emissions.
3. Water usage is correlated to overall energy usage.
4. Increases in energy and water usage are associated with extreme weather evidenced by climate change.
5. Changes in energy usage due to the COVID pandemic have resulted in decreases in energy and water usage.
6. Given energy usage data for buildings over 50,000 Gross Square Feet in the District of Columbia, as well as
facility Energy Star (ES) ratings and weather data, we can predict future energy usage in a manner that may
generate savings for building owners (or operators).
Primary Hypothesis
Can we predict DC large commercial buildings’ energy usage based on their historical electricity and
natural gas consumption and weather patterns?
5. Open Data DC
Project Architecture
National Climatic
Data Center -
NOAA
Building_Energy_
Benchmarks
CSV File PostgreSQL
Hosted on
Amazon Web
Services
Regression
Scikit- Learn
Yellowbrick
Features
Building
Characteristics
● Type
● Area
● Year Built
● NG
Heating
Monthly Data
● Electricity
● Natural
Gas
Data Ingestion
Data Munging and
Wrangling
Data Computation and
Analysis
Data Modeling and
Application
Data Reporting and
Visualization
Weather Month
Data for DC.
Data Cleaning
Pandas
Jupyter
Notebook
● Can we predict energy
usage based on
historical buildings'
electricity & natural gas
consumption, and
weather patterns?
EDA
Matplotlab
Numpy
Pandas
Seaborn
Data Product
Visualization
Plotly
7. ● Relational Database
● Selected for WORM
capabilities
● Integrates well with
AWS
● Ensure all group
members had access
to an updated copy of
the database at all
times
● Jupyter Notebooks was used
to easily run code using the
repl
● Pandas was used to do initial
transformations and
exploration of the data
Data Storage and Initial Analysis
8. Instances by Building Type Heat MapMap Violin Plot
Data Cleaning and Wrangling Tools and
Initial Visualizations
9. Data Cleaning and Wrangling
● Removed columns that are not relevant
● Removed instances that did not have 12 months of valid data
● Removed duplicates
● Calculated energy intensity (kbtu/sqft)
● Used Pandas Melt function to pivot usage columns to rows
● Met with data owners, the DC Department of Energy and Environment
● Merged property types from 40 to 19
● Eliminated annual data e.g. - water usage, GHG emissions
● Removed renewable energy as a potential feature because of sparsity of
data
● Investigated whether Energy Star scores contributed to energy usage
14. ● Yellowbrick was used to visualize different
models and to help with feature selection
● Models were tested on all buildings and
then on individual building types
● Global model received relatively good R2
and MSE scores but larger individual
models also received good scores
17. Back
Some definitions:
● Energy Usage in kBtu (1,000 British Thermal Units)
● Intensity = Energy Usage / Square Footage
● CLDD: Cooling Degree Days
● HTDD: Heating Degree Days
Additional EDA And Feature Engineering
18. Cyclic Encoding
EDA Led Us To Focus The Analysis On Electricity Usage
And Take Into Account Seasonal Patterns
Natural Gas usage requires different
considerations for modeling (dropped)
20. Global Model Versus Models By Building Type
● Lodging/Residential and Office
buildings: ~80% of the dataset
● Started with simple Linear Regression
based on EDA findings
● Energy Usage or Intensity as the target
variable?
● Explored different combinations of
features
Lodging/Residential
GLOBAL
All Building
Types
Banks
Education
Entertainment
Food Sales
Food Services
Healthcare Industrial
Mixed Use
Others
Parking
Public Services
Religious
Utilities
Services
Tech
Retail
Storage
Office
21. We Made Some Decisions on Target Variables And
Some Features, Based on Modeling
DOEE recommended using: reported floor area and 2018-2019 usage data
Monthly energy usage
for years 2010 - 2019
vs.
Monthly energy usage
for years 2018 - 2019
One Hot Encoded Foreign
Keys, one at the time
vs.
All Foreign Keys (One
Hot) Encoded
Tax Record Floor Area vs. Reported Floor Area
Target Variable: Intensity
vs.
Target Variable: Energy Usage
22. ● All years
● Tax record area
● One Hot: ‘Ward’
R2 = 0.11
Coefficient R2 Increases When One Hot Encoding Both ‘Ward’ And ‘Year Built’
Global Model, LinearRegression
● All years
● Tax record area
● One Hot: ‘Year Built’
R2 = 0.09
● All years
● Tax record area
● One Hot: ‘Ward’, ‘Year Built’
R2 = 0.16
1) Effect of different foreign keys as features:
● Years 2018-2019
● Reported area
● Target: ‘Intensity’
R2 = 0.36
● Years 2018-2019
● Reported area
● Target: ‘Energy Usage’
R2 = 0.73
2) Target variable (with One Hot -- ‘Ward’, ‘Year Built’):
Also, ‘reported area’ is a better predictor
of Energy Usage in commercial buildings
23. Preparation For Modeling
● Built a separate Jupyter notebooks for each model attempt
● Pipeline
○ ColumnTransformer
○ FeatureUnion
○ OneHotEncoder
● Cross validation
○ Time Series Split with 12 folds
○ Returned Coefficient of Determination (R2) and mean squared error (mse) of the regressor,
along with the final fitted model, fitted on all of the data
24. Algorithm Selection
Global Model:
● LinearRegression
● SGDRegressor, PolynomialFeatures (quad, cub)
● LinearRegression, PolynomialFeatures (quad, cub)
● Random Forest
Models by Building Type:
● Ensemble: LinearRegression, RandomForest, VotingRegressor
25. Results
Average Usage (MBtu) - All Buildings:
Mean
R2
Std. Dev.
R2
Mean
RMSE
Std. Dev
RMSE
LinearRegression 0.72 0.04 891 445
RandomForest 0.82 0.04 718 365
LR-PolyFeat(2) 0.84 0.12 681 614
LR-PolyFeat(3) 0.65 0.23 992 811
P25 P50 P75
174 400 900
Average Usage (MBtu) - Lodg./Res.:
Ensemble 0.84 0.04 261 148 174 400 900
Global Model:
Lodging/Residential:
Average Usage (MBtu) - Office:
Ensemble 0.89 0.02 331 129 455 862 1,465
Office:
26. A Global Model Does OK, Per-Building Models
Excel At Predicting Energy Usage
● Ensemble models for both Lodging/Residential and Office buildings enhance
weaker models such as the Linear Regression, therefore are more predictive
● Linear Regression model with second order polynomial features performs
better than third order polynomial
● Global model’s RMSE is considerably high compared to buildings energy
usage interquartile ranges
27. Conclusions
● It is possible to accurately predict energy used by DC large commercial
buildings with this type of data
● Owner reported (versus tax reported) square footage is the most important
driver of energy usage in buildings larger that 50,000 square feet in DC
● Using the methodology presented in this study, DOEE will have a good
picture of how large commercial buildings consume energy
● Also, once new data is available, DOEE will have a better understanding of
the effects of COVID-19 on energy consumption profiles
28. Next Steps
● Update study once 2020 data is available
● Contribute to the DC Open Data Repo
● Work further with DOEE
● Explore additional features in models
● Look more closely at the importance of building location
● Dominique becomes a TA and later an instructor at G-town and then a Super
Star!!!!