An explanation of how Can I Solar? (www.canisolar.com) works. Discussion of linear regression and time series forecasting. Verification of linear regression assumptions; discussion of an automatic model selection pipeline for linear, ARIMA, and exponential smoothing time series models.
2. MOTIVATION
โข Residential solar sector grew 51% from
2013 to 2014
โข Projected market value of $3.7 billion in
2015
โข Complex decision with many variables
โข Homeowners want to know:
โข How much money can I save?
โข When will I break even?
3. CAN I SOLAR?
A DATA-DRIVEN WEB APPLICATION
http://www.canisolar.com
4. MODELING INSTALLATION COSTS
โข Data on 400,000 installs obtained from โจ
National Renewable Energy Laboratory
โข Cost of solar installations varies by:
โข size of the array
โข year of installation
โข location of installation
โข Multiple linear regression provides good ๏ฌt and
is easily interpretable
โข Also tried multilevel modeling and random
forest regression
5. MODELING FUTURE ELECTRICITY PRICES
โข 15 years of monthly historical electricity prices by state obtained from Energy
Information Administration
โข Prices and trends vary signi๏ฌcantly by state, so no one model works best for all
states
โข Developed a pipeline to
automatically test, validate,
and select an appropriate
time-series model for each
state, e.g.:
โข linear
โข ARIMA
โข exponential smoothing
9. GABRIEL J. MICHAEL
โข Ph.D., Political Science, George Washington
University
โข Used survival regression to model countries'
adoption of intellectual property laws
โข Postdoc, Yale Law School
โข Used NLP with SVMs to classify tweets and
regulatory comments on political topics
Exploring the since-demolished PEPCO
Benning Generating Station,Washington, DC
Urban explorer, electronics hobbyist
Visualization ofTwitter users' connections
and sentiment about net neutrality
10. MODELS OF INSTALLATION COSTS
Simple Linear
Regression
Multiple Linear
Regression
Multilevel
Model
Random Forest
Regression
Model Form
log(cost) ~
log(size_kw)
log(cost) ~
log(size_kw) + state
+ year
log(cost) ~
log(size_kw) +
(log(size_kw) | state/
year_installed)
log(cost) ~
log(size_kw)
Notes
easy to interpret
and explain
con๏ฌdence and
prediction intervals for
multilevel models are
dif๏ฌcult to interpret
scikit-learn's random
forest regressor doesn't
support factors, and the
R packages are too slow
R2 or Pseudo R2 0.81 0.89 0.89 0.93
10-fold CV MSE 0.089 0.053 0.050 0.050
11. Per-capita electricity consumption has ๏ฌattened and
even declined in recent years
United States: kWh per capita
0
4000
8000
12000
16000
1960 1963 1966 1969 1972 1975 1978 1981 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011
12. โข Industry standard warranties
offer guaranteed 90% output
at 10 years, 80% output at 25
years
โข I use a simple exponential
decay curve to calculate
performance in month 0 to
month 360 (30 years)
PHOTOVOLTAIC PERFORMANCE
DECLINE OVERTIME
0 5 10 15 20 25 30
0.00.20.40.60.81.0
Performance = e^(โ0.005322 + โ0.008935 * Years)
YearPerformance
15. BACKEND
โข Python 3 + pandas for core classes and program logic
โข R for modeling + rpy2 Python interface to R
โข MySQL for storage of electricity consumption and
price data, and solar installation cost/size data
โข MongoDB for storage and retrieval of geolocated
insolation data
โข Code on GitHub: https://github.com/langelgjm/canisolar
18. ASSUMPTIONS OF LINEAR REGRESSION
โข Homoskedasticity
(constant variance of
errors)
โข Some evidence of
heteroskedasticity
โข Could use robust
standard errors for
intervals, although the
con๏ฌdence intervals are
not much wider
19. ASSUMPTIONS OF LINEAR REGRESSION
โข Normality of residuals
โข Evidence of non-normal
(heavy tailed) error
distribution
โข This assumption only
necessary for con๏ฌdence
intervals/p-values, not best
linear unbiased estimates
โข Could use robust regression
with t-distribution
20. ASSUMPTIONS OF LINEAR REGRESSION
โข True linear relationship
โข True with simple
regression of cost ~ size
โข No signi๏ฌcant
multicollinearity
โข Variance in๏ฌation factors
relatively low
21. TIME SERIES MODELING
โข No other predictors (time is the only variable)
โข Strong a priori reason to believe most states will have an increasing,
roughly linear trend in future electricity prices, often with seasonality
22. TIME SERIES MODELING
โข States vary signi๏ฌcantly from one another in historical prices,
trends, and seasonality
โข We cannot expect the same model to perform well for all states!
24. 1. Create a handcrafted list of 7 possible models (1 linear, 4 ARIMA, and 2
exponential smoothing)
LONGTERM FORECASTING:A SOLUTION
Parameters Seasonal Parameters Note
Linear n/a n/a
ARIMA (1,0,0) None include drift
ARIMA (1,1,0) None include drift
ARIMA (1,0,0) (1,0,0)
ARIMA (1,0,0) (1,1,0)
Exponential Smoothing M M no damping
Exponential Smoothing A A no damping
25. 2. Train each model on 1/3, 1/2, & 2/3 of historical data; test on the respective
remaining proportion of historical data (2 models shown)
LONGTERM FORECASTING:A SOLUTION
26. 3. Select the model with the lowest MSE across all tests
4. Repeat for every U.S. state + DC
5. Sanity check the resulting models
LONGTERM FORECASTING:A SOLUTION
Forecasts from ARIMA(1,0,0)(1,0,0)[12] with nonโzero mean
2000 2010 2020 2030 2040
101520
Forecasts from ETS(A,A,A)
2000 2010 2020 2030 2040
050100150
NH MS