Powering a Virtual Power Station with
Big Data
Michael Bironneau
April 2016
0
5
10
15
20
25
30
35
Installed Capacity (GW) Generation (GW)
0
2
4
6
8
10
12
14
16
18
20
0:00 2:30 5:00 7:30 10:00 12:30 15:00 17:30 20:00 22:30
MW
Total Power
Average upwards flex – 120%
Average downwards flex – 35%
?
?
• 25-40k messages processed per second
• Total size of data 500TB-800TB
Open Energi in the coming year:
• 25-40k messages processed per second
• Total size of data 500TB-800TB
Open Energi in the coming year:
Perspective: here’s what “big data” means to Boeing [1]:
• ~64k messages per second from each aircraft
• Total size of data over 100 petabytes
[1]: http://bit.ly/18kQlMn
0
20
40
60
80
100
120
Open Energi Boeing
Size of data (PB)
Our data is not huge at the moment…
…but after domestic demand-side response (or something else on that scale)
0
20
40
60
80
100
120
Open Energi Boeing
Size of data (PB)
Why Hortonworks Data Platform
• Can scale quickly to respond to market demands
• Interoperability with existing code
• Fantastic data integration
• Knowledgeable technical support
• Security and data governance
Batch | Our HDP setup
Flume
Asset Data
National
Electricity Data
Market data
Other “live”
timeseries data
Hive
Streaming
Hive
other
Applications
Real-time | (Work ongoing)
Asset Data
ML models
HDFS, cache,
Elasticsearch
…
Update ML Models
Correlate Events
Enrich
Apache Hive | Example
CREATE EXTERNAL TABLE semi_structured_stuff (...)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = ‘semi/structured',
'es.index.auto.create' = 'false') ;
SELECT something FROM semi_structured_stuff
JOIN metadata m ON …
LEFT JOIN timeseries t ON …
Index semi-structured data
(Elasticsearch)
Use Hive to integrate this with
timeseries data and other metadata
Farm out complex analytics to
Python
SELECT transform(something)
USING ‘insane_maths.py’
AS (result)
Benefits
• Reduced storage cost compared to SAN + SQL Server
• Better utilisation of infrastructure thanks to YARN
• Pain-free integration of multiple data sources with external tables
in Hive
• Scale up/down on demand
• Re-use existing Python code = low development overhead
Dynamic
Demand
Predict
&
Forecast
Optimise
&
Explore
Verify
Alert Simulations
Insights via web
Machine learning
Statistical Analysis
Event correlation
Expert system
Real-time aggregation
Real-time web feed
Dynamic
Demand
Predict
&
Forecast
Optimise
&
Explore
Verify
Alert Simulations
Insights via web
Machine learning
Statistical Analysis
Event correlation
Expert system
Real-time aggregation
Real-time web feed
Thanks for listening. Any questions?

Powering a Virtual Power Station with Big Data

  • 1.
    Powering a VirtualPower Station with Big Data Michael Bironneau April 2016
  • 2.
  • 6.
    0 2 4 6 8 10 12 14 16 18 20 0:00 2:30 5:007:30 10:00 12:30 15:00 17:30 20:00 22:30 MW Total Power Average upwards flex – 120% Average downwards flex – 35%
  • 7.
  • 8.
    • 25-40k messagesprocessed per second • Total size of data 500TB-800TB Open Energi in the coming year:
  • 9.
    • 25-40k messagesprocessed per second • Total size of data 500TB-800TB Open Energi in the coming year: Perspective: here’s what “big data” means to Boeing [1]: • ~64k messages per second from each aircraft • Total size of data over 100 petabytes [1]: http://bit.ly/18kQlMn
  • 10.
    0 20 40 60 80 100 120 Open Energi Boeing Sizeof data (PB) Our data is not huge at the moment…
  • 11.
    …but after domesticdemand-side response (or something else on that scale) 0 20 40 60 80 100 120 Open Energi Boeing Size of data (PB)
  • 12.
    Why Hortonworks DataPlatform • Can scale quickly to respond to market demands • Interoperability with existing code • Fantastic data integration • Knowledgeable technical support • Security and data governance
  • 13.
    Batch | OurHDP setup Flume Asset Data National Electricity Data Market data Other “live” timeseries data Hive Streaming Hive other Applications
  • 14.
    Real-time | (Workongoing) Asset Data ML models HDFS, cache, Elasticsearch … Update ML Models Correlate Events Enrich
  • 15.
    Apache Hive |Example CREATE EXTERNAL TABLE semi_structured_stuff (...) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = ‘semi/structured', 'es.index.auto.create' = 'false') ; SELECT something FROM semi_structured_stuff JOIN metadata m ON … LEFT JOIN timeseries t ON … Index semi-structured data (Elasticsearch) Use Hive to integrate this with timeseries data and other metadata Farm out complex analytics to Python SELECT transform(something) USING ‘insane_maths.py’ AS (result)
  • 16.
    Benefits • Reduced storagecost compared to SAN + SQL Server • Better utilisation of infrastructure thanks to YARN • Pain-free integration of multiple data sources with external tables in Hive • Scale up/down on demand • Re-use existing Python code = low development overhead
  • 17.
    Dynamic Demand Predict & Forecast Optimise & Explore Verify Alert Simulations Insights viaweb Machine learning Statistical Analysis Event correlation Expert system Real-time aggregation Real-time web feed
  • 18.
    Dynamic Demand Predict & Forecast Optimise & Explore Verify Alert Simulations Insights viaweb Machine learning Statistical Analysis Event correlation Expert system Real-time aggregation Real-time web feed
  • 19.
    Thanks for listening.Any questions?

Editor's Notes

  • #5 There is a powerful economic case to distribute demand more efficiently using DSR technology, regardless of the future generation mix The capital cost of building a new peaking power station can be up to £5 million per megawatt of power The current costs to aggregate a megawatt via Dynamic Demand sit at around £200,000 It provides a no-build approach to capacity challenges which is cleaner, cheaper, more secure and faster than the alternatives.
  • #6 - Open Energi is turning the energy system on it’s head, so that instead of supply adjusting to meet demand, demand adjusts to meet supply By harnessing small amounts of flexible energy demand from energy-intensive equipment we can create a virtual power station and displace fossil-fuelled peaking power stations This is enabling a user-led transformation in how our energy system works, so that businesses and consumers are not only making it happen, but also seeing the benefits It’s a vital part of our transition to a zero carbon economy because we cannot maximise our use of renewables unless our demand for energy becomes more responsive
  • #7 Dynamic Demand can deliver approx £85,000 per MW/Yr FCDM / Static FFR £22,000 - £26,000 per MW/Yr STOR - £10,000 - £15,000 per MW/Yr
  • #8 We capture data at finest grain level. Stored as COV. The challenge is then aggregating multiple timeseries without downsampling. We also need to downsample all these series to multiple resolutions. They are all irregularly sampled. Hence the challenge, which prevents us from using timeseries databases.
  • #13 Confidence that our data platform can scale quickly if needed The markets we operate in are unpredictable When domestic market takes off, our data could increase by two orders of magnitude! Fantastic data integration support Can easily wrap our existing codebase Reduce our £/GB by 80% for archival data while retaining ability to query Extensibility New tools being added to the ecosystem on a regular basis More and more developers trained in Hadoop ecosystem means easier on-boarding Knowledgeable support from Hortonworks Security and governance built into platform
  • #15 This is ongoing work and in particular we haven’t quite figured out the “asset data” -> storm bit.
  • #17 Not limited by storage cost – able to enrich data to reduce cost of processing Better utilisation of infrastructure compared to VMs dedicated to a single service – here YARN means we can really get the most out of everything Ability to mix Python with SQL means easier/maintainable aggregation/downsampling Interactive querying of multiple data sources with Spark in Jupyter Easy ingestion process using multiple Flume agents Can still use Elasticsearch for small timeseries
  • #18 Now let’s have a look at where HDP fits in to our big “wheel of data”.
  • #20 Not limited by storage cost – able to enrich data to reduce cost of processing Ability to mix Python with SQL means easier/maintainable aggregation/downsampling Interactive querying of multiple data sources with Spark in Jupyter Easy ingestion process using multiple Flume agents Can still use Elasticsearch for small timeseries