Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

© Hitachi, Ltd. 2016. All rights reserved.
Hitachi, Ltd. OSS Solution Center
2016/10/27
Yusuke Furuyama
Yang Xie
Leveraging smart meter data
for electric utilities:
Comparison of Spark SQL with Hive

1. Leveraging smart meter data[Sample use case for electric utilities]
2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)
3. Additional evaluation with Spark 2.0
Contents
1
4. Summary

2© Hitachi, Ltd. 2016. All rights reserved.
1. Leveraging smart meter data
[Sample use case for electric utilities]

1-1 Hitachi’s Social innovation business approach

1-2 Situation of electric utilities and their needs
• Electric utilities have to adapt to competitive free market
• Request for price cut of power transmission fee from government
 Liberalization of the retail Electric Power Market
 Needs to cut the cost for transmission and distribution equipment
• Transmission and distribution equipment was replaced periodically
• Decide the timing of replace by the condition of equipment
Maintenance team needs: Obtain the load status of each equipment
Electricity rate
Power plant
Transmission
Lines
Substation Distribution
Lines
Home
Transmission and
Distribution Companies
Power
transmission feeElectric Power
Generation
Companies
Electricity retailers

1-3 Situation of electric utilities and their needs (future)
• Decreasing nuclear plant as a stable power supplier
• Increasing renewable energy supply
 Unstable power supply
 Needs for high level Demand Response
• Rates by time zone (current demand response)
• Many and small renewable energy suppliers
• Near real-time demand response for each distribution system
Planning team needs: Obtain near real-time load status of each equipment
Electricity rate
Power plant
Transmission
Lines
Substation Distribution
Lines
Home
Transmission and
Distribution Companies
Power
transmission feeElectric Power
Generation
Companies
Electricity retailers

1-4 Leveraging big data for electric utilities
Power Grid
Transformer
(500/Distribution Line)
Switch
(20/Distribution Line)
Meter Data
Management System Data from smart meters(every 30min.)
Analyze the data from smart meters to grasp the load status of equipment
 Meet the needs of electric utilities
Substation
(1,000)
Distribution Lines
(5/Substation)
Smart Meter
(４/Trans)
・・・0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
Data Analysis
System
Total
10,000,000
Concern about performance
needed for near real-time processing

Data processing platform (Hadoop, Spark)
Data Analysis System
Target of this session
Planner
(Equipment/Demand
response)
Meter Data
Management System
1-5 System Component
Raw Data Processed Data
Data visualization tool (Pentaho）

2. Performance evaluation of MapReduce and Spark 1.6
(using Hive and Spark SQL)

2-1 Use Case for electric utilities
 Points of analysis
Needs Point of analysis
Find the equipment that needs to be replaced Find the equipment that has heavy workload
Estimate the timing for replacement Check the trend of load status
Select the proper capacity of new equipment Extract the peak of the load
 Needs of electric utilities (recap)
Needs
・Analyze the data from smart meters to grasp the load status of equipment
・Near real-time

2-2 Contents of performance evaluation
 Point of evaluation for near real-time processing
Check if MapReduce and Spark can process the data in 30min.
Point of analysis Aggregate per
Find the equipment has heavy
workload
Equipment
Check the trend of load status Term
Extract the peak of the load Time zone
 Items for evaluation
・Distribution system ・Substation ・Switch
・Transformer ・Meter
・1day ・1month(30days) ・1year(365days)
・Specific 30min of each day ・24h
• Concern about performance for processing 10,000,000 meters
• Data comes from smart meter every 30min (48/Day, spec of smart meter)

2-3 Target of performance evaluation
 Target of evaluation
Data Processing Platform
(MapReduce, Spark 1.6)
End
Data visualization tool（Pentaho）
Meter Data
Management System
Target
Aggregation batch
(Hive / Spark SQL)
Start
Time from start of aggregation batch for meter data to end of the batch

2-4 Target of performance evaluation
 System Configuration  Spec
Master Node
CPU Core ２
Memory 8 GB
Capacity of disk 80 GB
# of disk 1
Per slave node total
CPU Core 16 64
Memory 128 GB 512 GB
Capacity of disk 900 GB -
# of disk 6 24
Total capacity
of disks
5.4 TB
（5,400 GB）
21.6 TB
（21,600 GB）
10Gbps LAN
10Gbps SW
1Gbps LAN
disk disk ・・・ disk
4 Slave Nodes
(Physical Machines)
1 Master Node
(Virtual Machine)

2-5 Dataset
 Smart meter data
Term # of records Size （CSV） Size （ORC File）
365days (1year) 3,650 million 2.475 TB 1.325 TB
30days (1month) 300 million 0.205 TB 0.158 TB
1day 10 million 0.007 TB 0.005 TB
0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. infometer1
0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. infometer2
・
・
・
0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. info
meter
10,000,000
Smart meter data /day
Data size
48 columns (every 30min)
10,000,000 records/day

2-6 Contents of performance evaluation (recap)
workload
Equipment
 Point of evaluation

2-6 Contents of performance evaluation (recap)
 Point of evaluation
workload
Equipment
+ File type
・Text (CSV)
・ORCFile (Column-based)

2-7 Comparison of txt with ORCFile (MapReduce)
• Couldn’t finish processing in 30min
• Performance improvement by ORCFile
62 sec
224 sec
3286 sec
99 sec
338 sec
6962 sec
0 sec 2000 sec 4000 sec 6000 sec
1day
30days
365days
Hive + TXT
Hive + ORCFile
31 sec
69 sec
492 sec
32 sec
138 sec
3015 sec
0 sec 2000 sec 4000 sec
1day
30days
365days
Hive + TXT
Hive + ORCFile
30min 30min
Result (24h) Result (0.5h)

2-8 Comparison of txt with ORCFile (Spark 1.6)
28 sec
50 sec
993 sec
47 sec
154 sec
1408 sec
0 sec 400 sec 800 sec 1200 sec 1600 sec
1day
30days
365days
Spark SQL + TXT
Spark SQL + ORCFile
• Could finish processing in 30min (1800s)
• Performance improvement by ORCFile
26 sec
32 sec
169 sec
30 sec
111 sec
1263 sec
1day
30days
365days
Spark SQL + TXT
Spark SQL + ORCFile
Result (24h) Result (0.5h)

2-9 Review of the results
Why the processing was fast with ORCFile
 Processing 0.5h data
・
・
・
・・・
Smart meter data
・
・
・
0：00 23：30
0：00
0：30
0：00
0：00
0：00
0：00
0：00
0：00
0：00
Use 1
column
only
 Processing 24h data
・
・
・
・
・
・
＋
・・・・
・
・
・
・
・
0：30 23：30 SUM/day
＋
Use all 48
columns
＋
• Processing big data with ORCFile was more effective
than processing small data
Results
• Processing 0.5h data with ORCFile was more
effective than processing 24h data
Compression
Reading specific column
Features of ORC File
＋
＋
＋
＋
＋
＋
＋
＋
＋
SUM/day
SUM/day
SUM/day
SUM/day
Smart meter data
0：00
0：30
0：30
0：30
0：30
23：30
23：30
23：30
23：30
0：00
0：00
0：00
0：00
0：00
0：30
0：30
0：30
0：30
23：30
23：30
23：30
23：30

32 sec
104 sec
69 sec
556 sec
0 sec 500 sec 1000 sec
Distribution
system
Per
equipment
Hive
Spark SQL
2-10 Comparison of MapReduce with Spark 1.6
50 sec
145 sec
224 sec
718 sec
0 sec 500 sec 1000 sec
Distribution
system
Per
equipment
Hive
Spark SQL
993 sec
1359 sec
3286 sec
4131 sec
Distribution
system
Per
equipment
Hive
Spark SQL
・Could finish processing the data
from 10,000,000 meter in 30min
using Spark 1.6!
・Spark’s good performance
with “per equipment” processing
Result (30days / 24h) Result (30days / 0.5h)
Result (365days / 24h)
30min
80%
down
78%
down
81%
down
54%
down

Trans
2,500,000
Switch
100,000
2-11 Review of the results
 Why the processing per equipment was more effective than the processing
for entire distribution system when using spark?
 Per equipment ・
・
・
・・・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・・・
・・・
0：00
0：00
0：00
0：00
0：00
0：30
0：30
0：30
0：30
0：30
23：30
23：30
23：30
23：30
23：30
Trans1
Trans2
Trans3
Switch1
Switch2
Line1 Substa.1
10005000
・
・
・
・
・
・
・・・
・・・
・・・
0：00
0：00
0：00 0：30
23：30
23：30
23：30
・・・
0：00
0：00
0：30
0：30
23：30
23：30
 For entire
distribution
system
JOIN
・Less disk I/O than MapReduce
・Smaller data (including
re-distributing data) than total
memory of cluster
Smart meter data Transformer Switch Distribution
Line
Substation
Distribution
System
Distribution
System
JOIN
JOIN JOIN
JOIN
Smart meter data
0：30
0：30

3. Additional evaluation with Spark 2.0

3-1 Evaluation environment
 System Configuration  Spec
Master Node
CPU Core 16
Memory 12 GB
Capacity of disk 900 GB
# of disk 8
Per slave node total
CPU Core 20 120
Memory 384 GB 2304 GB
Capacity of disk 1200 GB -
# of disk 10 60
Total capacity
of disks
12.0 TB
（12,000 GB）
72.0 TB
（72,000 GB）
１０Gbps LAN
１０Gbps SW
6 Slave Nodes
(Physical)
1 Master Node
(Physical)
disk disk ・・・ disk

3-2 Comparison of Spark 2.0 with Spark 1.6
316 sec
322 sec
Per
Equipment
Spark 1.6
Spark 2.0
87 sec
102 sec
Per
Equipment
Spark 1.6
Spark 2.0
71 sec
89 sec
0 sec 20 sec 40 sec 60 sec 80 sec 100 sec
Per
Equipment
Spark 1.6
Spark 2.0
・Performance improvement roughly
20% (including disk I/O)
・More effective with small data
Result (365days / 24h) Result (30days / 24h)
Result (1day / 24h)

Demo

Data visualization tool (Pentaho）
Aggregated Data
(29 days)
Demo: Data aggregation and visualization
② Execute aggregation batch
(Spark SQL)
① Show 29 days data
Aggregated Data
(30 days)
③ Show 30 days data
Raw Data (1day)
④ Outlier detected!!

4. Summary

4 Summary
 Leveraging data from 10,000,000 smart meters for electric utilities
- Built data analysis system
- Concern about performance
 Evaluate the performance of batch processing
- Spark could process the data from
10,000,000 meters in 30min (4node)
 Evaluate the performance of Spark 2.0
- Performance improvement
roughly 20% (compared to 1.6)
Data Processing Platform
(MapReduce, Spark)
Data visualization tool（Pentaho）
Aggregation batch
(Hive / Spark SQL)
Meter Data
Management System

Hitachi, Ltd. OSS Solution Center
Comparison of Spark SQL with Hive
Leveraging smart meter data for electric utilities:
2016/10/27
Yusuke Furuyama
Yang Xie
END
28

他社商品名、商標等の引用に関する表示
 Hadoop、HiveおよびSparkは、Apache Software Foundationの米国およびその他の国における登録商標または商
標です。
 その他記載の会社名、製品名などは、それぞれの会社の商標もしくは登録商標です。

Appendix. Difficulty with large data to be shuffled
 Attempted to aggregate raw (48 columns) meter data per equipment
- Extremely slow (Spark 2.0) or Job failed (Spark 1.6)
- Processing: Iteration of JOIN and GROUP BY+SUM
- Huge data to be shuffled (spilled out from page cache)
76 sec 223 sec
13281 sec
85 sec
236 sec
3650 sec
0 sec
2000 sec
4000 sec
6000 sec
8000 sec
10000 sec
12000 sec
14000 sec
1day 30days 365days
Processingtime
Target data
Before
adding
After
adding
 Add HDFS disks as disks for shuffle
- Performance Improved (365days)
- Performance degraded (1day/30days)
・Data for Spark (including temporary data)
should be smaller than memory.
・Had better to process as a trial
to estimate
Adding disks for shuffle (Spark 2.0.0）
Heavy load on a local disk (OS disk) by shuffle

Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

Similar to Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive