Huohua: A Distributed Time Series Analysis Framework For Spark

www.twosigma.com
Huohua 火花
Distributed Time Series Analysis
Framework For Spark
June 10, 2016
Wenbo Zhao
Spark Summit 2016

About Me
June 10, 2016
w Software Engineer @
w Focus on analytics related tools, libraries and Systems

$0.0
$500.0
$1,000.0
$1,500.0
$2,000.0
$2,500.0
1/3/1950
1/3/1953
1/3/1956
1/3/1959
1/3/1962
1/3/1965
1/3/1968
1/3/1971
1/3/1974
1/3/1977
1/3/1980
1/3/1983
1/3/1986
1/3/1989
1/3/1992
1/3/1995
1/3/1998
1/3/2001
1/3/2004
1/3/2007
1/3/2010
1/3/2013
1/3/2016
S&P 500
We view everything as a time series
June 10, 2016
w Stock market prices
w Temperatures
w Sensor logs
w Presidential polls
w …
50°F
55°F
60°F
65°F
70°F
75°F
80°F
85°F
90°F
95°F
100°F
New York
San Francisco

What is a time series?
June 10, 2016
w A sequence of observations obtained in successive time order
w Our goal is to forecast future values given past observations
$8.90
$8.95
$8.90
$9.06
$9.10
8:00 11:00 14:00 17:00 20:00
corn price
?

Multivariate time series
June 10, 2016
w We can forecast better by joining multiple time series
w Temporal join is a fundamental operation for time series analysis
w Huohua enables fast distributed temporal join of large scale unaligned time series
$8.90
$8.95
$8.90
$9.06 $9.10
8:00 11:00 14:00 17:00 20:00
corn price
75°F
72°F
71°F
72°F
68°F
67°F
65°F
temperature

What is temporal join?
June 10, 2016
w A particular join function defined by a matching criteria over time
w Examples of criteria
w look-backward– find the most recent observation in the past
w look-forward – find the closest observation in the future
time series 1 time series 2
look-forward
time series 1 time series 2
look-backward
observation

Temporaljoin with look-backwardcriteria
June 10, 2016
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
time weather corn price
08:00 AM
10:00 AM
12:00 AM

June 10, 2016
time weather
08:00 AM
10:00 AM
12:00 AM
time corn price
08:00AM
11:00AM
08:00 AM 60 °F
10:00 AM
12:00 AM
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F

June 10, 2016
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM

June 10, 2016
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F

time corn price
08:00AM
11:00AM
time corn price
08:00AM
11:00AM
time corn price
08:00AM
11:00AM
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
June 10, 2016
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
…
…
Hundreds of thousands of data sources
with unaligned timestamps
Thousands of market data sets
We need fast and scalable distributed temporal join

Issues with existing solutions
June 10, 2016
w A single time series may not fit into a single machine
w Forecasting may involve hundreds of time series
w Existing packages don’t support temporal join or can’t handle large time series
w MatLab, R, SAS, Pandas
w Even Spark based solutions fall short
w PairRDDFunctions, DataFrame/Dataset, spark-ts

Huohua – a new time series library for Spark
June 10, 2016
w Goal
w provide a collection of functions to manipulateand analyze time series at scale
w group, temporal join, summarize, aggregate …
w How
w build a time series aware data structure
w extending RDD to TimeSeriesRDD
w optimize using temporal locality
w reduce shuffling
w reduce memory pressure by streaming

What is a TimeSeriesRDD in Huohua?
June 10, 2016
w TimeSeriesRDD extends RDD to represent time series data
w associates a time range to each partition
w tracks partitions’time-rangesthrough operations
w preserves the temporal order
TimeSeriesRDD
operations
time series
functions

TimeSeriesRDD– an RDD representingtime series
June 10, 2016
time temperature
6:00AM 60°F
6:01AM 61°F
… …
7:00AM 70°F
7:01AM 71°F
… …
8:00AM 80°F
8:01AM 81°F
… …
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
RDD
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…

TimeSeriesRDD– an RDD representingtime series
June 10, 2016
range:
[06:00AM, 07:00AM)
range:
[07:00AM, 8:00AM)
range:
[8:00AM, ∞)
TimeSeriesRDDtime temperature
6:00AM 60°F
6:01AM 61°F
… …
7:00AM 70°F
7:01AM 71°F
… …
8:00AM 80°F
8:01AM 81°F
… …
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…

Group function
June 10, 2016
w A group function groups rows with exactly the same timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM San Francisco 60°F
group 1
group 2
group 3
group 4

Group function
June 10, 2016
w A group function groups rows with nearby timestamps
time city temperature
group 1
group 2

Group in Spark
June 10, 2016
w Groups rows with exactly the same timestamps
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM

w Data is shuffled and materialized
Group in Spark
June 10, 2016
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM

Group in Spark
June 10, 2016
RDD
groupBy
RDD
1:00PM 1:00PM
3:00PM 3:00PM
2:00PM
4:00PM
2:00PM
4:00PM

Group in Spark
June 10, 2016
RDD
groupBy
RDD
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM

Group in Spark
June 10, 2016
w Temporal order is not preserved
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM

Group in Spark
June 10, 2016
w Another sort is required
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM

Group in Spark
June 10, 2016
w Another sort is required
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
2:00PM 2:00PM
4:00PM 4:00PM
1:00PM 1:00PM
3:00PM 3:00PM

Group in Spark
June 10, 2016
w Back to correct temporal order
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM

Group in Spark
June 10, 2016
w Back to temporal order
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM

Group in Huohua
June 10, 2016
w Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM

Group in Huohua
June 10, 2016
TimeSeriesRDD
1:00PM
2:00PM
1:00PM
3:00PM 3:00PM
4:00PM
4:00PM
2:00PM

Group in Huohua
June 10, 2016
TimeSeriesRDD
1:00PM
2:00PM
1:00PM
3:00PM 3:00PM
4:00PM 4:00PM
2:00PM

Benchmark for group
June 10, 2016
w Running time of count after group
w 16 executors (10G memory and 4 cores per executor)
w data is read from HDFS
0s
20s
40s
60s
80s
100s
20M 40M 60M 80M 100M
RDD DataFrame TimeseriesRDD

Temporaljoin
June 10, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction – whether it should look-backward or look-forward
w window - how much it should look-backward or look-forward
look-backward temporal join
window

Temporaljoin
June 10, 2016
w Temporal join with criteria look-back and window of length 1
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
time series time series

Temporaljoin
June 10, 2016
w How do we do temporal join in TimeSeriesRDD?
TimeSeriesRDD TimeSeriesRDD
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM

Temporaljoin in Huohua
June 10, 2016
w partition time space into disjoint intervals
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM

June 10, 2016
w Build dependencygraph for the joinedTimeSeriesRDD
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM

June 10, 2016
w Temporal join with criteria look-back and window 1
w Join data as streams per partition
1:00AM 1
1:00AM 1:00AM1:00AM
2:00AM
4:00AM
5:00AM
3:00AM
5:00AM

June 10, 2016
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
1:00AM 1:00AM1:00AM
2:00AM

June 10, 2016
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM
3:00AM

June 10, 2016
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM 3:00AM
5:00AM 5:00AM

Benchmark for temporaljoin
June 10, 2016
w Running time of count after temporal join
w 16 executors (10G memory and 4 cores per executor)
w data is read from HDFS
0s
20s
40s
60s
80s
100s
20M 40M 60M 80M 100M
RDD DataFrame TimeseriesRDD

Functions over TimeSeriesRDD
June 10, 2016
w group functions such as window, intervalization etc.
w temporal joins such as look-forward, look-backwardetc.
w summarizers such as average, variance, z-score etc. over
w windows
w Intervals
w cycles

Open Source
June 10, 2016
w Not quite yet …
w https://github.com/twosigma

Future work
June 10, 2016
w Dataframe / Dataset integration
w Speed up
w RicherAPIs
w Python bindings
w More summarizers

Key contributors
June 10, 2016
w Christopher Aycock
w Jonathan Coveney
w Jin Li
w David Medina
w David Palaitis
w Ris Sawyer
w Leif Walsh
w Wenbo Zhao

Huohua: A Distributed Time Series Analysis Framework For Spark

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from Jen Aman

More from Jen Aman (20)

Recently uploaded

Recently uploaded (20)

Huohua: A Distributed Time Series Analysis Framework For Spark