SlideShare a Scribd company logo
www.twosigma.com
Huohua 火花
Distributed Time Series Analysis
Framework For Spark
June 10, 2016
Wenbo Zhao
Spark Summit 2016
About Me
June 10, 2016
w Software Engineer @
w Focus on analytics related tools, libraries and Systems
$0.0
$500.0
$1,000.0
$1,500.0
$2,000.0
$2,500.0
1/3/1950
1/3/1953
1/3/1956
1/3/1959
1/3/1962
1/3/1965
1/3/1968
1/3/1971
1/3/1974
1/3/1977
1/3/1980
1/3/1983
1/3/1986
1/3/1989
1/3/1992
1/3/1995
1/3/1998
1/3/2001
1/3/2004
1/3/2007
1/3/2010
1/3/2013
1/3/2016
S&P 500
We view everything as a time series
June 10, 2016
w Stock market prices
w Temperatures
w Sensor logs
w Presidential polls
w …
50°F
55°F
60°F
65°F
70°F
75°F
80°F
85°F
90°F
95°F
100°F
New York
San Francisco
What is a time series?
June 10, 2016
w A sequence of observations obtained in successive time order
w Our goal is to forecast future values given past observations
$8.90
$8.95
$8.90
$9.06
$9.10
8:00 11:00 14:00 17:00 20:00
corn price
?
Multivariate time series
June 10, 2016
w We can forecast better by joining multiple time series
w Temporal join is a fundamental operation for time series analysis
w Huohua enables fast distributed temporal join of large scale unaligned time series
$8.90
$8.95
$8.90
$9.06 $9.10
8:00 11:00 14:00 17:00 20:00
corn price
75°F
72°F
71°F
72°F
68°F
67°F
65°F
temperature
What is temporal join?
June 10, 2016
w A particular join function defined by a matching criteria over time
w Examples of criteria
w look-backward– find the most recent observation in the past
w look-forward – find the closest observation in the future
time series 1 time series 2
look-forward
time series 1 time series 2
look-backward
observation
Temporaljoin with look-backwardcriteria
June 10, 2016
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
time weather corn price
08:00 AM
10:00 AM
12:00 AM
Temporaljoin with look-backwardcriteria
June 10, 2016
time weather
08:00 AM
10:00 AM
12:00 AM
time corn price
08:00AM
11:00AM
time weather corn price
08:00 AM 60 °F
10:00 AM
12:00 AM
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
Temporaljoin with look-backwardcriteria
June 10, 2016
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
time weather corn price
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM
Temporaljoin with look-backwardcriteria
June 10, 2016
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
time weather corn price
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
time corn price
08:00AM
11:00AM
time corn price
08:00AM
11:00AM
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
Temporaljoin with look-backwardcriteria
June 10, 2016
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00AM
11:00AM
time weather corn price
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
…
…
Hundreds of thousands of data sources
with unaligned timestamps
Thousands of market data sets
We need fast and scalable distributed temporal join
Issues with existing solutions
June 10, 2016
w A single time series may not fit into a single machine
w Forecasting may involve hundreds of time series
w Existing packages don’t support temporal join or can’t handle large time series
w MatLab, R, SAS, Pandas
w Even Spark based solutions fall short
w PairRDDFunctions, DataFrame/Dataset, spark-ts
Huohua – a new time series library for Spark
June 10, 2016
w Goal
w provide a collection of functions to manipulateand analyze time series at scale
w group, temporal join, summarize, aggregate …
w How
w build a time series aware data structure
w extending RDD to TimeSeriesRDD
w optimize using temporal locality
w reduce shuffling
w reduce memory pressure by streaming
What is a TimeSeriesRDD in Huohua?
June 10, 2016
w TimeSeriesRDD extends RDD to represent time series data
w associates a time range to each partition
w tracks partitions’time-rangesthrough operations
w preserves the temporal order
TimeSeriesRDD
operations
time series
functions
TimeSeriesRDD– an RDD representingtime series
June 10, 2016
time temperature
6:00AM 60°F
6:01AM 61°F
… …
7:00AM 70°F
7:01AM 71°F
… …
8:00AM 80°F
8:01AM 81°F
… …
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
RDD
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…
TimeSeriesRDD– an RDD representingtime series
June 10, 2016
range:
[06:00AM, 07:00AM)
range:
[07:00AM, 8:00AM)
range:
[8:00AM, ∞)
TimeSeriesRDDtime temperature
6:00AM 60°F
6:01AM 61°F
… …
7:00AM 70°F
7:01AM 71°F
… …
8:00AM 80°F
8:01AM 81°F
… …
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…
Group function
June 10, 2016
w A group function groups rows with exactly the same timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM San Francisco 60°F
2:00 PM New York 71°F
2:00 PM San Francisco 61°F
3:00 PM New York 72°F
3:00 PM San Francisco 62°F
4:00 PM New York 73°F
4:00 PM San Francisco 63°F
group 1
group 2
group 3
group 4
Group function
June 10, 2016
w A group function groups rows with nearby timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM San Francisco 60°F
2:00 PM New York 71°F
2:00 PM San Francisco 61°F
3:00 PM New York 72°F
3:00 PM San Francisco 62°F
4:00 PM New York 73°F
4:00 PM San Francisco 63°F
group 1
group 2
Group in Spark
June 10, 2016
w Groups rows with exactly the same timestamps
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
w Data is shuffled and materialized
Group in Spark
June 10, 2016
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
Group in Spark
June 10, 2016
w Data is shuffled and materialized
RDD
groupBy
RDD
1:00PM 1:00PM
3:00PM 3:00PM
2:00PM
4:00PM
2:00PM
4:00PM
Group in Spark
June 10, 2016
w Data is shuffled and materialized
RDD
groupBy
RDD
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
Group in Spark
June 10, 2016
w Temporal order is not preserved
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
Group in Spark
June 10, 2016
w Another sort is required
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
Group in Spark
June 10, 2016
w Another sort is required
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
2:00PM 2:00PM
4:00PM 4:00PM
1:00PM 1:00PM
3:00PM 3:00PM
Group in Spark
June 10, 2016
w Back to correct temporal order
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
Group in Spark
June 10, 2016
w Back to temporal order
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
Group in Huohua
June 10, 2016
w Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
Group in Huohua
June 10, 2016
w Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
Group in Huohua
June 10, 2016
w Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
1:00PM
3:00PM 3:00PM
4:00PM
4:00PM
2:00PM
Group in Huohua
June 10, 2016
w Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
1:00PM
3:00PM 3:00PM
4:00PM 4:00PM
2:00PM
Benchmark for group
June 10, 2016
w Running time of count after group
w 16 executors (10G memory and 4 cores per executor)
w data is read from HDFS
0s
20s
40s
60s
80s
100s
20M 40M 60M 80M 100M
RDD DataFrame TimeseriesRDD
Temporaljoin
June 10, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction – whether it should look-backward or look-forward
w window - how much it should look-backward or look-forward
look-backward temporal join
window
Temporaljoin
June 10, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction – whether it should look-backward or look-forward
w window - how much it should look-backward or look-forward
look-backward temporal join
window
Temporaljoin
June 10, 2016
w Temporal join with criteria look-back and window of length 1
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
time series time series
Temporaljoin
June 10, 2016
w Temporal join with criteria look-back and window of length 1
w How do we do temporal join in TimeSeriesRDD?
TimeSeriesRDD TimeSeriesRDD
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
Temporaljoin in Huohua
June 10, 2016
w Temporal join with criteria look-back and window of length 1
w partition time space into disjoint intervals
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
Temporaljoin in Huohua
June 10, 2016
w Temporal join with criteria look-back and window of length 1
w Build dependencygraph for the joinedTimeSeriesRDD
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
Temporaljoin in Huohua
June 10, 2016
w Temporal join with criteria look-back and window 1
w Join data as streams per partition
1:00AM 1
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM 1:00AM1:00AM
2:00AM
4:00AM
5:00AM
3:00AM
5:00AM
Temporaljoin in Huohua
June 10, 2016
w Temporal join with criteria look-back and window 1
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM 1:00AM1:00AM
2:00AM
Temporaljoin in Huohua
June 10, 2016
w Temporal join with criteria look-back and window 1
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM
3:00AM
Temporaljoin in Huohua
June 10, 2016
w Temporal join with criteria look-back and window 1
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM 3:00AM
5:00AM 5:00AM
Benchmark for temporaljoin
June 10, 2016
w Running time of count after temporal join
w 16 executors (10G memory and 4 cores per executor)
w data is read from HDFS
0s
20s
40s
60s
80s
100s
20M 40M 60M 80M 100M
RDD DataFrame TimeseriesRDD
Functions over TimeSeriesRDD
June 10, 2016
w group functions such as window, intervalization etc.
w temporal joins such as look-forward, look-backwardetc.
w summarizers such as average, variance, z-score etc. over
w windows
w Intervals
w cycles
Open Source
June 10, 2016
w Not quite yet …
w https://github.com/twosigma
Future work
June 10, 2016
w Dataframe / Dataset integration
w Speed up
w RicherAPIs
w Python bindings
w More summarizers
Key contributors
June 10, 2016
w Christopher Aycock
w Jonathan Coveney
w Jin Li
w David Medina
w David Palaitis
w Ris Sawyer
w Leif Walsh
w Wenbo Zhao
Thank you
June 10, 2016
w QA

More Related Content

Viewers also liked

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
Jen Aman
 
Interactive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkInteractive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by Spark
Spark Summit
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Spark Summit
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Forecasting Hierarchical Time Series
Forecasting Hierarchical Time SeriesForecasting Hierarchical Time Series
Forecasting Hierarchical Time Series
Rob Hyndman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
Perspectives on Big Data & Analytics-(Doug Wolfe, CIA)
Perspectives on Big Data & Analytics-(Doug Wolfe, CIA)Perspectives on Big Data & Analytics-(Doug Wolfe, CIA)
Perspectives on Big Data & Analytics-(Doug Wolfe, CIA)
Spark Summit
 
CarFast: Achieving Higher Statement Coverage Faster
CarFast: Achieving Higher Statement Coverage FasterCarFast: Achieving Higher Statement Coverage Faster
CarFast: Achieving Higher Statement Coverage FasterSangmin Park
 
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Sangmin Park
 
Falcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent ProgramsFalcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent ProgramsSangmin Park
 
Effective Fault-Localization Techniques for Concurrent Software
Effective Fault-Localization Techniques for Concurrent SoftwareEffective Fault-Localization Techniques for Concurrent Software
Effective Fault-Localization Techniques for Concurrent Software
Sangmin Park
 

Viewers also liked (20)

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Interactive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkInteractive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by Spark
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Forecasting Hierarchical Time Series
Forecasting Hierarchical Time SeriesForecasting Hierarchical Time Series
Forecasting Hierarchical Time Series
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
 
Perspectives on Big Data & Analytics-(Doug Wolfe, CIA)
Perspectives on Big Data & Analytics-(Doug Wolfe, CIA)Perspectives on Big Data & Analytics-(Doug Wolfe, CIA)
Perspectives on Big Data & Analytics-(Doug Wolfe, CIA)
 
CarFast: Achieving Higher Statement Coverage Faster
CarFast: Achieving Higher Statement Coverage FasterCarFast: Achieving Higher Statement Coverage Faster
CarFast: Achieving Higher Statement Coverage Faster
 
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
 
Falcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent ProgramsFalcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent Programs
 
Effective Fault-Localization Techniques for Concurrent Software
Effective Fault-Localization Techniques for Concurrent SoftwareEffective Fault-Localization Techniques for Concurrent Software
Effective Fault-Localization Techniques for Concurrent Software
 

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
Jen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
Jen Aman
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
Jen Aman
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
Jen Aman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 

More from Jen Aman (20)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 

Recently uploaded

Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Huohua: A Distributed Time Series Analysis Framework For Spark

  • 1. www.twosigma.com Huohua 火花 Distributed Time Series Analysis Framework For Spark June 10, 2016 Wenbo Zhao Spark Summit 2016
  • 2. About Me June 10, 2016 w Software Engineer @ w Focus on analytics related tools, libraries and Systems
  • 3. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 1/3/1950 1/3/1953 1/3/1956 1/3/1959 1/3/1962 1/3/1965 1/3/1968 1/3/1971 1/3/1974 1/3/1977 1/3/1980 1/3/1983 1/3/1986 1/3/1989 1/3/1992 1/3/1995 1/3/1998 1/3/2001 1/3/2004 1/3/2007 1/3/2010 1/3/2013 1/3/2016 S&P 500 We view everything as a time series June 10, 2016 w Stock market prices w Temperatures w Sensor logs w Presidential polls w … 50°F 55°F 60°F 65°F 70°F 75°F 80°F 85°F 90°F 95°F 100°F New York San Francisco
  • 4. What is a time series? June 10, 2016 w A sequence of observations obtained in successive time order w Our goal is to forecast future values given past observations $8.90 $8.95 $8.90 $9.06 $9.10 8:00 11:00 14:00 17:00 20:00 corn price ?
  • 5. Multivariate time series June 10, 2016 w We can forecast better by joining multiple time series w Temporal join is a fundamental operation for time series analysis w Huohua enables fast distributed temporal join of large scale unaligned time series $8.90 $8.95 $8.90 $9.06 $9.10 8:00 11:00 14:00 17:00 20:00 corn price 75°F 72°F 71°F 72°F 68°F 67°F 65°F temperature
  • 6. What is temporal join? June 10, 2016 w A particular join function defined by a matching criteria over time w Examples of criteria w look-backward– find the most recent observation in the past w look-forward – find the closest observation in the future time series 1 time series 2 look-forward time series 1 time series 2 look-backward observation
  • 7. Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 10:00 AM 12:00 AM
  • 8. Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 10:00 AM 12:00 AM time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 60 °F 10:00 AM 12:00 AM time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F
  • 9. Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM
  • 10. Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F
  • 11. time corn price 08:00AM 11:00AM time corn price 08:00AM 11:00AM time corn price 08:00AM 11:00AM time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F … … Hundreds of thousands of data sources with unaligned timestamps Thousands of market data sets We need fast and scalable distributed temporal join
  • 12. Issues with existing solutions June 10, 2016 w A single time series may not fit into a single machine w Forecasting may involve hundreds of time series w Existing packages don’t support temporal join or can’t handle large time series w MatLab, R, SAS, Pandas w Even Spark based solutions fall short w PairRDDFunctions, DataFrame/Dataset, spark-ts
  • 13. Huohua – a new time series library for Spark June 10, 2016 w Goal w provide a collection of functions to manipulateand analyze time series at scale w group, temporal join, summarize, aggregate … w How w build a time series aware data structure w extending RDD to TimeSeriesRDD w optimize using temporal locality w reduce shuffling w reduce memory pressure by streaming
  • 14. What is a TimeSeriesRDD in Huohua? June 10, 2016 w TimeSeriesRDD extends RDD to represent time series data w associates a time range to each partition w tracks partitions’time-rangesthrough operations w preserves the temporal order TimeSeriesRDD operations time series functions
  • 15. TimeSeriesRDD– an RDD representingtime series June 10, 2016 time temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … (6:00 AM, 60°F) (6:01 AM, 61°F) … RDD (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) …
  • 16. TimeSeriesRDD– an RDD representingtime series June 10, 2016 range: [06:00AM, 07:00AM) range: [07:00AM, 8:00AM) range: [8:00AM, ∞) TimeSeriesRDDtime temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … (6:00 AM, 60°F) (6:01 AM, 61°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) …
  • 17. Group function June 10, 2016 w A group function groups rows with exactly the same timestamps time city temperature 1:00 PM New York 70°F 1:00 PM San Francisco 60°F 2:00 PM New York 71°F 2:00 PM San Francisco 61°F 3:00 PM New York 72°F 3:00 PM San Francisco 62°F 4:00 PM New York 73°F 4:00 PM San Francisco 63°F group 1 group 2 group 3 group 4
  • 18. Group function June 10, 2016 w A group function groups rows with nearby timestamps time city temperature 1:00 PM New York 70°F 1:00 PM San Francisco 60°F 2:00 PM New York 71°F 2:00 PM San Francisco 61°F 3:00 PM New York 72°F 3:00 PM San Francisco 62°F 4:00 PM New York 73°F 4:00 PM San Francisco 63°F group 1 group 2
  • 19. Group in Spark June 10, 2016 w Groups rows with exactly the same timestamps RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 20. w Data is shuffled and materialized Group in Spark June 10, 2016 RDD groupBy RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 21. Group in Spark June 10, 2016 w Data is shuffled and materialized RDD groupBy RDD 1:00PM 1:00PM 3:00PM 3:00PM 2:00PM 4:00PM 2:00PM 4:00PM
  • 22. Group in Spark June 10, 2016 w Data is shuffled and materialized RDD groupBy RDD 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 23. Group in Spark June 10, 2016 w Temporal order is not preserved RDD groupBy RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 24. Group in Spark June 10, 2016 w Another sort is required RDD groupBy sortBy RDD RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 25. Group in Spark June 10, 2016 w Another sort is required RDD groupBy sortBy RDD RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 2:00PM 2:00PM 4:00PM 4:00PM 1:00PM 1:00PM 3:00PM 3:00PM
  • 26. Group in Spark June 10, 2016 w Back to correct temporal order RDD groupBy sortBy RDD RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 27. Group in Spark June 10, 2016 w Back to temporal order RDD groupBy sortBy RDD RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 28. Group in Huohua June 10, 2016 w Data is grouped locally as streams TimeSeriesRDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 29. Group in Huohua June 10, 2016 w Data is grouped locally as streams TimeSeriesRDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  • 30. Group in Huohua June 10, 2016 w Data is grouped locally as streams TimeSeriesRDD 1:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 2:00PM
  • 31. Group in Huohua June 10, 2016 w Data is grouped locally as streams TimeSeriesRDD 1:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 2:00PM
  • 32. Benchmark for group June 10, 2016 w Running time of count after group w 16 executors (10G memory and 4 cores per executor) w data is read from HDFS 0s 20s 40s 60s 80s 100s 20M 40M 60M 80M 100M RDD DataFrame TimeseriesRDD
  • 33. Temporaljoin June 10, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction – whether it should look-backward or look-forward w window - how much it should look-backward or look-forward look-backward temporal join window
  • 34. Temporaljoin June 10, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction – whether it should look-backward or look-forward w window - how much it should look-backward or look-forward look-backward temporal join window
  • 35. Temporaljoin June 10, 2016 w Temporal join with criteria look-back and window of length 1 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM time series time series
  • 36. Temporaljoin June 10, 2016 w Temporal join with criteria look-back and window of length 1 w How do we do temporal join in TimeSeriesRDD? TimeSeriesRDD TimeSeriesRDD 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM
  • 37. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window of length 1 w partition time space into disjoint intervals TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM
  • 38. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window of length 1 w Build dependencygraph for the joinedTimeSeriesRDD TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM
  • 39. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window 1 w Join data as streams per partition 1:00AM 1 TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM1:00AM 2:00AM 4:00AM 5:00AM 3:00AM 5:00AM
  • 40. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window 1 w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM1:00AM 2:00AM
  • 41. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window 1 w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM
  • 42. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window 1 w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM 5:00AM 5:00AM
  • 43. Benchmark for temporaljoin June 10, 2016 w Running time of count after temporal join w 16 executors (10G memory and 4 cores per executor) w data is read from HDFS 0s 20s 40s 60s 80s 100s 20M 40M 60M 80M 100M RDD DataFrame TimeseriesRDD
  • 44. Functions over TimeSeriesRDD June 10, 2016 w group functions such as window, intervalization etc. w temporal joins such as look-forward, look-backwardetc. w summarizers such as average, variance, z-score etc. over w windows w Intervals w cycles
  • 45. Open Source June 10, 2016 w Not quite yet … w https://github.com/twosigma
  • 46. Future work June 10, 2016 w Dataframe / Dataset integration w Speed up w RicherAPIs w Python bindings w More summarizers
  • 47. Key contributors June 10, 2016 w Christopher Aycock w Jonathan Coveney w Jin Li w David Medina w David Palaitis w Ris Sawyer w Leif Walsh w Wenbo Zhao
  • 48. Thank you June 10, 2016 w QA