Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Huohua: A Distributed Time Series Analysis Framework For Spark

2,591 views

Published on

Spark Summit 2016 Talk by Wenbo Zhao (Two Sigma Investments)

Published in: Data & Analytics
  • Be the first to comment

Huohua: A Distributed Time Series Analysis Framework For Spark

  1. 1. www.twosigma.com Huohua 火花 Distributed Time Series Analysis Framework For Spark June 10, 2016 Wenbo Zhao Spark Summit 2016
  2. 2. About Me June 10, 2016 w Software Engineer @ w Focus on analytics related tools, libraries and Systems
  3. 3. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 1/3/1950 1/3/1953 1/3/1956 1/3/1959 1/3/1962 1/3/1965 1/3/1968 1/3/1971 1/3/1974 1/3/1977 1/3/1980 1/3/1983 1/3/1986 1/3/1989 1/3/1992 1/3/1995 1/3/1998 1/3/2001 1/3/2004 1/3/2007 1/3/2010 1/3/2013 1/3/2016 S&P 500 We view everything as a time series June 10, 2016 w Stock market prices w Temperatures w Sensor logs w Presidential polls w … 50°F 55°F 60°F 65°F 70°F 75°F 80°F 85°F 90°F 95°F 100°F New York San Francisco
  4. 4. What is a time series? June 10, 2016 w A sequence of observations obtained in successive time order w Our goal is to forecast future values given past observations $8.90 $8.95 $8.90 $9.06 $9.10 8:00 11:00 14:00 17:00 20:00 corn price ?
  5. 5. Multivariate time series June 10, 2016 w We can forecast better by joining multiple time series w Temporal join is a fundamental operation for time series analysis w Huohua enables fast distributed temporal join of large scale unaligned time series $8.90 $8.95 $8.90 $9.06 $9.10 8:00 11:00 14:00 17:00 20:00 corn price 75°F 72°F 71°F 72°F 68°F 67°F 65°F temperature
  6. 6. What is temporal join? June 10, 2016 w A particular join function defined by a matching criteria over time w Examples of criteria w look-backward– find the most recent observation in the past w look-forward – find the closest observation in the future time series 1 time series 2 look-forward time series 1 time series 2 look-backward observation
  7. 7. Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 10:00 AM 12:00 AM
  8. 8. Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 10:00 AM 12:00 AM time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 60 °F 10:00 AM 12:00 AM time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F
  9. 9. Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM
  10. 10. Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F
  11. 11. time corn price 08:00AM 11:00AM time corn price 08:00AM 11:00AM time corn price 08:00AM 11:00AM time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F Temporaljoin with look-backwardcriteria June 10, 2016 time weather 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F time corn price 08:00AM 11:00AM time weather corn price 08:00 AM 60 °F 10:00 AM 70 °F 12:00 AM 80 °F … … Hundreds of thousands of data sources with unaligned timestamps Thousands of market data sets We need fast and scalable distributed temporal join
  12. 12. Issues with existing solutions June 10, 2016 w A single time series may not fit into a single machine w Forecasting may involve hundreds of time series w Existing packages don’t support temporal join or can’t handle large time series w MatLab, R, SAS, Pandas w Even Spark based solutions fall short w PairRDDFunctions, DataFrame/Dataset, spark-ts
  13. 13. Huohua – a new time series library for Spark June 10, 2016 w Goal w provide a collection of functions to manipulateand analyze time series at scale w group, temporal join, summarize, aggregate … w How w build a time series aware data structure w extending RDD to TimeSeriesRDD w optimize using temporal locality w reduce shuffling w reduce memory pressure by streaming
  14. 14. What is a TimeSeriesRDD in Huohua? June 10, 2016 w TimeSeriesRDD extends RDD to represent time series data w associates a time range to each partition w tracks partitions’time-rangesthrough operations w preserves the temporal order TimeSeriesRDD operations time series functions
  15. 15. TimeSeriesRDD– an RDD representingtime series June 10, 2016 time temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … (6:00 AM, 60°F) (6:01 AM, 61°F) … RDD (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) …
  16. 16. TimeSeriesRDD– an RDD representingtime series June 10, 2016 range: [06:00AM, 07:00AM) range: [07:00AM, 8:00AM) range: [8:00AM, ∞) TimeSeriesRDDtime temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … (6:00 AM, 60°F) (6:01 AM, 61°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) …
  17. 17. Group function June 10, 2016 w A group function groups rows with exactly the same timestamps time city temperature 1:00 PM New York 70°F 1:00 PM San Francisco 60°F 2:00 PM New York 71°F 2:00 PM San Francisco 61°F 3:00 PM New York 72°F 3:00 PM San Francisco 62°F 4:00 PM New York 73°F 4:00 PM San Francisco 63°F group 1 group 2 group 3 group 4
  18. 18. Group function June 10, 2016 w A group function groups rows with nearby timestamps time city temperature 1:00 PM New York 70°F 1:00 PM San Francisco 60°F 2:00 PM New York 71°F 2:00 PM San Francisco 61°F 3:00 PM New York 72°F 3:00 PM San Francisco 62°F 4:00 PM New York 73°F 4:00 PM San Francisco 63°F group 1 group 2
  19. 19. Group in Spark June 10, 2016 w Groups rows with exactly the same timestamps RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  20. 20. w Data is shuffled and materialized Group in Spark June 10, 2016 RDD groupBy RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  21. 21. Group in Spark June 10, 2016 w Data is shuffled and materialized RDD groupBy RDD 1:00PM 1:00PM 3:00PM 3:00PM 2:00PM 4:00PM 2:00PM 4:00PM
  22. 22. Group in Spark June 10, 2016 w Data is shuffled and materialized RDD groupBy RDD 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  23. 23. Group in Spark June 10, 2016 w Temporal order is not preserved RDD groupBy RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  24. 24. Group in Spark June 10, 2016 w Another sort is required RDD groupBy sortBy RDD RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  25. 25. Group in Spark June 10, 2016 w Another sort is required RDD groupBy sortBy RDD RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 2:00PM 2:00PM 4:00PM 4:00PM 1:00PM 1:00PM 3:00PM 3:00PM
  26. 26. Group in Spark June 10, 2016 w Back to correct temporal order RDD groupBy sortBy RDD RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  27. 27. Group in Spark June 10, 2016 w Back to temporal order RDD groupBy sortBy RDD RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  28. 28. Group in Huohua June 10, 2016 w Data is grouped locally as streams TimeSeriesRDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  29. 29. Group in Huohua June 10, 2016 w Data is grouped locally as streams TimeSeriesRDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM
  30. 30. Group in Huohua June 10, 2016 w Data is grouped locally as streams TimeSeriesRDD 1:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 2:00PM
  31. 31. Group in Huohua June 10, 2016 w Data is grouped locally as streams TimeSeriesRDD 1:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 2:00PM
  32. 32. Benchmark for group June 10, 2016 w Running time of count after group w 16 executors (10G memory and 4 cores per executor) w data is read from HDFS 0s 20s 40s 60s 80s 100s 20M 40M 60M 80M 100M RDD DataFrame TimeseriesRDD
  33. 33. Temporaljoin June 10, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction – whether it should look-backward or look-forward w window - how much it should look-backward or look-forward look-backward temporal join window
  34. 34. Temporaljoin June 10, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction – whether it should look-backward or look-forward w window - how much it should look-backward or look-forward look-backward temporal join window
  35. 35. Temporaljoin June 10, 2016 w Temporal join with criteria look-back and window of length 1 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM time series time series
  36. 36. Temporaljoin June 10, 2016 w Temporal join with criteria look-back and window of length 1 w How do we do temporal join in TimeSeriesRDD? TimeSeriesRDD TimeSeriesRDD 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM
  37. 37. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window of length 1 w partition time space into disjoint intervals TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM
  38. 38. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window of length 1 w Build dependencygraph for the joinedTimeSeriesRDD TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM
  39. 39. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window 1 w Join data as streams per partition 1:00AM 1 TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM1:00AM 2:00AM 4:00AM 5:00AM 3:00AM 5:00AM
  40. 40. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window 1 w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM1:00AM 2:00AM
  41. 41. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window 1 w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM
  42. 42. Temporaljoin in Huohua June 10, 2016 w Temporal join with criteria look-back and window 1 w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM 5:00AM 5:00AM
  43. 43. Benchmark for temporaljoin June 10, 2016 w Running time of count after temporal join w 16 executors (10G memory and 4 cores per executor) w data is read from HDFS 0s 20s 40s 60s 80s 100s 20M 40M 60M 80M 100M RDD DataFrame TimeseriesRDD
  44. 44. Functions over TimeSeriesRDD June 10, 2016 w group functions such as window, intervalization etc. w temporal joins such as look-forward, look-backwardetc. w summarizers such as average, variance, z-score etc. over w windows w Intervals w cycles
  45. 45. Open Source June 10, 2016 w Not quite yet … w https://github.com/twosigma
  46. 46. Future work June 10, 2016 w Dataframe / Dataset integration w Speed up w RicherAPIs w Python bindings w More summarizers
  47. 47. Key contributors June 10, 2016 w Christopher Aycock w Jonathan Coveney w Jin Li w David Medina w David Palaitis w Ris Sawyer w Leif Walsh w Wenbo Zhao
  48. 48. Thank you June 10, 2016 w QA

×