Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark - Migration Story

977 views

Published on

Spark is fast and general engine for large-scale data processing which can solve all of your problems.
… Or can it?

This talk will cover real-world issues encountered during migration of the existing product to Spark infrastructure.
Aimed at software engineers that just started to evaluate Spark or those who are already using it.

Published in: Engineering
  • Be the first to comment

Spark - Migration Story

  1. 1. Spark: Migration Story
  2. 2. About me Roman Chukh  11+ years of experience  Java / PHP / Ruby / etc.  ~1 year with Apache Spark  Interested in  Data Storage / Data Flow  Monitoring  Provisioning Tools
  3. 3. Agenda  Why Spark?  Our Migration to Spark  Issues  … and solutions  … or workarounds  … or at least the lessons learnt
  4. 4. Why Spark?
  5. 5. “ [Spark is a] Fast and general-purpose cluster computing platform for large-scale data processing
  6. 6. Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940 Why Spark? API
  7. 7. Why Spark? Active Development Source: https://github.com/apache/spark/pulse/monthly
  8. 8. Why Spark? Community Growth Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
  9. 9. Why Spark? Real-World Usage Source: http://www.slideshare.net/databricks/apache-spark-15-presented-by-databricks-cofounder-patrick- wendell/6
  10. 10. Largest Cluster 8000 nodes Tencent Largest single job 1 PB Alibaba.com Databricks Top streaming intake 1 TB / hour Janelia.org Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940 Why Spark? Real-World Usage
  11. 11. Migrating to Spark
  12. 12. Cluster Manager Application SparkContext Worker Node Executor Task Executor Task Worker Node Executor Task Executor Task Migrating To Spark Before We Start
  13. 13. Migrating To Spark The Product  Cloud-based analytics application  Won the Big Data Startup Challenge  In-house computation engine
  14. 14. Migrating To Spark Reasons  More data  More granular data  Support various data backends  Support Machine Learning algorithms
  15. 15. Migrating To Spark Use Cases ❏ supplement Graph database used to store/query big dimensions ❏ supplement RDBMS for querying of high volumes of data ❏ represent existing computation graph as flow of Spark-based operations
  16. 16. Migrating To Spark Star Schema Dimension DimensionMetric Process / Filter Dimension Filter Metric Process / Filter Dimension Result Data Processing ...
  17. 17. Issues
  18. 18. Issue #1 Low-Level API
  19. 19. Issue #1: Low-Level API RDD “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing” Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  20. 20. Issue #1: Low-Level API RDD: Resilient Distributed Dataset ❏ Immutable ❏ Statically typed: RDD<MyClass> ❏ Fault-Tolerant: Automatically rebuilt on failure ❏ Lazily evaluated
  21. 21. Issue #1: Low-Level API Example workflow Read File line-by-line Get line length Sum lengths Result
  22. 22. Issue #1: Low-Level API RDD: Example lines.txt some lines for test
  23. 23. Issue #1: Low-Level API RDD: Issues  Functional transformations (e.g. map/reduce) are not as intuitive  Manual memory management  High (dev) maintenance cost
  24. 24. Issue #1: Low-Level API DataFrame: Overview ❏ (Semi-) Structured data ❏ Columnar Storage ❏ Graph mutation ❏ Code generation ❏ "on" by default in 1.5+ ❏ "always on" in latest master
  25. 25. Issue #1: Low-Level API DataFrame: Example lines.json {"line":"some"} {"line":"lines"} {"line":"for"} {"line":"test"}
  26. 26. Issue #1: Low-Level API DataFrame vs RDD Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
  27. 27. Issue #1: Low-Level API DataFrame: Graph Mutation Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
  28. 28. Issue #1: Low-Level API Lessons Learnt ❏ Be aware of the new features ❏ … especially why they were introduced ❏ Low-Level API != Better Performance
  29. 29. Issue #2 DataSource Predicates
  30. 30. “ “The fastest way to process big data is to never read it” Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
  31. 31. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0
  32. 32. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 AND y < 10 WHERE y < 10 AND
  33. 33. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 OR y < 10 WHERE y < 10 OR
  34. 34. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 OR y < 10 WHERE y < 10 OR
  35. 35. … is at a very early stage ❏ Only simple predicates <, <=, >, >=, = ❏ Only ‘AND’ predicate groups (no OR support) Issue #2: DataSource Predicates JDBC
  36. 36. … is buggy ❏ Parquet < 1.7 ❏ PARQUET-136 - NPE if all column values are null ❏ Parquet 1.7 ❏ PARQUET-251 - Possible incorrect results for String/Decimal/Binary columns Issue #2: DataSource Predicates Apache Parquet
  37. 37. Issue #2: DataSource Predicates Lessons Learnt ❏ Know your data format / data storage features ❏ ... and issues ❏ Its hard to check predicate pushdown behavior ❏ SPARK-11390: Pushdown information ❏ Simple aggregation operations are not supported ❏ Check out the talk “The Pushdown of Everything”
  38. 38. Issue #3 Spark SQL
  39. 39. ❏ Window functions (e.g. row_number) ❏ Introduced for HiveContext in 1.4 ❏ Introduced for SparkContext in 1.5 ❏ Subquery (e.g. not exists) support is still missing ❏ Can sometimes be replaced with left semi join Issue #3: Spark (sort of) SQL Missing Functionality
  40. 40. Issue #3: Spark (sort of) SQL Lessons Learnt ❏ Know your use-case ❏ Spark SQL is still quite young ❏ SQL grammar is incomplete ❏ … but actively extended
  41. 41. Issue #4 Round Trips
  42. 42. Issue #4: Round Trips Background Metric Data Processing ... Filter Metric Result Internal API Process / Filter Dimension Dimension ids Dimension
  43. 43. Issue #4: Round Trips Background Metric Data Processing ... Filter Metric Result Internal API Process / Filter Dimension Dimension ids Dimension
  44. 44. Get ID for the ‘Year 2015’ Issue #4: Round Trips Resolving Dimensions Dimension WHERE key = ‘2015’ Result
  45. 45. Get IDs of all passed months of the current year Dimension WHERE parent = 2015 and level = month Dim. id of ‘2015’ WHERE key = ‘2015’ Issue #4: Round Trips Resolving Dimensions Result
  46. 46. Get IDs of all passed months of the current year AND their siblings from the previous year Dimension WHERE parent = 2015 and level = month Dim. id of ‘2015’ Jan, Feb, … WHERE key = ‘2015’ WHERE sibling_id = sibling_id - 1 Result Issue #4: Round Trips Resolving Dimensions
  47. 47. ❏ Spark is better suited for a single complex request ❏ … though not too complex yet ❏ Invest time in architecture analysis and data flow ❏ It might be better to replace a more high-level API Issue #4: Round Trips Lessons Learnt
  48. 48. Issue #5 Out of Memory
  49. 49. “ “RAM's cheap, but not that cheap” Source: http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-everything-to-ram-and-run-it-from-there
  50. 50. Issue #5: OOM Background ❏ Receive request ❏ Select / Filter / Process data (on Spark) ❏ Collect results ❏ … Out Of Memory
  51. 51. ❏ Same data as before ❏ Same external API Issue #5: OOM Workaround: Requirements
  52. 52. ❏ Result holds ~ 1M objects ❏ (Average) Object size 928 bytes ❏ Result size ~880 MB Issue #5: OOM Workaround: Before
  53. 53. Issue #5: OOM Workaround: After ❏ Result holds ~ 1M objects ❏ (Average) Object size 272 bytes ❏ Result size ~261 MB
  54. 54. ❏ Invest (more) time in data structures ❏ Some java performance tips: http://java-performance.com/ ❏ Know your serializer ❏ E.g. Kryo (v2.2.1) prepares object for deserialization by using default constructor. Issue #5: OOM Lessons Learnt
  55. 55. Instead Of Epilogue
  56. 56. “ “The fact that there is a highway to hell and only a stairway to heaven says a lot about the traffic trends” Source: https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_that_there_is_a_highway_to_hell_and_only
  57. 57. Thanks! Any questions?
  58. 58. Resources  http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf  https://databricks.com/resources/slides  https://databricks.com/spark/developer-resources  https://github.com/apache/spark/pulse/monthly  http://www.slideshare.net/databricks/building-a-modern-application-with- dataframes-52776940  http://www.slideshare.net/databricks/spark-summit-eu-2015-matei- zaharia-keynote  http://www.slideshare.net/databricks/apache-spark-15-presented-by- databricks-cofounder-patrick-wendell/6  http://www.slideshare.net/databricks/building-a-modern-application-with- dataframes-52776940  http://www.slideshare.net/databricks/spark-summit-eu-2015-matei- zaharia-keynote  http://www.slideshare.net/databricks/spark-whats-new-whats-coming  http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load- everything-to-ram-and-run-it-from-there  https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_tha t_there_is_a_highway_to_hell_and_only

×