Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Evolution of apache spark

1,218 views

Published on

Journey of Spark in 1.x series

Published in: Technology
  • Be the first to comment

Evolution of apache spark

  1. 1. Evolution of Apache Spark Journey of Spark in 1.x series
  2. 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  3. 3. Agenda ● Spark 1.0 ● State of Big data ● Change in ecosystem ● Dawn of structured data ● Working with structured sources ● Dawn of custom memory management ● Evolution of Libraries
  4. 4. Spark 1.0 ● Release on May 2014 [1] ● First production ready, backward compatible release ● Contains ○ Spark batch ○ Spark streaming ○ Shark ○ MLLib and Graphx ● Developed over 4 years ● Better hadoop
  5. 5. State of Big data Industry ● Map/Reduce was the way to do big data processing ● HDFS was primary source of the data ● Tools like Sqoop developed for moving data to hdfs and hdfs acted like single point of source ● Every data by default assumed to be unstructured and structure was laid on top of it ● Hive and Pig were popular ways to do structured and semi structured data processing on top of Map/Reduce
  6. 6. Spark 1.0 Ideas ● RDD abstraction was supported to do Map/Reduce style programming ● Primary source supported was HDFS and memory as the speedup layer ● Spark-streaming viewed as faster batch processing rather than as streaming ● To support Hive, Shark was created to generate RDD code rather than Map/Reduce
  7. 7. Changes from 2014 ● Big data industry has gone through many radical changes in thinking in last two years ● Some of those changes started in spark and some other are influenced by other frameworks ● These changes are important to understand why Spark 2.0 abstractions are radically different than Spark 1.0 ● Many of these are already discussed in earlier meetups, links to the videos are in reference
  8. 8. Dawn of Structured Data
  9. 9. Usage of Big data in 2014. ● Most of the people were using higher level tools like Hive and Pig to process data rather using Map/Reduce ● Most of the data was residing in the RDBMS databases and user ETL data from mysql to hive to query ● So lot of use cases were analysing structured data rather than basic assumption of unstructured in big data world ● Huge time is consumed for ETL and non optimized workflows from Hive
  10. 10. Spark with Structured Data in 1.2 ● Spark recognised need of structured data in the market and started to evolve the platform to support that use case ● First attempt was to have a specialised RDD called SchemaRDD in Spark 1.2 which represented that schema ● But this approach was not clean ● Also even though there was InputFormat to read from structured data, there was no direct API to read from Spark
  11. 11. DataSource API in Spark 1.3 ● First API to provide an unified API to read from structured and semi structured sources ● Can read from RDBMS, NoSql databases like Mongodb,Cassandra etc ● Advanced API like InputFormat which gives lot of control to source to optimize locality of data ● So in Spark 1.3, spark addressed the need of structured data being first class in Big data ecosystem ● For more info refer to, Anatomy of DataSource API talk[2]
  12. 12. DataFrame abstraction in Spark ● Spark understood modifying the RDD abstraction is not good enough ● Many frameworks like Hive, Pig tried and failed mapping querying efficiently on Map/Reduce ● So Spark came up with Dataframe abstraction which goes through a complete different pipeline that of RDD which is highly optimized ● For more info refer to, Anatomy of DataFrame API talk [3]
  13. 13. Evolution of InMemory processing
  14. 14. In memory in Spark 1.0 ● Spark was the first open source big data framework to embrace in memory computing ● With cheaper hardware and abstractions like RDD allowed spark to exploit memory in efficient way than all other hadoop ecosystem projects ● The first implementation of in memory computing followed typical cache approach of keeping serialized java bytes ● This proved to be challenging in future
  15. 15. Challenges of in memory in Java ● As more and more big data frameworks started to exploit memory, soon they realised few limitation of Java memory model ● Java memory is tuned for short lived objects and complete control of memory is given to JVM ● But big data system started using JVM for long term storage, JVM memory model started feel inadequate ● Also as java heap grew, to cache more data, GC pauses started to kill performance
  16. 16. Custom memory management ● Apache Flink is first big data system to implement custom memory management in java ● Flink follows Dataframe like API with custom memory model ● The custom memory model with non GC based approach proved to be highly successful ● By observing trends in community, optly Spark also adopted same in Spark 1.4
  17. 17. Tungsten in Spark 1.4 ● Spark release first version of custom memory management in 1.4 version ● It was only supported DF as they need custom memory model ● Custom memory management greatly improved use of spark in higher vm size and fewer GC paused ● Solved OOM issues which plagued earlier versions of spark ● For more info refer to, Anatomy of In memory management in Spark talk [4]
  18. 18. DSL’s for data processing
  19. 19. RDD and Map/Reduce API API ● RDD API of spark follows functional programming paradigm which is similar to Map/Reduce ● RDD API passes around opaque function objects which is great for programming but bad for system based optimization ● Map/Reduce API of Java also follows same patterns but less elegant than scala ones ● Hard to optimise compared to Pig/Hive ● So we saw a steady increase in custom DSL’s in hadoop world
  20. 20. Need of DSL’s in Hadoop ● DSL’s like Pig or Hive are much more easier to understand compare to Java API ● Less error prone and helps to be very specific ● Can be easily optimised, as DSL only focuses on what to do not how to do ● As Java Map/Reduce mixes what with how, it’s hard to optimize compare to Hive and Pig ● So more and more people prefered these DSL over platform level API’s
  21. 21. Challenges of DSL in Hadoop ● Hive and Pig DSL do not integrate well with Map/Reduce API’s ● DSL often lack the flexibility of complete programming language ● Hive/Pig DSL don’t define single abstraction to share so you will be not able mix ● DSL are powerful for optimization but soon become limited in terms of functionality
  22. 22. Scala as language to host DSL ● Scala is one of the first language to embrace DSL as the first class citizens ● Scala features like implicits, higher order functions, structured types etc allow easily build DSL’s and integrate with language ● This allows any library on scala to integrate DSL and harness full power of language ● Many libraries define their own DSL outside big data. Ex : Slick, Akka-http, Sbt
  23. 23. DF DSL and Spark SQL DSL ● To harness power of custom memory management and hive like optimizes spark encourages to write DF and spark sql DSL over spark RDD code ● Whenever we write this DSL, all the features of scala language and its libraries are available,which makes it more powerful that Pig/ Hive ● Other frameworks like Flink, Beam follow same ideas on scala, Java 8 etc ● You can easily mix and match DSL with RDD API
  24. 24. Dataset DSL in Spark 1.6 ● Dataframe DSL introduced in 1.4 and stabilised in 1.5 ● As spark observed the user and performance benefits of DSL based programming, it wanted to make as import pillar of Spark ● So in Spark 1.6, Spark released Dataset DSL which is poised to complete RDD API from user land ● This indicates a big shift in thinking as we are more and more moving away from 1.0 Map/Reduce and unstructured mindset.
  25. 25. Evolution of Libraries
  26. 26. Evolution of libraries vs frameworks ● Spark is one of the first big data framework to build platform rather than collection of frameworks ● Single abstraction results in multiple libraries not multiple frameworks ● All these libraries get benefits from the improvements in run time ● This made spark to build lot of ecosystem in very less time ● To understand the meaning of platform, refer to Introduction to Flink talk [5]
  27. 27. Data exchange between Libraries ● As more and more libraries are added to spark, having common way to exchange data became important ● Initially libraries started using RDD as data exchange format, but soon discovered some limitations ● Limitations of RDD as data exchange format is ○ No defined schema. Need to come up with domain object for each library ○ Too low level ○ Custom serialization is hard to integrate
  28. 28. DataFrame as data exchange format ● From last few release, spark is making Dataframe as new data exchange format of Spark ● Dataframe has schema and can be easily passed around between libraries ● Dataframe is higher level abstraction compared RDD ● As Dataframe are serialized using platform specific code generation, all libraries will be following same serialization ● Dataset will follow the same advantages
  29. 29. Learnings from Spark 1.x ● Structured/Semi structured data is the first class of Big data processing system ● Custom memory management and code generated serialization gives best performance on JVM ● DataFrame/ Dataset are the new abstraction layers to build next generation big data processing system ● DSL is the way forward over Map/Reduce like API’s ● Having high level structured abstractions make libraries coexist happily on a platform
  30. 30. References 1. http://spark.apache.org/news/spark-1-0-0-released.html 2. https://www.youtube.com/watch?v=ckX6fT3kYG0 3. https://www.youtube.com/watch?v=iKOGBr-kOks 4. https://www.youtube.com/watch?v=7nIMpD5TyNs 5. https://www.youtube.com/watch?v=jErEhxP8LYQ

×