Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with Akmal Chaudri

1,624 views

Published on

Attend this session to learn how to easily share state in-memory across multiple Spark jobs, either within the same application or between different Spark applications using an implementation of the Spark RDD abstraction provided in Apache Ignite. During the talk, attendees will learn in detail how IgniteRDD – an implementation of native Spark RDD and DataFrame APIs – shares the state of the RDD across other Spark jobs, applications and workers. Examples will show how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames.

Published in: Data & Analytics
  • Be the first to comment

How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with Akmal Chaudri

  1. 1. Akmal Chaudhri, GridGain Systems How to share state across multiple Spark jobs using Apache Ignite #EUde9
  2. 2. Agenda •  Introduction to Apache Ignite •  Ignite for Spark •  IgniteContext and IgniteRDD •  Installation and Deployment •  Demos •  Q&A 2#EUde9
  3. 3. Introduction to Apache Ignite 3#EUde9
  4. 4. Apache Ignite in one slide •  Memory-centric platform –  that is strongly consistent –  and highly-available –  with powerful SQL –  key-value and processing APIs •  Designed for –  Performance –  Scalability 4#EUde9
  5. 5. Apache Ignite •  Data source agnostic •  Fully fledged compute engine and durable storage •  OLAP and OLTP •  Fully ACID transactions across memory and disk •  In-memory SQL support •  Early ML libraries •  Growing community 5#EUde9
  6. 6. Ignite for Spark 6#EUde9
  7. 7. Why share state in Spark? •  Long running applications –  Passing state between jobs •  Disk File System –  Convert RDDs to disk files and back •  Share RDDs in-memory –  Native Spark API –  Native Spark transformations 7#EUde9
  8. 8. Ignite for Spark •  Spark RDD abstraction •  Shared in-memory view on data across different Spark jobs, workers or applications •  Implemented as a view over a distributed Ignite cache 8#EUde9
  9. 9. Ignite for Spark •  Deployment modes –  Share RDD across tasks on the host –  Share RDD across tasks in the application –  Share RDD globally •  Shared state can be –  Standalone mode (outlives Spark application) –  Embedded mode (lifetime of Spark application) 9#EUde9
  10. 10. Ignite In-Memory File System •  Distributed in-memory file system •  Implements HDFS API •  Can be transparently plugged into Hadoop or Spark deployments 10#EUde9
  11. 11. IgniteContext and IgniteRDD 11#EUde9
  12. 12. IgniteContext •  Main entry-point to Spark-Ignite integration •  SparkContext plus either one of –  IgniteConfiguration() –  Path to XML configuration file •  Optional Boolean client argument –  true => Shared deployment –  false => Embedded deployment 12#EUde9
  13. 13. IgniteContext examples 13#EUde9 val igniteContext = new IgniteContext(sparkContext, () => new IgniteConfiguration()) val igniteContext = new IgniteContext(sparkContext, "examples/config/spark/example-shared-rdd.xml")
  14. 14. IgniteRDD •  Implementation of Spark RDD representing a live view of an Ignite cache •  Mutable (unlike native RDDs) –  All changes in Ignite cache will be visible to RDD users immediately •  Provides partitioning information to Spark executor •  Provides affinity information to Spark so that RDD computations can use data locality 14#EUde9
  15. 15. Write to Ignite •  Ignite caches operate on key-value pairs •  Spark tuple RDD for key-value pairs and savePairs method –  RDD partitioning, store values in parallel if possible •  Value-only RDD and saveValues method –  IgniteRDD generates a unique affinity-local key for each value stored into the cache 15#EUde9
  16. 16. Write code example 16#EUde9 val conf = new SparkConf().setAppName("SparkIgniteWriter") val sc = new SparkContext(conf) val ic = new IgniteContext(sc, "examples/config/spark/example-shared-rdd.xml") val sharedRDD: IgniteRDD[Int, Int] = ic.fromCache("sharedRDD") sharedRDD.savePairs(sc.parallelize(1 to 100000, 10) .map(i => (i, i)))
  17. 17. Read from Ignite •  IgniteRDD is a live view of an Ignite cache –  No need to explicitly load data to Spark application from Ignite –  All RDD methods are available to use right away after an instance of IgniteRDD is created 17#EUde9
  18. 18. Read code example 18#EUde9 val conf = new SparkConf().setAppName("SparkIgniteReader") val sc = new SparkContext(conf) val ic = new IgniteContext(sc, "examples/config/spark/example-shared-rdd.xml") val sharedRDD: IgniteRDD[Int, Int] = ic.fromCache("sharedRDD") val greaterThanFiftyThousand = sharedRDD.filter(_._2 > 50000) println("The count is "+greaterThanFiftyThousand.count())
  19. 19. Installation and Deployment 19#EUde9
  20. 20. Installation and Deployment •  Shared Deployment •  Embedded Deployment •  Maven •  SBT 20#EUde9
  21. 21. Shared Deployment •  Standalone mode •  Ignite nodes deployed with Spark worker nodes •  Add following lines to spark-env.sh 21#EUde9 IGNITE_LIBS="${IGNITE_HOME}/libs/*" for file in ${IGNITE_HOME}/libs/* do if [ -d ${file} ] && [ "${file}" != "${IGNITE_HOME}"/libs/optional ]; then IGNITE_LIBS=${IGNITE_LIBS}:${file}/* fi done export SPARK_CLASSPATH=$IGNITE_LIBS
  22. 22. Embedded Deployment •  Ignite nodes are started inside Spark job processes and are stopped when job dies •  Ignite code distributed to worker machines using Spark deployment mechanism •  Ignite nodes will be started on all workers as a part of IgniteContext initialization 22#EUde9
  23. 23. Maven •  Ignite’s Spark artifact hosted in Maven Central •  Scala 2.11 example 23#EUde9 <dependency> <groupId>org.apache.ignite</groupId> <artifactId>ignite-spark</artifactId> <version>${ignite.version}</version> </dependency>
  24. 24. SBT •  Ignite’s Spark artifact added to build.sbt •  Scala 2.11 example 24#EUde9 libraryDependencies += "org.apache.ignite" % "ignite-spark" % "ignite.version"
  25. 25. Demos 25#EUde9
  26. 26. Resources •  Ignite for Spark documentation –  https://apacheignite-fs.readme.io/docs/ignite-for-spark •  Spark Data Frames Support in Apache Ignite –  https://issues.apache.org/jira/browse/IGNITE-3084 •  Code examples –  https://github.com/apache/ignite/ => ScalarSharedRDDExample.scala –  https://github.com/apache/ignite/ => SharedRDDExample.java 26#EUde9
  27. 27. Any Questions? Thank you for joining us. Follow the conversation. http://ignite.apache.org 27#EUde9

×