© 2018 GridGain Systems, Inc.
Improving Apache Spark™ In-Memory
Computing with Apache Ignite™
Valentin Kulichenko
GridGain Systems
© 2018 GridGain Systems, Inc.
a memory-centric distributed
database, caching, and processing platform
for transactional, analytical, and streaming workloads,
delivering in-memory speeds at petabyte scale
© 2018 GridGain Systems, Inc.
Apache Ignite Database and Caching Platform
Memory-Centric Storage
Ignite Native Persistence
(Flash, SSD, Intel 3D XPoint)
Third-Party Persistence
(RDBMS, HDFS, NoSQL)
SQL Transactions Compute Services MLStreamingKey/Value
IoTFinancial
Services
Pharma &
Healthcare
E-CommerceTravel &
Logistics
Telco
© 2018 GridGain Systems, Inc.
• Distributed memory-centric database • Ingests data from HDFS or another
storage
• Fully fledged compute platform: SQL,
transactions, key-value, collocated
processing, ML/DL
• Streaming and compute engine
• OLAP and OLTP • Inclined towards OLAP and focused on
MR payloads
Comparing Ignite and Spark
© 2018 GridGain Systems, Inc.
Ignite is a memory-centric store for Spark
• No data movement from Ignite to Spark
• In-place query execution
• Boost DataFrame and SQL performance
• Share state and data among Spark jobs
• Faster data and streaming analytics
Ignite and Spark Together
+
© 2018 GridGain Systems, Inc.
Ignite and Spark Integration
Spark Application
Spark Worker
Spark
Job
Spark
Job
Yarn Mesos Docker HDFS
Spark Worker
Spark
Job
Spark
Job
Spark Worker
Spark
Job
Spark
Job
In-Memory Shared RDD or DataFrame
GridGain Node GridGain Node GridGain Node
Share state and
data among
Spark jobs
No data
movement
Boost DataFrame
and SQL
Performance
SQL on top
of RDDs
In-place query
execution
© 2018 GridGain Systems, Inc.
• Spark RDD abstraction
• Shared view over Ignite cache/table
• Mutable
• Ignite SQL on top of RDDs APIs
• Indexes and in-place execution
Ignite Shared RDDs
© 2018 GridGain Systems, Inc.
• Standard RDD APIs + Ignite SQL
• No rip-and-replace
• Switch to Ignite as a storage
Write to and Read from Ignite
val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD")
val greaterThanFiftyThousand = sharedRDD.filter(_._2 > 50000)
val df = sharedRDD.sql(”select _val from Integer where _key > 50000”)
val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD")
sharedRDD.savePairs(sc.parallelize(1 to 100000, 10).map(i => (i, i)))
© 2018 GridGain Systems, Inc.
• Optimizing Spark’s Catalyst Engine
• In-place execution on Ignite side
• No data movement
• For most of the scenarios
Ignite DataFrames
© 2017 GridGain Systems, Inc.
1. Initial Query
2. Query execution over local data
3. Reduce multiple results in one
Ignite Node
Canada
Toronto
Ottawa
Montreal
Calgary
Ignite Node
India
Mumbai
New Delhi
1
2
23
SQL Queries Execution Flow
© 2018 GridGain Systems, Inc.
• Store DataFrames in Ignite
• Save modes
• Append
• Overwrite
• ErrorIfExists
• Ignore
SparkSession spark = _
String cfgPath = “path/to/config/file”
Dataset<Row> jsonDataFrame = spark.read().json(“path/to/file.json”);
jsonDataFrame.write()
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.mode(SaveMode.Append) // SaveMode
//... other options
.save();
Saving DataFrames
© 2018 GridGain Systems, Inc.
• Read from Ignite
• Specify format
• Specify config file
SparkSession spark = _
String cfgPath = “path/to/config/file”
Dataset<Row> df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), cfgPath) //Ignite config
.load();
df.createOrReplaceTempView("person");
Dataset<Row> igniteDF = spark.sql(
"SELECT * FROM person WHERE name = 'Mary Major'");
Reading DataFrames
© 2018 GridGain Systems, Inc.
• 1 Ignite Server Node
• SensorDataGenerator
• Writes random data to a socket
• Stream
• Connects to the socket, reads sensor data and
streams via Spark; for each streamed RDD, it
creates a DataFrame and saves it into Ignite
• Query
• Creates another Spark application that uses
DataFrames integration to query data from Ignite
DataFrames Demo Setup
+
© 2018 GridGain Systems, Inc.
Any Questions?
Thank you for joining us. Follow the conversation.
http://ignite.apache.org
#apacheignite

Improving Apache Spark™ In-Memory Computing with Apache Ignite™

  • 1.
    © 2018 GridGainSystems, Inc. Improving Apache Spark™ In-Memory Computing with Apache Ignite™ Valentin Kulichenko GridGain Systems
  • 2.
    © 2018 GridGainSystems, Inc. a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads, delivering in-memory speeds at petabyte scale
  • 3.
    © 2018 GridGainSystems, Inc. Apache Ignite Database and Caching Platform Memory-Centric Storage Ignite Native Persistence (Flash, SSD, Intel 3D XPoint) Third-Party Persistence (RDBMS, HDFS, NoSQL) SQL Transactions Compute Services MLStreamingKey/Value IoTFinancial Services Pharma & Healthcare E-CommerceTravel & Logistics Telco
  • 4.
    © 2018 GridGainSystems, Inc. • Distributed memory-centric database • Ingests data from HDFS or another storage • Fully fledged compute platform: SQL, transactions, key-value, collocated processing, ML/DL • Streaming and compute engine • OLAP and OLTP • Inclined towards OLAP and focused on MR payloads Comparing Ignite and Spark
  • 5.
    © 2018 GridGainSystems, Inc. Ignite is a memory-centric store for Spark • No data movement from Ignite to Spark • In-place query execution • Boost DataFrame and SQL performance • Share state and data among Spark jobs • Faster data and streaming analytics Ignite and Spark Together +
  • 6.
    © 2018 GridGainSystems, Inc. Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job Spark Job Spark Worker Spark Job Spark Job In-Memory Shared RDD or DataFrame GridGain Node GridGain Node GridGain Node Share state and data among Spark jobs No data movement Boost DataFrame and SQL Performance SQL on top of RDDs In-place query execution
  • 7.
    © 2018 GridGainSystems, Inc. • Spark RDD abstraction • Shared view over Ignite cache/table • Mutable • Ignite SQL on top of RDDs APIs • Indexes and in-place execution Ignite Shared RDDs
  • 8.
    © 2018 GridGainSystems, Inc. • Standard RDD APIs + Ignite SQL • No rip-and-replace • Switch to Ignite as a storage Write to and Read from Ignite val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD") val greaterThanFiftyThousand = sharedRDD.filter(_._2 > 50000) val df = sharedRDD.sql(”select _val from Integer where _key > 50000”) val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD") sharedRDD.savePairs(sc.parallelize(1 to 100000, 10).map(i => (i, i)))
  • 9.
    © 2018 GridGainSystems, Inc. • Optimizing Spark’s Catalyst Engine • In-place execution on Ignite side • No data movement • For most of the scenarios Ignite DataFrames
  • 10.
    © 2017 GridGainSystems, Inc. 1. Initial Query 2. Query execution over local data 3. Reduce multiple results in one Ignite Node Canada Toronto Ottawa Montreal Calgary Ignite Node India Mumbai New Delhi 1 2 23 SQL Queries Execution Flow
  • 11.
    © 2018 GridGainSystems, Inc. • Store DataFrames in Ignite • Save modes • Append • Overwrite • ErrorIfExists • Ignore SparkSession spark = _ String cfgPath = “path/to/config/file” Dataset<Row> jsonDataFrame = spark.read().json(“path/to/file.json”); jsonDataFrame.write() .format(IgniteDataFrameSettings.FORMAT_IGNITE()) .mode(SaveMode.Append) // SaveMode //... other options .save(); Saving DataFrames
  • 12.
    © 2018 GridGainSystems, Inc. • Read from Ignite • Specify format • Specify config file SparkSession spark = _ String cfgPath = “path/to/config/file” Dataset<Row> df = spark.read() .format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source .option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), cfgPath) //Ignite config .load(); df.createOrReplaceTempView("person"); Dataset<Row> igniteDF = spark.sql( "SELECT * FROM person WHERE name = 'Mary Major'"); Reading DataFrames
  • 13.
    © 2018 GridGainSystems, Inc. • 1 Ignite Server Node • SensorDataGenerator • Writes random data to a socket • Stream • Connects to the socket, reads sensor data and streams via Spark; for each streamed RDD, it creates a DataFrame and saves it into Ignite • Query • Creates another Spark application that uses DataFrames integration to query data from Ignite DataFrames Demo Setup +
  • 14.
    © 2018 GridGainSystems, Inc. Any Questions? Thank you for joining us. Follow the conversation. http://ignite.apache.org #apacheignite