SlideShare a Scribd company logo
Under-the-hood:
Creating Your Own Spark Data Sources
Speaker: Jayesh Thakrar @ Conversant
Slide 1
Why Build Your Own Data Source?
Slide 2
Don't need
• text, csv, ORC, Parquet, JSON files
• JDBC sources
• When available from project / vendor, e.g.
• Cassandra, Kafka, MongoDB, etc
• Teradata, Greenplum (2018), etc
Need to
• Use special features, e.g.
• Kafka transactions
• Exploit RDBMS specific partition features
• Because you can, and want to 
Conversant Use Cases (mid-2017)
• Greenplum as data source (read/write)
• Incompatibility between Spark, Kafka and
Kafka connector versions
Agenda
1. Introduction to Spark data sources
2. Walk-through sample code
3. Practical considerations
Slide 3
Introduction To
Spark Data Source
Slide 4
Using Data Sources
Slide 5
• Built-in
spark.read.csv("path")
spark.read.orc("path")
spark.read.parquet("path")
• Third-party/custom
spark.read.format("class-name").load() // custom data source
spark.read.format("...").option("...", "...").load() // with options
spark.read.format("...").schema("..").load() // with schema
Spark Data Source
Spark Application
Spark API
Spark Data SourceData
Spark Backend
(runtime, Spark
libraries, etc)
Slide 6
Data Source V2 API
Spark Application
Spark API
Spark Data SourceData
Spark Backend
(runtime, Spark
libraries, etc)
Shiny, New V2 API
since Spark 2.3 (Feb 2018)
SPARK-15689
No relations, scans involved
Slide 7
V2 API: Data Source Types
Data Source Type Output
Batch Dataset
Microbatch
(successor to Dstreams)
Structured Stream =
stream of bounded dataset
Continuous
Continuous Stream = continuous
stream of Row(s)
Slide 8
V2 DATA SOURCE API
• Well documented design
V2 API : Design Doc , Feature SPARK-15689
Continuous Streaming Design Doc and Feature - SPARK-20928
• Implemented as Java Interfaces (not classes)
• Similar interfaces across all data source types
• Microbatch and Continuous not hardened yet...
e.g. SPARK-23886, SPARK-23887, SPARK-22911
Slide 9
Code Walkthrough
Slide 10
Details
• Project: https://github.com/JThakrar/sparkconn
• Requirements:
• Scala 2.11
• SBT
Slide 11
Reading From Data Source
val data = spark.read.format("DataSource").option("key", "value").schema("..").load()
Step Action
spark = SparkSession
read Returns a DataFrameReader for data source orchestration
format
Lazy lookup of data source class -
by shortname or by full-qualified class name
option Zero, one or more key-value pairs of options for data source
schema Optional, user-provided schema
load
Loads the data as a dataframe.
Remember dataframes are lazily-evaluated.
Slide 12
Reading: Interfaces to Implement
Slide 13
spark.format(...)
DataSourceRegister
ReadSupport
ReadSupportWithSchema
DataSourceReader
DataReaderFactory
DataReaderFactory
DataReaderFactory
DataReader
DataReader
DataReader
Driver
Executors / Tasks / Partitions
Need to
implement a
minimum of 5
interfaces and
3 classes
Interface Definitions And Dependency
Slide 14
libraryDependencies +=
"org.apache.spark" %%
"spark-sql" %
"2.3.1" %
"provided"
Example V2 API Based Datasource
Slide 15
• Very simple data source that generates a row with a fixed schema of a single
string column. Column name = "string_value"
• Completely self-contained (i.e. no external connection)
• Number of partitions in dataset is user-configurable (default = 5)
• All partitions contains same number of rows (strings)
• Number of rows per partition is user-configurable (default = 5)
Read Interface: DataSourceRegister
Slide 16
Interface Purpose
org.apache.spark.sql.sources.v2.DataSourceRegister
org.apache.spark.sql.sources.v2.ReadSupport
and / or
org.apache.spark.sql.sources.v2.ReadSupportWithSchema
DataSourceRegister is the entry point for your data source.
ReadSupport is then responsible to instantiate the object
implementing DataSourceReader (discussed later). It accepts the
options/parameters and optional schema from Spark application.
Read Interface: DataSourceReader
Slide 17
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataSourceReader
This interface requires implementations for:
• determining the schema of data
• determining the number of partitions and creating that many
reader factories below.
Read Interface: DataReaderFactory
Slide 18
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataReaderFactory
This is the "handle" passed by the driver to each executor. It
instantiates the readers below and controls data fetch.
Read Interface: DataReader
Slide 19
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data from source (at task-level)
Summary
Slide 20
Interface Purpose
org.apache.spark.sql.sources.v2.DataSourceRegister
org.apache.spark.sql.sources.v2.ReadSupport
org.apache.spark.sql.sources.v2.ReadSupportWithSchema
DataSourceRegister is the entry point for your connector.
ReadSupport is then responsible to instantiate the object
implementing DataSourceReader below. It accepts the
options/parameters and optional schema from Spark application.
org.apache.spark.sql.sources.v2.reader.DataSourceReader
This interface requires implementations for:
• determining the schema of data
• determining the number of partitions and creating that many
reader factories below.
The DataSourceRegister, DataSourceReader and DataReaderFactory
are instantiated at the driver. The driver then serializes
DataReaderFactory and sends it to each of the executors.
org.apache.spark.sql.sources.v2.reader.DataReaderFactory
This is the "handle" passed by the driver to each executor. It
instantiates the readers below and controls data fetch.
org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data
Because you are implementing interfaces, YOU can determine the "class" parameters and initialization
Write Interfaces to Implement
Slide 21
Interface Purpose
org.apache.spark.sql.sources.v2.DataSourceRegister
org.apache.spark.sql.sources.v2.WriteSupport
DataSourceRegister is the entry point for your connector.
WriteSupport is then responsible to instantiate the object
implementing DataSourceWriter below. It accepts the
options/parameters and the schema.
org.apache.spark.sql.sources.v2.writer.DataSourceWriter
This interface requires implementations for:
• committing data write
• aborting data write
• creating writer factories
The DataSourceRegister, DataSourceWriter and DataWriterFactory
are instantiated at the driver. The driver then serializes
DataWriterFactory and sends it to each of the executors.
org.apache.spark.sql.sources.v2.writer.DataWriterFactory
This is the "handle" passed by the driver to each executor. It
instantiates the writers below and controls data fetch.
org.apache.spark.sql.sources.v2.writer.DataWriter
This does the actual work of writing and committing/aborting the
data
org.apache.spark.sql.sources.v2.writer.WriterCommitMessage
This is a "commit" message that is passed from the DataWriter to
DataSourceWriter.
Practical Considerations
Slide 22
Know Your Data Source
• Configuration
• Partitions
• Data schema
• Parallelism approach
• Batch and/or streaming
• Restart / recovery
Slide 23
V2 API IS STILL EVOLVING
• SPARK-22386 - Data Source V2 Improvements
• SPARK-23507 - Migrate existing data sources
• SPARK-24073 - DataReaderFactory Renamed in 2.4
• SPARK-24252, SPARK-25006 - DataSourceV2: Add catalog support
• So Why use V2?
Future-ready and alternative to V2 needs significantly more time and effort!
See https://www.youtube.com/watch?v=O9kpduk5D48
Slide 24
About...
Conversant
• Digital marketing unit of Epsilon under Alliance Data Systems (ADS)
• (Significant) player in internet advertising.
We see about 80% of internet ad bids in the US
• Secret sauce = anonymous cross-device profiles driving personalized messaging
Me (Jayesh Thakrar)
• Sr. Software Engineer (jthakrar@conversantmedia.com)
• https://www.linkedin.com/in/jayeshthakrar/
Slide 25
Questions?
Slide 26

More Related Content

What's hot

The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
fschupp
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshopIntroduction to Spark SQL training workshop
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
.NET Core, ASP.NET Core Course, Session 14
.NET Core, ASP.NET Core Course, Session 14.NET Core, ASP.NET Core Course, Session 14
.NET Core, ASP.NET Core Course, Session 14
aminmesbahi
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
Ramez Al-Fayez
 
Sqlite
SqliteSqlite
Sqlite
Kumar
 
Integration patterns in AEM 6
Integration patterns in AEM 6Integration patterns in AEM 6
Integration patterns in AEM 6
Yuval Ararat
 
Spring Data - Intro (Odessa Java TechTalks)
Spring Data - Intro (Odessa Java TechTalks)Spring Data - Intro (Odessa Java TechTalks)
Spring Data - Intro (Odessa Java TechTalks)
Igor Anishchenko
 
.NET Core, ASP.NET Core Course, Session 15
.NET Core, ASP.NET Core Course, Session 15.NET Core, ASP.NET Core Course, Session 15
.NET Core, ASP.NET Core Course, Session 15
aminmesbahi
 
Sqlite
SqliteSqlite
Sqlite
Raghu nath
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
Roger Barnes
 
Object Oriented Programming with Laravel - Session 6
Object Oriented Programming with Laravel - Session 6Object Oriented Programming with Laravel - Session 6
Object Oriented Programming with Laravel - Session 6
Shahrzad Peyman
 
AAC Room
AAC RoomAAC Room
AAC Room
선옥 장
 
L04 base patterns
L04 base patternsL04 base patterns
L04 base patterns
Ólafur Andri Ragnarsson
 
Entity framework code first
Entity framework code firstEntity framework code first
Entity framework code first
Confiz
 
Hibernate Developer Reference
Hibernate Developer ReferenceHibernate Developer Reference
Hibernate Developer Reference
Muthuselvam RS
 
Object Oriented Programming with Laravel - Session 4
Object Oriented Programming with Laravel - Session 4Object Oriented Programming with Laravel - Session 4
Object Oriented Programming with Laravel - Session 4
Shahrzad Peyman
 
Hibernate tutorial
Hibernate tutorialHibernate tutorial
Hibernate tutorial
Mumbai Academisc
 
Hibernate architecture
Hibernate architectureHibernate architecture
Hibernate architecture
Anurag
 

What's hot (19)

The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshopIntroduction to Spark SQL training workshop
Introduction to Spark SQL training workshop
 
.NET Core, ASP.NET Core Course, Session 14
.NET Core, ASP.NET Core Course, Session 14.NET Core, ASP.NET Core Course, Session 14
.NET Core, ASP.NET Core Course, Session 14
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
Sqlite
SqliteSqlite
Sqlite
 
Integration patterns in AEM 6
Integration patterns in AEM 6Integration patterns in AEM 6
Integration patterns in AEM 6
 
Spring Data - Intro (Odessa Java TechTalks)
Spring Data - Intro (Odessa Java TechTalks)Spring Data - Intro (Odessa Java TechTalks)
Spring Data - Intro (Odessa Java TechTalks)
 
.NET Core, ASP.NET Core Course, Session 15
.NET Core, ASP.NET Core Course, Session 15.NET Core, ASP.NET Core Course, Session 15
.NET Core, ASP.NET Core Course, Session 15
 
Sqlite
SqliteSqlite
Sqlite
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
 
Object Oriented Programming with Laravel - Session 6
Object Oriented Programming with Laravel - Session 6Object Oriented Programming with Laravel - Session 6
Object Oriented Programming with Laravel - Session 6
 
AAC Room
AAC RoomAAC Room
AAC Room
 
L04 base patterns
L04 base patternsL04 base patterns
L04 base patterns
 
Entity framework code first
Entity framework code firstEntity framework code first
Entity framework code first
 
Hibernate Developer Reference
Hibernate Developer ReferenceHibernate Developer Reference
Hibernate Developer Reference
 
Object Oriented Programming with Laravel - Session 4
Object Oriented Programming with Laravel - Session 4Object Oriented Programming with Laravel - Session 4
Object Oriented Programming with Laravel - Session 4
 
Hibernate tutorial
Hibernate tutorialHibernate tutorial
Hibernate tutorial
 
Hibernate architecture
Hibernate architectureHibernate architecture
Hibernate architecture
 

Similar to ApacheCon North America 2018: Creating Spark Data Sources

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Jen Aman
 
JDBC java for learning java for learn.ppt
JDBC java for learning java for learn.pptJDBC java for learning java for learn.ppt
JDBC java for learning java for learn.ppt
kingkolju
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Java- JDBC- Mazenet Solution
Java- JDBC- Mazenet SolutionJava- JDBC- Mazenet Solution
Java- JDBC- Mazenet Solution
Mazenetsolution
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
Spark sql
Spark sqlSpark sql
Spark sql
Zahra Eskandari
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
Navid Kalaei
 

Similar to ApacheCon North America 2018: Creating Spark Data Sources (20)

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 
JDBC java for learning java for learn.ppt
JDBC java for learning java for learn.pptJDBC java for learning java for learn.ppt
JDBC java for learning java for learn.ppt
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Java- JDBC- Mazenet Solution
Java- JDBC- Mazenet SolutionJava- JDBC- Mazenet Solution
Java- JDBC- Mazenet Solution
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 

More from Jayesh Thakrar

Apache big-data-2017-spark-profiling
Apache big-data-2017-spark-profilingApache big-data-2017-spark-profiling
Apache big-data-2017-spark-profiling
Jayesh Thakrar
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big Data
Jayesh Thakrar
 
Apache big-data-2017-scala-sql
Apache big-data-2017-scala-sqlApache big-data-2017-scala-sql
Apache big-data-2017-scala-sql
Jayesh Thakrar
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
Jayesh Thakrar
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
Jayesh Thakrar
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Jayesh Thakrar
 

More from Jayesh Thakrar (7)

Apache big-data-2017-spark-profiling
Apache big-data-2017-spark-profilingApache big-data-2017-spark-profiling
Apache big-data-2017-spark-profiling
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big Data
 
Apache big-data-2017-scala-sql
Apache big-data-2017-scala-sqlApache big-data-2017-scala-sql
Apache big-data-2017-scala-sql
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 

ApacheCon North America 2018: Creating Spark Data Sources

  • 1. Under-the-hood: Creating Your Own Spark Data Sources Speaker: Jayesh Thakrar @ Conversant Slide 1
  • 2. Why Build Your Own Data Source? Slide 2 Don't need • text, csv, ORC, Parquet, JSON files • JDBC sources • When available from project / vendor, e.g. • Cassandra, Kafka, MongoDB, etc • Teradata, Greenplum (2018), etc Need to • Use special features, e.g. • Kafka transactions • Exploit RDBMS specific partition features • Because you can, and want to  Conversant Use Cases (mid-2017) • Greenplum as data source (read/write) • Incompatibility between Spark, Kafka and Kafka connector versions
  • 3. Agenda 1. Introduction to Spark data sources 2. Walk-through sample code 3. Practical considerations Slide 3
  • 4. Introduction To Spark Data Source Slide 4
  • 5. Using Data Sources Slide 5 • Built-in spark.read.csv("path") spark.read.orc("path") spark.read.parquet("path") • Third-party/custom spark.read.format("class-name").load() // custom data source spark.read.format("...").option("...", "...").load() // with options spark.read.format("...").schema("..").load() // with schema
  • 6. Spark Data Source Spark Application Spark API Spark Data SourceData Spark Backend (runtime, Spark libraries, etc) Slide 6
  • 7. Data Source V2 API Spark Application Spark API Spark Data SourceData Spark Backend (runtime, Spark libraries, etc) Shiny, New V2 API since Spark 2.3 (Feb 2018) SPARK-15689 No relations, scans involved Slide 7
  • 8. V2 API: Data Source Types Data Source Type Output Batch Dataset Microbatch (successor to Dstreams) Structured Stream = stream of bounded dataset Continuous Continuous Stream = continuous stream of Row(s) Slide 8
  • 9. V2 DATA SOURCE API • Well documented design V2 API : Design Doc , Feature SPARK-15689 Continuous Streaming Design Doc and Feature - SPARK-20928 • Implemented as Java Interfaces (not classes) • Similar interfaces across all data source types • Microbatch and Continuous not hardened yet... e.g. SPARK-23886, SPARK-23887, SPARK-22911 Slide 9
  • 11. Details • Project: https://github.com/JThakrar/sparkconn • Requirements: • Scala 2.11 • SBT Slide 11
  • 12. Reading From Data Source val data = spark.read.format("DataSource").option("key", "value").schema("..").load() Step Action spark = SparkSession read Returns a DataFrameReader for data source orchestration format Lazy lookup of data source class - by shortname or by full-qualified class name option Zero, one or more key-value pairs of options for data source schema Optional, user-provided schema load Loads the data as a dataframe. Remember dataframes are lazily-evaluated. Slide 12
  • 13. Reading: Interfaces to Implement Slide 13 spark.format(...) DataSourceRegister ReadSupport ReadSupportWithSchema DataSourceReader DataReaderFactory DataReaderFactory DataReaderFactory DataReader DataReader DataReader Driver Executors / Tasks / Partitions Need to implement a minimum of 5 interfaces and 3 classes
  • 14. Interface Definitions And Dependency Slide 14 libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1" % "provided"
  • 15. Example V2 API Based Datasource Slide 15 • Very simple data source that generates a row with a fixed schema of a single string column. Column name = "string_value" • Completely self-contained (i.e. no external connection) • Number of partitions in dataset is user-configurable (default = 5) • All partitions contains same number of rows (strings) • Number of rows per partition is user-configurable (default = 5)
  • 16. Read Interface: DataSourceRegister Slide 16 Interface Purpose org.apache.spark.sql.sources.v2.DataSourceRegister org.apache.spark.sql.sources.v2.ReadSupport and / or org.apache.spark.sql.sources.v2.ReadSupportWithSchema DataSourceRegister is the entry point for your data source. ReadSupport is then responsible to instantiate the object implementing DataSourceReader (discussed later). It accepts the options/parameters and optional schema from Spark application.
  • 17. Read Interface: DataSourceReader Slide 17 Interface Purpose org.apache.spark.sql.sources.v2.reader.DataSourceReader This interface requires implementations for: • determining the schema of data • determining the number of partitions and creating that many reader factories below.
  • 18. Read Interface: DataReaderFactory Slide 18 Interface Purpose org.apache.spark.sql.sources.v2.reader.DataReaderFactory This is the "handle" passed by the driver to each executor. It instantiates the readers below and controls data fetch.
  • 19. Read Interface: DataReader Slide 19 Interface Purpose org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data from source (at task-level)
  • 20. Summary Slide 20 Interface Purpose org.apache.spark.sql.sources.v2.DataSourceRegister org.apache.spark.sql.sources.v2.ReadSupport org.apache.spark.sql.sources.v2.ReadSupportWithSchema DataSourceRegister is the entry point for your connector. ReadSupport is then responsible to instantiate the object implementing DataSourceReader below. It accepts the options/parameters and optional schema from Spark application. org.apache.spark.sql.sources.v2.reader.DataSourceReader This interface requires implementations for: • determining the schema of data • determining the number of partitions and creating that many reader factories below. The DataSourceRegister, DataSourceReader and DataReaderFactory are instantiated at the driver. The driver then serializes DataReaderFactory and sends it to each of the executors. org.apache.spark.sql.sources.v2.reader.DataReaderFactory This is the "handle" passed by the driver to each executor. It instantiates the readers below and controls data fetch. org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data Because you are implementing interfaces, YOU can determine the "class" parameters and initialization
  • 21. Write Interfaces to Implement Slide 21 Interface Purpose org.apache.spark.sql.sources.v2.DataSourceRegister org.apache.spark.sql.sources.v2.WriteSupport DataSourceRegister is the entry point for your connector. WriteSupport is then responsible to instantiate the object implementing DataSourceWriter below. It accepts the options/parameters and the schema. org.apache.spark.sql.sources.v2.writer.DataSourceWriter This interface requires implementations for: • committing data write • aborting data write • creating writer factories The DataSourceRegister, DataSourceWriter and DataWriterFactory are instantiated at the driver. The driver then serializes DataWriterFactory and sends it to each of the executors. org.apache.spark.sql.sources.v2.writer.DataWriterFactory This is the "handle" passed by the driver to each executor. It instantiates the writers below and controls data fetch. org.apache.spark.sql.sources.v2.writer.DataWriter This does the actual work of writing and committing/aborting the data org.apache.spark.sql.sources.v2.writer.WriterCommitMessage This is a "commit" message that is passed from the DataWriter to DataSourceWriter.
  • 23. Know Your Data Source • Configuration • Partitions • Data schema • Parallelism approach • Batch and/or streaming • Restart / recovery Slide 23
  • 24. V2 API IS STILL EVOLVING • SPARK-22386 - Data Source V2 Improvements • SPARK-23507 - Migrate existing data sources • SPARK-24073 - DataReaderFactory Renamed in 2.4 • SPARK-24252, SPARK-25006 - DataSourceV2: Add catalog support • So Why use V2? Future-ready and alternative to V2 needs significantly more time and effort! See https://www.youtube.com/watch?v=O9kpduk5D48 Slide 24
  • 25. About... Conversant • Digital marketing unit of Epsilon under Alliance Data Systems (ADS) • (Significant) player in internet advertising. We see about 80% of internet ad bids in the US • Secret sauce = anonymous cross-device profiles driving personalized messaging Me (Jayesh Thakrar) • Sr. Software Engineer (jthakrar@conversantmedia.com) • https://www.linkedin.com/in/jayeshthakrar/ Slide 25