SQOOP on SPARK
for Data Ingestion
Veena Basavaraj & Vinoth Chandar
@Uber
Works currently @ Uber focussed on building
a real time pipeline for ingestion to Hadoop
for batch and stream processing.
@linkedin lead on Voldemort
@Oracle focussed log based replication, HPC
and stream processing
Works currently @Uber on streaming systems.
Prior to this worked
@Cloudera on Ingestion
for Hadoop and @Linkedin
on fronted and service infra
Agenda
• Data Ingestion Today
• Introduction Apache Sqoop2
• Sqoop Jobs on Apache Spark
• Insights & Next Steps
In the beginning…
Data Ingestion Tool
• Primary need
• Transferring data
from SQL to
HADOOP
• SQOOP solved it
well!
Ingestion needs Evolved…
Data Ingestion Pipeline
• Ingestion pipeline can now have
• Non SQL like data sources
• Messaging Systems as data sources
• Multi-stage pipeline
Sqoop 2
• Generic Data Transfer
Service
• FROM - egress data out
from ANY source
• TO - ingress data into ANY
source
• Pluggable Data Sources
• Server-Client Design
Sqoop 2
• CONNECTOR
• JOB
Connector
• Connectors
represent Data
Sources
Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write
data from the data source
Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write
data from the data source
Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !
Sqoop Job
• Creating a Job
• Job Submission
• Job Execution
Lets talk about
MYSQL to KAFKA
example
Create Job
• Create LINKs
• Populate FROM link Config and
create FROM LINK
• Populate TO link Config and
create TO LINK
Create Job
• Create LINKs
• Populate FROM link Config and
create FROM LINK
• Populate TO link Config and
create TO LINK
Create MySQL link
Create Kafka link
Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Config
• Populate Driver Config such as
parallelism for extract and
load
numExtractors
numLoaders
Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Config
• Populate Driver Config such as
parallelism for extract and
load
Add MySQL From Config
Add kafka To Config
numExtractors
numLoaders
Create Job API
public static void createJob(String[] jobconfigs) {
CommandLine cArgs = parseArgs(createOptions(), jobconfigs);
MLink fromLink = createFromLink(‘jdbc-connector’, jobconfigs);
MLink toLink = createToLink(‘kafka-connector’, jobconfigs);
MJob sqoopJob = createJob(fromLink, toLink, jobconfigs);
}
Job Submit
• Sqoop uses MR engine to transfer data between FROM
and TO data sources
• Hadoop Configuration Object is used to pass FROM/
TO and Driver Configs to the MR engine
• Submits the Job via MR-client and tracks job status and
stats such as counters
Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !
Remember!
Job Execution
• InputFormat/Splits for Partitioning
• Invokes FROM Partition API
• Mappers for Extraction
• Invokes FROM Extract API
• Reducers for Loading
• Invokes TO Load API
• OutputFormat for Commits/ Aborts
So What’s the Scoop?
So What’s the Scoop?
It turns out…
• Sqoop 2 supports pluggable Execution Engine
• Why not replace MR with Spark for parallelism?
• Why not extend the Connector APIs to support
simple (T) transformations along with (EL) ?
Why Apache Spark ?
• Why not ? Data Pipeline expressed as Spark jobs
• Speed is a feature! Faster than MapReduce
• Growing Community embracing Apache Spark
• Low effort less than few weeks to build a POC
• EL to ETL -> Nifty transformations can be easily added
Lets talk SQOOP on SPARK
implementation!
Spark Sqoop Job
• Creating a Job
• Job Submission
• Job Execution
Create Sqoop Spark Job
• Create a SparkContext from the relevant configs
• Instantiate a SqoopSparkJob and invoke SqoopSparkJob.init(..)
that wraps both Sqoop and Spark initialization
• As before Create a Sqoop Job with createJob API
• Invoke SqoopSparkJob.execute(conf, context)
public class SqoopJDBCHDFSJobDriver {
public static void main(String[] args){
final SqoopSparkJob sparkJob = new SqoopSparkJob();
CommandLine cArgs = SqoopSparkJob.parseArgs(createOptions(), args);
SparkConf conf = sparkJob.init(cArgs);
JavaSparkContext context = new JavaSparkContext(conf);
MLink fromLink = getJDBCLink();
MLink toLink = getHDFSLink();
MJob sqoopJob = createJob(fromLink, toLink);
sparkJob.setJob(sqoopJob);
sparkJob.execute(conf, context);
}
Create Sqoop Spark Job
1
2
3
4
Spark Job Submission
• We explored a few options.!
• Invoke Spark in process within the Sqoop Server to
execute the job
• Use Remote Spark Context used by Hive on Spark to
submit
• Sqoop Job as a driver for the Spark submit command
Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark yarn client or directly via
command line submit the driver program to yarn client/
• bin/spark-submit —classorg.apache.sqoop.spark.SqoopJDBCHDFSJobDriver
--master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/
—jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir
hdfs://path/to/output —numE 4 —numL 4
Spark Job Execution
• 3 main stages
• Obtain containers for parallel execution by simply
converting job’s partitions to an RDD
• Partition API determines parallelism, Map stage uses
Extract API to read records
• Another Map stage uses Load API to write records
Spark Job Execution
SqoopSparkJob.execute(…){
List<Partition> sp = getPartitions(request,numMappers);
JavaRDD<Partition> partitionRDD = sc.parallelize(sp,
sp.size());
JavaRDD<List<IntermediateDataFormat<?>>> extractRDD =
partitionRDD.map(new SqoopExtractFunction(request));


extractRDD.map(new SqoopLoadFunction(request)).collect();
}
1
2
3
Spark Job Execution
• We chose to have 2 map stages for a reason
• Load parallelism can be different from Extract
parallelism, for instance we may need to restrict the
TO based on number of Kafka Partitions on the topic
• We can repartition before we invoke the Load stage
Micro Benchmark —>MySQL to HDFS
Table w/ 300K records, numExtractors = numLoaders
Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark —>MySQL to HDFS
What was Easy?
• Reusing existing Connectors, NO changes to the Connector API
required.
• Inbuilt support for Standalone and Cluster mode for quick end-end
testing and faster iteration
• Scheduling Spark sqoop jobs via Oozie
What was not Easy?
• No clean Spark Job Submit API that provides job statistics, using
Yarn UI for Job status and health.
• We had to convert a bunch of Sqoop core classes such as IDF
(internal representation for records transferred) to be serializable
• Managing Hadoop and spark dependencies together and CNF
caused some pain
Next Steps!
• Explore alternative ways for Spark Sqoop Job Submission
• Expose Spark job stats such as accumulators in the submission
history
• Proposed Connector Filter API (cleaning, data masking)
• We want to work with Sqoop community to merge this back if its
useful
• https://github.com/vybs/sqoop-on-spark
Questions!
• Apache Sqoop Project - sqoop.apache.org
• Apache Spark Project - spark.apache.org
• Thanks to the Folks @Cloudera and @Uber !!!
• You can reach us @vybs, @byte_array

Sqoop on Spark for Data Ingestion

  • 1.
    SQOOP on SPARK forData Ingestion Veena Basavaraj & Vinoth Chandar @Uber
  • 2.
    Works currently @Uber focussed on building a real time pipeline for ingestion to Hadoop for batch and stream processing. @linkedin lead on Voldemort @Oracle focussed log based replication, HPC and stream processing Works currently @Uber on streaming systems. Prior to this worked @Cloudera on Ingestion for Hadoop and @Linkedin on fronted and service infra
  • 3.
    Agenda • Data IngestionToday • Introduction Apache Sqoop2 • Sqoop Jobs on Apache Spark • Insights & Next Steps
  • 4.
  • 5.
    Data Ingestion Tool •Primary need • Transferring data from SQL to HADOOP • SQOOP solved it well!
  • 6.
  • 7.
    Data Ingestion Pipeline •Ingestion pipeline can now have • Non SQL like data sources • Messaging Systems as data sources • Multi-stage pipeline
  • 8.
    Sqoop 2 • GenericData Transfer Service • FROM - egress data out from ANY source • TO - ingress data into ANY source • Pluggable Data Sources • Server-Client Design
  • 9.
  • 10.
  • 11.
    Connector • Data Sourceproperties represented via Configs • LINK config to connect to data source • JOB config to read/write data from the data source
  • 12.
    Connector • Data Sourceproperties represented via Configs • LINK config to connect to data source • JOB config to read/write data from the data source
  • 13.
    Connector API • PluggableConnector API implemented by Connectors • Partition(P) API for parallelism • (E) Extract API to egress data • (L) Load API to ingress data • No (T) Transform yet !
  • 14.
    Sqoop Job • Creatinga Job • Job Submission • Job Execution
  • 15.
    Lets talk about MYSQLto KAFKA example
  • 16.
    Create Job • CreateLINKs • Populate FROM link Config and create FROM LINK • Populate TO link Config and create TO LINK
  • 17.
    Create Job • CreateLINKs • Populate FROM link Config and create FROM LINK • Populate TO link Config and create TO LINK Create MySQL link Create Kafka link
  • 18.
    Create Job • CreateJOB associating FROM and TO LINKS • Populate the FROM and TO Job Config • Populate Driver Config such as parallelism for extract and load numExtractors numLoaders
  • 19.
    Create Job • CreateJOB associating FROM and TO LINKS • Populate the FROM and TO Job Config • Populate Driver Config such as parallelism for extract and load Add MySQL From Config Add kafka To Config numExtractors numLoaders
  • 20.
    Create Job API publicstatic void createJob(String[] jobconfigs) { CommandLine cArgs = parseArgs(createOptions(), jobconfigs); MLink fromLink = createFromLink(‘jdbc-connector’, jobconfigs); MLink toLink = createToLink(‘kafka-connector’, jobconfigs); MJob sqoopJob = createJob(fromLink, toLink, jobconfigs); }
  • 21.
    Job Submit • Sqoopuses MR engine to transfer data between FROM and TO data sources • Hadoop Configuration Object is used to pass FROM/ TO and Driver Configs to the MR engine • Submits the Job via MR-client and tracks job status and stats such as counters
  • 22.
    Connector API • PluggableConnector API implemented by Connectors • Partition(P) API for parallelism • (E) Extract API to egress data • (L) Load API to ingress data • No (T) Transform yet ! Remember!
  • 23.
    Job Execution • InputFormat/Splitsfor Partitioning • Invokes FROM Partition API • Mappers for Extraction • Invokes FROM Extract API • Reducers for Loading • Invokes TO Load API • OutputFormat for Commits/ Aborts
  • 24.
  • 25.
  • 26.
    It turns out… •Sqoop 2 supports pluggable Execution Engine • Why not replace MR with Spark for parallelism? • Why not extend the Connector APIs to support simple (T) transformations along with (EL) ?
  • 27.
    Why Apache Spark? • Why not ? Data Pipeline expressed as Spark jobs • Speed is a feature! Faster than MapReduce • Growing Community embracing Apache Spark • Low effort less than few weeks to build a POC • EL to ETL -> Nifty transformations can be easily added
  • 28.
    Lets talk SQOOPon SPARK implementation!
  • 29.
    Spark Sqoop Job •Creating a Job • Job Submission • Job Execution
  • 30.
    Create Sqoop SparkJob • Create a SparkContext from the relevant configs • Instantiate a SqoopSparkJob and invoke SqoopSparkJob.init(..) that wraps both Sqoop and Spark initialization • As before Create a Sqoop Job with createJob API • Invoke SqoopSparkJob.execute(conf, context)
  • 31.
    public class SqoopJDBCHDFSJobDriver{ public static void main(String[] args){ final SqoopSparkJob sparkJob = new SqoopSparkJob(); CommandLine cArgs = SqoopSparkJob.parseArgs(createOptions(), args); SparkConf conf = sparkJob.init(cArgs); JavaSparkContext context = new JavaSparkContext(conf); MLink fromLink = getJDBCLink(); MLink toLink = getHDFSLink(); MJob sqoopJob = createJob(fromLink, toLink); sparkJob.setJob(sqoopJob); sparkJob.execute(conf, context); } Create Sqoop Spark Job 1 2 3 4
  • 32.
    Spark Job Submission •We explored a few options.! • Invoke Spark in process within the Sqoop Server to execute the job • Use Remote Spark Context used by Hive on Spark to submit • Sqoop Job as a driver for the Spark submit command
  • 33.
    Spark Job Submission •Build a “uber.jar” with the driver and all the sqoop dependencies • Programmatically using Spark yarn client or directly via command line submit the driver program to yarn client/ • bin/spark-submit —classorg.apache.sqoop.spark.SqoopJDBCHDFSJobDriver --master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ —jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs://path/to/output —numE 4 —numL 4
  • 34.
    Spark Job Execution •3 main stages • Obtain containers for parallel execution by simply converting job’s partitions to an RDD • Partition API determines parallelism, Map stage uses Extract API to read records • Another Map stage uses Load API to write records
  • 35.
    Spark Job Execution SqoopSparkJob.execute(…){ List<Partition>sp = getPartitions(request,numMappers); JavaRDD<Partition> partitionRDD = sc.parallelize(sp, sp.size()); JavaRDD<List<IntermediateDataFormat<?>>> extractRDD = partitionRDD.map(new SqoopExtractFunction(request)); 
 extractRDD.map(new SqoopLoadFunction(request)).collect(); } 1 2 3
  • 36.
    Spark Job Execution •We chose to have 2 map stages for a reason • Load parallelism can be different from Extract parallelism, for instance we may need to restrict the TO based on number of Kafka Partitions on the topic • We can repartition before we invoke the Load stage
  • 37.
    Micro Benchmark —>MySQLto HDFS Table w/ 300K records, numExtractors = numLoaders
  • 38.
    Table w/ 2.8Mrecords, numExtractors = numLoaders good partitioning!! Micro Benchmark —>MySQL to HDFS
  • 39.
    What was Easy? •Reusing existing Connectors, NO changes to the Connector API required. • Inbuilt support for Standalone and Cluster mode for quick end-end testing and faster iteration • Scheduling Spark sqoop jobs via Oozie
  • 40.
    What was notEasy? • No clean Spark Job Submit API that provides job statistics, using Yarn UI for Job status and health. • We had to convert a bunch of Sqoop core classes such as IDF (internal representation for records transferred) to be serializable • Managing Hadoop and spark dependencies together and CNF caused some pain
  • 41.
    Next Steps! • Explorealternative ways for Spark Sqoop Job Submission • Expose Spark job stats such as accumulators in the submission history • Proposed Connector Filter API (cleaning, data masking) • We want to work with Sqoop community to merge this back if its useful • https://github.com/vybs/sqoop-on-spark
  • 42.
    Questions! • Apache SqoopProject - sqoop.apache.org • Apache Spark Project - spark.apache.org • Thanks to the Folks @Cloudera and @Uber !!! • You can reach us @vybs, @byte_array