Sqoop on Spark for Data Ingestion

SQOOP on SPARK
for Data Ingestion
Veena Basavaraj & Vinoth Chandar
@Uber

Works currently @ Uber focussed on building
a real time pipeline for ingestion to Hadoop
for batch and stream processing.
@linkedin lead on Voldemort
@Oracle focussed log based replication, HPC
and stream processing
Works currently @Uber on streaming systems.
Prior to this worked
@Cloudera on Ingestion
for Hadoop and @Linkedin
on fronted and service infra

Agenda
• Data Ingestion Today
• Introduction Apache Sqoop2
• Sqoop Jobs on Apache Spark
• Insights & Next Steps

Data Ingestion Tool
• Primary need
• Transferring data
from SQL to
HADOOP
• SQOOP solved it
well!

Data Ingestion Pipeline
• Ingestion pipeline can now have
• Non SQL like data sources
• Messaging Systems as data sources
• Multi-stage pipeline

Sqoop 2
• Generic Data Transfer
Service
• FROM - egress data out
from ANY source
• TO - ingress data into ANY
source
• Pluggable Data Sources
• Server-Client Design

Connector
• Connectors
represent Data
Sources

Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write
data from the data source

Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !

Sqoop Job
• Creating a Job
• Job Submission
• Job Execution

Lets talk about
MYSQL to KAFKA
example

Create Job
• Create LINKs
• Populate FROM link Conﬁg and
create FROM LINK
• Populate TO link Conﬁg and
create TO LINK

Create Job
• Create LINKs
• Populate FROM link Conﬁg and
create FROM LINK
• Populate TO link Conﬁg and
create TO LINK
Create MySQL link
Create Kafka link

Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Conﬁg
• Populate Driver Conﬁg such as
parallelism for extract and
load
numExtractors
numLoaders

Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Conﬁg
• Populate Driver Conﬁg such as
parallelism for extract and
load
Add MySQL From Config
Add kafka To Config
numExtractors
numLoaders

Create Job API
public static void createJob(String[] jobconfigs) {
CommandLine cArgs = parseArgs(createOptions(), jobconfigs);
MLink fromLink = createFromLink(‘jdbc-connector’, jobconfigs);
MLink toLink = createToLink(‘kafka-connector’, jobconfigs);
MJob sqoopJob = createJob(fromLink, toLink, jobconfigs);
}

Job Submit
• Sqoop uses MR engine to transfer data between FROM
and TO data sources
• Hadoop Conﬁguration Object is used to pass FROM/
TO and Driver Conﬁgs to the MR engine
• Submits the Job via MR-client and tracks job status and
stats such as counters

Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !
Remember!

Job Execution
• InputFormat/Splits for Partitioning
• Invokes FROM Partition API
• Mappers for Extraction
• Invokes FROM Extract API
• Reducers for Loading
• Invokes TO Load API
• OutputFormat for Commits/ Aborts

It turns out…
• Sqoop 2 supports pluggable Execution Engine
• Why not replace MR with Spark for parallelism?
• Why not extend the Connector APIs to support
simple (T) transformations along with (EL) ?

Why Apache Spark ?
• Why not ? Data Pipeline expressed as Spark jobs
• Speed is a feature! Faster than MapReduce
• Growing Community embracing Apache Spark
• Low effort less than few weeks to build a POC
• EL to ETL -> Nifty transformations can be easily added

Lets talk SQOOP on SPARK
implementation!

Spark Sqoop Job
• Creating a Job
• Job Submission
• Job Execution

Create Sqoop Spark Job
• Create a SparkContext from the relevant conﬁgs
• Instantiate a SqoopSparkJob and invoke SqoopSparkJob.init(..)
that wraps both Sqoop and Spark initialization
• As before Create a Sqoop Job with createJob API
• Invoke SqoopSparkJob.execute(conf, context)

public class SqoopJDBCHDFSJobDriver {
public static void main(String[] args){
final SqoopSparkJob sparkJob = new SqoopSparkJob();
CommandLine cArgs = SqoopSparkJob.parseArgs(createOptions(), args);
SparkConf conf = sparkJob.init(cArgs);
JavaSparkContext context = new JavaSparkContext(conf);
MLink fromLink = getJDBCLink();
MLink toLink = getHDFSLink();
MJob sqoopJob = createJob(fromLink, toLink);
sparkJob.setJob(sqoopJob);
sparkJob.execute(conf, context);
}
Create Sqoop Spark Job
1
2
3
4

Spark Job Submission
• We explored a few options.!
• Invoke Spark in process within the Sqoop Server to
execute the job
• Use Remote Spark Context used by Hive on Spark to
submit
• Sqoop Job as a driver for the Spark submit command

Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark yarn client or directly via
command line submit the driver program to yarn client/
• bin/spark-submit —classorg.apache.sqoop.spark.SqoopJDBCHDFSJobDriver
--master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/
—jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir
hdfs://path/to/output —numE 4 —numL 4

Spark Job Execution
• 3 main stages
• Obtain containers for parallel execution by simply
converting job’s partitions to an RDD
• Partition API determines parallelism, Map stage uses
Extract API to read records
• Another Map stage uses Load API to write records

Spark Job Execution
SqoopSparkJob.execute(…){
List<Partition> sp = getPartitions(request,numMappers);
JavaRDD<Partition> partitionRDD = sc.parallelize(sp,
sp.size());
JavaRDD<List<IntermediateDataFormat<?>>> extractRDD =
partitionRDD.map(new SqoopExtractFunction(request));
 
extractRDD.map(new SqoopLoadFunction(request)).collect();
}
1
2
3

Spark Job Execution
• We chose to have 2 map stages for a reason
• Load parallelism can be different from Extract
parallelism, for instance we may need to restrict the
TO based on number of Kafka Partitions on the topic
• We can repartition before we invoke the Load stage

Micro Benchmark —>MySQL to HDFS
Table w/ 300K records, numExtractors = numLoaders

Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark —>MySQL to HDFS

What was Easy?
• Reusing existing Connectors, NO changes to the Connector API
required.
• Inbuilt support for Standalone and Cluster mode for quick end-end
testing and faster iteration
• Scheduling Spark sqoop jobs via Oozie

What was not Easy?
• No clean Spark Job Submit API that provides job statistics, using
Yarn UI for Job status and health.
• We had to convert a bunch of Sqoop core classes such as IDF
(internal representation for records transferred) to be serializable
• Managing Hadoop and spark dependencies together and CNF
caused some pain

Next Steps!
• Explore alternative ways for Spark Sqoop Job Submission
• Expose Spark job stats such as accumulators in the submission
history
• Proposed Connector Filter API (cleaning, data masking)
• We want to work with Sqoop community to merge this back if its
useful
• https://github.com/vybs/sqoop-on-spark

Questions!
• Apache Sqoop Project - sqoop.apache.org
• Apache Spark Project - spark.apache.org
• Thanks to the Folks @Cloudera and @Uber !!!
• You can reach us @vybs, @byte_array

Sqoop on Spark for Data Ingestion

More Related Content

What's hot

Similar to Sqoop on Spark for Data Ingestion

More from DataWorks Summit

Recently uploaded

Sqoop on Spark for Data Ingestion