SQOOP on SPARK
for Data Ingestion
Veena Basavaraj & Vinoth Chandar
@Uber
Currently @Uber on streaming systems.
@Cloudera on Ingestion for Hadoop.
@Linkedin on front-end service infra.
2
Currently @ Uber focussed on building
a real time pipeline for ingestion to
Hadoop. @linkedin lead on Voldemort.
In the past, worked on log based
replication, HPC and stream
processing.
Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
3
Sqoop Before
4
SQL HADOOP
Data Ingestion
• Data Ingestion needs evolved
– Non SQL like data sources
– Messaging Systems as data sources
– Multi-stage pipeline
5
Sqoop Now
• Generic data
Transfer Service
–FROM ANY
source
–TO ANY
target
6
MYSQL KAFKA
HDFS MONGO
FTP HDFS
FROM TO
KAFKA MEMSQL
Sqoop How?
• Connectors represent
Pluggable Data Sources
• Connectors are
configurable
•LINK configs
•JOB configs
7
Sqoop Connector API
8
Source Targetpartition()
**No Transform (T) stage yet!
extract() load()
Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
9
It turns out…
• MapReduce is slow!
• We need Connector APIs to support (T)
transformations, not just EL
• Good news! - Execution Engine is also
pluggable
10
Why Apache Spark ?
• Why not ? ETL expressed as Spark jobs
• Faster than MapReduce
• Growing Community embracing Apache
Spark
11
Why Not Use Spark Data Sources?
12
Sure we can ! but …
Why Not Spark DataSources ?
• Recent addition for data sources!
• Run MR Sqoop jobs on Spark with
simple config change
• Leverage incremental EL & job
management within Sqoop
Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
14
Sqoop on Spark
•Creating a Job
•Job Submission
•Job Execution
15
Sqoop Job API
• Create Sqoop Job
–Create FROM and TO job configs
–Create JOB associating FROM and TO configs
• SparkContext holds Sqoop Jobs
• Invoke SqoopSparkJob.execute(conf, context)
16
Spark Job Submission
• We explored a few options.!
– Invoke Spark in process within the Sqoop Server to
execute the job
– Use Remote Spark Context used by Hive on Spark to
submit
– Sqoop Job as a driver for the Spark submit command
17
Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark Yarn Client ( non public) or
directly via command line submit the driver program to
yarn client/
• bin/spark-submit —class org.apache.sqoop.spark.SqoopJDBCHDFSJobDriver
--master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ —
jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs://
path/to/output —numE 4 —numL 4
18
Spark Job Execution
19
MySQL
partitionRDD extractRDD
.map() .map()
loadRDD.collect()
Spark Job Execution
SqoopSparkJob.execute(…)
List<Partition> sp = getPartitions(request,numMappers);
JavaRDD<Partition> partitionRDD = sc.parallelize(sp, sp.size());
20
1
2
3
JavaRDD<List<IntermediateDataFormat<?>>> extractRDD =
partitionRDD.map(new SqoopExtractFunction(request)); 

extractRDD.map(new SqoopLoadFunction(request)).collect();
Spark Job Execution
21
MySQL
Compute Partitions Extract from
MySQL
.map() .mapPartition()
Load
into HDFS
.repartition()
Repartition
to
limit
files on HDFS
Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
22
Micro Benchmark: MySQL to HDFS
23
Table w/ 300K records, numExtractors = numLoaders
Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark: MySQL to HDFS
24
What was Easy?
• NO changes to the Connector API
required.
• Inbuilt support for Standalone and Yarn
Cluster mode for quick end-end testing
and faster iteration
• Scheduling Spark sqoop jobs via Oozie
25
What was not Easy?
• No clean public Spark Job Submit API.
Using Yarn UI for Job status and health.
• Bunch of Sqoop core classes such as IDF
had to be made serializable
• Managing Hadoop and spark dependencies
together in Sqoop caused some pain
26
Next Steps!
• Explore alternative ways for Spark Sqoop
Job Submission with Spark 1.4 additions
• Connector Filter API (filter, data masking)
• SQOOP-1532
– https://github.com/vybs/sqoop-on-spark
27
Sqoop Connector ETL
28
Source Targetpartition()
**With Transform (T) stage!
extract() Transform()
load()
Questions!
• Thanks to the Folks @Cloudera
and @Uber !
• You can reach us @vybs,
@byte_array
29

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

  • 1.
    SQOOP on SPARK forData Ingestion Veena Basavaraj & Vinoth Chandar @Uber
  • 2.
    Currently @Uber onstreaming systems. @Cloudera on Ingestion for Hadoop. @Linkedin on front-end service infra. 2 Currently @ Uber focussed on building a real time pipeline for ingestion to Hadoop. @linkedin lead on Voldemort. In the past, worked on log based replication, HPC and stream processing.
  • 3.
    Agenda • Sqoop forData Ingestion • Why Sqoop on Spark? • Sqoop Jobs on Spark • Insights & Next Steps 3
  • 4.
  • 5.
    Data Ingestion • DataIngestion needs evolved – Non SQL like data sources – Messaging Systems as data sources – Multi-stage pipeline 5
  • 6.
    Sqoop Now • Genericdata Transfer Service –FROM ANY source –TO ANY target 6 MYSQL KAFKA HDFS MONGO FTP HDFS FROM TO KAFKA MEMSQL
  • 7.
    Sqoop How? • Connectorsrepresent Pluggable Data Sources • Connectors are configurable •LINK configs •JOB configs 7
  • 8.
    Sqoop Connector API 8 SourceTargetpartition() **No Transform (T) stage yet! extract() load()
  • 9.
    Agenda • Sqoop forData Ingestion • Why Sqoop on Spark? • Sqoop Jobs on Spark • Insights & Next Steps 9
  • 10.
    It turns out… •MapReduce is slow! • We need Connector APIs to support (T) transformations, not just EL • Good news! - Execution Engine is also pluggable 10
  • 11.
    Why Apache Spark? • Why not ? ETL expressed as Spark jobs • Faster than MapReduce • Growing Community embracing Apache Spark 11
  • 12.
    Why Not UseSpark Data Sources? 12 Sure we can ! but …
  • 13.
    Why Not SparkDataSources ? • Recent addition for data sources! • Run MR Sqoop jobs on Spark with simple config change • Leverage incremental EL & job management within Sqoop
  • 14.
    Agenda • Sqoop forData Ingestion • Why Sqoop on Spark? • Sqoop Jobs on Spark • Insights & Next Steps 14
  • 15.
    Sqoop on Spark •Creatinga Job •Job Submission •Job Execution 15
  • 16.
    Sqoop Job API •Create Sqoop Job –Create FROM and TO job configs –Create JOB associating FROM and TO configs • SparkContext holds Sqoop Jobs • Invoke SqoopSparkJob.execute(conf, context) 16
  • 17.
    Spark Job Submission •We explored a few options.! – Invoke Spark in process within the Sqoop Server to execute the job – Use Remote Spark Context used by Hive on Spark to submit – Sqoop Job as a driver for the Spark submit command 17
  • 18.
    Spark Job Submission •Build a “uber.jar” with the driver and all the sqoop dependencies • Programmatically using Spark Yarn Client ( non public) or directly via command line submit the driver program to yarn client/ • bin/spark-submit —class org.apache.sqoop.spark.SqoopJDBCHDFSJobDriver --master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ — jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs:// path/to/output —numE 4 —numL 4 18
  • 19.
    Spark Job Execution 19 MySQL partitionRDDextractRDD .map() .map() loadRDD.collect()
  • 20.
    Spark Job Execution SqoopSparkJob.execute(…) List<Partition>sp = getPartitions(request,numMappers); JavaRDD<Partition> partitionRDD = sc.parallelize(sp, sp.size()); 20 1 2 3 JavaRDD<List<IntermediateDataFormat<?>>> extractRDD = partitionRDD.map(new SqoopExtractFunction(request)); 
 extractRDD.map(new SqoopLoadFunction(request)).collect();
  • 21.
    Spark Job Execution 21 MySQL ComputePartitions Extract from MySQL .map() .mapPartition() Load into HDFS .repartition() Repartition to limit files on HDFS
  • 22.
    Agenda • Sqoop forData Ingestion • Why Sqoop on Spark? • Sqoop Jobs on Spark • Insights & Next Steps 22
  • 23.
    Micro Benchmark: MySQLto HDFS 23 Table w/ 300K records, numExtractors = numLoaders
  • 24.
    Table w/ 2.8Mrecords, numExtractors = numLoaders good partitioning!! Micro Benchmark: MySQL to HDFS 24
  • 25.
    What was Easy? •NO changes to the Connector API required. • Inbuilt support for Standalone and Yarn Cluster mode for quick end-end testing and faster iteration • Scheduling Spark sqoop jobs via Oozie 25
  • 26.
    What was notEasy? • No clean public Spark Job Submit API. Using Yarn UI for Job status and health. • Bunch of Sqoop core classes such as IDF had to be made serializable • Managing Hadoop and spark dependencies together in Sqoop caused some pain 26
  • 27.
    Next Steps! • Explorealternative ways for Spark Sqoop Job Submission with Spark 1.4 additions • Connector Filter API (filter, data masking) • SQOOP-1532 – https://github.com/vybs/sqoop-on-spark 27
  • 28.
    Sqoop Connector ETL 28 SourceTargetpartition() **With Transform (T) stage! extract() Transform() load()
  • 29.
    Questions! • Thanks tothe Folks @Cloudera and @Uber ! • You can reach us @vybs, @byte_array 29