Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SQOOP on SPARK
for Data Ingestion
Veena Basavaraj & Vinoth Chandar
@Uber
Works currently @ Uber focussed on building
a real time pipeline for ingestion to Hadoop
for batch and stream processing.
...
Agenda
• Data Ingestion Today
• Introduction Apache Sqoop2
• Sqoop Jobs on Apache Spark
• Insights & Next Steps
In the beginning…
Data Ingestion Tool
• Primary need
• Transferring data
from SQL to
HADOOP
• SQOOP solved it
well!
Ingestion needs Evolved…
Data Ingestion Pipeline
• Ingestion pipeline can now have
• Non SQL like data sources
• Messaging Systems as data sources
...
Sqoop 2
• Generic Data Transfer
Service
• FROM - egress data out
from ANY source
• TO - ingress data into ANY
source
• Plu...
Sqoop 2
• CONNECTOR
• JOB
Connector
• Connectors
represent Data
Sources
Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write...
Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write...
Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to ...
Sqoop Job
• Creating a Job
• Job Submission
• Job Execution
Lets talk about
MYSQL to KAFKA
example
Create Job
• Create LINKs
• Populate FROM link Config and
create FROM LINK
• Populate TO link Config and
create TO LINK
Create Job
• Create LINKs
• Populate FROM link Config and
create FROM LINK
• Populate TO link Config and
create TO LINK
Crea...
Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Config
• Populate Driver Config such as...
Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Config
• Populate Driver Config such as...
Create Job API
public static void createJob(String[] jobconfigs) {
CommandLine cArgs = parseArgs(createOptions(), jobconfi...
Job Submit
• Sqoop uses MR engine to transfer data between FROM
and TO data sources
• Hadoop Configuration Object is used t...
Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to ...
Job Execution
• InputFormat/Splits for Partitioning
• Invokes FROM Partition API
• Mappers for Extraction
• Invokes FROM E...
So What’s the Scoop?
So What’s the Scoop?
It turns out…
• Sqoop 2 supports pluggable Execution Engine
• Why not replace MR with Spark for parallelism?
• Why not ext...
Why Apache Spark ?
• Why not ? Data Pipeline expressed as Spark jobs
• Speed is a feature! Faster than MapReduce
• Growing...
Lets talk SQOOP on SPARK
implementation!
Spark Sqoop Job
• Creating a Job
• Job Submission
• Job Execution
Create Sqoop Spark Job
• Create a SparkContext from the relevant configs
• Instantiate a SqoopSparkJob and invoke SqoopSpar...
public class SqoopJDBCHDFSJobDriver {
public static void main(String[] args){
final SqoopSparkJob sparkJob = new SqoopSpar...
Spark Job Submission
• We explored a few options.!
• Invoke Spark in process within the Sqoop Server to
execute the job
• ...
Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark ya...
Spark Job Execution
• 3 main stages
• Obtain containers for parallel execution by simply
converting job’s partitions to an...
Spark Job Execution
SqoopSparkJob.execute(…){
List<Partition> sp = getPartitions(request,numMappers);
JavaRDD<Partition> p...
Spark Job Execution
• We chose to have 2 map stages for a reason
• Load parallelism can be different from Extract
parallel...
Micro Benchmark —>MySQL to HDFS
Table w/ 300K records, numExtractors = numLoaders
Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark —>MySQL to HDFS
What was Easy?
• Reusing existing Connectors, NO changes to the Connector API
required.
• Inbuilt support for Standalone a...
What was not Easy?
• No clean Spark Job Submit API that provides job statistics, using
Yarn UI for Job status and health.
...
Next Steps!
• Explore alternative ways for Spark Sqoop Job Submission
• Expose Spark job stats such as accumulators in the...
Questions!
• Apache Sqoop Project - sqoop.apache.org
• Apache Spark Project - spark.apache.org
• Thanks to the Folks @Clou...
Upcoming SlideShare
Loading in …5
×

Sqoop on Spark for Data Ingestion

14,055 views

Published on

Hadoop Summit 2015

Published in: Technology
  • Want to earn $4000/m? Of course you do. Learn how when you join today! ➤➤ http://ishbv.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Sqoop on Spark for Data Ingestion

  1. 1. SQOOP on SPARK for Data Ingestion Veena Basavaraj & Vinoth Chandar @Uber
  2. 2. Works currently @ Uber focussed on building a real time pipeline for ingestion to Hadoop for batch and stream processing. @linkedin lead on Voldemort @Oracle focussed log based replication, HPC and stream processing Works currently @Uber on streaming systems. Prior to this worked @Cloudera on Ingestion for Hadoop and @Linkedin on fronted and service infra
  3. 3. Agenda • Data Ingestion Today • Introduction Apache Sqoop2 • Sqoop Jobs on Apache Spark • Insights & Next Steps
  4. 4. In the beginning…
  5. 5. Data Ingestion Tool • Primary need • Transferring data from SQL to HADOOP • SQOOP solved it well!
  6. 6. Ingestion needs Evolved…
  7. 7. Data Ingestion Pipeline • Ingestion pipeline can now have • Non SQL like data sources • Messaging Systems as data sources • Multi-stage pipeline
  8. 8. Sqoop 2 • Generic Data Transfer Service • FROM - egress data out from ANY source • TO - ingress data into ANY source • Pluggable Data Sources • Server-Client Design
  9. 9. Sqoop 2 • CONNECTOR • JOB
  10. 10. Connector • Connectors represent Data Sources
  11. 11. Connector • Data Source properties represented via Configs • LINK config to connect to data source • JOB config to read/write data from the data source
  12. 12. Connector • Data Source properties represented via Configs • LINK config to connect to data source • JOB config to read/write data from the data source
  13. 13. Connector API • Pluggable Connector API implemented by Connectors • Partition(P) API for parallelism • (E) Extract API to egress data • (L) Load API to ingress data • No (T) Transform yet !
  14. 14. Sqoop Job • Creating a Job • Job Submission • Job Execution
  15. 15. Lets talk about MYSQL to KAFKA example
  16. 16. Create Job • Create LINKs • Populate FROM link Config and create FROM LINK • Populate TO link Config and create TO LINK
  17. 17. Create Job • Create LINKs • Populate FROM link Config and create FROM LINK • Populate TO link Config and create TO LINK Create MySQL link Create Kafka link
  18. 18. Create Job • Create JOB associating FROM and TO LINKS • Populate the FROM and TO Job Config • Populate Driver Config such as parallelism for extract and load numExtractors numLoaders
  19. 19. Create Job • Create JOB associating FROM and TO LINKS • Populate the FROM and TO Job Config • Populate Driver Config such as parallelism for extract and load Add MySQL From Config Add kafka To Config numExtractors numLoaders
  20. 20. Create Job API public static void createJob(String[] jobconfigs) { CommandLine cArgs = parseArgs(createOptions(), jobconfigs); MLink fromLink = createFromLink(‘jdbc-connector’, jobconfigs); MLink toLink = createToLink(‘kafka-connector’, jobconfigs); MJob sqoopJob = createJob(fromLink, toLink, jobconfigs); }
  21. 21. Job Submit • Sqoop uses MR engine to transfer data between FROM and TO data sources • Hadoop Configuration Object is used to pass FROM/ TO and Driver Configs to the MR engine • Submits the Job via MR-client and tracks job status and stats such as counters
  22. 22. Connector API • Pluggable Connector API implemented by Connectors • Partition(P) API for parallelism • (E) Extract API to egress data • (L) Load API to ingress data • No (T) Transform yet ! Remember!
  23. 23. Job Execution • InputFormat/Splits for Partitioning • Invokes FROM Partition API • Mappers for Extraction • Invokes FROM Extract API • Reducers for Loading • Invokes TO Load API • OutputFormat for Commits/ Aborts
  24. 24. So What’s the Scoop?
  25. 25. So What’s the Scoop?
  26. 26. It turns out… • Sqoop 2 supports pluggable Execution Engine • Why not replace MR with Spark for parallelism? • Why not extend the Connector APIs to support simple (T) transformations along with (EL) ?
  27. 27. Why Apache Spark ? • Why not ? Data Pipeline expressed as Spark jobs • Speed is a feature! Faster than MapReduce • Growing Community embracing Apache Spark • Low effort less than few weeks to build a POC • EL to ETL -> Nifty transformations can be easily added
  28. 28. Lets talk SQOOP on SPARK implementation!
  29. 29. Spark Sqoop Job • Creating a Job • Job Submission • Job Execution
  30. 30. Create Sqoop Spark Job • Create a SparkContext from the relevant configs • Instantiate a SqoopSparkJob and invoke SqoopSparkJob.init(..) that wraps both Sqoop and Spark initialization • As before Create a Sqoop Job with createJob API • Invoke SqoopSparkJob.execute(conf, context)
  31. 31. public class SqoopJDBCHDFSJobDriver { public static void main(String[] args){ final SqoopSparkJob sparkJob = new SqoopSparkJob(); CommandLine cArgs = SqoopSparkJob.parseArgs(createOptions(), args); SparkConf conf = sparkJob.init(cArgs); JavaSparkContext context = new JavaSparkContext(conf); MLink fromLink = getJDBCLink(); MLink toLink = getHDFSLink(); MJob sqoopJob = createJob(fromLink, toLink); sparkJob.setJob(sqoopJob); sparkJob.execute(conf, context); } Create Sqoop Spark Job 1 2 3 4
  32. 32. Spark Job Submission • We explored a few options.! • Invoke Spark in process within the Sqoop Server to execute the job • Use Remote Spark Context used by Hive on Spark to submit • Sqoop Job as a driver for the Spark submit command
  33. 33. Spark Job Submission • Build a “uber.jar” with the driver and all the sqoop dependencies • Programmatically using Spark yarn client or directly via command line submit the driver program to yarn client/ • bin/spark-submit —classorg.apache.sqoop.spark.SqoopJDBCHDFSJobDriver --master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ —jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs://path/to/output —numE 4 —numL 4
  34. 34. Spark Job Execution • 3 main stages • Obtain containers for parallel execution by simply converting job’s partitions to an RDD • Partition API determines parallelism, Map stage uses Extract API to read records • Another Map stage uses Load API to write records
  35. 35. Spark Job Execution SqoopSparkJob.execute(…){ List<Partition> sp = getPartitions(request,numMappers); JavaRDD<Partition> partitionRDD = sc.parallelize(sp, sp.size()); JavaRDD<List<IntermediateDataFormat<?>>> extractRDD = partitionRDD.map(new SqoopExtractFunction(request)); 
 extractRDD.map(new SqoopLoadFunction(request)).collect(); } 1 2 3
  36. 36. Spark Job Execution • We chose to have 2 map stages for a reason • Load parallelism can be different from Extract parallelism, for instance we may need to restrict the TO based on number of Kafka Partitions on the topic • We can repartition before we invoke the Load stage
  37. 37. Micro Benchmark —>MySQL to HDFS Table w/ 300K records, numExtractors = numLoaders
  38. 38. Table w/ 2.8M records, numExtractors = numLoaders good partitioning!! Micro Benchmark —>MySQL to HDFS
  39. 39. What was Easy? • Reusing existing Connectors, NO changes to the Connector API required. • Inbuilt support for Standalone and Cluster mode for quick end-end testing and faster iteration • Scheduling Spark sqoop jobs via Oozie
  40. 40. What was not Easy? • No clean Spark Job Submit API that provides job statistics, using Yarn UI for Job status and health. • We had to convert a bunch of Sqoop core classes such as IDF (internal representation for records transferred) to be serializable • Managing Hadoop and spark dependencies together and CNF caused some pain
  41. 41. Next Steps! • Explore alternative ways for Spark Sqoop Job Submission • Expose Spark job stats such as accumulators in the submission history • Proposed Connector Filter API (cleaning, data masking) • We want to work with Sqoop community to merge this back if its useful • https://github.com/vybs/sqoop-on-spark
  42. 42. Questions! • Apache Sqoop Project - sqoop.apache.org • Apache Spark Project - spark.apache.org • Thanks to the Folks @Cloudera and @Uber !!! • You can reach us @vybs, @byte_array

×