Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Open Source Big Data Ingestion - Without the Heartburn!

Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.

  • Be the first to comment

Open Source Big Data Ingestion - Without the Heartburn!

  1. 1. Open Source Big Data Ingestion Without the Heartburn! Pat Patterson Community Champion @metadaddy pat@streamsets.com
  2. 2. The Ingest Problem Apache Flume Apache Sqoop Apache Nifi StreamSets Data Collector Demo Agenda
  3. 3. Volume Variety Velocity Veracity Big Data Ingest
  4. 4. Free Like a puppy Difficulty Fragility Maintenance Base Case - Custom Code
  5. 5. Originated at Cloudera Inspired by Facebook Scribe - open source log aggregation Decentralized, distributed system of independent agents ‘Off-cluster’ only Opaque, record-oriented payload - byte arrays Apache Flume
  6. 6. Apache Flume Flume Agent Source Sink Channel Incoming Data Outgoing Data Interceptor ● Modify/drop events in-flight Sink ● Removes data from Channel ● Sends data to downstream Agent or Destination Channel ● Stores data in the order received Interceptor Source ● Accepts incoming Data ● Scales as required ● Writes data to Channel
  7. 7. Apache Flume Flume Agent Flume Agent Flume Agent Works well for managing impedance mismatches between source and sink - smooth out spikes in load Log HDFS
  8. 8. Apache Flume # example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
  9. 9. Combinatorial explosion of agents with tasks and record formats Contextual processing is hard Configuration validation is hard No overall view of the system Version 1.0 released Jan 2012 Latest version (1.6) released May 2015 Apache Flume
  10. 10. Originated at Cloudera Transfer bulk data between RDBMS and Hadoop Command-line tool Breaks a table/query into ‘n’ partitions ‘On-cluster’ - runs as a ‘map-only’ job in MapReduce ‘High-Speed Connectors’ can take advantage of low-level database features - Teradata, Exadata, Netezza etc Apache Sqoop
  11. 11. Apache Sqoop $ sqoop import-all-tables -m {{cluster_data.worker_node_hostname.length}} --connect jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db --username=retail_dba --password=cloudera --compression-codec=snappy --as-parquetfile --warehouse-dir=/user/hive/warehouse --hive-import
  12. 12. Batch mode only Database credentials on command line, or shipped around in MapReduce config Version 1.0.0 released June 2010 Latest version (1.4.6) released April 2015 Sqoop 2 proposed in SQOOP-365, Oct 2011 Latest (1.99.6) released May 2015, still ‘not intended for production deployment’ Apache Sqoop
  13. 13. Originated at NSA as Niagarafiles Open sourced December 2014, Apache TLP July 2015 Opaque, file-oriented payload Distributed system of processors with centralized control Based on flow-based programming concepts Data Provenance Web-based user interface Apache NiFi
  14. 14. Apache NiFi
  15. 15. Apache NiFi Opaque files -> same combinatorial explosion as Flume ConvertAvroToJSON, ConvertCSVToAvro, ConvertJSONToAvro, ConvertJSONToSQL Not really big data native Breaks principle of data locality Operates as own cluster
  16. 16. Founded by ex-Cloudera, Informatica employees Continuous open source, intent-driven, big data ingest Visible, record-oriented approach fixes combinatorial explosion Batch or stream processing Standalone, Spark cluster, MapReduce cluster IDE for pipeline development by ‘civilians’ SDK for custom components (origin/processor/destination) StreamSets Data Collector
  17. 17. StreamSets Data Collector
  18. 18. StreamSets Data Collector Relatively new - first public release September 2015 So far, vast majority of commits are from StreamSets staff
  19. 19. SDC Demo Apache Kafka ↘ StreamSets Data Collector ↘ Apache Kudu
  20. 20. Flume - good for smoothing out impedance mismatches, but complex to deploy and maintain Sqoop - good for database dumps, but not enterprise-friendly Nifi - good for file-oriented flows, but not really big-data oriented StreamSets Data Collector - good for continuous ingest pipelines, but relative newcomer Conclusion
  21. 21. Thank You! Pat Patterson Community Champion @metadaddy pat@streamsets.com

    Be the first to comment

    Login to see the comments

  • masakikubomura

    May. 26, 2016
  • majidhazari

    Oct. 28, 2016
  • nickchervov

    May. 30, 2019

Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.

Views

Total views

3,045

On Slideshare

0

From embeds

0

Number of embeds

17

Actions

Downloads

54

Shares

0

Comments

0

Likes

3

×