Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.
5. Originated at Cloudera
Inspired by Facebook Scribe - open source log
aggregation
Decentralized, distributed system of independent
agents
‘Off-cluster’ only
Opaque, record-oriented payload - byte arrays
Apache Flume
7. Apache Flume
Flume Agent
Flume Agent
Flume Agent
Works well for managing impedance mismatches between source and sink -
smooth out spikes in load
Log
HDFS
8. Apache Flume
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
9. Combinatorial explosion of agents with tasks and record formats
Contextual processing is hard
Configuration validation is hard
No overall view of the system
Version 1.0 released Jan 2012
Latest version (1.6) released May 2015
Apache Flume
10. Originated at Cloudera
Transfer bulk data between RDBMS and Hadoop
Command-line tool
Breaks a table/query into ‘n’ partitions
‘On-cluster’ - runs as a ‘map-only’ job in MapReduce
‘High-Speed Connectors’ can take advantage of low-level
database features - Teradata, Exadata, Netezza etc
Apache Sqoop
12. Batch mode only
Database credentials on command line, or shipped around in MapReduce config
Version 1.0.0 released June 2010
Latest version (1.4.6) released April 2015
Sqoop 2 proposed in SQOOP-365, Oct 2011
Latest (1.99.6) released May 2015, still ‘not intended for production deployment’
Apache Sqoop
13. Originated at NSA as Niagarafiles
Open sourced December 2014, Apache TLP July 2015
Opaque, file-oriented payload
Distributed system of processors with centralized control
Based on flow-based programming concepts
Data Provenance
Web-based user interface
Apache NiFi
15. Apache NiFi
Opaque files -> same combinatorial explosion as Flume
ConvertAvroToJSON, ConvertCSVToAvro,
ConvertJSONToAvro, ConvertJSONToSQL
Not really big data native
Breaks principle of data locality
Operates as own cluster
16. Founded by ex-Cloudera, Informatica employees
Continuous open source, intent-driven, big data ingest
Visible, record-oriented approach fixes combinatorial explosion
Batch or stream processing
Standalone, Spark cluster, MapReduce cluster
IDE for pipeline development by ‘civilians’
SDK for custom components (origin/processor/destination)
StreamSets Data Collector
20. Flume - good for smoothing out impedance mismatches, but complex to deploy and maintain
Sqoop - good for database dumps, but not enterprise-friendly
Nifi - good for file-oriented flows, but not really big-data oriented
StreamSets Data Collector - good for continuous ingest pipelines, but relative newcomer
Conclusion