Open Source Big Data Ingestion - Without the Heartburn!

Open Source Big Data Ingestion
Without the Heartburn!
Pat Patterson
Community Champion
@metadaddy
pat@streamsets.com

The Ingest Problem
Apache Flume
Apache Sqoop
Apache Nifi
StreamSets Data Collector
Demo
Agenda

Volume
Variety
Velocity
Veracity
Big Data Ingest

Free
Like a puppy
Difficulty
Fragility
Maintenance
Base Case - Custom Code

Originated at Cloudera
Inspired by Facebook Scribe - open source log
aggregation
Decentralized, distributed system of independent
agents
‘Off-cluster’ only
Opaque, record-oriented payload - byte arrays
Apache Flume

Apache Flume
Flume Agent
Source
Sink
Channel
Incoming
Data
Outgoing
Data
Interceptor
● Modify/drop events
in-flight
Sink
● Removes data from
Channel
● Sends data to
downstream Agent or
Destination
Channel
● Stores data in the
order received
Interceptor
Source
● Accepts incoming
Data
● Scales as required
● Writes data to
Channel

Apache Flume
Flume Agent
Flume Agent
Flume Agent
Works well for managing impedance mismatches between source and sink -
smooth out spikes in load
Log
HDFS

Apache Flume
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Combinatorial explosion of agents with tasks and record formats
Contextual processing is hard
Configuration validation is hard
No overall view of the system
Version 1.0 released Jan 2012
Latest version (1.6) released May 2015
Apache Flume

Originated at Cloudera
Transfer bulk data between RDBMS and Hadoop
Command-line tool
Breaks a table/query into ‘n’ partitions
‘On-cluster’ - runs as a ‘map-only’ job in MapReduce
‘High-Speed Connectors’ can take advantage of low-level
database features - Teradata, Exadata, Netezza etc
Apache Sqoop

Apache Sqoop
$ sqoop import-all-tables
-m {{cluster_data.worker_node_hostname.length}}
--connect jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db
--username=retail_dba
--password=cloudera
--compression-codec=snappy
--as-parquetfile
--warehouse-dir=/user/hive/warehouse
--hive-import

Batch mode only
Database credentials on command line, or shipped around in MapReduce config
Version 1.0.0 released June 2010
Latest version (1.4.6) released April 2015
Sqoop 2 proposed in SQOOP-365, Oct 2011
Latest (1.99.6) released May 2015, still ‘not intended for production deployment’
Apache Sqoop

Originated at NSA as Niagarafiles
Open sourced December 2014, Apache TLP July 2015
Opaque, file-oriented payload
Distributed system of processors with centralized control
Based on flow-based programming concepts
Data Provenance
Web-based user interface
Apache NiFi

Apache NiFi
Opaque files -> same combinatorial explosion as Flume
ConvertAvroToJSON, ConvertCSVToAvro,
ConvertJSONToAvro, ConvertJSONToSQL
Not really big data native
Breaks principle of data locality
Operates as own cluster

Founded by ex-Cloudera, Informatica employees
Continuous open source, intent-driven, big data ingest
Visible, record-oriented approach fixes combinatorial explosion
Batch or stream processing
Standalone, Spark cluster, MapReduce cluster
IDE for pipeline development by ‘civilians’
SDK for custom components (origin/processor/destination)

Relatively new - first public release September 2015
So far, vast majority of commits are from StreamSets staff

SDC Demo
Apache Kafka
↘
StreamSets
Data Collector
↘
Apache Kudu

Flume - good for smoothing out impedance mismatches, but complex to deploy and maintain
Sqoop - good for database dumps, but not enterprise-friendly
Nifi - good for file-oriented flows, but not really big-data oriented
StreamSets Data Collector - good for continuous ingest pipelines, but relative newcomer
Conclusion

Thank You!
Pat Patterson
Community Champion
@metadaddy
pat@streamsets.com

Open Source Big Data Ingestion - Without the Heartburn!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Open Source Big Data Ingestion - Without the Heartburn!

Similar to Open Source Big Data Ingestion - Without the Heartburn! (20)

More from Pat Patterson

More from Pat Patterson (20)

Recently uploaded

Recently uploaded (20)

Open Source Big Data Ingestion - Without the Heartburn!