Open source data ingestion

Open Source
Data Collection/Ingestion
Treasure Data, Inc.
www.treasuredata.com

Hello!
- “Committer” of Fluentd
- Treasure Data, Inc.
- Former Algorithmic Trader
- Stanford Math and CS

Table of Contents
1. Why you should care
2. Data Collection v. Data Ingestion
3. Examples: Data Collection Tools
4. Examples: Data Ingestion Tools
5. Case Study: Async App Logging
Links to be added after the talk.

Data Collection/Ingestion is HARD

Data Sources
Raw Data
Storage
Processed
Data
Analysis
Environment
(Big) Data Pipeline
Data Collection
and Ingestion
Data Pre-
processing
Data Fetching
Data Engineers

Data Sources
Raw Data
Storage
Processed
Data
Analysis
Environment
If Data Collection Goes Awry...
Data Collection
and Ingestion
Data Pre-
processing
Data Fetching
Data Engineers

Data Collection
- Happens where data originates
- “logging code”
- Batch v. Streaming
- Pull v. Push
log.error(“FUUUUU....WHY!?”)
cln.send({“uid”:1,”action”:”died”})
200 GET a.com/?utm=big%20data

Data Ingestion
- Receives data
- Sometimes coupled with storage
- Routing data Data Ingestion Layer

rsyslog
- The grandfather of data collectors
- Streaming
- Installed by default, widely understood
- Not as easy to extend/configure

rsyslog
https://github.com/rsyslog/rsyslog/blob/master/ChangeLog

Scribe
- Written originally at Facebook
- Streaming
- Fast (C++)
- Nightmare to build, largely
abandoned

Flume-ng
- Written and maintained by
Cloudera (successor to Flume)
- Commercial support by
Cloudera. Track record for
Hadoop
- Java can be heavy-handed for
some orgs/cases

Logstash
- Pluggable architecture, rich
ecosystem
- The “L” of the ELK stack by
Elastic
- JRuby
- HA uses Redis as a queue
http://apuntesdetrabajo.es/?p=263

Heka
- Developed at Mozilla
- Written in Go, extensible w/ Lua
- Plugin system, but compilation
needed (Go’s limitation, may
change)

Fluentd
- Plugin architecture
- Built-in HA
- CRuby (JRuby on the roadmap)
- google-fluentd, td-agent
- Lightweight multi-source, multi-
destination log routing

Embulk
- Plugin architecture
- Focuses on Batch workloads
- Java/JRuby
- Very new! (looking for
contributors!)

RabbitMQ
- Written in Erlang, supported by
Pivotal
- Implements AMQP

Kafka
- Begun at LinkedIn, now Confluent
- Topic-based Message Broker:
Producer/Broker/Consumer
- Distributed design
- Provides at least once, at most
once by consumers

Fluentd!?
- Used (abused?) as a bus/MQ
- tag-based event routing
- Can be combined with
RabbitMQ/Kafka, etc.

Application Logging
- Common ask: “How’s our new feature doing?”
GET
/foobar
API Server
200 {...}

Application Logging
- What NOT to do: synchronous logging
GET
/foobar
API Server200 {...} Data Backend
write
ack

Application Logging
- What NOT to do: synchronous logging
GET
/foobar
API Server200 {...} Local Data
Collector
write Flush
Data
Backendack
Buffer

- Is writing to a local log collector safe?
- What if the log collector retries by error?
But wait...
- A lot of problems to think about!

“Much of the blame, little of the glory”
(Just kidding. The entire data team relies on YOU!)

Thank you!
(...and we are hiring!)
www.treasuredata.com/careers

- Software
- www.fluentd.org
- hekad.readthedocs.org
- logstash.org
- kafka.apache.org
- Embulk.org
- www.rabbitmq.com
- Ideas
- https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
- http://radar.oreilly.com/2015/04/the-log-the-lifeblood-of-your-data-
pipeline.htmlL
Bibliography

Open source data ingestion

More Related Content

What's hot

Viewers also liked

Similar to Open source data ingestion

More from Treasure Data, Inc.

Open source data ingestion