Inside Flume

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
30,042
On Slideshare
27,522
From Embeds
2,520
Number of Embeds
9

Actions

Shares
Downloads
988
Comments
3
Likes
60

Embeds 2,520

http://nosql.mypopescu.com 2,500
http://sozialpapier.com 4
https://twitter.com 4
http://webcache.googleusercontent.com 3
http://www.sozialpapier.com 3
http://www.slideshare.net 2
http://192.168.1.111 2
http://static.slidesharecdn.com 1
https://p.yammer.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Inside Flume Henry Robinson henry@cloudera.com @henryr Tuesday, 17 August 2010
  • 2. Who am I? • Distributed systems guy • Apache ZooKeeper committer • I work at Cloudera on Flume, ZooKeeper, Hue, more... • p.s. Cloudera is hiring! Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 3. About Cloudera • Software, services and support for Hadoop • Built around an open core • All our patches get contributed upstream • Flume and Hue are open-source • We just started the Whirr project • We maintain, package and support Cloudera’s Distribution for Hadoop • Smoothing off a lot of the rough edges around Hadoop • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive, Pig, Hue, Flume and more. Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 4. What’s the problem? • Data collection is currently a priori and ad hoc • A priori - decide what you want to collect ahead of time • Ad hoc - Each kind of data source goes through its own collection path • Usually a collection of fragile, custom scripts Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 5. What is Flume? (and how can it help?) • Flume is: • A distributed data collection service • Scalable • Configurable • Extensible • Manageable • Open source • How can it help? • One-stop solution for data collection of all formats • Flexible reliability guarantees allow careful performance tuning • Enables quick iteration on new collection strategies Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 6. The Flume Model • Built around the concept of flows • A single flow corresponds to a type of data source • Like web server logs • Or machine monitoring metrics • Different flows might have different compression, batching or reliability setups • Flume multiplexes many flows onto one service instance • Flows are comprised of nodes chained together • Each Flume process can run many nodes, so resources are shared • Each node receives data at its source, and sends it to its sink Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 7. Flume Flows • Three typical flows, all on the same Flume service Flow 1: Web-clicks Reliable Delivery, Compressed, Batched EV A EN D AT TS DATA Flow 2: Process monitoring EVENTS Best Effort Delivery DA TA N TS E VE Flow 3: Advert Impressions Reliable Delivery Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 8. Anatomy of a Flume node • Data come in through a source... • ... are optionally processed by one or more decorators... • ... and then are transmitted out via a sink • Each of these components is (re-)configurable at run- time • Each has a very simple API, and a plugin interface that makes customizing Flume very easy • These simple abstractions are sufficient to build more complex features like acknowledged delivery, filtering, compression Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 9. Agents and Collectors • Nodes that receive data from an application are called agents • Flume supports many sources for agents, including: • Syslog • Tailing a file • Unix processes • Scribe API • Twitter • Nodes that write data to permanent storage are called collectors • Most often they write to HDFS Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 10. Flume Nodes Source Agent Sink HTTPD Tail Apache Downstream HTTPD logs processor node • Each role may be played by many Processor different nodes Source Decorator Sink Extract browser Upstream agent name from log string Downstream node and attach it to event collector node • Usually require substantially fewer collectors than agents Collector Source Sink HDFS:// Upstream namenode/ S HDF processor node /weblogs/ %{browser}/ Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 11. Flume Events • All data are transformed into a series of events • Events are a pair (body, metadata) • Body is a string of bytes • Metadata is a table mapping keys to values • Flume can use this to inform processing • Or simply write it with the event Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 12. The Flume Configuration Language • Node configurations are written in a simple language • my-flume-node : src | { decorator => sink } • For example: a configuration to read HTTP log data from a file and send it to a collector: • web-log-agent : tail(“/var/log/httpd.log”) | agentBESink • On the collector, receive data and bucket it according to browser: • web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) } • Two lines to set-up an entire flow Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 13. Keeping Track of Nodes • The master service monitors all Flume nodes • A single port-of-call for checking on the health of your Flume service • Send commands to the master, and it will forward them to the nodes • The Flume Shell is a convenient, scriptable command-line tool • Web-based UIs are also available Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 14. Flume as a Distributed System • Fundamental principle: Keep state out of the data path where possible • Replication is costly • Consistency is problematic • Global knowledge is impractical • Follow the end-to-end principle - put smarts at the edges • Advantages • Failures become much cheaper • Performance is better • Disadvantages • Have to weaken some delivery guarantees Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 15. Scalability and reliability in Flume • The data path is ‘horizontally scalable’ • Add more machines, get more performance • Typically the bottleneck is write performance at the collector • If machines fail, others automatically take their place • The master only requires a few machines • Consistency and replication handled by ZooKeeper + gossip • A cluster of five or seven machines can handle thousands of nodes • Can add more if you manage to hit the limit Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 16. Flume as Open Source • http://github.com/cloudera/flume • Already vibrant contributor community • Flume 0.9.1 is at release candidate 0 right now • Cloudera provides • Packages • Standardisation • Support Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 17. Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010