Your SlideShare is downloading. ×
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Flume @ Austin HUG 2/17/11

5,295

Published on

This presentation describes Flume, a distributed log collection system for shipping data to frameworks such as Hadoop and HBase. It provides an overview and describes updates and emerging stories …

This presentation describes Flume, a distributed log collection system for shipping data to frameworks such as Hadoop and HBase. It provides an overview and describes updates and emerging stories from the community since its open source release. These are the slides from the 2/18/11 Austin, TX HUG.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,295
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
133
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. FlumeReliable DistributedStreaming Log CollectionJonathan Hsieh, Henry Robinson, Patrick HuntCloudera, IncHadoop World 2010, 10/12/2010
  • 2. Flume4 months after HadoopWorld 2010Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer, Bruce MitchenerCloudera, IncAustin Hadoop Users Group 2/17/2011
  • 3. Who Am I? • Cloudera: – Software Engineer on the Platform Team – Flume Project Lead / Designer / Architect • U of Washington: – “On Leave” from PhD program – Research in Systems and Programming Languages • Previously: – Computer Security, Embedded Systems. Austin Hadoop User Group, 2/17/2011 4
  • 4. The basic scenario• You have a bunch of servers generating log files.• You figured out that your logs are valuable and you want to keep them and analyze them.• Because of the volume of data, you’ve started using a Apache Hadoop or Cloudera’s Distribution of Apache Hadoop.• … and you’ve got some ad-hoc, It’s log, log .. Everyone wants a log! hacked together scripts that copy data from servers to HDFS. Austin Hadoop User Group, 2/17/2011 5
  • 5. Ad-hockery gets complicated• Reliability – Will you data still get there … if your scripts fail? … if your hardware failed? … if HDFS goes down? … if EC2 has flaked out?• Scale – As you add servers will your scripts keep up to 100GB’s per day? Will you have tons of small files? Are you going to have tons of connections? Are you willing to suffer more latency to mitigate?• Manageability – How do you know if the script failed on machine 172? What about logs from that other system? How do you monitor and configure all the servers? Can you deal with elasticity?• Extensibility – Can you service custom logs? Send data to different places like Hbase, Hive or Incremental search indexes? Can you do near-realtime?• Blackbox – What happens when the guy who write it leaves? Austin Hadoop User Group, 2/17/2011 6
  • 6. Cloudera FlumeFlume is a framework and conduit forcollecting and quickly shipping data recordsfrom of many sources and to one centralizedplace for storage and processing.Project Principles:• Scalability• Reliability• Extensibility• Manageability• Openness Austin Hadoop User Group, 2/17/2011 7
  • 7. : The Standard Use Caseserver Agent server Agent Collector server Agent server Agentserver Agent server Agent Collector server Agent server Agent HDFSserver Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 8
  • 8. : The Standard Use Case Flumeserver Agent server Agent Collector server Agent server Agentserver Agent server Agent Collector server Agent server Agent HDFSserver Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 9
  • 9. : The Standard Use Case Flume Masterserver Agent server Agent Collector server Agent server Agentserver Agent server Agent Collector server Agent server Agent HDFSserver Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 10
  • 10. : The Standard Use Case Flume Masterserver Agent server Agent Collector server Agent server Agentserver Agent server Agent Collector server Agent server Agent HDFSserver Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 11
  • 11. Flume’s Key Abstractions• Data path and control path node• Nodes are in the data path Agent source sink – Nodes have a source and a sink – They can take different roles node • A typical topology has agent nodes and collector nodes. Collector source sink • Optionally it has processor nodes.• Masters are in the control path. – Centralized point of configuration. – Specify sources and sinks Master – Can control flows of data between nodes – Use one master or use many with a ZK-backed quorum Austin Hadoop User Group, 2/17/2011 12
  • 12. Flume’s Key Abstractions• Data path and control path node• Nodes are in the data path source sink – Nodes have a source and a sink – They can take different roles node • A typical topology has agent nodes and collector nodes. source sink • Optionally it has processor nodes.• Masters are in the control path. – Centralized point of configuration. – Specify sources and sinks Master – Can control flows of data between nodes – Use one master or use many with a ZK-backed quorum Austin Hadoop User Group, 2/17/2011 13
  • 13. Can I has the codez?node001: tail(“/var/log/app/log”) | autoE2ESink;node002: tail(“/var/log/app/log”) | autoE2ESink;…node100: tail(“/var/log/app/log”) | autoE2ESink;collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) Austin Hadoop User Group, 2/17/2011 14
  • 14. Outline• What is Flume?• Scalability – Horizontal scalability of all nodes and masters• Reliability – Fault-tolerance and High availability• Extensibility – Unix principle, all kinds of data, all kinds of sources, all kinds of sinks• Manageability – Centralized management supporting dynamic reconfiguration• Openness – Apache v2.0 License and an active and growing community Austin Hadoop User Group, 2/17/2011 15
  • 15. SCALABILITY Austin Hadoop User Group, 2/17/2011 16
  • 16. : The Standard Use Case Flumeserver Agent server Agent Collector server Agent server Agentserver Agent server Agent Collector server Agent server Agent HDFSserver Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 17
  • 17. Data path is horizontally scalableserver Agent server Agent Collector server Agent server Agent HDFS• Add collectors to increase availability and to handle more data – Assumes a single agent will not dominate a collector – Fewer connections to HDFS. – Larger more efficient writes to HDFS.• Agents have mechanisms for machine resource tradeoffs • Write log locally to avoid collector disk IO bottleneck and catastrophic failures • Compression and batching (trade cpu for network) • Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks) Austin Hadoop User Group, 2/17/2011 18
  • 18. RELIABILITY Austin Hadoop User Group, 2/17/2011 19
  • 19. Tunable failure recovery modes• Best effort – Fire and forget Agent Collector HDFS• Store on failure + retry – Local acks, local errors Agent Collector HDFS detectable – Failover when faults detected.• End to end reliability Agent Collector – End to end acks HDFS – Data survives compound failures, and may be retried multiple times Austin Hadoop User Group, 2/17/2011 20
  • 20. Load balancing Agent Agent Collector Agent Agent Collector Agent Agent Collector• Agents are logically partitioned and send to different collectors• Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. Austin Hadoop User Group, 2/17/2011 21
  • 21. Load balancing and collector failover Agent Agent Collector Agent Agent Collector Agent Agent Collector• Agents are logically partitioned and send to different collectors• Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. Austin Hadoop User Group, 2/17/2011 22
  • 22. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3• A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future• Nodes can talk to any master.• Masters can talk to an existing ZK ensemble Austin Hadoop User Group, 2/17/2011 23
  • 23. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3• A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future• Nodes can talk to any master.• Masters can talk to an existing ZK ensemble Austin Hadoop User Group, 2/17/2011 24
  • 24. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3• A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future• Nodes can talk to any master.• Masters can talk to an existing ZK ensemble Austin Hadoop User Group, 2/17/2011 25
  • 25. MANAGEABILITY Austin Hadoop User Group, 2/17/2011 Wheeeeee! 26
  • 26. Centralized Dataflow Management Interfaces• One place to specify node sources, sinks and data flows.• Basic Web interface• Flume Shell – Command line interface – Scriptable• Cloudera Enterprise – Flume Monitor App – Graphical web interface Austin Hadoop User Group, 2/17/2011 27
  • 27. Configuring Flume fan console tail filter out roll hdfsNode: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ;• A concise and precise configuration language for specifying dataflows in a node.• Dynamic updates of configurations – Allows for live failover changes – Allows for handling newly provisioned machines – Allows for changing analytics Austin Hadoop User Group, 2/17/2011 28
  • 28. Output bucketing Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt HDFS /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt … Collectornode : collectorSource | collectorSink(“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)• Automatic output file management – Write hdfs files in over time based tags Austin Hadoop User Group, 2/17/2011 29
  • 29. EXTENSIBILITY Austin Hadoop User Group, 2/17/2011 30
  • 30. Flume is easy to extend• Simple source and sink APIs – An event streaming design – Many simple operations composes for complex behavior• Plug-in architecture so you can add your own sources, sinks and decorators and sinks fan sink source deco out deco sink Austin Hadoop User Group, 2/17/2011 31
  • 31. Variety of Connectors• Sources produce data – Console, Exec, Syslog, Scribe, IRC, Twitter, – In the works: JMS, AMQP, pubsubhubbub/RSS/Atom• Sinks consume data source – Console, Local files, HDFS, S3 – Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search – In the works: JMS, AMQP sink• Decorators modify data sent to sinks – Wire batching, compression, sampling, projection, extraction, throughput throttling – Custom near real-time processing (Meebo) – JRuby event modifiers (InfoChimps) deco – Cryptographic extensions(Rearden) Austin Hadoop User Group, 2/17/2011 32
  • 32. : Multi Datacenter Collector tier api api api Agent Agent Agent CollectorAPI server api Agent api api api Agent Agent Agent Collector api Agent api api api Agent Agent Agent Collector api Agent HDFSProcessor server api Agent api Agent api Agent proc Agent Collector api Agent api Agent api Agent proc Agent Collector api Agent api Agent api Agent proc Agent Collector Austin Hadoop User Group, 2/17/2011 33
  • 33. : Multi Datacenter Collector tier api api api Agent Agent Agent CollectorAPI server api Agent api api api Agent Agent Agent Collector api Agent api api api Agent Agent Agent Collector api Agent Relay HDFSProcessor server api Agent api Agent api Agent proc Agent Collector api Agent api Agent api Agent proc Agent Collector api Agent api Agent api Agent proc Agent Collector Austin Hadoop User Group, 2/17/2011 34
  • 34. : Near Realtime Aggregator FlumeAd svr Agent Ad svr Agent Tracker Collector HDFS Ad svr Agent Ad svr Agent quick reports DB Hive job verify reports Austin Hadoop User Group, 2/17/2011 35
  • 35. An enterprise story Flume Collector tier api api api Agent Agent Agent CollectorAPI server api Win Kerberos HDFS api api api Agent Agent Agent Collector api Linux DD DD DD api api api Agent Agent Agent Collector api Linux Active Directory / LDAP Austin Hadoop User Group, 2/17/2011 36
  • 36. An emerging community story Flume Agent Agent Hive querysvr Agent Agent HDFS Pig query hdfs Key lookup Collector Fanout hbase HBase Range query index Incremental Search query Search Idx Faceted query Austin Hadoop User Group, 2/17/2011 37
  • 37. OPENNESS ANDCOMMUNITY Austin Hadoop User Group, 2/17/2011 38
  • 38. Flume is Open Source• Apache v2.0 Open Source License – Independent from Apache Software Foundation• GitHub source code repository – http://github.com/cloudera/flume – Regular tarball update versions every 2-3 months. – Regular CDH packaging updates every 3-4 months.• Review Board for code review• New external committers wanted! – Cloudera folks: Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer – Independent folks: Bruce Mitchener Austin Hadoop User Group, 2/17/2011 39
  • 39. Growing user and developer community• History: – Initial Open Source Release, June 2010• Growth: – Pre-Hadoop Summit (Late June 2010): • 4 followers, 4 forks (original authors) – Pre-Hadoop World (October 2010): • 174 followers, 34 forks – Pre-CDH3B4 Release (February 2011): • 288 followers, 51 forks Austin Hadoop User Group, 2/17/2011 40
  • 40. Support• Community-based mailing lists for support – “an answer in a few days” – User: https://groups.google.com/a/cloudera.org/group/flume-user – Dev: https://groups.google.com/a/cloudera.org/group/flume-dev• Community-based IRC chat room – “quick questions, quick answers” – #flume in irc.freenode.net• Commercial support with Cloudera Enterprise subscription – Chat with sales@cloudera.com Austin Hadoop User Group, 2/17/2011 41
  • 41. CONCLUSIONS Austin Hadoop User Group, 2/17/2011 42
  • 42. Summary• Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs. – It is centrally managed, which allows for automated and adaptive configurations. – This design allows for near-real time processing. – Apache v2.0 License with active and growing community• Part of Cloudera’s Distribution for Hadoop, about to be refreshed for CDH3b4. Austin Hadoop User Group, 2/17/2011 43
  • 43. Questions? (and shameless plugs)• Contact info: – jon@cloudera.com – Twitter @jmhsieh• Cloudera Training in Dallas – Hadoop Training for Developers - March 14-16 – Hadoop Training for Administrators - March 17-18 – Sign up at http://cloudera.eventbrite.com – 10% discount code for classes "hug“• Cloudera is Hiring! Austin Hadoop User Group, 2/17/2011 44

×