Your SlideShare is downloading. ×
Flume
Reliable Distributed
Streaming Log Collection

Ian Wrigley
Educational Services, Cloudera, Inc
ian@cloudera.com
Scenario	
  
•  Situa,on:	
  
      –  You	
  have	
  hundreds	
  of	
  services	
  producing	
  logs	
  
      –  You’re	...
You	
  need	
  a	
  “Flume”	
  
•  Flume	
  is	
  a	
  distributed	
  system	
  that	
  gets	
  
   your	
  logs	
  from	
...
Use	
  cases	
  
•  Collec,ng	
  logs	
  from	
  nodes	
  in	
  your	
  
   Hadoop	
  cluster	
  
•  Collec,ng	
  logs	
  ...
Key	
  abstrac,ons	
  
•  Data	
  path	
  and	
  control	
  path	
                                                        ...
A	
  sample	
  topology	
  
            Agent tier   Collector tier       Master	
  
            Agent
             Agent ...
Masters	
  control	
  node	
  configura,on	
  
            Agent tier   Collector tier       Master	
  
            Agent
 ...
Outline	
  
•  What	
  is	
  Flume?	
  
      –  Goals	
  and	
  architecture	
  
•  Reliability	
  
      –  Fault-­‐tole...
RELIABILITY	
  


                       The logs will still get there…
7/15/2010                                         ...
Tunable	
  data	
  reliability	
  levels	
  
•  Best	
  effort	
  
      –  Fire	
  and	
  forget	
                        ...
SCALABILITY	
  



7/15/2010
                       Logs jamming the Kemi River   11
A	
  sample	
  topology	
  
            Agent tier   Collector tier       Master	
  
            Agent
             Agent ...
Data	
  path	
  is	
  horizontally	
  scalable	
  
             Agent
              Agent                               Co...
Load	
  balancing	
  
                                 Agent
                                  Agent                      ...
Load	
  balancing	
  
                                 Agent
                                  Agent                      ...
Control	
  plane	
  is	
  horizontally	
  scalable	
  
                Node                              Master	
         ...
Control	
  plane	
  is	
  horizontally	
  scalable	
  
                Node                              Master	
         ...
Control	
  plane	
  is	
  horizontally	
  scalable	
  
                Node                              Master	
         ...
EXTENSIBILITY	
  


                         Turn raw logs into something useful…
7/15/2010                               ...
Flume	
  is	
  easy	
  to	
  extend	
  
•  Simple	
  source	
  and	
  sink	
  APIs	
  
      –  Event	
  granularity	
  st...
Variety	
  of	
  Data	
  sources	
  
•  Can	
  deal	
  with	
  push	
  and	
  pull	
  sources	
                           ...
Variety	
  of	
  Data	
  output	
  
•  Send	
  data	
  to	
  many	
  sinks	
  
      –  HDFS,	
  Files,	
  Console,	
  RPC...
MANAGEABILITY	
  



7/15/2010
                         Wheeeeee!   23
Centralized	
  data	
  flow	
  management	
  
•  Master	
  specifies	
  node	
  sources,	
  sinks	
  and	
  data	
  flows	
  ...
Output	
  bucke,ng	
  
       Collector                                                 /logs/web/2010/0715/1200/data-xxx....
For	
  advanced	
  users	
  
•  A	
  concise	
  and	
  precise	
  configura,on	
  language	
  for	
  specifying	
  arbitrar...
CONCLUSIONS	
  



7/15/2010              27
Summary	
  
•  Flume	
  is	
  a	
  distributed,	
  reliable,	
  scalable	
  system	
  for	
  collec,ng	
  and	
  
   deliv...
Contribute!	
  
•  GitHub	
  source	
  repo	
  
      –  hUp://github.com/cloudera/flume	
  
•  Mailing	
  lists	
  
      ...
Image	
  credits	
  
•    hUp://www.flickr.com/photos/victorvonsalza/3327750057/	
  
•    hUp://www.flickr.com/photos/victor...
Flumetalk
Upcoming SlideShare
Loading in...5
×

Flumetalk

1,646

Published on

Mike Olson, Ian Wrigley, Andy Kent, Daniel Sikar: Hadoop User Group UK meet - 07/09/2010

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,646
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Flumetalk"

  1. 1. Flume Reliable Distributed Streaming Log Collection Ian Wrigley Educational Services, Cloudera, Inc ian@cloudera.com
  2. 2. Scenario   •  Situa,on:   –  You  have  hundreds  of  services  producing  logs   –  You’re  running  a  daily  cron  job  on  the  logs   •  Rota,ng  the  logs   •  Maybe  compressing  or  otherwise  processing  them   •  Transferring  them  to  HDFS  (the  Hadoop  Distributed  File  System)   •  Problem:     –  As  the  amount  of  data  increases,  it  takes  longer  and  longer  to  run  the   cron  job   7/15/2010 2
  3. 3. You  need  a  “Flume”   •  Flume  is  a  distributed  system  that  gets   your  logs  from  their  source  and   aggregates  them  to  where  you  want  to   process  them   •  Open  source,    Apache  v2.0  License   •  Goals:   –  Reliability   –  Scalability   –  Extensibility   –  Manageability   Columbia Gorge, Broughton Log Flume 7/15/2010 3
  4. 4. Use  cases   •  Collec,ng  logs  from  nodes  in  your   Hadoop  cluster   •  Collec,ng  logs  from  services  such   as  hUpd,  mail,  etc.   •  Collec,ng  impressions  from   custom  apps  for  an  ad  network   •  But  wait,  there’s  more!   It’s log, log ... Everyone wants a log! –  Basic  online  in-­‐stream  analysis   –  Online  in-­‐stream  file  processing  and   manipula,on   7/15/2010 4
  5. 5. Key  abstrac,ons   •  Data  path  and  control  path   Agent •  Nodes  are  in  the  data  path     –  Nodes  have  a  source  and  a  sink   Collector –  They  can  take  different  roles   •  A  typical  topology  has  agent  nodes  and  collector  nodes   •  Op,onally  it  has  processor  nodes   •  Masters  are  in  the  control  path   Master   –  Centralized  point  of  configura,on   –  Specify  sources  and  sinks     –  Can  control  flows  of  data  between  nodes   –  Use  one  master  or  use  many  with  a  ZooKeeper-­‐backed  quorum   7/15/2010 5
  6. 6. A  sample  topology   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 6
  7. 7. Masters  control  node  configura,on   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Storage tier Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 7
  8. 8. Outline   •  What  is  Flume?   –  Goals  and  architecture   •  Reliability   –  Fault-­‐tolerance  and  High  availability     •  Scalability   –  Horizontal  scalability  of  all  nodes  and  masters   •  Extensibility   –  Unix  principle,  all  kinds  of  data,  all  kinds  of  sources,  all  kinds  of  sinks   •  Manageability   –  Centralized  management  suppor,ng  dynamic  reconfigura,on     7/15/2010 8
  9. 9. RELIABILITY   The logs will still get there… 7/15/2010 9
  10. 10. Tunable  data  reliability  levels   •  Best  effort   –  Fire  and  forget   Agent Collector HDFS •  Store  on  failure  +  retry   –  Local  acks,  local  errors  detectable     Agent Collector HDFS –  Failover  when  faults  detected     •  End-­‐to-­‐end  reliability   –  End  to  end  acks   Agent Collector HDFS –  Data  survives  compound  failures,   and  may  be  retried  mul,ple   ,mes   7/15/2010 10
  11. 11. SCALABILITY   7/15/2010 Logs jamming the Kemi River 11
  12. 12. A  sample  topology   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 12
  13. 13. Data  path  is  horizontally  scalable   Agent Agent Collector HDFS Agent Agent •  Add  collectors  to  increase  availability  and  to  handle  more  data   –  Assumes  a  single  agent  will  not  dominate  a  collector   –  Fewer  connec,ons  to  HDFS   –  Larger,  more  efficient  writes  to  HDFS   •  Agents  have  mechanisms  for  machine  resource  tradeoffs   •  Write  log  locally  to  avoid  collector  disk  IO  boUleneck  and  catastrophic  failures   •  Compression  and  batching    (trade  cpu  for  network)   •  Push  computa,on  into  the  event  collec,on  pipeline  (balance  IO,  Mem,  and  CPU   resource  boUlenecks)   7/15/2010 13
  14. 14. Load  balancing   Agent Agent Collector Agent Agent Collector Agent Collector Agent •  Agents  are  logically  par,,oned  and  can  send  to  different  collectors   •  Use  randomiza,on  to  pre-­‐specify  failovers  when  many  collectors   exist     •  Spread  load  if  a  collector  goes  down   •  Spread  load  if  new  collectors  are  added  to  the  system   7/15/2010 14
  15. 15. Load  balancing   Agent Agent Collector Agent Agent Collector Agent Collector Agent •  Agents  are  logically  par,,oned  and  can  send  to  different  collectors   •  Use  randomiza,on  to  pre-­‐specify  failovers  when  many  collectors   exist     •  Spread  load  if  a  collector  goes  down   •  Spread  load  if  new  collectors  are  added  to  the  system   7/15/2010 15
  16. 16. Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 16
  17. 17. Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 17
  18. 18. Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 18
  19. 19. EXTENSIBILITY   Turn raw logs into something useful… 7/15/2010 19
  20. 20. Flume  is  easy  to  extend   •  Simple  source  and  sink  APIs   –  Event  granularity  streaming  design   –  Have  many  simple  opera,ons  and  compose  for  complex  behavior   •  End-­‐to-­‐end  principle   –  Put  smarts  and  state  at  the  end  points.    Keep  the  middle  simple   •  Flume  deals  with  reliability     –  Just  add  a  new  source  or  add  a  new  sink  and  Flume  has  primi,ves  to  deal   with  reliability   7/15/2010 20
  21. 21. Variety  of  Data  sources   •  Can  deal  with  push  and  pull  sources   push App   Agent •  Supports  many  legacy  event  sources   –  Tailing  a  file   poll App   Agent –  Output  from  periodically  Exec’ed  program   –  Syslog,  Syslog-­‐ng   –  Experimental:  IRC  /  TwiUer  /  Scribe  /  AMQP   embed App   Agent 7/15/2010 21
  22. 22. Variety  of  Data  output   •  Send  data  to  many  sinks   –  HDFS,  Files,  Console,  RPC   –  Experimental:  HBase,  Voldemort,  S3,  etc…   •  Supports  an  extensible  variety  of  outputs  formats  and  des,na,ons   –  Output  to  language-­‐neutral  and  open  data  formats  (JSON,  Avro,  text)   –  Compressed  output  files  in  development   •  Uses  decorators  to  process  event  data  in-­‐flight   –  Sampling,  aUribute  extrac,on,  filtering,  projec,on,  checksumming,   batching,  wire  compression,  etc…   7/15/2010 22
  23. 23. MANAGEABILITY   7/15/2010 Wheeeeee! 23
  24. 24. Centralized  data  flow  management   •  Master  specifies  node  sources,  sinks  and  data  flows   –  Simply  specify  the  role  of  the  node:  collector,  agent   –  Or  specify  a  custom  configura,on  for  a  node   •  Control  Interfaces:   –  Flume  Shell     –  Basic  Web   –  HUE  +  Flume  Manager  App  (Enterprise  users)   7/15/2010 24
  25. 25. Output  bucke,ng   Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt HDFS /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt Collector … node : collectorSource | collectorSink ("hdfs://namenode/logs/ web/%Y/%m%d/%H00", "data") •  Automa,c  output  file  management     –  Write  HDFS  files  in  directories  using  ,me-­‐based  tags   7/15/2010 25
  26. 26. For  advanced  users   •  A  concise  and  precise  configura,on  language  for  specifying  arbitrary   data  paths   –  Dataflows  are  essen,ally  DAGs   –  Control  specific  event  flows   •  Enable  durability  mechanism  and  failover  mechanisms   •  Tune  the  parameters  these  mechanisms   –  Dynamic  updates  of  configura,ons   •  Allows  for  live  failover  changes   •  Allows  for  handling  newly  provisioned  machines   •  Allows  for  changing  analy,cs   7/15/2010 26
  27. 27. CONCLUSIONS   7/15/2010 27
  28. 28. Summary   •  Flume  is  a  distributed,  reliable,  scalable  system  for  collec,ng  and   delivering  high-­‐volume  con,nuous  event  data  such  as  logs   –  Tunable  data  reliability  levels   –  Reliable  master  backed  by  ZooKeeper   –  Write  data  to  HDFS  into  buckets  ready  for  batch  processing   –  Dynamically  configurable  nodes   –  Simplified  automated  management  for  agent+collector  topologies   •  Open  Source  Apache  v2.0  license   7/15/2010 28
  29. 29. Contribute!   •  GitHub  source  repo   –  hUp://github.com/cloudera/flume   •  Mailing  lists   –  User:  hUps://groups.google.com/a/cloudera.org/group/flume-­‐user   –  Dev:  hUps://groups.google.com/a/cloudera.org/group/flume-­‐dev   •  Development  trackers   –  JIRA  (bugs/  formal  feature  requests):     •  hUps://issues.cloudera.org/browse/FLUME   –  Review  board  (code  reviews):     •  hUp://review.hbase.org  -­‐>  hUp://review.cloudera.org   •  IRC  Channels   –  #flume  @  irc.freenode.net   7/15/2010 29
  30. 30. Image  credits   •  hUp://www.flickr.com/photos/victorvonsalza/3327750057/   •  hUp://www.flickr.com/photos/victorvonsalza/3207639929/   •  hUp://www.flickr.com/photos/victorvonsalza/3327750059/   •  hUp://www.emvergeoning.com/?m=200811   •  hUp://www.flickr.com/photos/juse/188960076/   •  hUp://www.flickr.com/photos/juse/188960076/   •  hUp://www.flickr.com/photos/23720661@N08/3186507302/   •  hUp://clarksoutdoorchairs.com/log_adirondack_chairs.html   •  hUp://www.flickr.com/photos/dboo/3314299591/   7/15/2010 30

×