• Like
  • Save
Flumetalk
Upcoming SlideShare
Loading in...5
×
 

Flumetalk

on

  • 1,885 views

Mike Olson, Ian Wrigley, Andy Kent, Daniel Sikar: Hadoop User Group UK meet - 07/09/2010

Mike Olson, Ian Wrigley, Andy Kent, Daniel Sikar: Hadoop User Group UK meet - 07/09/2010

Statistics

Views

Total Views
1,885
Views on SlideShare
1,885
Embed Views
0

Actions

Likes
4
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Flumetalk Flumetalk Presentation Transcript

    • Flume Reliable Distributed Streaming Log Collection Ian Wrigley Educational Services, Cloudera, Inc ian@cloudera.com
    • Scenario   •  Situa,on:   –  You  have  hundreds  of  services  producing  logs   –  You’re  running  a  daily  cron  job  on  the  logs   •  Rota,ng  the  logs   •  Maybe  compressing  or  otherwise  processing  them   •  Transferring  them  to  HDFS  (the  Hadoop  Distributed  File  System)   •  Problem:     –  As  the  amount  of  data  increases,  it  takes  longer  and  longer  to  run  the   cron  job   7/15/2010 2
    • You  need  a  “Flume”   •  Flume  is  a  distributed  system  that  gets   your  logs  from  their  source  and   aggregates  them  to  where  you  want  to   process  them   •  Open  source,    Apache  v2.0  License   •  Goals:   –  Reliability   –  Scalability   –  Extensibility   –  Manageability   Columbia Gorge, Broughton Log Flume 7/15/2010 3
    • Use  cases   •  Collec,ng  logs  from  nodes  in  your   Hadoop  cluster   •  Collec,ng  logs  from  services  such   as  hUpd,  mail,  etc.   •  Collec,ng  impressions  from   custom  apps  for  an  ad  network   •  But  wait,  there’s  more!   It’s log, log ... Everyone wants a log! –  Basic  online  in-­‐stream  analysis   –  Online  in-­‐stream  file  processing  and   manipula,on   7/15/2010 4
    • Key  abstrac,ons   •  Data  path  and  control  path   Agent •  Nodes  are  in  the  data  path     –  Nodes  have  a  source  and  a  sink   Collector –  They  can  take  different  roles   •  A  typical  topology  has  agent  nodes  and  collector  nodes   •  Op,onally  it  has  processor  nodes   •  Masters  are  in  the  control  path   Master   –  Centralized  point  of  configura,on   –  Specify  sources  and  sinks     –  Can  control  flows  of  data  between  nodes   –  Use  one  master  or  use  many  with  a  ZooKeeper-­‐backed  quorum   7/15/2010 5
    • A  sample  topology   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 6
    • Masters  control  node  configura,on   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Storage tier Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 7
    • Outline   •  What  is  Flume?   –  Goals  and  architecture   •  Reliability   –  Fault-­‐tolerance  and  High  availability     •  Scalability   –  Horizontal  scalability  of  all  nodes  and  masters   •  Extensibility   –  Unix  principle,  all  kinds  of  data,  all  kinds  of  sources,  all  kinds  of  sinks   •  Manageability   –  Centralized  management  suppor,ng  dynamic  reconfigura,on     7/15/2010 8
    • RELIABILITY   The logs will still get there… 7/15/2010 9
    • Tunable  data  reliability  levels   •  Best  effort   –  Fire  and  forget   Agent Collector HDFS •  Store  on  failure  +  retry   –  Local  acks,  local  errors  detectable     Agent Collector HDFS –  Failover  when  faults  detected     •  End-­‐to-­‐end  reliability   –  End  to  end  acks   Agent Collector HDFS –  Data  survives  compound  failures,   and  may  be  retried  mul,ple   ,mes   7/15/2010 10
    • SCALABILITY   7/15/2010 Logs jamming the Kemi River 11
    • A  sample  topology   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 12
    • Data  path  is  horizontally  scalable   Agent Agent Collector HDFS Agent Agent •  Add  collectors  to  increase  availability  and  to  handle  more  data   –  Assumes  a  single  agent  will  not  dominate  a  collector   –  Fewer  connec,ons  to  HDFS   –  Larger,  more  efficient  writes  to  HDFS   •  Agents  have  mechanisms  for  machine  resource  tradeoffs   •  Write  log  locally  to  avoid  collector  disk  IO  boUleneck  and  catastrophic  failures   •  Compression  and  batching    (trade  cpu  for  network)   •  Push  computa,on  into  the  event  collec,on  pipeline  (balance  IO,  Mem,  and  CPU   resource  boUlenecks)   7/15/2010 13
    • Load  balancing   Agent Agent Collector Agent Agent Collector Agent Collector Agent •  Agents  are  logically  par,,oned  and  can  send  to  different  collectors   •  Use  randomiza,on  to  pre-­‐specify  failovers  when  many  collectors   exist     •  Spread  load  if  a  collector  goes  down   •  Spread  load  if  new  collectors  are  added  to  the  system   7/15/2010 14
    • Load  balancing   Agent Agent Collector Agent Agent Collector Agent Collector Agent •  Agents  are  logically  par,,oned  and  can  send  to  different  collectors   •  Use  randomiza,on  to  pre-­‐specify  failovers  when  many  collectors   exist     •  Spread  load  if  a  collector  goes  down   •  Spread  load  if  new  collectors  are  added  to  the  system   7/15/2010 15
    • Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 16
    • Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 17
    • Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 18
    • EXTENSIBILITY   Turn raw logs into something useful… 7/15/2010 19
    • Flume  is  easy  to  extend   •  Simple  source  and  sink  APIs   –  Event  granularity  streaming  design   –  Have  many  simple  opera,ons  and  compose  for  complex  behavior   •  End-­‐to-­‐end  principle   –  Put  smarts  and  state  at  the  end  points.    Keep  the  middle  simple   •  Flume  deals  with  reliability     –  Just  add  a  new  source  or  add  a  new  sink  and  Flume  has  primi,ves  to  deal   with  reliability   7/15/2010 20
    • Variety  of  Data  sources   •  Can  deal  with  push  and  pull  sources   push App   Agent •  Supports  many  legacy  event  sources   –  Tailing  a  file   poll App   Agent –  Output  from  periodically  Exec’ed  program   –  Syslog,  Syslog-­‐ng   –  Experimental:  IRC  /  TwiUer  /  Scribe  /  AMQP   embed App   Agent 7/15/2010 21
    • Variety  of  Data  output   •  Send  data  to  many  sinks   –  HDFS,  Files,  Console,  RPC   –  Experimental:  HBase,  Voldemort,  S3,  etc…   •  Supports  an  extensible  variety  of  outputs  formats  and  des,na,ons   –  Output  to  language-­‐neutral  and  open  data  formats  (JSON,  Avro,  text)   –  Compressed  output  files  in  development   •  Uses  decorators  to  process  event  data  in-­‐flight   –  Sampling,  aUribute  extrac,on,  filtering,  projec,on,  checksumming,   batching,  wire  compression,  etc…   7/15/2010 22
    • MANAGEABILITY   7/15/2010 Wheeeeee! 23
    • Centralized  data  flow  management   •  Master  specifies  node  sources,  sinks  and  data  flows   –  Simply  specify  the  role  of  the  node:  collector,  agent   –  Or  specify  a  custom  configura,on  for  a  node   •  Control  Interfaces:   –  Flume  Shell     –  Basic  Web   –  HUE  +  Flume  Manager  App  (Enterprise  users)   7/15/2010 24
    • Output  bucke,ng   Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt HDFS /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt Collector … node : collectorSource | collectorSink ("hdfs://namenode/logs/ web/%Y/%m%d/%H00", "data") •  Automa,c  output  file  management     –  Write  HDFS  files  in  directories  using  ,me-­‐based  tags   7/15/2010 25
    • For  advanced  users   •  A  concise  and  precise  configura,on  language  for  specifying  arbitrary   data  paths   –  Dataflows  are  essen,ally  DAGs   –  Control  specific  event  flows   •  Enable  durability  mechanism  and  failover  mechanisms   •  Tune  the  parameters  these  mechanisms   –  Dynamic  updates  of  configura,ons   •  Allows  for  live  failover  changes   •  Allows  for  handling  newly  provisioned  machines   •  Allows  for  changing  analy,cs   7/15/2010 26
    • CONCLUSIONS   7/15/2010 27
    • Summary   •  Flume  is  a  distributed,  reliable,  scalable  system  for  collec,ng  and   delivering  high-­‐volume  con,nuous  event  data  such  as  logs   –  Tunable  data  reliability  levels   –  Reliable  master  backed  by  ZooKeeper   –  Write  data  to  HDFS  into  buckets  ready  for  batch  processing   –  Dynamically  configurable  nodes   –  Simplified  automated  management  for  agent+collector  topologies   •  Open  Source  Apache  v2.0  license   7/15/2010 28
    • Contribute!   •  GitHub  source  repo   –  hUp://github.com/cloudera/flume   •  Mailing  lists   –  User:  hUps://groups.google.com/a/cloudera.org/group/flume-­‐user   –  Dev:  hUps://groups.google.com/a/cloudera.org/group/flume-­‐dev   •  Development  trackers   –  JIRA  (bugs/  formal  feature  requests):     •  hUps://issues.cloudera.org/browse/FLUME   –  Review  board  (code  reviews):     •  hUp://review.hbase.org  -­‐>  hUp://review.cloudera.org   •  IRC  Channels   –  #flume  @  irc.freenode.net   7/15/2010 29
    • Image  credits   •  hUp://www.flickr.com/photos/victorvonsalza/3327750057/   •  hUp://www.flickr.com/photos/victorvonsalza/3207639929/   •  hUp://www.flickr.com/photos/victorvonsalza/3327750059/   •  hUp://www.emvergeoning.com/?m=200811   •  hUp://www.flickr.com/photos/juse/188960076/   •  hUp://www.flickr.com/photos/juse/188960076/   •  hUp://www.flickr.com/photos/23720661@N08/3186507302/   •  hUp://clarksoutdoorchairs.com/log_adirondack_chairs.html   •  hUp://www.flickr.com/photos/dboo/3314299591/   7/15/2010 30