Your SlideShare is downloading. ×
0
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

723

Published on

Slides for presentation I gave to Chicago Hadoop User Group on April 9, 2014

Slides for presentation I gave to Chicago Hadoop User Group on April 9, 2014

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
723
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
46
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Flume Getting Logs/Data to Hadoop Steve Hoffman Chicago Hadoop User Group (CHUG) 2014-04-09T10:30:00Z
  • 2. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy
  • 3. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy • Tech Guy @Orbitz
  • 4. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy • Tech Guy @Orbitz • Wrote a book on Flume
  • 5. Why do I need Flume? • Created to deal with streaming data/logs to HDFS • Can’t mount HDFS (usually) • Can’t “copy” to files to HDFS if the files aren’t closed (aka log files) • Need to buffer “some”, then write and close a file — repeat • May involve multiple hops due to topology (# of machines, datacenter separation, etc). • A lot can go wrong here…
  • 6. Agent • Java daemon • Has a name (usually ‘agent’) • Receive data from sources and write events to 1 or more channels • Move events from 1 channel to sink. Remove from channel if successfully written.
  • 7. Events • Headers = Key/Value Pairs — Map<String,  String> • Body = byte array — byte[] • For example: 10.10.1.1 - - [29/Jan/2014:03:36:04 -0600] "HEAD /ping.html HTTP/1.1" 200 0 "-" "-" “-"! {“timestamp”:”1391986793111”, “host”:”server1.example.com”} 31302e31302e312e31202d202d205b32392f4a616e2f323031343a30333a33 363a3034202d303630305d202248454144202f70696e672e68746d6c204854 54502f312e312220323030203020222d2220222d2220222d22
  • 8. Channels • Place to hold Events • Memory or File Backed (also JDBC, but why?) • Bounded - Size is configurable • Resources aren’t infinite
  • 9. Sources • Feeds data to one or more Channels • Usually data pushed to it (listen for data on a socket. i.e. HTTP Source) or from Avro log4J appender. • Or can periodically poll another system and generate events (i.e. run a command every minute, and parse output into Event, Query a DB/Mongo/ etc.)
  • 10. Sinks • Move Events from a single Channel to a destination • Only removes from Channel if write successful • HDFSSink you’ll use the most
 — most likely…
  • 11. Configuration Sample # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1!
  • 12. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels}
  • 13. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type
  • 14. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s)
  • 15. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s) Apply type specific
 configurations
  • 16. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s) Apply type specific
 configurations RTM - Flume User Guide
 https://flume.apache.org/FlumeUserGuide.html" or my book :)
  • 17. Configuration Sample (logs) Creating channels! Creating instance of channel c1 type memory! Created channel c1! Creating instance of source r1, type seq! Creating instance of sink: k1, type: logger! Channel c1 connected to [r1, k1]! Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner: { source:org.apache.flume.source.SequenceGeneratorSource{name:r1,state:IDLE} counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@19484a05 counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }! Event: { headers:{} body: 30 0 }! Event: { headers:{} body: 31 1 }! Event: { headers:{} body: 32 2 }! and so on…
  • 18. Using Cloudera Manager • Same stuff, just in
 a GUI • Centrally managed in a Database (instead of source control/Git) • Distributed from central location (instead of Chef/Puppet)
  • 19. Multiple destinations need
 multiple channels
  • 20. Channel Selector • When more than 1 channel specified on Source • Replicating (Each channel gets a copy) - default • Multiplexing (Channel picked based on a header value) • Custom (If these don’t work for you - code one!)
  • 21. Channel Selector
 Replicating • Copy sent to all channels associated with Source agent.sources.r1.selector.type=replicating
 agent.sources.r1.channels=c1  c2  c3   • Can specify “optional” channels agent.sources.r1.selector.optional=c3   • Transaction success if all non-optional channels take the event (in this case c1 & c2)
  • 22. Channel Selector
 Multiplexing • Copy sent to only some of the channels agent.sources.r1.selector.type=multiplexing
 agent.sources.r1.channels=c1  c2  c3  c4   • Switch based on header key 
 (i.e. {“currency”:“USD”} → c1) agent.sources.r1.selector.header=currency
 agent.sources.r1.selector.mapping.USD=c1
 agent.sources.r1.selector.mapping.EUR=c2  c3
 agent.sources.r1.selector.default=c4
  • 23. Interceptors • Zero or more on Source (before written to channel) • Zero or more on Sink (after read from channel) • Or Both • Use for transformations of data in-flight (headers OR body) public  Event  intercept(Event  event);
 public  List<Event>  intercept(List<Event>  events);   • Return null or empty List to drop Events
  • 24. Interceptor Chaining • Processed in Order Listed in Configuration (source r1 example): agent.sources.r1.interceptors=i1  i2  i3
 agent.sources.r1.interceptors.i1.type=timestamp
 agent.sources.r1.interceptors.i1.preserveExisting=true
 agent.sources.r1.interceptors.i2.type=static
 agent.sources.r1.interceptors.i2.key=datacenter
 agent.sources.r1.interceptors.i2.value=CHI
 agent.sources.r1.interceptors.i3.type=host
 agent.sources.r1.interceptors.i3.hostHeader=relay
 agent.sources.r1.interceptors.i3.useIP=false   • Resulting Headers added before writing to Channel: {“timestamp”:“1392350333234”,  “datacenter”:“CHI”,   “relay”:“flumebox.example.com”}
  • 25. Morphlines • Interceptor and Sink forms. • See Cloudera Website/Blog • Created to ease transforms and Cloudera Search/Flume integration. • An example: # convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
 # The input may match one of "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
 # or "yyyy-MM-dd'T'HH:mm:ss" or "yyyy-MM-dd".
 convertTimestamp {
 field : timestamp
 inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", 
 "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"]
 inputTimezone : America/Chicago
 outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
 outputTimezone : UTC
 }
  • 26. Avro • Apache Avro - Data Serialization • http://avro.apache.org/ • Storage Format and Wire Protocol • Self-Describing (schema written with the data) • Supports Compression of Data (not container — so MapReduce friendly — “splitable”) • Binary friendly — Doesn’t require records separated by n
  • 27. Avro Source/Sink • Preferred inter-agent transport in Flume • Simple Configuration (host + port for sink and port for source) • Minimal transformation needed for Flume Events • Version of Avro in client & server don’t need to match — only payload versioning matters
 (think protocol buffers vs Java serialization)
  • 28. Avro Source/Sink Config foo.sources=…
 foo.channels=channel-­‐foo
 foo.channels.channel-­‐foo.type=memory
 foo.sinks=sink-­‐foo
 foo.sinks.sink-­‐foo.channel=channel-­‐foo
 foo.sinks.sink-­‐foo.type=avro
 foo.sinks.sink-­‐foo.hostname=bar.example.com
 foo.sinks.sink-­‐foo.port=12345
 foo.sinks.sink-­‐foo.compression-­‐type=deflate   bar.sources=datafromfoo
 bar.sources.datafromfoo.type=avro
 bar.sources.datafromfoo.bind=0.0.0.0
 bar.sources.datafromfoo.port=12345
 bar.sources.datafromfoo.compression-­‐type=deflate
 bar.sources.datafromfoo.channels=channel-­‐bar
 bar.channels=channel-­‐bar
 bar.channels.channel-­‐bar.type=memory
 bar.sinks=…
  • 29. log4j Avro Sink • Remember that Web
 Server pushing data to
 Source? • Use the Flume Avro log4j appender! • log level, category, etc. become headers in Event • “message” String becomes the body
  • 30. log4j Configuration • log4j.properties sender (include flume-­‐ng-­‐sdk-­‐1.X.X.jar in project): log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAp pender
 log4j.appender.flume.Hostname=example.com
 log4j.appender.flume.Port=12345
 log4j.appender.flume.UnsafeMode=true
 
 log4j.logger.org.example.MyClass=DEBUG,flume   • flume avro receiver: agent.sources=logs
 agent.sources.logs.type=avro
 agent.sources.logs.bind=0.0.0.0
 agent.sources.logs.port=12345
 agent.sources.logs.channels=…
  • 31. Avro Client • Send data to AvroSource from command line • Run flume program with avro-­‐client instead of agent parameter $  bin/flume-­‐ng  avro-­‐client  -­‐H  server.example.com
        -­‐p  12345  [-­‐F  input_file]   • Each line of the file (or stdin if no file given) becomes an event • Useful for testing or injecting data from outside Flume sources (ExecSource vs cronjob which pipes output to avro-­‐client).
  • 32. HDFSSink • Read from Channel and write 
 to a file in HDFS in chunks • Until 1 of 3 things happens: • some amount of time elapses (rollInterval) • some number of records have been written (rollCount) • some size of data has been written (rollSize) • Close that file and start a new one
  • 33. HDFS Configuration foo.sources=…
 foo.channels=channel-­‐foo
 foo.channels.channel-­‐foo.type=memory
 foo.sinks=sink-­‐foo
 foo.sinks.sink-­‐foo.channel=channel-­‐foo
 foo.sinks.sink-­‐foo.type=hdfs
 foo.sinks.sink-­‐foo.hdfs.path=hdfs://NN/data/%Y/%m/%d/%H
 foo.sinks.sink-­‐foo.hdfs.rollInterval=60
 foo.sinks.sink-­‐foo.hdfs.filePrefix=log
 foo.sinks.sink-­‐foo.hdfs.fileSuffix=.avro
 foo.sinks.sink-­‐foo.hdfs.inUsePrefix=_
 foo.sinks.sink-­‐foo.serializer=avro_event
 foo.sinks.sink-­‐foo.serializer.compressionCodec=snappy
  • 34. HDFS writing… drwxr-­‐x-­‐-­‐-­‐      -­‐  flume  flume                    0  2014-­‐02-­‐16  17:04  /data/2014/02/16/23
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume                    0  2014-­‐02-­‐16  17:04  /data/2014/02/16/23/_log.1392591607925.avro.tmp
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              1877  2014-­‐02-­‐16  17:01  /data/2014/02/16/23/log.1392591607923.avro
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              1955  2014-­‐02-­‐16  17:02  /data/2014/02/16/23/log.1392591607924.avro
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              2390  2014-­‐02-­‐16  17:04  /data/2014/02/16/23/log.1392591798436.avro   • The zero length .tmp file is the current file. Won’t see real size until it closes (just like when you do a hadoop  fs  -­‐put) • Use …hdfs.inUsePrefix=_ to prevent open files from being included in MapReduce jobs
  • 35. Event Serializers • Defines how the Event gets written to Sink • Just the body as a UTF-8 String agent.sinks.foo-­‐sink.serializer=text   • Headers and Body as UTF-8 String agent.sinks.foo-­‐sink.serializer=header_and_text   • Avro (Flume record Schema) agent.sinks.foo-­‐sink-­‐serializer=avro_event   • Custom (none of the above meets your needs)
  • 36. Lessons Learned
  • 37. Source: https://xkcd.com/1179/ Too Many…
  • 38. Timezones are Evil • Daylight savings time causes problems twice a year (in Spring: no 2am hour. In Fall: twice the data during 2am hour — 02:15? Which one?) • Date processing in MapReduce jobs: Hourly jobs, filters, etc. • Dated paths: hdfs://NN/data/%Y/%m/%d/%H • Use UTC: -­‐Duser.timezone=UTC   • Use one of the ISO8601 formats like 2014-­‐02-­‐26T18:00:00.000Z • Sorts the way you usually want • Every time library supports it* - and if not, easy to parse.
  • 39. Generally Speaking… • Async handoff doesn’t work under load when bad stuff happens Write Read Filesystem or Queue or Database or whatever Not ∞
  • 40. Async Handoff Oops Flume Agent tail -F foo.log foo.log
  • 41. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1
  • 42. Async Handoff Oops Flume Agent tail -F foo.log foo.log foo.log.1
  • 43. Async Handoff Oops Flume Agent tail -F foo.log foo.log foo.log.2
  • 44. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2
  • 45. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log
  • 46. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log
  • 47. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log X
  • 48. Don’t Use Tail • Tailing a file for input is bad - assumptions are made that aren’t guarantees. • Direct support removed during Flume rewrite • Handoff can go bad with files: when writer faster than reader • With Queue: when reader doesn’t read before expire time • No way to apply “back pressure” to tell tail there is a problem. It isn’t listening…
  • 49. What can I use? • If you can’t use the log4j Avro Appender… • Use logrotate to move old logs to “spool” directory • SpoolingDirectorySource • Finally, cron job to remove .COMPLETED files (for delayed delete) OR set deletePolicy=true (immediate) • Alternatively use log rotate with avro_client? (probably other ways too…)
  • 50. RAM or Disk Channels? Source:
 http://blog.scoutapp.com/articles/2011/02/10/ understanding-disk-i-o-when-should-you-be-worried
  • 51. Duplicate Events • Transactions only at Agent level • You may see Events more than once • Distributed Transactions are expensive • Just deal with in query/scrub phase — much less costly than trying to prevent it from happening
  • 52. Late Data • Data could be “late”/delayed • Outages • Restarts • Act of Nature • Only sure thing is a “database” — single write + ACK • Depending on your monitoring, it could be REALLY LATE.
  • 53. Monitoring • Know when it breaks so you can fix it before you can’t ingest new data (and it is lost) • This time window is small if volume is high • Flume Monitoring still WIP, but hooks are there
  • 54. Other Operational Concerns • resource utilization - number of open files when writing (file descriptors), disk space used for file channel, disk contention, disk speed* • number of inbound and outbound sockets - may need to tier (Avro Source/Sink) • minimize hops if possible - another place for data to get stuck
  • 55. Not everything is a nail • Flume is great for handling individual records • What if you need to compute an average? • Get a Stream Processing system • Storm (twitter’s) • Samza (linkedIn’s) • Others… • Flume can co-exist with these — use most appropriate tool
  • 56. Questions? …and thanks! Slides @ http://slideshare.net/bacoboy

×