Your SlideShare is downloading. ×
  • Like
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

  • 519 views
Published

Slides for presentation I gave to Chicago Hadoop User Group on April 9, 2014

Slides for presentation I gave to Chicago Hadoop User Group on April 9, 2014

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
519
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
35
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Flume Getting Logs/Data to Hadoop Steve Hoffman Chicago Hadoop User Group (CHUG) 2014-04-09T10:30:00Z
  • 2. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy
  • 3. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy • Tech Guy @Orbitz
  • 4. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy • Tech Guy @Orbitz • Wrote a book on Flume
  • 5. Why do I need Flume? • Created to deal with streaming data/logs to HDFS • Can’t mount HDFS (usually) • Can’t “copy” to files to HDFS if the files aren’t closed (aka log files) • Need to buffer “some”, then write and close a file — repeat • May involve multiple hops due to topology (# of machines, datacenter separation, etc). • A lot can go wrong here…
  • 6. Agent • Java daemon • Has a name (usually ‘agent’) • Receive data from sources and write events to 1 or more channels • Move events from 1 channel to sink. Remove from channel if successfully written.
  • 7. Events • Headers = Key/Value Pairs — Map<String,  String> • Body = byte array — byte[] • For example: 10.10.1.1 - - [29/Jan/2014:03:36:04 -0600] "HEAD /ping.html HTTP/1.1" 200 0 "-" "-" “-"! {“timestamp”:”1391986793111”, “host”:”server1.example.com”} 31302e31302e312e31202d202d205b32392f4a616e2f323031343a30333a33 363a3034202d303630305d202248454144202f70696e672e68746d6c204854 54502f312e312220323030203020222d2220222d2220222d22
  • 8. Channels • Place to hold Events • Memory or File Backed (also JDBC, but why?) • Bounded - Size is configurable • Resources aren’t infinite
  • 9. Sources • Feeds data to one or more Channels • Usually data pushed to it (listen for data on a socket. i.e. HTTP Source) or from Avro log4J appender. • Or can periodically poll another system and generate events (i.e. run a command every minute, and parse output into Event, Query a DB/Mongo/ etc.)
  • 10. Sinks • Move Events from a single Channel to a destination • Only removes from Channel if write successful • HDFSSink you’ll use the most
 — most likely…
  • 11. Configuration Sample # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1!
  • 12. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels}
  • 13. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type
  • 14. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s)
  • 15. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s) Apply type specific
 configurations
  • 16. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s) Apply type specific
 configurations RTM - Flume User Guide
 https://flume.apache.org/FlumeUserGuide.html" or my book :)
  • 17. Configuration Sample (logs) Creating channels! Creating instance of channel c1 type memory! Created channel c1! Creating instance of source r1, type seq! Creating instance of sink: k1, type: logger! Channel c1 connected to [r1, k1]! Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner: { source:org.apache.flume.source.SequenceGeneratorSource{name:r1,state:IDLE} counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@19484a05 counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }! Event: { headers:{} body: 30 0 }! Event: { headers:{} body: 31 1 }! Event: { headers:{} body: 32 2 }! and so on…
  • 18. Using Cloudera Manager • Same stuff, just in
 a GUI • Centrally managed in a Database (instead of source control/Git) • Distributed from central location (instead of Chef/Puppet)
  • 19. Multiple destinations need
 multiple channels
  • 20. Channel Selector • When more than 1 channel specified on Source • Replicating (Each channel gets a copy) - default • Multiplexing (Channel picked based on a header value) • Custom (If these don’t work for you - code one!)
  • 21. Channel Selector
 Replicating • Copy sent to all channels associated with Source agent.sources.r1.selector.type=replicating
 agent.sources.r1.channels=c1  c2  c3   • Can specify “optional” channels agent.sources.r1.selector.optional=c3   • Transaction success if all non-optional channels take the event (in this case c1 & c2)
  • 22. Channel Selector
 Multiplexing • Copy sent to only some of the channels agent.sources.r1.selector.type=multiplexing
 agent.sources.r1.channels=c1  c2  c3  c4   • Switch based on header key 
 (i.e. {“currency”:“USD”} → c1) agent.sources.r1.selector.header=currency
 agent.sources.r1.selector.mapping.USD=c1
 agent.sources.r1.selector.mapping.EUR=c2  c3
 agent.sources.r1.selector.default=c4
  • 23. Interceptors • Zero or more on Source (before written to channel) • Zero or more on Sink (after read from channel) • Or Both • Use for transformations of data in-flight (headers OR body) public  Event  intercept(Event  event);
 public  List<Event>  intercept(List<Event>  events);   • Return null or empty List to drop Events
  • 24. Interceptor Chaining • Processed in Order Listed in Configuration (source r1 example): agent.sources.r1.interceptors=i1  i2  i3
 agent.sources.r1.interceptors.i1.type=timestamp
 agent.sources.r1.interceptors.i1.preserveExisting=true
 agent.sources.r1.interceptors.i2.type=static
 agent.sources.r1.interceptors.i2.key=datacenter
 agent.sources.r1.interceptors.i2.value=CHI
 agent.sources.r1.interceptors.i3.type=host
 agent.sources.r1.interceptors.i3.hostHeader=relay
 agent.sources.r1.interceptors.i3.useIP=false   • Resulting Headers added before writing to Channel: {“timestamp”:“1392350333234”,  “datacenter”:“CHI”,   “relay”:“flumebox.example.com”}
  • 25. Morphlines • Interceptor and Sink forms. • See Cloudera Website/Blog • Created to ease transforms and Cloudera Search/Flume integration. • An example: # convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
 # The input may match one of "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
 # or "yyyy-MM-dd'T'HH:mm:ss" or "yyyy-MM-dd".
 convertTimestamp {
 field : timestamp
 inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", 
 "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"]
 inputTimezone : America/Chicago
 outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
 outputTimezone : UTC
 }
  • 26. Avro • Apache Avro - Data Serialization • http://avro.apache.org/ • Storage Format and Wire Protocol • Self-Describing (schema written with the data) • Supports Compression of Data (not container — so MapReduce friendly — “splitable”) • Binary friendly — Doesn’t require records separated by n
  • 27. Avro Source/Sink • Preferred inter-agent transport in Flume • Simple Configuration (host + port for sink and port for source) • Minimal transformation needed for Flume Events • Version of Avro in client & server don’t need to match — only payload versioning matters
 (think protocol buffers vs Java serialization)
  • 28. Avro Source/Sink Config foo.sources=…
 foo.channels=channel-­‐foo
 foo.channels.channel-­‐foo.type=memory
 foo.sinks=sink-­‐foo
 foo.sinks.sink-­‐foo.channel=channel-­‐foo
 foo.sinks.sink-­‐foo.type=avro
 foo.sinks.sink-­‐foo.hostname=bar.example.com
 foo.sinks.sink-­‐foo.port=12345
 foo.sinks.sink-­‐foo.compression-­‐type=deflate   bar.sources=datafromfoo
 bar.sources.datafromfoo.type=avro
 bar.sources.datafromfoo.bind=0.0.0.0
 bar.sources.datafromfoo.port=12345
 bar.sources.datafromfoo.compression-­‐type=deflate
 bar.sources.datafromfoo.channels=channel-­‐bar
 bar.channels=channel-­‐bar
 bar.channels.channel-­‐bar.type=memory
 bar.sinks=…
  • 29. log4j Avro Sink • Remember that Web
 Server pushing data to
 Source? • Use the Flume Avro log4j appender! • log level, category, etc. become headers in Event • “message” String becomes the body
  • 30. log4j Configuration • log4j.properties sender (include flume-­‐ng-­‐sdk-­‐1.X.X.jar in project): log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAp pender
 log4j.appender.flume.Hostname=example.com
 log4j.appender.flume.Port=12345
 log4j.appender.flume.UnsafeMode=true
 
 log4j.logger.org.example.MyClass=DEBUG,flume   • flume avro receiver: agent.sources=logs
 agent.sources.logs.type=avro
 agent.sources.logs.bind=0.0.0.0
 agent.sources.logs.port=12345
 agent.sources.logs.channels=…
  • 31. Avro Client • Send data to AvroSource from command line • Run flume program with avro-­‐client instead of agent parameter $  bin/flume-­‐ng  avro-­‐client  -­‐H  server.example.com
        -­‐p  12345  [-­‐F  input_file]   • Each line of the file (or stdin if no file given) becomes an event • Useful for testing or injecting data from outside Flume sources (ExecSource vs cronjob which pipes output to avro-­‐client).
  • 32. HDFSSink • Read from Channel and write 
 to a file in HDFS in chunks • Until 1 of 3 things happens: • some amount of time elapses (rollInterval) • some number of records have been written (rollCount) • some size of data has been written (rollSize) • Close that file and start a new one
  • 33. HDFS Configuration foo.sources=…
 foo.channels=channel-­‐foo
 foo.channels.channel-­‐foo.type=memory
 foo.sinks=sink-­‐foo
 foo.sinks.sink-­‐foo.channel=channel-­‐foo
 foo.sinks.sink-­‐foo.type=hdfs
 foo.sinks.sink-­‐foo.hdfs.path=hdfs://NN/data/%Y/%m/%d/%H
 foo.sinks.sink-­‐foo.hdfs.rollInterval=60
 foo.sinks.sink-­‐foo.hdfs.filePrefix=log
 foo.sinks.sink-­‐foo.hdfs.fileSuffix=.avro
 foo.sinks.sink-­‐foo.hdfs.inUsePrefix=_
 foo.sinks.sink-­‐foo.serializer=avro_event
 foo.sinks.sink-­‐foo.serializer.compressionCodec=snappy
  • 34. HDFS writing… drwxr-­‐x-­‐-­‐-­‐      -­‐  flume  flume                    0  2014-­‐02-­‐16  17:04  /data/2014/02/16/23
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume                    0  2014-­‐02-­‐16  17:04  /data/2014/02/16/23/_log.1392591607925.avro.tmp
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              1877  2014-­‐02-­‐16  17:01  /data/2014/02/16/23/log.1392591607923.avro
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              1955  2014-­‐02-­‐16  17:02  /data/2014/02/16/23/log.1392591607924.avro
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              2390  2014-­‐02-­‐16  17:04  /data/2014/02/16/23/log.1392591798436.avro   • The zero length .tmp file is the current file. Won’t see real size until it closes (just like when you do a hadoop  fs  -­‐put) • Use …hdfs.inUsePrefix=_ to prevent open files from being included in MapReduce jobs
  • 35. Event Serializers • Defines how the Event gets written to Sink • Just the body as a UTF-8 String agent.sinks.foo-­‐sink.serializer=text   • Headers and Body as UTF-8 String agent.sinks.foo-­‐sink.serializer=header_and_text   • Avro (Flume record Schema) agent.sinks.foo-­‐sink-­‐serializer=avro_event   • Custom (none of the above meets your needs)
  • 36. Lessons Learned
  • 37. Source: https://xkcd.com/1179/ Too Many…
  • 38. Timezones are Evil • Daylight savings time causes problems twice a year (in Spring: no 2am hour. In Fall: twice the data during 2am hour — 02:15? Which one?) • Date processing in MapReduce jobs: Hourly jobs, filters, etc. • Dated paths: hdfs://NN/data/%Y/%m/%d/%H • Use UTC: -­‐Duser.timezone=UTC   • Use one of the ISO8601 formats like 2014-­‐02-­‐26T18:00:00.000Z • Sorts the way you usually want • Every time library supports it* - and if not, easy to parse.
  • 39. Generally Speaking… • Async handoff doesn’t work under load when bad stuff happens Write Read Filesystem or Queue or Database or whatever Not ∞
  • 40. Async Handoff Oops Flume Agent tail -F foo.log foo.log
  • 41. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1
  • 42. Async Handoff Oops Flume Agent tail -F foo.log foo.log foo.log.1
  • 43. Async Handoff Oops Flume Agent tail -F foo.log foo.log foo.log.2
  • 44. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2
  • 45. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log
  • 46. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log
  • 47. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log X
  • 48. Don’t Use Tail • Tailing a file for input is bad - assumptions are made that aren’t guarantees. • Direct support removed during Flume rewrite • Handoff can go bad with files: when writer faster than reader • With Queue: when reader doesn’t read before expire time • No way to apply “back pressure” to tell tail there is a problem. It isn’t listening…
  • 49. What can I use? • If you can’t use the log4j Avro Appender… • Use logrotate to move old logs to “spool” directory • SpoolingDirectorySource • Finally, cron job to remove .COMPLETED files (for delayed delete) OR set deletePolicy=true (immediate) • Alternatively use log rotate with avro_client? (probably other ways too…)
  • 50. RAM or Disk Channels? Source:
 http://blog.scoutapp.com/articles/2011/02/10/ understanding-disk-i-o-when-should-you-be-worried
  • 51. Duplicate Events • Transactions only at Agent level • You may see Events more than once • Distributed Transactions are expensive • Just deal with in query/scrub phase — much less costly than trying to prevent it from happening
  • 52. Late Data • Data could be “late”/delayed • Outages • Restarts • Act of Nature • Only sure thing is a “database” — single write + ACK • Depending on your monitoring, it could be REALLY LATE.
  • 53. Monitoring • Know when it breaks so you can fix it before you can’t ingest new data (and it is lost) • This time window is small if volume is high • Flume Monitoring still WIP, but hooks are there
  • 54. Other Operational Concerns • resource utilization - number of open files when writing (file descriptors), disk space used for file channel, disk contention, disk speed* • number of inbound and outbound sockets - may need to tier (Avro Source/Sink) • minimize hops if possible - another place for data to get stuck
  • 55. Not everything is a nail • Flume is great for handling individual records • What if you need to compute an average? • Get a Stream Processing system • Storm (twitter’s) • Samza (linkedIn’s) • Others… • Flume can co-exist with these — use most appropriate tool
  • 56. Questions? …and thanks! Slides @ http://slideshare.net/bacoboy