1
Deploying Apache Flume for
Low-Latency Analytics
Mike Percy
Software Engineer, Cloudera
PMC Member & Committer, Apache F...
About me
2
• Software engineer @ Cloudera
• Previous life @ Yahoo! in CORE
• PMC member / commiter on Apache Flume
• Part-...
What do you mean by low-latency?
3
• Shooting for a 2-minute SLA
• from time of occurrence
• to visibility via interactive...
Approach
4
• Stream Avro data into HDFS using Flume
• Lots of small files
• Convert the Avro data to Parquet in batches
• ...
Agenda
5
1. Using Flume
• Streaming Avro into HDFS using Flume
2. Using Hive & Impala
• Batch-converting older Avro data t...
Why Flume event streaming is awesome
• Couldn’t I just do this with a shell script?
• What year is this, 2001? There is a ...
Flume Agent design
7
Multi-agent network architecture
8
Typical fan-in topology
Shown here:
• Many agents (1 per host) at edge tier
• 4 “collec...
Events
Flume’s core data movement atom: the Event
• An Event has a simple schema
• Event header: Map<String, String>
• sim...
Channels
• Passive Component
• Channel type determines the reliability guarantees
• Stock channel types:
• JDBC – has perf...
Sources
• Event-driven or polling-based
• Most sources can accept batches of events
• Stock source implementations:
• Avro...
Sinks
• All sinks are polling-based
• Most sinks can process batches of events at a time
• Stock sink implementations (as ...
Flume component interactions
• Source: Puts events into the local channel
• Channel: Store events until someone takes them...
Flume configuration example
agent1.properties:
# Flume will start these named components
agent1.sources = src1
agent1.chan...
Our Flume config
agent.channels = mem1
agent.sources = netcat
agent.sinks = hdfs1 hdfs2 hdfs3 hdfs4
agent.channels.mem1.ty...
Avro schema
16
{
"type": "record",
"name": "UserAction",
"namespace": "com.cloudera.flume.demo",
"fields": [
{"name": "txn...
Custom Event Parser / Avro Serializer
/** converts comma-separated-value (CSV) input
text into binary Avro output on HDFS ...
And now our data flows into Hadoop
18
Send some data to Flume:
./generator.pl | nc localhost 8080
Flume does the rest:
$ h...
Querying our data via SQL: Impala and Hive
19
Shares Everything Client-Facing
 Metadata (table definitions)
 ODBC/JDBC d...
Cloudera Impala
20
Interactive SQL for Hadoop
 Responses in seconds
 Nearly ANSI-92 standard SQL with Hive SQL
Native MP...
Impala Query Execution
21
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN...
Impala Query Execution
22
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN...
Impala Query Execution
23
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN...
Parquet File Format: http://parquet.io
24
Open source, columnar file format for
Hadoop developed by Cloudera & Twitter
Row...
Table setup (using the Hive shell)
25
CREATE EXTERNAL TABLE webdataIn
PARTITIONED BY (year int, month int, day int)
ROW FO...
Conversion to Parquet
Create the archive table from Impala:
impala> CREATE TABLE webdataArchive
(txn_id bigint, ts bigint,...
Conversion to Parquet
Load the Parquet table from an Impala SELECT:
ALTER TABLE webdataArchive
ADD PARTITION (year = 2013,...
Partition update & metadata refresh
Impala can’t see the Avro data until you add the
partition and refresh the metadata fo...
Querying against both data sets
Query against the correct table based on job schedule
SELECT txn_id, product_id, user_name...
Query results
+---------------+------------+------------+--------+
| txn_id | product_id | user_name | action |
+---------...
Views
• It would be much better to be able to define a view
to simplify your queries instead of using UNION ALL
• Hive cur...
Other considerations
• Don’t forget to delete the small files once they have
been compacted
• Keep your incoming data tabl...
Even lower latency via Flume & HBase
• HBase
• Flume has 2 HBase sinks:
• ASYNC_HBASE is faster, doesn’t support Kerberos ...
Flume ingestion strategies
• In this example, we used “netcat” because it’s easy
• For a system that needs reliable delive...
For reference
Flume docs
• User Guide: flume.apache.org/FlumeUserGuide.html
• Dev Guide:
flume.apache.org/FlumeDeveloperGu...
Flume 1.4.0
• Release candidate is out for version 1.4.0
• Adds:
• SSL connection security for Avro source
• Thrift source...
Flume: How to get involved!
• Join the mailing lists:
• user-subscribe@flume.apache.org
• dev-subscribe@flume.apache.org
•...
38
Thank You!
Mike Percy
Work: Software Engineer, Cloudera
Apache: PMC Member & Committer, Apache Flume
Twitter: @mike_per...
Upcoming SlideShare
Loading in...5
×

Deploying Apache Flume to enable low-latency analytics

8,171

Published on

The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.

Published in: Technology
0 Comments
27 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,171
On Slideshare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
319
Comments
0
Likes
27
Embeds 0
No embeds

No notes for slide
  • Four tier1 agents drain into one tier2 agent then distributes its load over two tier3 agentsYou can have a single config file for 3 agents and you pass that around your deployment and you’re doneAt any node, ingest rate must equal exit rate
  • File channel is the recommended channel: reliable channel (no data loss in outage), scales linearly with additional spindles (more disks, better performance), better durability guarantees than memory channelMemory channel can’t scale to large capacity because bound by memoryJDBC not recommended due to slow performance (don’t mention deadlock)
  • File channel is the recommended channel: reliable channel (no data loss in outage), scales linearly with additional spindles (more disks, better performance), better durability guarantees than memory channelMemory channel can’t scale to large capacity because bound by memoryJDBC not recommended due to slow performance (don’t mention deadlock)
  • Avro is standardChannels support transactionsflume sources:avroexecsyslogspooling directoryhttpembedded agentJMS
  • Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
  • In upper diagram, the 3 agents’ flow is healthyIn lower diagram, sink fails to communicate with downstream source thus reservoir fills up and the reservoir filling up cascades upstream, buffering from downstream hardware failuresBut no events are lost until all channels in that flow fill up, at which point the sources report failure to the clientSteadystate flow restored when link becomes active
  • Interactive SQL for HadoopResponses in seconds vs. minutes or hours4-100x faster than HiveNearly ANSI-92 standard SQL with HiveQLCREATE, ALTER, SELECT, INSERT, JOIN, subqueries, etc.ODBC/JDBC drivers Compatible SQL interface for existing Hadoop/CDH applicationsNative MPP Query EnginePurpose-built for low latency queries – another application being brought to HadoopSeparate runtime from MapReduce which is designed for batch processingTightly integrated with Hadoop ecosystem – major design imperative and differentiator for ClouderaSingle system (no integration)Native, open file formats that are compatible across the ecosystem (no copying)Single metadata model (no synchronization)Single set of hardware and system resources (better performance, lower cost)Integrated, end-to-end security (no vulnerabilities)Open SourceKeeps with our strategy of an open platform – i.e. if it stores or processes data, it’s open sourceApache-licensedCode available on Github
  • Deploying Apache Flume to enable low-latency analytics

    1. 1. 1 Deploying Apache Flume for Low-Latency Analytics Mike Percy Software Engineer, Cloudera PMC Member & Committer, Apache Flume
    2. 2. About me 2 • Software engineer @ Cloudera • Previous life @ Yahoo! in CORE • PMC member / commiter on Apache Flume • Part-time MSCS student @ Stanford • I play the drums (when I can find the time!) • Twitter: @mike_percy
    3. 3. What do you mean by low-latency? 3 • Shooting for a 2-minute SLA • from time of occurrence • to visibility via interactive SQL • Pretty modest SLA • The goal is fast access to all of our “big data” • from a single SQL interface
    4. 4. Approach 4 • Stream Avro data into HDFS using Flume • Lots of small files • Convert the Avro data to Parquet in batches • Compact into big files • Efficiently query both data sets using Impala • UNION ALL or (soon) VIEW • Eventually delete the small files
    5. 5. Agenda 5 1. Using Flume • Streaming Avro into HDFS using Flume 2. Using Hive & Impala • Batch-converting older Avro data to Parquet • How to query the whole data set (new and old together) 3. Gotchas 4. A little about HBase github.com/mpercy/flume-rtq-hadoop-summit-2013
    6. 6. Why Flume event streaming is awesome • Couldn’t I just do this with a shell script? • What year is this, 2001? There is a better way! • Scalable collection, aggregation of events (i.e. logs) • Dynamic, contextual event routing • Low latency, high throughput • Declarative configuration • Productive out of the box, yet powerfully extensible • Open source software 6
    7. 7. Flume Agent design 7
    8. 8. Multi-agent network architecture 8 Typical fan-in topology Shown here: • Many agents (1 per host) at edge tier • 4 “collector” agents in that data center • 3 centralized agents writing to HDFS
    9. 9. Events Flume’s core data movement atom: the Event • An Event has a simple schema • Event header: Map<String, String> • similar in spirit to HTTP headers • Event body: array of bytes 9
    10. 10. Channels • Passive Component • Channel type determines the reliability guarantees • Stock channel types: • JDBC – has performance issues • Memory – lower latency for small writes, but not durable • File – provides durability; most people use this 10
    11. 11. Sources • Event-driven or polling-based • Most sources can accept batches of events • Stock source implementations: • Avro-RPC – other Java-based Flume agents can send data to this source port • Thrift-RPC – for interop with Thrift clients written in any language • HTTP – post via a REST service (extensible) • JMS – ingest from Java Message Service • Syslog-UDP, Syslog-TCP • Netcat • Scribe • Spooling Directory – parse and ingest completed log files • Exec – execute shell commands and ingest the output 11
    12. 12. Sinks • All sinks are polling-based • Most sinks can process batches of events at a time • Stock sink implementations (as of 1.4.0 RC1): • HDFS • HBase (2 variants) • SolrCloud • ElasticSearch • Avro-RPC, Thrift-RPC – Flume agent inter-connect • File Roller – write to local filesystem • Null, Logger, Seq, Stress – for testing purposes 12
    13. 13. Flume component interactions • Source: Puts events into the local channel • Channel: Store events until someone takes them • Sink: Takes events from the local channel • On failure, sinks backoff and retry forever until success 13
    14. 14. Flume configuration example agent1.properties: # Flume will start these named components agent1.sources = src1 agent1.channels = ch1 agent1.sinks = sink1 # channel config agent1.channels.ch1.type = memory # source config agent1.sources.src1.type = netcat agent1.sources.src1.channels = ch1 agent1.sources.src1.bind = 127.0.0.1 agent1.sources.src1.port = 8080 # sink config agent1.sinks.sink1.type = logger agent1.sinks.sink1.channel = ch1 14
    15. 15. Our Flume config agent.channels = mem1 agent.sources = netcat agent.sinks = hdfs1 hdfs2 hdfs3 hdfs4 agent.channels.mem1.type = memory agent.channels.mem1.capacity = 1000000 agent.channels.mem1.transactionCapacity = 10000 agent.sources.netcat.type = netcat agent.sources.netcat.channels = mem1 agent.sources.netcat.bind = 0.0.0.0 agent.sources.netcat.port = 8080 agent.sources.netcat.ack-every-event = false agent.sinks.hdfs1.type = hdfs agent.sinks.hdfs1.channel = mem1 agent.sinks.hdfs1.hdfs.useLocalTimeStamp = true agent.sinks.hdfs1.hdfs.path = hdfs:///flume/webdata/avro/year=%Y/month=%m/da y=%d agent.sinks.hdfs1.hdfs.fileType = DataStream agent.sinks.hdfs1.hdfs.inUsePrefix = . agent.sinks.hdfs1.hdfs.filePrefix = log agent.sinks.hdfs1.hdfs.fileSuffix = .s1.avro agent.sinks.hdfs1.hdfs.batchSize = 10000 agent.sinks.hdfs1.hdfs.rollInterval = 30 agent.sinks.hdfs1.hdfs.rollCount = 0 agent.sinks.hdfs1.hdfs.rollSize = 0 agent.sinks.hdfs1.hdfs.idleTimeout = 60 agent.sinks.hdfs1.hdfs.callTimeout = 25000 agent.sinks.hdfs1.serializer = com.cloudera.flume.demo.CSVAvroSerializer$Builder agent.sinks.hdfs2.type = hdfs agent.sinks.hdfs2.channel = mem1 # etc … sink configs 2 thru 4 omitted for brevity 15
    16. 16. Avro schema 16 { "type": "record", "name": "UserAction", "namespace": "com.cloudera.flume.demo", "fields": [ {"name": "txn_id", "type": "long"}, {"name": ”ts", "type": "long"}, {"name": "action", "type": "string"}, {"name": "product_id", "type": "long"}, {"name": "user_ip", "type": "string"}, {"name": "user_name", "type": "string"}, {"name": "path", "type": "string"}, {"name": "referer", "type": "string", "default":""} ] }
    17. 17. Custom Event Parser / Avro Serializer /** converts comma-separated-value (CSV) input text into binary Avro output on HDFS */ public class CSVAvroSerializer extends AbstractAvroEventSerializer<UserAction> { private OutputStream out; private CSVAvroSerializer(OutputStream out) { this.out = out; } protected OutputStream getOutputStream() { return out; } protected Schema getSchema() { Schema schema = UserAction.SCHEMA$; return schema; } protected UserAction convert(Event event) { String line = new String(event.getBody()); String[] fields = line.split(","); UserAction action = new UserAction(); action.setTxnId(Long.parseLong(fields[0])); action.setTimestamp(Long.parseLong(fields[1])); action.setAction(fields[2]); action.setProductId(Long.parseLong(fields[3])); action.setUserIp(fields[4]); action.setUserName(fields[5]); action.setPath(fields[6]); action.setReferer(fields[7]); return action; } public static class Builder { // Builder impl omitted for brevity } } 17
    18. 18. And now our data flows into Hadoop 18 Send some data to Flume: ./generator.pl | nc localhost 8080 Flume does the rest: $ hadoop fs -ls /flume/webdata/avro/year=2013/month=06/day=26 Found 16 items /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s1.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s2.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s3.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s4.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s1.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s2.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s3.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s4.avro . . .
    19. 19. Querying our data via SQL: Impala and Hive 19 Shares Everything Client-Facing  Metadata (table definitions)  ODBC/JDBC drivers  SQL syntax (Hive SQL)  Flexible file formats  Machine pool  Hue GUI But Built for Different Purposes  Hive: runs on MapReduce and ideal for batch processing  Impala: native MPP query engine ideal for interactive SQL Storage Integration Resource Management Metadata HDFS HBase TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS Hive SQL Syntax Impala SQL Syntax + Compute FrameworkMapReduce Compute Framework Batch Processing Interactive SQL
    20. 20. Cloudera Impala 20 Interactive SQL for Hadoop  Responses in seconds  Nearly ANSI-92 standard SQL with Hive SQL Native MPP Query Engine  Purpose-built for low-latency queries  Separate runtime from MapReduce  Designed as part of the Hadoop ecosystem Open Source  Apache-licensed
    21. 21. Impala Query Execution 21 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request 1) Request arrives via ODBC/JDBC/Beeswax/Shell
    22. 22. Impala Query Execution 22 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data
    23. 23. Impala Query Execution 23 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query results
    24. 24. Parquet File Format: http://parquet.io 24 Open source, columnar file format for Hadoop developed by Cloudera & Twitter Rowgroup format: file contains multiple horiz. slices Supports storing each column in a separate file Supports fully shredded nested data Column values stored in native types Supports index pages for fast lookup Extensible value encodings
    25. 25. Table setup (using the Hive shell) 25 CREATE EXTERNAL TABLE webdataIn PARTITIONED BY (year int, month int, day int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/flume/webdata/avro/’ TBLPROPERTIES ('avro.schema.url’ = 'hdfs:///user/mpercy/UserAction.avsc'); /* this must be done for every new day: */ ALTER TABLE webdataIn ADD PARTITION (year = 2013, month = 06, day = 20);
    26. 26. Conversion to Parquet Create the archive table from Impala: impala> CREATE TABLE webdataArchive (txn_id bigint, ts bigint, action string, product_id bigint, user_ip string, user_name string, path string, referer string) PARTITIONED BY (year int, month int, day int) STORED AS PARQUETFILE; 26
    27. 27. Conversion to Parquet Load the Parquet table from an Impala SELECT: ALTER TABLE webdataArchive ADD PARTITION (year = 2013, month = 6, day = 20); INSERT OVERWRITE TABLE webdataArchive PARTITION (year = 2013, month = 6, day = 20) SELECT txn_id, ts, action, product_id, user_ip, user_name, path, referer FROM webdataIn WHERE year = 2013 AND month = 6 AND day = 20; 27
    28. 28. Partition update & metadata refresh Impala can’t see the Avro data until you add the partition and refresh the metadata for the table • Pre-create partitions on a daily basis hive> ALTER TABLE webdataIn ADD PARTITION (year = 2013, month = 6, day = 20); • Refresh metadata for incoming table every minute impala> REFRESH webdataIn 28
    29. 29. Querying against both data sets Query against the correct table based on job schedule SELECT txn_id, product_id, user_name, action FROM webdataIn WHERE year = 2013 AND month = 6 AND (day = 25 OR day = 26) AND product_id = 30111 UNION ALL SELECT txn_id, product_id, user_name, action FROM webdataArchive WHERE NOT (year = 2013 AND month = 6 AND (day = 25 OR day = 26)) AND product_id = 30111; 29
    30. 30. Query results +---------------+------------+------------+--------+ | txn_id | product_id | user_name | action | +---------------+------------+------------+--------+ | 1372244973556 | 30111 | ylucas | click | | 1372244169096 | 30111 | fdiaz | click | | 1372244563287 | 30111 | vbernard | view | | 1372244268481 | 30111 | ntrevino | view | | 1372244381025 | 30111 | oramsey | click | | 1372244242387 | 30111 | wwallace | click | | 1372244227407 | 30111 | thughes | click | | 1372245609209 | 30111 | gyoder | view | | 1372245717845 | 30111 | yblevins | view | | 1372245733985 | 30111 | moliver | click | | 1372245756852 | 30111 | pjensen | view | | 1372245800509 | 30111 | ucrane | view | | 1372244848845 | 30111 | tryan | click | | 1372245073616 | 30111 | pgilbert | view | | 1372245141668 | 30111 | walexander | click | | 1372245362977 | 30111 | agillespie | click | | 1372244892540 | 30111 | ypham | view | | 1372245466121 | 30111 | yowens | buy | | 1372245515189 | 30111 | emurray | view | +---------------+------------+------------+--------+ 30
    31. 31. Views • It would be much better to be able to define a view to simplify your queries instead of using UNION ALL • Hive currently supports views • Impala 1.1 will add VIEW support • Available next month (July 2013) 31
    32. 32. Other considerations • Don’t forget to delete the small files once they have been compacted • Keep your incoming data table to a minimum • REFRESH <table> is expensive • Flume may write duplicates • Your webdataIn table may return duplicate entries • Consider doing a de-duplication job via M/R before the INSERT OVERWRITE • Your schema should have keys to de-dup on • DISTINCT might help with both of these issues 32
    33. 33. Even lower latency via Flume & HBase • HBase • Flume has 2 HBase sinks: • ASYNC_HBASE is faster, doesn’t support Kerberos or 0.96 yet • HBASE is slower but uses the stock libs and supports everything • Take care of de-dup by using idempotent operations • Write to unique keys • Stream + Batch pattern is useful here if you need counters • Approximate counts w/ real-time increment ops • Go back and reconcile the streaming numbers with batch- generated accurate numbers excluding any duplicates created by retries 33
    34. 34. Flume ingestion strategies • In this example, we used “netcat” because it’s easy • For a system that needs reliable delivery we should use something else: • Flume RPC client API • Thrift RPC • HTTP source (REST) • Spooling Directory source • Memory channel vs. File channel 34
    35. 35. For reference Flume docs • User Guide: flume.apache.org/FlumeUserGuide.html • Dev Guide: flume.apache.org/FlumeDeveloperGuide.html Impala tutorial: • www.cloudera.com/content/cloudera- content/cloudera-docs/Impala/latest/Installing-and- Using-Impala/ciiu_tutorial.html 35
    36. 36. Flume 1.4.0 • Release candidate is out for version 1.4.0 • Adds: • SSL connection security for Avro source • Thrift source • SolrCloud sink • Flume embedding support (embedded agent) • File channel performance improvements • Easier plugin integration • Many other improvements & bug fixes • Try it: http://people.apache.org/~mpercy/flume/apache- flume-1.4.0-RC1/ 36
    37. 37. Flume: How to get involved! • Join the mailing lists: • user-subscribe@flume.apache.org • dev-subscribe@flume.apache.org • Look at the code • github.com/apache/flume – Mirror of the flume.apache.org git repo • File a bug (or fix one!) • issues.apache.org/jira/browse/FLUME • More on how to contribute: • cwiki.apache.org/confluence/display/FLUME/How+to+Con tribute 37
    38. 38. 38 Thank You! Mike Percy Work: Software Engineer, Cloudera Apache: PMC Member & Committer, Apache Flume Twitter: @mike_percy Code: github.com/mpercy/flume-rtq-hadoop-summit-2013
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×