Routing trillion events per day @twitter

1. Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 Lohit VijayaRenu & Gary Steelman @lohitvijayarenu @efsie

2. 1. Event Logs at Twitter 2. Log Collection 3. Log Processing 4. Log Replication 5. The Future 6. Questions In this talk

3. Overview

4. ● Clients log events specifying a Category name. Eg ads_view, login_event ... ● Events are grouped together across all clients into the Category ● Events are stored on Hadoop Distributed File System, bucketed every hour into separate directories ○ /logs/ads_view/2017/05/01/23 ○ /logs/login_event/2017/05/01/23 Life of an Event Clients Aggregated by Category Storage HDFS Http Clients Clients Client Daemon Client Daemon Client Daemon Http Endpoint

5. ~3PB <1500 Across millions of clients >1TTrillion Events a Day of Data a Day Nodes Incoming uncompressed Collocated with HDFS datanodes Event Log Stats >600Categories Event groups by category

6. Event Log Architecture Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log ReplicatorStorage (HDFS) Inside DataCenter Storage (Streaming)

11. Event Log Architecture Events Events RT Storage (HDFS) Inside DC1 Events Events RT Storage (HDFS) Inside DCN DW Storage (HDFS) Prod Storage (HDFS) DW Storage (HDFS) Cold Storage (HDFS) Prod Storage (HDFS)

12. Event Log Architecture Events Events RT Storage (HDFS) Inside DC1 Events Events RT Storage (HDFS) Inside DCN DW Storage (HDFS) Prod Storage (HDFS) DW Storage (HDFS) Cold Storage (HDFS) Prod Storage (HDFS)

13. Collection

15. Past Event Collection Overview Future Scribe Client Daemon Scribe Aggregator Daemons Scribe Client Daemon Flume Aggregator Daemon Flume Aggregator Daemon Flume Client Daemon Present

16. Event Collection Past Challenges with Scribe ● Too many open file handles to HDFS ○ 600 categories x 1500 aggregators x >6 per hour =~ >5.4M files per hour ● High IO wait on DataNodes at scale ● Max limit on throughput per aggregator ● Difficult to track message drops ● No longer active open source development

17. Event Collection Present Apache Flume ● Well defined interfaces ● Open source ● Concept of transactions ● Existing implementations of interfaces Source Sink Channel Client HDFS Flume Agent

18. Event Collection Present Category Group ● Combine multiple related categories into a category group ● Provide different properties per group ● Contains multiple events to generate fewer combined sequence files Agent 1 Agent 2 Agent 3 Category 1 Category 3Category 2 Category Group

19. Group 1 Category Groups Aggregator Group 1 Aggregator Group 2 Event Collection Present Aggregator Group ● A set of aggregators hosting same set of category groups ● Easier to manage group of aggregators hosting subset of categories Agent 1 Agent 2 Agent 3 Agent 8 Group 2

20. Event Collection Present Flume features for category/aggregator groups ● Extend Interceptor to multiplex events into groups ● Implement Memory Channel Group to have separate memory channel per category group ● ZooKeeper registration per category group for service discovery ● Metrics for category groups

21. Event Collection Present Flume performance improvements ● HDFSEventSink batching increased (5x) throughput reducing spikes on memory channel ● Implement buffering in HDFSEventSink instead of using SpillableMemoryChannel ● Stream events close to network speed

22. Processing

24. >1PB 20-50% To process one day of data 8Wall Clock Hours Data per Day Disk Space Output of cleaned, compressed, consolidated, and converted Saved by processing Flume sequence files Log Processor Stats Processing Trillion Events per Day

25. ● Make processing log data easier for analytics teams ● Disk space is at a premium on analytics clusters ● Still too many files cause increased pressure on the NameNode ● Log data is read many times and different teams all perform the same pre-processing steps on the same data sets Log Processor Needs Processing Trillion Events per Day

26. Datacenter 1 Log Processor Steps ads_group/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hhlogin_group/yyyy/mm/dd/hh Category Groups CategoriesDemux Jobs ads_group_demuxer login_group_demuxer

29. Decode Base64 encoding from logged data 1 Demux Category groups into individual categories for easier consumption by analytics teams 2 Clean Corrupt, empty, or invalid records so data sets are more reliable 3 [Convert] Some categories into Parquet for fastest use in ad-hoc exploratory tools 4 Consolidate Small files to reduce pressure on the NameNode 5 Compress Logged data to the highest level to save disk space. From LZO level 3 to LZO level 7 6 Log Processor Steps

30. ● Scribe’s contract amounts to sending a binary blob to a port ● Scribe used new line characters to delimit records in a binary blob batch of records ● Valid records may include newline characters ● Scribe base64 encoded received binary blobs to avoid confusion with record delimiter ● Base 64 encoding is no longer necessary because we have moved to one serialized Thrift object per binary blob Why Base64 Decoding? Legacy Choices

31. /logs/ads_click/yyyy/mm/dd/hh/1.lzo /logs/ads_view/yyyy/mm/dd/hh/1.lzo Log Demux Visual /raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq DEMUX /logs/ads_view/yyyy/mm/dd/hh/1.lzo

34. ● One log processor daemon per RT Hadoop cluster, where Flume aggregates logs ● Primarily responsible for demuxing category groups out of the Flume sequence files ● The daemon schedules Tez jobs every hour for every category group in a thread pool ● Daemon atomically presents processed category instances so partial data can’t be read ● Processing proceeds according to criticality of data or “tiers” Log Processor Daemon

35. ● Some categories are significantly larger than other categories (KBs v TBs) ● MapReduce demux? Each reducer handles a single category ● Streaming demux? Each spout or channel handles a single category ● Massive skew in partitioning by category causes long running tasks which slows down job completion time ● Relatively well understood fault tolerance semantics similar to MapReduce, Spark, etc Why Apache Tez?

36. ● Tez’s dynamic hash partitioner adjusts partitions at runtime if necessary, allowing large partitions to be further partitioned so multiple tasks process events for a single category one task ○ More info at TEZ-3209. ○ Thanks to team member Ming Ma for the contribution! ● Easier horizontal scaling simultaneously providing more predictable processing times Dynamic Partitioning

37. Typical Partitioning Visual Task 3 Task 2 Input File 1 Task 1

38. Hash Partitioning Visual Task 3 Task 2 Input File 1 Task 1 Task 4 Task 5

39. Replication

41. >1PB ~10 Across all analytics clusters >24kCopy Jobs per Day PB of Data per Day Analytics Clusters Replicated to analytics clusters Log Replication Stats Replicating Trillion Events per Day

42. ● Collocate logs with compute and disk capacity for analytics teams ○ Cross-data center reads are incredibly expensive ○ Cross-rack reads within data center are still expensive ● Too many users and jobs to fit into only the Hadoop RT clusters ● Hadoop RT clusters have a high SLA so we cannot allow adhoc usage ● Critical data set copies and backups in case of data center failures Log Replication Needs Processing Trillion Events per Day

43. Datacenter N Datacenter 1 Log Replication Visual ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

54. Copy Logged data from all processing clusters to the target cluster. 1 Merge Copied data into one directory. 2 Present Data atomically by renaming it to an accessible location. 3 Publish Metadata to the Data Abstraction Layer to notify analytics teams data is ready for consumption. 4 Log Replication Steps Distributing Trillion Events per Day

55. ● One log replicator daemon per DW, PROD, or COLD Hadoop cluster, where analytics users run queries and jobs ● Primarily responsible for copying category partitions out of the RT Hadoop clusters ● The daemons schedule Hadoop DistCp jobs every hour for every category ● Daemon atomically presents processed category instances so partial data can’t be read ● Replication proceeds according to criticality of data or “tiers” Log Replicator Daemon

56. The Future

57. Future of Log Management ● Flume client improvement for tracing, SLA and throttling ● Flume client to support for message validation before logging ● Centralized configuration management ● Processing and replication every 10 minutes instead of every hour

58. Thank You Questions?

Routing trillion events per day @twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Routing trillion events per day @twitter

Similar to Routing trillion events per day @twitter (20)

More from lohitvijayarenu

More from lohitvijayarenu (6)

Recently uploaded

Recently uploaded (20)

Routing trillion events per day @twitter