SlideShare a Scribd company logo
Real-time Analytics at Facebook



Zheng Shao
10/18/2011
Agenda
 1   Analytics and Real-time

 2   Data Freeway

 3   Puma

 4   Future Works
Analytics and Real-time
what and why
Facebook Insights
• Use cases
▪   Websites/Ads/Apps/Pages
▪   Time series
▪   Demographic break-downs
▪   Unique counts/heavy hitters

• Major challenges
▪   Scalability
▪   Latency
Analytics based on Hadoop/Hive
                                              Hourly           Daily
        seconds            seconds         Copier/Loader   Pipeline Jobs


 HTTP             Scribe             NFS              Hive             MySQL
                                                     Hadoop

• 3000-node Hadoop cluster

• Copier/Loader: Map-Reduce hides machine failures

• Pipeline Jobs: Hive allows SQL-like syntax

• Good scalability, but poor latency! 24 – 48 hours.
How to Get Lower Latency?




• Small-batch Processing                    • Stream Processing
▪   Run Map-reduce/Hive every hour, every   ▪   Aggregate the data as soon as it arrives
    15 min, every 5 min, …
                                            ▪   How to solve the reliability problem?
▪   How do we reduce per-batch
    overhead?
Decisions
• Stream Processing wins!



• Data Freeway
▪   Scalable Data Stream Framework

• Puma
▪   Reliable Stream Aggregation Engine
Data Freeway
scalable data stream
Scribe

                                                  Batch
                                                  Copier
                                                               HDFS

                                                  tail/fopen
 Scribe          Scribe     Scribe
                Mid-Tier                    NFS
 Clients                    Writers                           Log
• Simple push/RPC-based logging system                     Consumer


• Open-sourced in 2008. 100 log categories at that time.

• Routing driven by static configuration.
Data Freeway
                                                      Continuous
                                                        Copier
                  C1            C2         DataNode

                                                                     HDFS
                                                                       PTail
                  C1            C2         DataNode
                                                                      (in the
                                                                       plan)
Scribe                                                  PTail
Clients      Calligraphus   Calligraphus     HDFS
               Mid-tier       Writers
                                                                      Log
                                                                   Consumer
                    Zookeeper

• 9GB/sec at peak, 10 sec latency, 2500 log categories
Calligraphus
• RPC  File System
▪   Each log category is represented by 1 or more FS directories
▪   Each directory is an ordered list of files

• Bucketing support
▪   Application buckets are application-defined shards.
▪   Infrastructure buckets allows log streams from x B/s to x GB/s

• Performance
▪   Latency: Call sync every 7 seconds
▪   Throughput: Easily saturate 1Gbit NIC
Continuous Copier
• File System  File System

• Low latency and smooth network usage

• Deployment
▪   Implemented as long-running map-only job
▪   Can move to any simple job scheduler

• Coordination
▪   Use lock files on HDFS for now
▪   Plan to move to Zookeeper
PTail
                     files        checkpoint

directory

directory

directory



  • File System  Stream (  RPC )

  • Reliability
  ▪   Checkpoints inserted into the data stream
  ▪   Can roll back to tail from any data checkpoints
  ▪   No data loss/duplicates
Channel Comparison
           Push / RPC Pull / FS
Latency      1-2 sec   10 sec
Loss/Dups      Few      None
Robustness     Low      High           Scribe
Complexity    Low       High
                                        Push /
                                         RPC
                      PTail + ScribeSend          Calligraphus
                                      Pull / FS

                                  Continuous Copier
Puma
real-time aggregation/storage
Overview


 Log Stream    Aggregations                   Serving
                               Storage
• ~ 1M log lines per second, but light read

• Multiple Group-By operations per log line

• The first key in Group By is always time/date-related

• Complex aggregations: Unique user count, most frequent
  elements
MySQL and HBase: one page
                   MySQL                 HBase
Parallel           Manual sharding       Automatic
                                         load balancing
Fail-over          Manual master/slave   Automatic
                   switch
Read efficiency    High                  Low
Write efficiency   Medium                High
Columnar support   No                    Yes
Puma2 Architecture




   PTail         Puma2         HBase        Serving

• PTail provide parallel data streams

• For each log line, Puma2 issue “increment” operations to
  HBase. Puma2 is symmetric (no sharding).

• HBase: single increment on multiple columns
Puma2: Pros and Cons
• Pros
▪   Puma2 code is very simple.
▪   Puma2 service is very easy to maintain.

• Cons
▪   “Increment” operation is expensive.
▪   Do not support complex aggregations.
▪   Hacky implementation of “most frequent elements”.
▪   Can cause small data duplicates.
Improvements in Puma2
• Puma2
▪   Batching of requests. Didn‟t work well because of long-tail distribution.

• HBase
▪   “Increment” operation optimized by reducing locks.
▪   HBase region/HDFS file locality; short-circuited read.
▪   Reliability improvements under high load.

• Still not good enough!
Puma3 Architecture



                          PTail          Puma3         HBase
• Puma3 is sharded by aggregation key.

• Each shard is a hashmap in memory.
                                                  Serving
• Each entry in hashmap is a pair of
  an aggregation key and a user-defined aggregation.

• HBase as persistent key-value storage.
Puma3 Architecture



                                    PTail           Puma3           HBase




• Write workflow
                                                                Serving
▪   For each log line, extract the columns for key and value.
▪   Look up in the hashmap and call user-defined aggregation
Puma3 Architecture



                                    PTail              Puma3       HBase




• Checkpoint workflow
                                                               Serving
▪   Every 5 min, save modified hashmap entries, PTail checkpoint to
    HBase
▪   On startup (after node failure), load from HBase
Puma3 Architecture



                                  PTail          Puma3           HBase




• Read workflow
                                                             Serving
▪   Read uncommitted: directly serve from the in-memory hashmap; load
    from Hbase on miss.
▪   Read committed: read from HBase and serve.
Puma3 Architecture



                                    PTail            Puma3        HBase




• Join
                                                              Serving
▪   Static join table in HBase.
▪   Distributed hash lookup in user-defined function (udf).
▪   Local cache improves the throughput of the udf a lot.
Puma2 / Puma3 comparison
• Puma3 is much better in write throughput
▪   Use 25% of the boxes to handle the same load.
▪   HBase is really good at write throughput.

• Puma3 needs a lot of memory
▪   Use 60GB of memory per box for the hashmap
▪   SSD can scale to 10x per box.
Puma3 Special Aggregations
• Unique Counts Calculation
▪   Adaptive sampling
▪   Bloom filter (in the plan)

• Most frequent item (in the plan)
▪   Lossy counting
▪   Probabilistic lossy counting
PQL – Puma Query Language
• CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟
  „adid‟, „userid‟);              INSERT INTO l (a, b, c)
                                  SELECT
• CREATE VIEW v AS                   udf.hour(time),
  SELECT *, udf.age(userid)          adid,
  FROM t                             age,
  WHERE udf.age(userid) > 21         count(1),
                                     udf.count_distinc(userid)
                                  FROM v
                                  GROUP BY
• CREATE HBASE TABLE h …             udf.hour(time),
                                     adid,
• CREATE LOGICAL TABLE l …           age;
Future Works
challenges and opportunities
Future Works
• Scheduler Support
▪   Just need simple scheduling because the work load is continuous

• Mass adoption
▪   Migrate most daily reporting queries from Hive

• Open Source
▪   Biggest bottleneck: Java Thrift dependency
▪   Will come one by one
Similar Systems
• STREAM from Stanford

• Flume from Cloudera

• S4 from Yahoo

• Rainbird/Storm from Twitter

• Kafka from Linkedin
Key differences
• Scalable Data Streams
▪   9 GB/sec with < 10 sec of latency
▪   Both Push/RPC-based and Pull/File System-based
▪   Components to support arbitrary combination of channels

• Reliable Stream Aggregations
▪   Good support for Time-based Group By, Table-Stream Lookup Join
▪   Query Language:    Puma : Realtime-MR = Hive : MR
▪   No support for sliding window, stream joins
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

What's hot

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
Databricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Ingesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
Timothy Spann
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Databricks
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
Databricks
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Databricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
DataWorks Summit/Hadoop Summit
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Ingesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 

Similar to Xldb2011 tue 0940_facebook_realtimeanalytics

Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookbaggioss
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)baggioss
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
Jean-Baptiste Poullet
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
Yifeng Jiang
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
Alfresco Software
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
caizer_x
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
Alexey Rybak
 

Similar to Xldb2011 tue 0940_facebook_realtimeanalytics (20)

Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 

More from liqiang xu

浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909liqiang xu
 
Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施
liqiang xu
 
Hdfs comics
Hdfs comicsHdfs comics
Hdfs comics
liqiang xu
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksliqiang xu
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerliqiang xu
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inliqiang xu
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseliqiang xu
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)liqiang xu
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能
liqiang xu
 
Nginx internals
Nginx internalsNginx internals
Nginx internalsliqiang xu
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)liqiang xu
 
1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)
liqiang xu
 

More from liqiang xu (12)

浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909
 
Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施
 
Hdfs comics
Hdfs comicsHdfs comics
Hdfs comics
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastner
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouse
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能
 
Nginx internals
Nginx internalsNginx internals
Nginx internals
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)
 
1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 

Xldb2011 tue 0940_facebook_realtimeanalytics

  • 1. Real-time Analytics at Facebook Zheng Shao 10/18/2011
  • 2. Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works
  • 4. Facebook Insights • Use cases ▪ Websites/Ads/Apps/Pages ▪ Time series ▪ Demographic break-downs ▪ Unique counts/heavy hitters • Major challenges ▪ Scalability ▪ Latency
  • 5. Analytics based on Hadoop/Hive Hourly Daily seconds seconds Copier/Loader Pipeline Jobs HTTP Scribe NFS Hive MySQL Hadoop • 3000-node Hadoop cluster • Copier/Loader: Map-Reduce hides machine failures • Pipeline Jobs: Hive allows SQL-like syntax • Good scalability, but poor latency! 24 – 48 hours.
  • 6. How to Get Lower Latency? • Small-batch Processing • Stream Processing ▪ Run Map-reduce/Hive every hour, every ▪ Aggregate the data as soon as it arrives 15 min, every 5 min, … ▪ How to solve the reliability problem? ▪ How do we reduce per-batch overhead?
  • 7. Decisions • Stream Processing wins! • Data Freeway ▪ Scalable Data Stream Framework • Puma ▪ Reliable Stream Aggregation Engine
  • 9. Scribe Batch Copier HDFS tail/fopen Scribe Scribe Scribe Mid-Tier NFS Clients Writers Log • Simple push/RPC-based logging system Consumer • Open-sourced in 2008. 100 log categories at that time. • Routing driven by static configuration.
  • 10. Data Freeway Continuous Copier C1 C2 DataNode HDFS PTail C1 C2 DataNode (in the plan) Scribe PTail Clients Calligraphus Calligraphus HDFS Mid-tier Writers Log Consumer Zookeeper • 9GB/sec at peak, 10 sec latency, 2500 log categories
  • 11. Calligraphus • RPC  File System ▪ Each log category is represented by 1 or more FS directories ▪ Each directory is an ordered list of files • Bucketing support ▪ Application buckets are application-defined shards. ▪ Infrastructure buckets allows log streams from x B/s to x GB/s • Performance ▪ Latency: Call sync every 7 seconds ▪ Throughput: Easily saturate 1Gbit NIC
  • 12. Continuous Copier • File System  File System • Low latency and smooth network usage • Deployment ▪ Implemented as long-running map-only job ▪ Can move to any simple job scheduler • Coordination ▪ Use lock files on HDFS for now ▪ Plan to move to Zookeeper
  • 13. PTail files checkpoint directory directory directory • File System  Stream (  RPC ) • Reliability ▪ Checkpoints inserted into the data stream ▪ Can roll back to tail from any data checkpoints ▪ No data loss/duplicates
  • 14. Channel Comparison Push / RPC Pull / FS Latency 1-2 sec 10 sec Loss/Dups Few None Robustness Low High Scribe Complexity Low High Push / RPC PTail + ScribeSend Calligraphus Pull / FS Continuous Copier
  • 16. Overview Log Stream Aggregations Serving Storage • ~ 1M log lines per second, but light read • Multiple Group-By operations per log line • The first key in Group By is always time/date-related • Complex aggregations: Unique user count, most frequent elements
  • 17. MySQL and HBase: one page MySQL HBase Parallel Manual sharding Automatic load balancing Fail-over Manual master/slave Automatic switch Read efficiency High Low Write efficiency Medium High Columnar support No Yes
  • 18. Puma2 Architecture PTail Puma2 HBase Serving • PTail provide parallel data streams • For each log line, Puma2 issue “increment” operations to HBase. Puma2 is symmetric (no sharding). • HBase: single increment on multiple columns
  • 19. Puma2: Pros and Cons • Pros ▪ Puma2 code is very simple. ▪ Puma2 service is very easy to maintain. • Cons ▪ “Increment” operation is expensive. ▪ Do not support complex aggregations. ▪ Hacky implementation of “most frequent elements”. ▪ Can cause small data duplicates.
  • 20. Improvements in Puma2 • Puma2 ▪ Batching of requests. Didn‟t work well because of long-tail distribution. • HBase ▪ “Increment” operation optimized by reducing locks. ▪ HBase region/HDFS file locality; short-circuited read. ▪ Reliability improvements under high load. • Still not good enough!
  • 21. Puma3 Architecture PTail Puma3 HBase • Puma3 is sharded by aggregation key. • Each shard is a hashmap in memory. Serving • Each entry in hashmap is a pair of an aggregation key and a user-defined aggregation. • HBase as persistent key-value storage.
  • 22. Puma3 Architecture PTail Puma3 HBase • Write workflow Serving ▪ For each log line, extract the columns for key and value. ▪ Look up in the hashmap and call user-defined aggregation
  • 23. Puma3 Architecture PTail Puma3 HBase • Checkpoint workflow Serving ▪ Every 5 min, save modified hashmap entries, PTail checkpoint to HBase ▪ On startup (after node failure), load from HBase
  • 24. Puma3 Architecture PTail Puma3 HBase • Read workflow Serving ▪ Read uncommitted: directly serve from the in-memory hashmap; load from Hbase on miss. ▪ Read committed: read from HBase and serve.
  • 25. Puma3 Architecture PTail Puma3 HBase • Join Serving ▪ Static join table in HBase. ▪ Distributed hash lookup in user-defined function (udf). ▪ Local cache improves the throughput of the udf a lot.
  • 26. Puma2 / Puma3 comparison • Puma3 is much better in write throughput ▪ Use 25% of the boxes to handle the same load. ▪ HBase is really good at write throughput. • Puma3 needs a lot of memory ▪ Use 60GB of memory per box for the hashmap ▪ SSD can scale to 10x per box.
  • 27. Puma3 Special Aggregations • Unique Counts Calculation ▪ Adaptive sampling ▪ Bloom filter (in the plan) • Most frequent item (in the plan) ▪ Lossy counting ▪ Probabilistic lossy counting
  • 28. PQL – Puma Query Language • CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟ „adid‟, „userid‟); INSERT INTO l (a, b, c) SELECT • CREATE VIEW v AS udf.hour(time), SELECT *, udf.age(userid) adid, FROM t age, WHERE udf.age(userid) > 21 count(1), udf.count_distinc(userid) FROM v GROUP BY • CREATE HBASE TABLE h … udf.hour(time), adid, • CREATE LOGICAL TABLE l … age;
  • 30. Future Works • Scheduler Support ▪ Just need simple scheduling because the work load is continuous • Mass adoption ▪ Migrate most daily reporting queries from Hive • Open Source ▪ Biggest bottleneck: Java Thrift dependency ▪ Will come one by one
  • 31. Similar Systems • STREAM from Stanford • Flume from Cloudera • S4 from Yahoo • Rainbird/Storm from Twitter • Kafka from Linkedin
  • 32. Key differences • Scalable Data Streams ▪ 9 GB/sec with < 10 sec of latency ▪ Both Push/RPC-based and Pull/File System-based ▪ Components to support arbitrary combination of channels • Reliable Stream Aggregations ▪ Good support for Time-based Group By, Table-Stream Lookup Join ▪ Query Language: Puma : Realtime-MR = Hive : MR ▪ No support for sliding window, stream joins
  • 33. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0