SlideShare a Scribd company logo
Real-time Analytics at Facebook:
Data Freeway and Puma


Zheng Shao
12/2/2011
Agenda
 1   Analytics and Real-time

 2   Data Freeway

 3   Puma

 4   Future Works
Analytics and Real-time
what and why
Facebook Insights
• Use cases
▪   Websites/Ads/Apps/Pages
▪   Time series
▪   Demographic break-downs
▪   Unique counts/heavy hitters

• Major challenges
▪   Scalability
▪   Latency
Analytics based on Hadoop/Hive
                                              Hourly           Daily
        seconds            seconds         Copier/Loader   Pipeline Jobs


 HTTP             Scribe             NFS              Hive             MySQL
                                                     Hadoop

• 3000-node Hadoop cluster

• Copier/Loader: Map-Reduce hides machine failures

• Pipeline Jobs: Hive allows SQL-like syntax

• Good scalability, but poor latency! 24 – 48 hours.
How to Get Lower Latency?




• Small-batch Processing                    • Stream Processing
▪   Run Map-reduce/Hive every hour, every   ▪   Aggregate the data as soon as it arrives
    15 min, every 5 min, …
                                            ▪   How to solve the reliability problem?
▪   How do we reduce per-batch
    overhead?
Decisions
• Stream Processing wins!



• Data Freeway
▪   Scalable Data Stream Framework

• Puma
▪   Reliable Stream Aggregation Engine
Data Freeway
scalable data stream
Scribe

                                                  Batch
                                                  Copier
                                                               HDFS

                                                  tail/fopen
 Scribe          Scribe     Scribe
                Mid-Tier                    NFS
 Clients                    Writers                           Log
• Simple push/RPC-based logging system                     Consumer


• Open-sourced in 2008. 100 log categories at that time.

• Routing driven by static configuration.
Data Freeway
                                                      Continuous
                                                        Copier
                  C1            C2         DataNode

                                                                     HDFS
                                                                       PTail
                  C1            C2         DataNode
                                                                      (in the
                                                                       plan)
Scribe                                                  PTail
Clients      Calligraphus   Calligraphus     HDFS
               Mid-tier       Writers
                                                                      Log
                                                                   Consumer
                    Zookeeper

• 9GB/sec at peak, 10 sec latency, 2500 log categories
Calligraphus
• RPC  File System
▪   Each log category is represented by 1 or more FS directories
▪   Each directory is an ordered list of files

• Bucketing support
▪   Application buckets are application-defined shards.
▪   Infrastructure buckets allows log streams from x B/s to x GB/s

• Performance
▪   Latency: Call sync every 7 seconds
▪   Throughput: Easily saturate 1Gbit NIC
Continuous Copier
• File System  File System

• Low latency and smooth network usage

• Deployment
▪   Implemented as long-running map-only job
▪   Can move to any simple job scheduler

• Coordination
▪   Use lock files on HDFS for now
▪   Plan to move to Zookeeper
PTail
                     files        checkpoint

directory

directory

directory



  • File System  Stream (  RPC )

  • Reliability
  ▪   Checkpoints inserted into the data stream
  ▪   Can roll back to tail from any data checkpoints
  ▪   No data loss/duplicates
Channel Comparison
           Push / RPC Pull / FS
Latency      1-2 sec   10 sec
Loss/Dups      Few      None
Robustness     Low      High           Scribe
Complexity    Low       High
                                        Push /
                                         RPC
                      PTail + ScribeSend          Calligraphus
                                      Pull / FS

                                  Continuous Copier
Puma
real-time aggregation/storage
Overview


 Log Stream    Aggregations                   Serving
                               Storage
• ~ 1M log lines per second, but light read

• Multiple Group-By operations per log line

• The first key in Group By is always time/date-related

• Complex aggregations: Unique user count, most frequent
  elements
MySQL and HBase: one page
                   MySQL                 HBase
Parallel           Manual sharding       Automatic
                                         load balancing
Fail-over          Manual master/slave   Automatic
                   switch
Read efficiency    High                  Low
Write efficiency   Medium                High
Columnar support   No                    Yes
Puma2 Architecture




   PTail         Puma2         HBase        Serving

• PTail provide parallel data streams

• For each log line, Puma2 issue “increment” operations to
  HBase. Puma2 is symmetric (no sharding).

• HBase: single increment on multiple columns
Puma2: Pros and Cons
• Pros
▪   Puma2 code is very simple.
▪   Puma2 service is very easy to maintain.

• Cons
▪   “Increment” operation is expensive.
▪   Do not support complex aggregations.
▪   Hacky implementation of “most frequent elements”.
▪   Can cause small data duplicates.
Improvements in Puma2
• Puma2
▪   Batching of requests. Didn‟t work well because of long-tail distribution.

• HBase
▪   “Increment” operation optimized by reducing locks.
▪   HBase region/HDFS file locality; short-circuited read.
▪   Reliability improvements under high load.

• Still not good enough!
Puma3 Architecture



                          PTail          Puma3         HBase
• Puma3 is sharded by aggregation key.

• Each shard is a hashmap in memory.
                                                  Serving
• Each entry in hashmap is a pair of
  an aggregation key and a user-defined aggregation.

• HBase as persistent key-value storage.
Puma3 Architecture



                                    PTail           Puma3           HBase




• Write workflow
                                                                Serving
▪   For each log line, extract the columns for key and value.
▪   Look up in the hashmap and call user-defined aggregation
Puma3 Architecture



                                    PTail              Puma3         HBase


• Checkpoint workflow
▪   Every 5 min, save modified hashmap entries,
    PTail checkpoint to HBase                                    Serving
▪   On startup (after node failure), load from HBase
▪   Get rid of items in memory once the time window has passed
Puma3 Architecture



                                  PTail          Puma3           HBase




• Read workflow
                                                             Serving
▪   Read uncommitted: directly serve from the in-memory hashmap; load
    from Hbase on miss.
▪   Read committed: read from HBase and serve.
Puma3 Architecture



                                    PTail            Puma3        HBase


• Join
▪   Static join table in HBase.
▪   Distributed hash lookup in user-defined function (udf).   Serving

▪   Local cache improves the throughput of the udf a lot.
Puma2 / Puma3 comparison
• Puma3 is much better in write throughput
▪   Use 25% of the boxes to handle the same load.
▪   HBase is really good at write throughput.

• Puma3 needs a lot of memory
▪   Use 60GB of memory per box for the hashmap
▪   SSD can scale to 10x per box.
Puma3 Special Aggregations
• Unique Counts Calculation
▪   Adaptive sampling
▪   Bloom filter (in the plan)

• Most frequent item (in the plan)
▪   Lossy counting
▪   Probabilistic lossy counting
PQL – Puma Query Language
• CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟
  „adid‟, „userid‟);              INSERT INTO l (a, b, c)
                                  SELECT
• CREATE VIEW v AS                   udf.hour(time),
  SELECT *, udf.age(userid)          adid,
  FROM t                             age,
  WHERE udf.age(userid) > 21         count(1),
                                     udf.count_distinc(userid)
                                  FROM v
                                  GROUP BY
• CREATE HBASE TABLE h …             udf.hour(time),
                                     adid,
• CREATE LOGICAL TABLE l …           age;
Future Works
challenges and opportunities
Future Works
• Scheduler Support
▪   Just need simple scheduling because the work load is continuous

• Mass adoption
▪   Migrate most daily reporting queries from Hive

• Open Source
▪   Biggest bottleneck: Java Thrift dependency
▪   Will come one by one
Similar Systems
• STREAM from Stanford

• Flume from Cloudera

• S4 from Yahoo

• Rainbird/Storm from Twitter

• Kafka from Linkedin
Key differences
• Scalable Data Streams
▪   9 GB/sec with < 10 sec of latency
▪   Both Push/RPC-based and Pull/File System-based
▪   Components to support arbitrary combination of channels

• Reliable Stream Aggregations
▪   Good support for Time-based Group By, Table-Stream Lookup Join
▪   Query Language:    Puma : Realtime-MR = Hive : MR
▪   No support for sliding window, stream joins
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

What's hot

HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
larsgeorge
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
Cloudera, Inc.
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBaseCon
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
Cloudera, Inc.
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
Scott Miao
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
Scott Miao
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
enissoz
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
DataWorks Summit
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
Lars Hofhansl
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Cloudera, Inc.
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
Cloudera, Inc.
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
HBaseCon
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Cloudera, Inc.
 
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesAchieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
DataWorks Summit
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
HBaseCon
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon
 

What's hot (20)

HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesAchieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 

Viewers also liked

CETS 2011, Sarah Remijan, handout for Webinars Made Easy
CETS 2011, Sarah Remijan, handout for Webinars Made EasyCETS 2011, Sarah Remijan, handout for Webinars Made Easy
CETS 2011, Sarah Remijan, handout for Webinars Made Easy
Chicago eLearning & Technology Showcase
 
Chris Goundry introduction
Chris Goundry introductionChris Goundry introduction
Chris Goundry introduction
cgoundry
 
Respiration (includingFermentation)
Respiration (includingFermentation)Respiration (includingFermentation)
Respiration (includingFermentation)
LM9
 
introtomongodb
introtomongodbintrotomongodb
introtomongodb
saikiran
 
Cets 2013 graunke using audacity
Cets 2013 graunke using audacityCets 2013 graunke using audacity
Cets 2013 graunke using audacity
Chicago eLearning & Technology Showcase
 
Cets 2015 ls iaco cheap cheerful
Cets 2015 ls iaco cheap cheerfulCets 2015 ls iaco cheap cheerful
Cets 2015 ls iaco cheap cheerful
Chicago eLearning & Technology Showcase
 
Xuat huyet nao
Xuat huyet naoXuat huyet nao
Xuat huyet nao
Ngô Định
 
SkySimulator & DrFerozMusa
SkySimulator & DrFerozMusaSkySimulator & DrFerozMusa
SkySimulator & DrFerozMusa
MUSA Sir DR IR FEROZ
 
투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과
2econsulting
 
CamTech
CamTechCamTech
CamTech
cmccutcheon17
 
YGAW (restitution)
YGAW (restitution)YGAW (restitution)
YGAW (restitution)
af83media
 
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a BudgetCETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
Chicago eLearning & Technology Showcase
 
Semiconductor06 april11 020511
Semiconductor06 april11 020511Semiconductor06 april11 020511
Semiconductor06 april11 020511
Prafulla Tekriwal
 
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
Chicago eLearning & Technology Showcase
 
DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!
Sandeep Joshi
 
Feel romania gro wing autumn edition
Feel romania gro wing autumn editionFeel romania gro wing autumn edition
Feel romania gro wing autumn editionTaras
 
How to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 StepsHow to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 Steps
Erwan Jegouzo
 
140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介
MK Translation Firm
 
Moneda
MonedaMoneda
Moneda
Ever
 

Viewers also liked (20)

CETS 2011, Sarah Remijan, handout for Webinars Made Easy
CETS 2011, Sarah Remijan, handout for Webinars Made EasyCETS 2011, Sarah Remijan, handout for Webinars Made Easy
CETS 2011, Sarah Remijan, handout for Webinars Made Easy
 
Chris Goundry introduction
Chris Goundry introductionChris Goundry introduction
Chris Goundry introduction
 
Respiration (includingFermentation)
Respiration (includingFermentation)Respiration (includingFermentation)
Respiration (includingFermentation)
 
introtomongodb
introtomongodbintrotomongodb
introtomongodb
 
Cets 2013 graunke using audacity
Cets 2013 graunke using audacityCets 2013 graunke using audacity
Cets 2013 graunke using audacity
 
Cets 2015 ls iaco cheap cheerful
Cets 2015 ls iaco cheap cheerfulCets 2015 ls iaco cheap cheerful
Cets 2015 ls iaco cheap cheerful
 
Xuat huyet nao
Xuat huyet naoXuat huyet nao
Xuat huyet nao
 
SkySimulator & DrFerozMusa
SkySimulator & DrFerozMusaSkySimulator & DrFerozMusa
SkySimulator & DrFerozMusa
 
투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과
 
CamTech
CamTechCamTech
CamTech
 
YGAW (restitution)
YGAW (restitution)YGAW (restitution)
YGAW (restitution)
 
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a BudgetCETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
 
Semiconductor06 april11 020511
Semiconductor06 april11 020511Semiconductor06 april11 020511
Semiconductor06 april11 020511
 
Brouchere
BrouchereBrouchere
Brouchere
 
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
 
DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!
 
Feel romania gro wing autumn edition
Feel romania gro wing autumn editionFeel romania gro wing autumn edition
Feel romania gro wing autumn edition
 
How to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 StepsHow to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 Steps
 
140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介
 
Moneda
MonedaMoneda
Moneda
 

Similar to Hic 2011 realtime_analytics_at_facebook

Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
liqiang xu
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
baggioss
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
Yifeng Jiang
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
yarapavan
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
Jean-Baptiste Poullet
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
caizer_x
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
StampedeCon
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
Alexey Rybak
 

Similar to Hic 2011 realtime_analytics_at_facebook (20)

Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 

More from baggioss

Hdfs写流程异常处理
Hdfs写流程异常处理Hdfs写流程异常处理
Hdfs写流程异常处理
baggioss
 
Hbase性能测试文档
Hbase性能测试文档Hbase性能测试文档
Hbase性能测试文档
baggioss
 
Hbase使用hadoop分析
Hbase使用hadoop分析Hbase使用hadoop分析
Hbase使用hadoop分析baggioss
 
Hadoop基线选定
Hadoop基线选定Hadoop基线选定
Hadoop基线选定
baggioss
 
Hic2011
Hic2011Hic2011
Hic2011
baggioss
 
Hdfs introduction
Hdfs introductionHdfs introduction
Hdfs introduction
baggioss
 
Hdfs
HdfsHdfs
Hdfs
baggioss
 
Hdfs
HdfsHdfs
Hdfs
baggioss
 
Hdfs原理及实现
Hdfs原理及实现Hdfs原理及实现
Hdfs原理及实现
baggioss
 

More from baggioss (10)

Hdfs写流程异常处理
Hdfs写流程异常处理Hdfs写流程异常处理
Hdfs写流程异常处理
 
Hbase性能测试文档
Hbase性能测试文档Hbase性能测试文档
Hbase性能测试文档
 
Hbase使用hadoop分析
Hbase使用hadoop分析Hbase使用hadoop分析
Hbase使用hadoop分析
 
Hadoop基线选定
Hadoop基线选定Hadoop基线选定
Hadoop基线选定
 
Hic2011
Hic2011Hic2011
Hic2011
 
Hdfs introduction
Hdfs introductionHdfs introduction
Hdfs introduction
 
Hbase
HbaseHbase
Hbase
 
Hdfs
HdfsHdfs
Hdfs
 
Hdfs
HdfsHdfs
Hdfs
 
Hdfs原理及实现
Hdfs原理及实现Hdfs原理及实现
Hdfs原理及实现
 

Recently uploaded

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 

Recently uploaded (20)

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 

Hic 2011 realtime_analytics_at_facebook

  • 1. Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011
  • 2. Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works
  • 4. Facebook Insights • Use cases ▪ Websites/Ads/Apps/Pages ▪ Time series ▪ Demographic break-downs ▪ Unique counts/heavy hitters • Major challenges ▪ Scalability ▪ Latency
  • 5. Analytics based on Hadoop/Hive Hourly Daily seconds seconds Copier/Loader Pipeline Jobs HTTP Scribe NFS Hive MySQL Hadoop • 3000-node Hadoop cluster • Copier/Loader: Map-Reduce hides machine failures • Pipeline Jobs: Hive allows SQL-like syntax • Good scalability, but poor latency! 24 – 48 hours.
  • 6. How to Get Lower Latency? • Small-batch Processing • Stream Processing ▪ Run Map-reduce/Hive every hour, every ▪ Aggregate the data as soon as it arrives 15 min, every 5 min, … ▪ How to solve the reliability problem? ▪ How do we reduce per-batch overhead?
  • 7. Decisions • Stream Processing wins! • Data Freeway ▪ Scalable Data Stream Framework • Puma ▪ Reliable Stream Aggregation Engine
  • 9. Scribe Batch Copier HDFS tail/fopen Scribe Scribe Scribe Mid-Tier NFS Clients Writers Log • Simple push/RPC-based logging system Consumer • Open-sourced in 2008. 100 log categories at that time. • Routing driven by static configuration.
  • 10. Data Freeway Continuous Copier C1 C2 DataNode HDFS PTail C1 C2 DataNode (in the plan) Scribe PTail Clients Calligraphus Calligraphus HDFS Mid-tier Writers Log Consumer Zookeeper • 9GB/sec at peak, 10 sec latency, 2500 log categories
  • 11. Calligraphus • RPC  File System ▪ Each log category is represented by 1 or more FS directories ▪ Each directory is an ordered list of files • Bucketing support ▪ Application buckets are application-defined shards. ▪ Infrastructure buckets allows log streams from x B/s to x GB/s • Performance ▪ Latency: Call sync every 7 seconds ▪ Throughput: Easily saturate 1Gbit NIC
  • 12. Continuous Copier • File System  File System • Low latency and smooth network usage • Deployment ▪ Implemented as long-running map-only job ▪ Can move to any simple job scheduler • Coordination ▪ Use lock files on HDFS for now ▪ Plan to move to Zookeeper
  • 13. PTail files checkpoint directory directory directory • File System  Stream (  RPC ) • Reliability ▪ Checkpoints inserted into the data stream ▪ Can roll back to tail from any data checkpoints ▪ No data loss/duplicates
  • 14. Channel Comparison Push / RPC Pull / FS Latency 1-2 sec 10 sec Loss/Dups Few None Robustness Low High Scribe Complexity Low High Push / RPC PTail + ScribeSend Calligraphus Pull / FS Continuous Copier
  • 16. Overview Log Stream Aggregations Serving Storage • ~ 1M log lines per second, but light read • Multiple Group-By operations per log line • The first key in Group By is always time/date-related • Complex aggregations: Unique user count, most frequent elements
  • 17. MySQL and HBase: one page MySQL HBase Parallel Manual sharding Automatic load balancing Fail-over Manual master/slave Automatic switch Read efficiency High Low Write efficiency Medium High Columnar support No Yes
  • 18. Puma2 Architecture PTail Puma2 HBase Serving • PTail provide parallel data streams • For each log line, Puma2 issue “increment” operations to HBase. Puma2 is symmetric (no sharding). • HBase: single increment on multiple columns
  • 19. Puma2: Pros and Cons • Pros ▪ Puma2 code is very simple. ▪ Puma2 service is very easy to maintain. • Cons ▪ “Increment” operation is expensive. ▪ Do not support complex aggregations. ▪ Hacky implementation of “most frequent elements”. ▪ Can cause small data duplicates.
  • 20. Improvements in Puma2 • Puma2 ▪ Batching of requests. Didn‟t work well because of long-tail distribution. • HBase ▪ “Increment” operation optimized by reducing locks. ▪ HBase region/HDFS file locality; short-circuited read. ▪ Reliability improvements under high load. • Still not good enough!
  • 21. Puma3 Architecture PTail Puma3 HBase • Puma3 is sharded by aggregation key. • Each shard is a hashmap in memory. Serving • Each entry in hashmap is a pair of an aggregation key and a user-defined aggregation. • HBase as persistent key-value storage.
  • 22. Puma3 Architecture PTail Puma3 HBase • Write workflow Serving ▪ For each log line, extract the columns for key and value. ▪ Look up in the hashmap and call user-defined aggregation
  • 23. Puma3 Architecture PTail Puma3 HBase • Checkpoint workflow ▪ Every 5 min, save modified hashmap entries, PTail checkpoint to HBase Serving ▪ On startup (after node failure), load from HBase ▪ Get rid of items in memory once the time window has passed
  • 24. Puma3 Architecture PTail Puma3 HBase • Read workflow Serving ▪ Read uncommitted: directly serve from the in-memory hashmap; load from Hbase on miss. ▪ Read committed: read from HBase and serve.
  • 25. Puma3 Architecture PTail Puma3 HBase • Join ▪ Static join table in HBase. ▪ Distributed hash lookup in user-defined function (udf). Serving ▪ Local cache improves the throughput of the udf a lot.
  • 26. Puma2 / Puma3 comparison • Puma3 is much better in write throughput ▪ Use 25% of the boxes to handle the same load. ▪ HBase is really good at write throughput. • Puma3 needs a lot of memory ▪ Use 60GB of memory per box for the hashmap ▪ SSD can scale to 10x per box.
  • 27. Puma3 Special Aggregations • Unique Counts Calculation ▪ Adaptive sampling ▪ Bloom filter (in the plan) • Most frequent item (in the plan) ▪ Lossy counting ▪ Probabilistic lossy counting
  • 28. PQL – Puma Query Language • CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟ „adid‟, „userid‟); INSERT INTO l (a, b, c) SELECT • CREATE VIEW v AS udf.hour(time), SELECT *, udf.age(userid) adid, FROM t age, WHERE udf.age(userid) > 21 count(1), udf.count_distinc(userid) FROM v GROUP BY • CREATE HBASE TABLE h … udf.hour(time), adid, • CREATE LOGICAL TABLE l … age;
  • 30. Future Works • Scheduler Support ▪ Just need simple scheduling because the work load is continuous • Mass adoption ▪ Migrate most daily reporting queries from Hive • Open Source ▪ Biggest bottleneck: Java Thrift dependency ▪ Will come one by one
  • 31. Similar Systems • STREAM from Stanford • Flume from Cloudera • S4 from Yahoo • Rainbird/Storm from Twitter • Kafka from Linkedin
  • 32. Key differences • Scalable Data Streams ▪ 9 GB/sec with < 10 sec of latency ▪ Both Push/RPC-based and Pull/File System-based ▪ Components to support arbitrary combination of channels • Reliable Stream Aggregations ▪ Good support for Time-based Group By, Table-Stream Lookup Join ▪ Query Language: Puma : Realtime-MR = Hive : MR ▪ No support for sliding window, stream joins
  • 33. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Editor's Notes

  1. Good morning everyone. My name’s Zheng Shao.Today I am going to talk about the Real-time Analytics at Facebook.
  2. This is the agenda of the talk. We will start with why we need realtime analytics, then get into details of how we implemented it, and finally future works and comparisons with other systems.
  3. First of all, what is realtime analytics and why we want to do it.
  4. This is the main use case for our analytics. We have a product called Facebook Insights, which allows website owners, advertisers, Facebook application developers, and Facebook page owners to view the time series of impression/click/action counters, the counters broken down by demographics like gender and age, as well as the unique user counters and heavy hitters like most popular urls.The major challenges of building the backend of this insights products are two folds. On one hand, we have huge amount of data coming from both Facebook and non-Facebook websites. On the other hand, customers of the insights product really want to have low-latency summaries, so that they can immediately know how popular a new article or a new game is.
  5. We didhave an existing complete data warehouse solution at Facebook to handle insights workload.In short, log streams got generated from HTTP servers and transferred to NFS via a log collection framework called scribe all within seconds, and then got copied/loaded into Hadoop. Summaries got generated from daily pipeline jobs and eventually got loaded into MySQL for serving.Specifically, we have a 3000-note Hadoop cluster to handle the scalability issue. Copier/Loader are map-reduce jobs which handle machine failures automatically. And Pipeline Jobs are written in Hive which has a SQL-like syntax.Pretty good scalability until we hit the data center power limit. But latency is terrible.
  6. We got 2 ideas on how to improve the latency.The first one is small-batch processing. Instead of using a batch of 1 day, we can produce much smaller batches. The question is how to reduce per-batch overhead, so that tiny batches like 1 min or less makes sense.The second one is stream processing. We can aggregate the data as soon as it arrives. This will produce near realtime results. The question is how to make the system reliable against hardware failures.It turns out the per-batch overhead of Map-Reduce is so high that it’s not practical to have even 5-minute batches on our Hadoop cluster, so we finally decided to go stream processing.
  7. The rest of the talk will focus on two key systems that we built for realtime analytics.The first one, Data Freeway, is a scalable data stream framework on top of Scribe and HDFS.The second one, Puma, is a reliable stream aggregation engine on top of HBase.
  8. This was our old data stream framework. It has several layers of data transportation. The first transport from clients to mid-tier is to reduce the fanout from tens of thousands to hundreds, the second transport is to shuffle the data based on log categories, so that one log category goes to a single writer. Then log data gets written into NFS, which is consumed by batch copier, as well as unix tail/fopen.In short, it’s a simple push/RPC-based logging system. Scribe was open-sourced in 2008, when we have 100 log categories at that time. It quickly got adopted by a lot of other companies. The routing is driven by static configuration which is flexible but have two problems: 1. not scalable because we need to maintain a config for each box in the writers, and a single writer is not scalable; 2. single point of failure in writers.
  9. We came up with Data Freeway in 2011. Right now it’s handling 9GB/sec of data at peak with 10 sec end-to-end latency, and has over 2500 log categories.It contains 4 major components.The first one is scribe. It’s used only at the client, responsible for sending out data via RPCs. The second one is called Calligraphus. It utilizes Zookeeper to manage the ownership of categories, shuffles the data and write to HDFS.The third one is called Continuous Copier, which continuously copies files from one HDFS to another, as the file grows.The fourth one is called PTail, which in parallel tails multiple directories on HDFS and writes out to stdout. Right now we directly ptail from the HDFS written by Calligraphus, but we plan to tail from the HDFS written by Continuous Copier in the future.Let’s get into details of these components.
  10. Calligraphus is responsible for getting log data from RPC and write to File System.Each log category is represented by 1 or more FS directories.Each directory is an ordered list of files, with date in the file name. The files can be compressed.This is a very simple protocol for storing log data. Probably the simplest that I can think of.The most interesting feature about Calligraphus is the bucketing support.We have application buckets, which are application-defined shards. These are used for sharded log consumers. Most of the big log consumers are sharded because their log stream is too big.We also support infrastructure buckets, which allow a single application bucket to have a throughput from several bytes per second to several gigabytes per second. Each infrastructure bucket is a directory. So big streams can go to multiple directories at the same time.Calligraphus has a pretty high performance. We call File System sync every 7 seconds, which is the major source of data latency right now. The network throughput can easily saturate 1Gbit NIC, and we are planning to use 10Gbit NIC some time soon.
  11. Continuous Copier is for continuous data transfer from one File System to another.Compared with the batch-based map-reduce copier, it provide much lower latency as well as smooth network usage.Right now it’s implemented as a long-running map-only job, but it can be easily moved to any simple job scheduling system other than map-reduce.Right now it uses lock files in HDFS for coordination among different nodes, and we plan to move to Zookeeper very soon.The peak throughput of continuous copier in production is about 3GB/sec compressed right now.
  12. The last component in Data Freeway is PTail, which transfers data from a File System to an output stream.The key feature of PTail is the checkpoint. A PTail checkpoint contains the current files and the file offsets in each of the directories. This makes it possible for PTail to roll back to an earlier checkpoint, and reproduce the data stream without any data loss/duplicates at the boundary.
  13. To wrap up Data Freeway, we support 2 channels for data transfers.Push via RPC has lower latency, can potentially have some loss/dups when network has a problem, is less robust with respect to machine failures, and has a very low complexity in code.Pull via FS has a longer latency, but it does not have any loss/dups, and is robust to machine failures. The problems is that the code of the File System, especially HDFS, can be pretty complex, and we still need to identify and fix some bugs there.Data Freeway consists of 4 components that allows data transfer between these 2 channels.
  14. This is the simplified architecture of a typical stream aggregation engine.Log streams get aggregated on a set of machines. The summaries is usually saved to storage for persistence. Online serving get summaries from either the aggregations directly or from the storage. Usually the write throughput is much higher than the read, because analytics data is only viewed by the owners of the website, e.g.In our environment, we have on the order of 1M log lines per second. For each of the log lines, we need to do multiple group-by operations, like by age, or by gender. The first key in group by is always time/date-related which means the summaries will become static after some time. Also we need to support complex aggregations like unique counts and heavy hitters.
  15. Let’s look at our storage choices first.We considered using either MySQL or HBase as our storage engine. HBase is much easier to manage in a distributed environment, which was the major reason that we chose HBase. It also has better write efficiency as well as Columnar support. The read efficiency is inferior because HBase’s cache has less memory space efficiency.
  16. The first architecture that we came up is called Puma2.We run Puma2 on a set of machines, and use PTail to provide parallel data streams. For each log line, Puma2 issues “increment” operations to HBase. Note that Puma2 servers are all symmetric, which means the same row in HBase can be incremented by multiple Puma2 at the same time.HBase can do single increment operation on multiple columns of the same row. So we can use a single increment operation in HBase to handle multiple Group-By’s.Puma2 went into production in March 2011 and is handling 600K log lines on 100 boxes (Puma2 + HBase)
  17. Here are the pros and cons of the Puma2 architecture. The good thing about Puma2 is extremely simple and easy to maintain. The root reason is that Puma2 servers are symmetric and almost stateless. The only state is the PTail checkpoint that is saved to HBase periodically. As a result, we can easily add more boxes or reboot a box if the box went down.However, Puma2 also has its problems. First of all, HBase increment operation is expensive because it’s a read-and-write, and read is expensive. It’s also not possible to support aggregations other than counts, because that need a lot of customized code in HBase. We did a hacky implementation of “most frequent elements” by multiple layers of “frequent element table”. Finally, Puma2 can have small data duplicates because “increments” and checkpoint writes are not in a single transaction.
  18. We did some small improvements to Puma2.On the Puma2 service, an obvious idea is to batch the increment requests to reduce the load on HBase. However, it didn’t work well because of the long-tail distribution of Group-By keys. It also made data less accurate because we cannot save checkpoints in the middle of a batch.On the HBase side, we first optimized the “increment” operation by reducing the number of locks. Another big efficiency improvement came from the short-circuited read from HBase directly to HDFS block files on the disk, instead of via DataNode daemon. We also improved the HBase reliability under the high load.All in all, we are still not happy about Puma2, especially when we try to support unique counters. So we switched to a new architecture called Puma3.
  19. The biggest difference between Puma2 and Puma3 is that in Puma3, we do aggregations in the memory of Puma3 process instead of in HBase. Local memory operations are much faster so that we can achieve a much higher throughput.In order to make in-memory aggregations, we made Puma3 sharded by aggregation key. That means the input PTail data stream has to be sharded as well. That is supported by the application bucketing feature from Calligraphus.Each shard of Puma3 is basically a hashmap in memory. Each entry of the hashmap is a pair of an aggregation key and a user-defined aggregation, which can be count, sum, avg, or anything.We use HBase as a persistent storage but usually don’t read from it.
  20. The write workflow for Puma3 is pretty simple.Basically, for each log line, we extract the columns for key and value. We use the key to look up the in-memory hashmap, and call user-defined aggregation with the value.Note that, since the log streams are sharded by aggregation key, the same aggregation key won’t appear in more than 1 Puma3 processes. This is the key to make Puma3 work.
  21. We checkpoint the state of Puma3 process into HBase every 5 minutes. Basically, we save all the modified hashmap entries as well as the PTail checkpoint. That means if Puma3 crashes and restarts, it can load the state from HBase via sequential read, which is pretty fast in HBase.In order to save memory, we also get rid of hashmap entries from memory once the time window for the aggregation has passed, because we are not going to receive new log lines for that time window again.
  22. There are 2 choices for the read workflow.If we want to read uncommitted aggregations which is usually with 10 seconds of latency, we directly serve from the in-memory hashmap. We go to HBase only for a miss, which will only happen if the time window of the aggregation has passed.If we want to read committed data, Puma3 will read from HBase and serve.Note that uncommitted aggregation result can decrease in value if the Puma3 process dies before making the next checkpoint. We plan to have a cache layer between serving and Puma3 to make sure numbers don’t decrease.
  23. Puma3 also supports joining with a static table in HBase. The join key has to be the row key in the static HBase table. It’s implemented as a simple distributed hash lookup in a user-defined function. We have found that local cache improves the throughput of the udf a lot.
  24. Comparing Puma2 and Puma3, we found that Puma3 is much better in writer throughput. We only need to use 25% of the boxes to handle the same work load. The main reason is that HBase is really good at write throughput.At the same time, Puma3 needs a lot of memory. Basically, all aggregations that can change needs to be stored in memory, to ensure the log stream write throughput. Right now we use 60GB of memory per box for the hashmap. In the future, we may use SSD that can easily scale to 10x more space per box.
  25. With Puma3, we can easily support these special aggregations, with some approximation.For unique counts, we have implemented a simple adaptive sampling algorithm, that samples more aggressively when the numbe of unique item increases. We can also easily implement the standard bloom filter for counting.For the most frequent items, we plan to implement the classic lossy counting algorithm and probabilistic lossy counting algorithm.
  26. The most important feature of Puma that distinguishes it from other stream processing projects is the language.We have built a SQL-like query language that allows us to define the input stream, the output table, as well as the query itself. Note that the query contains user-defined functions for Join as well as Aggregations.Puma3 is right now in pre-production stage. We plan to push it out in production as soon as we verified all the summaries against Puma2 and Hive.
  27. Here are a list of things we plan to do next.First is simple scheduling for Puma3. We just need very simple scheduling because the work load is continuous. Most likely we will reuse some existing frameworks.Second is the mass adoption inside the company. We plan to migrate most daily reporting queries from Hive, as long as the query is simple enough to be supported by Puma. This will reduce the latency as well as improve the efficiency, because of the saving in compression/decompression.The third one is open-source. Right now, the biggest bottleneck is Java Thrift which has diverged between Facebook and open-source. We plan to open-source the projects one by one, starting from Calligraphus.
  28. There are lots of similar systems in academia as well as other companies.
  29. Instead of comparing them one by one, I will end the presentation by a summary of the key differences.Data Freeway is a scalable data stream framework with 9GB/sec throughput and 10 sec latency. It supports both Push/RPC-based and Pull/File System-based channels. We have components to support arbitrary combination of channels to adapt to the use case.Puma is a reliable stream aggregation engine. It has good support for time-window-based Group By as well as table-stream Lookup Join. It has a query language that makes Puma comparable to Hive when comparing Realtime-MR and MR. Puma has no support and no plan to support sliding window and stream joins because those are very hard problems that we don’t see in our environment.