SlideShare a Scribd company logo
1 of 71
Big Data at Twitter
#chirpdata

     Kevin Weil
     @kevinweil

     Twitter Analytics
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
•   5 million floppy disks
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
•   5 million floppy disks
•   225 GB while I give this talk
Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
• Resources
  overwhelmed
• Lost data
Scribe
• Surprise! FB had same problem, built
and open-sourced Scribe
• Log collection framework over Thrift
• You write log lines, with categories
• It does the rest
Scribe
                           FE   FE   FE
• Runs locally; reliable
in network outage
Scribe
                           FE         FE     FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer;              Agg        Agg
hierarchical, scalable
Scribe
                              FE         FE     FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer;                 Agg        Agg
hierarchical, scalable
• Pluggable outputs
                       File          HDFS
Scribe at Twitter
• Solved our problem, opened new
vistas
• Currently 30 different categories
logged from javascript, RoR, Scala, etc
• We improved logging, monitoring,
writing to Hadoop, compression
Scribe at Twitter
 • Continuing to work with FB
 • GSoC project! Help make it more
 awesome.


• http://github.com/traviscrawford/scribe
• http://wiki.developers.facebook.com/index.php/User:GSoC
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
• Uh oh.
Where do I put 7TB/day?
• Need a cluster of
machines
Where do I put 7TB/day?
• Need a cluster of
machines

• ... which adds new
layers of complexity
Hadoop
• Distributed file system
   • Automatic replication, fault
   tolerance
• MapReduce-based parallel computation
   • Key-value based computation
   interface allows for wide applicability
Hadoop
• Open source: top-level Apache project
• Scalable: Y! has a 4000 node cluster
• Powerful: sorted 1TB random integers
in 62 seconds

• Easy packaging: free Cloudera RPMs
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
    • grep, awk? No way.
    • Data is in MySQL... self join on an n-
    billion row table?
    • n,000,000,000 x n,000,000,000 = ?
Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
    • grep, awk? No way.
    • Data is in MySQL... self join on an n-
    billion row table?
    • n,000,000,000 x n,000,000,000 = ?
    • I don’t know either.
Two Analysis Challenges
2. Large-scale grouping and counting
   • select count(*) from users? maybe.
   • select count(*) from tweets? uh...
   • Imagine joining them.
   • And grouping.
   • And sorting.
Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
• But a fun one!
Analysis at Scale
• Now we’re rolling
• Count all tweets: 12 billion, 5 minutes
• Hit FlockDB in parallel to assemble
social graph aggregates
• Run pagerank across users to calculate
reputations
But...
• Analysis typically in Java
• Single-input, two-stage data flow is rigid
• Projections, filters: custom code
• Joins lengthy, error-prone
• n-stage jobs: hard to manage
• Exploration requires compilation
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Pig
• High level language
• Transformations on
sets of records
• Process data one
step at a time
• Easier than SQL?
Why Pig?
 Because I bet you can read
 the following script
A Real Pig Script




• Just for fun... the same calculation in Java
No, Seriously.
Pig Makes it Easy
• 5% of the code
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
• Readable, reusable
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
• More minds contributing = more value
from your data
Counting Big Data
• How many requests per day?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
• Geographic distribution of queries?
Correlating Big Data
• Usage difference for mobile users?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
• What do successful users use often?
Research on Big Data
• What can we tell from a user’s tweets?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
• Machine learning
If We Had More Time...
• HBase backing namesearch
• LZO compression
• Protocol Buffers and Hadoop
• Our open source: hadoop-lzo, elephant-
bird
• Realtime analytics with Cassandra
Questions?
        Follow me at
        twitter.com/kevinweil

More Related Content

Viewers also liked

Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011ZoeMM
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalRushin Naik
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestOllie Parsley
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningSngular Meaning
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Hortonworks
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designChristine Outram
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining sourceAtaxo Group
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science researchDavide Bennato
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on TwitterPulkit Goyal
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 

Viewers also liked (20)

Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _final
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - Devnest
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion mining
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product design
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining source
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science research
 
PPT FOR BIG
PPT FOR BIGPPT FOR BIG
PPT FOR BIG
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on Twitter
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 

Similar to Big Data at Twitter, Chirp 2010

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examplesYoshitomo Matsubara
 
Multi-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphXMulti-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphXQingbo Hu
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Webmaria.grineva
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorterManchor Ko
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computingJULIO GONZALEZ SANZ
 

Similar to Big Data at Twitter, Chirp 2010 (20)

Enar short course
Enar short courseEnar short course
Enar short course
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Multi-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphXMulti-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphX
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Scaling
ScalingScaling
Scaling
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computing
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 

Recently uploaded (20)

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 

Big Data at Twitter, Chirp 2010

  • 1.
  • 2. Big Data at Twitter #chirpdata Kevin Weil @kevinweil Twitter Analytics
  • 3. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 4. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 5. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess?
  • 6. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr)
  • 7. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs
  • 8. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks
  • 9. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks • 225 GB while I give this talk
  • 10. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale
  • 11. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale • Resources overwhelmed • Lost data
  • 12. Scribe • Surprise! FB had same problem, built and open-sourced Scribe • Log collection framework over Thrift • You write log lines, with categories • It does the rest
  • 13. Scribe FE FE FE • Runs locally; reliable in network outage
  • 14. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • 15. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable • Pluggable outputs File HDFS
  • 16. Scribe at Twitter • Solved our problem, opened new vistas • Currently 30 different categories logged from javascript, RoR, Scala, etc • We improved logging, monitoring, writing to Hadoop, compression
  • 17. Scribe at Twitter • Continuing to work with FB • GSoC project! Help make it more awesome. • http://github.com/traviscrawford/scribe • http://wiki.developers.facebook.com/index.php/User:GSoC
  • 18. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 19. How do you store 7TB/day? • Single machine? • What’s HD write speed?
  • 20. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s
  • 21. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB
  • 22. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB • Uh oh.
  • 23. Where do I put 7TB/day? • Need a cluster of machines
  • 24. Where do I put 7TB/day? • Need a cluster of machines • ... which adds new layers of complexity
  • 25. Hadoop • Distributed file system • Automatic replication, fault tolerance • MapReduce-based parallel computation • Key-value based computation interface allows for wide applicability
  • 26. Hadoop • Open source: top-level Apache project • Scalable: Y! has a 4000 node cluster • Powerful: sorted 1TB random integers in 62 seconds • Easy packaging: free Cloudera RPMs
  • 27. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 28. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 29. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 30. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 31. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 32. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 33. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 34. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ?
  • 35. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ? • I don’t know either.
  • 36. Two Analysis Challenges 2. Large-scale grouping and counting • select count(*) from users? maybe. • select count(*) from tweets? uh... • Imagine joining them. • And grouping. • And sorting.
  • 37. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment
  • 38. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment • But a fun one!
  • 39. Analysis at Scale • Now we’re rolling • Count all tweets: 12 billion, 5 minutes • Hit FlockDB in parallel to assemble social graph aggregates • Run pagerank across users to calculate reputations
  • 40. But... • Analysis typically in Java • Single-input, two-stage data flow is rigid • Projections, filters: custom code • Joins lengthy, error-prone • n-stage jobs: hard to manage • Exploration requires compilation
  • 41. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 42. Pig • High level language • Transformations on sets of records • Process data one step at a time • Easier than SQL?
  • 43. Why Pig? Because I bet you can read the following script
  • 44. A Real Pig Script • Just for fun... the same calculation in Java
  • 46. Pig Makes it Easy • 5% of the code
  • 47. Pig Makes it Easy • 5% of the code • 5% of the dev time
  • 48. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time
  • 49. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time • Readable, reusable
  • 50. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions
  • 51. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration
  • 52. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration • More minds contributing = more value from your data
  • 53. Counting Big Data • How many requests per day?
  • 54. Counting Big Data • How many requests per day? • Average latency? 95% latency?
  • 55. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour?
  • 56. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day?
  • 57. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries?
  • 58. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries? • Geographic distribution of queries?
  • 59. Correlating Big Data • Usage difference for mobile users?
  • 60. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients?
  • 61. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses
  • 62. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked?
  • 63. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked? • What do successful users use often?
  • 64. Research on Big Data • What can we tell from a user’s tweets?
  • 65. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers?
  • 66. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow?
  • 67. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth?
  • 68. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection
  • 69. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection • Machine learning
  • 70. If We Had More Time... • HBase backing namesearch • LZO compression • Protocol Buffers and Hadoop • Our open source: hadoop-lzo, elephant- bird • Realtime analytics with Cassandra
  • 71. Questions? Follow me at twitter.com/kevinweil

Editor's Notes