SlideShare a Scribd company logo
1 of 58
Hadoop, Pig, and Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter




                           TM
Introduction
‣   Hadoop Overview
‣   Why Pig?
‣   Evolution of Data Processing at Twitter
‣   Pig for Counting
‣   Pig for Correlating
‣   Pig for Research and Data Mining
‣   Conclusions and Next Steps
My Background
‣   Studied Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh routing algorithms,
    GBs of data
‣   Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣   Twitter: Hadoop, Pig, machine learning, visualization, social
    graph analysis, ??? of data
Introduction
‣   Hadoop Overview
‣   Why Pig?
‣   Evolution of Data Processing at Twitter
‣   Pig for Counting
‣   Pig for Correlating
‣   Pig for Research and Data Mining
‣   Conclusions and Next Steps
Data is Getting Big
‣   NYSE: 1 TB/day
‣   Facebook: 20+ TB
    compressed/day
‣   CERN/LHC: 40 TB/day (15
    PB/year!)
‣   And growth is accelerating
‣   Need multiple machines,
    horizontal scalability
Hadoop
‣   Distributed file system (hard to store a PB)
‣   Fault-tolerant, handles replication, node failure, etc
‣   MapReduce-based parallel computation
    (even harder to process a PB)
‣   Generic key-value based computation interface
    allows for wide applicability
‣   Open source, top-level Apache project
‣   Scalable: Y! has a 4000-node cluster
‣   Powerful: sorted a TB of random integers in 62 seconds
MapReduce?
             ‣   Challenge: how many tweets per
                 user, given tweets table?
             ‣   Input: key=row, value=tweet info
             ‣   Map: output key=user_id, value=1
             ‣   Shuffle: sort by user_id
             ‣   Reduce: for each user_id, sum
             ‣   Output: user_id, tweet count
             ‣   With 2x machines, runs close to
                 2x faster.
MapReduce?
             ‣   Challenge: how many tweets per
                 user, given tweets table?
             ‣   Input: key=row, value=tweet info
             ‣   Map: output key=user_id, value=1
             ‣   Shuffle: sort by user_id
             ‣   Reduce: for each user_id, sum
             ‣   Output: user_id, tweet count
             ‣   With 2x machines, runs close to
                 2x faster.
MapReduce?
             ‣   Challenge: how many tweets per
                 user, given tweets table?
             ‣   Input: key=row, value=tweet info
             ‣   Map: output key=user_id, value=1
             ‣   Shuffle: sort by user_id
             ‣   Reduce: for each user_id, sum
             ‣   Output: user_id, tweet count
             ‣   With 2x machines, runs close to
                 2x faster.
MapReduce?
             ‣   Challenge: how many tweets per
                 user, given tweets table?
             ‣   Input: key=row, value=tweet info
             ‣   Map: output key=user_id, value=1
             ‣   Shuffle: sort by user_id
             ‣   Reduce: for each user_id, sum
             ‣   Output: user_id, tweet count
             ‣   With 2x machines, runs close to
                 2x faster.
MapReduce?
             ‣   Challenge: how many tweets per
                 user, given tweets table?
             ‣   Input: key=row, value=tweet info
             ‣   Map: output key=user_id, value=1
             ‣   Shuffle: sort by user_id
             ‣   Reduce: for each user_id, sum
             ‣   Output: user_id, tweet count
             ‣   With 2x machines, runs close to
                 2x faster.
MapReduce?
             ‣   Challenge: how many tweets per
                 user, given tweets table?
             ‣   Input: key=row, value=tweet info
             ‣   Map: output key=user_id, value=1
             ‣   Shuffle: sort by user_id
             ‣   Reduce: for each user_id, sum
             ‣   Output: user_id, tweet count
             ‣   With 2x machines, runs close to
                 2x faster.
MapReduce?
             ‣   Challenge: how many tweets per
                 user, given tweets table?
             ‣   Input: key=row, value=tweet info
             ‣   Map: output key=user_id, value=1
             ‣   Shuffle: sort by user_id
             ‣   Reduce: for each user_id, sum
             ‣   Output: user_id, tweet count
             ‣   With 2x machines, runs close
                 to 2x faster.
But...
‣   Analysis typically done in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins: lengthy, error-prone
‣   n-stage jobs: Hard to manage
‣   Prototyping/exploration requires             ‣   analytics in Eclipse?
    compilation                                      ur doin it wrong...
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process data one step at a time
            ‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Democratizes Large-scale Data
Analysis
‣   The Pig version is:
‣     5% of the code
‣     5% of the time
‣     Within 50% of the execution time.
‣   Innovation increasingly driven from large-scale data analysis
‣     Need fast iteration to understand the right questions
‣     More minds contributing = more value from your data
Introduction
‣   Hadoop Overview
‣   Why Pig?
‣   Evolution of Data Processing at Twitter
‣   Pig for Counting
‣   Pig for Correlating
‣   Pig for Research and Data Mining
‣   Conclusions and Next Steps
MySQL, MySQL, MySQL
‣   We all start there.
‣   But MySQL is not built for analysis.
‣   select count(*) from users? Maybe.
‣   select count(*) from tweets? Uh...
‣   Imagine joining them.
‣   And grouping.
‣   Then sorting.
The Hadoop Ecosystem at Twitter
‣   Cloudera’s free distribution, running Hadoop 0.20.1
‣   Heavily modified Facebook Scribe for log collection -> HDFS*
‣   Heavily modified LZO code for fast, splittable data compression**
‣   Data stored either as LZO-compressed flat files (logs, etc) or
    serialized, LZO-compressed protocol buffers (structured data).
‣   Custom InputFormats, Pig LoadFuncs for the above*
‣   Some Java-based MapReduce, some Hadoop Streaming
‣   Most analysis, and most interesting analyses, done in Pig.
‣   * Open sourced, or on the way. Please come talk afterwards if you’re interested.
‣   ** http://www.github.com/kevinweil/hadoop-lzo
Data?
        ‣   Semi-structured: apache
            logs, search logs, RoR logs,
            mysql query logs, rate limiter
            logs, per-application logs
        ‣   Structured: tweets, users,
            block notifications, phones,
            favorites, saved searches,
            retweets, authentications, sms
            usage, third party clients,
            followings
        ‣   Entangled: the social graph
Introduction
‣   Hadoop Overview
‣   Why Pig?
‣   Evolution of Data Processing at Twitter
‣   Pig for Counting
‣   Pig for Correlating
‣   Pig for Research and Data Mining
‣   Conclusions and Next Steps
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣   How many unique queries, how many unique users?
‣
Counting Big Data
‣                standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣   How many unique queries, how many unique users?
‣   What is their geographic distribution?
Counting Big Data
‣   Where are users querying from? The API, the front page, their
    profile page, etc?
‣
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
‣   Search corrections, search suggestions
Correlating Big Data
‣                 probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
‣   Search corrections, search suggestions
‣   A/B testing
Correlating Big Data
‣   What is the correlation between users with registered phones
    and users that tweet?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
‣   What graph structures lead to successful networks?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
‣   What graph structures lead to successful networks?
‣   User reputation
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
‣   Language detection
Research on Big Data
‣            prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
‣   Language detection
‣   ... the list goes on.
Research on Big Data
‣   How well can we detect bots and other non-human tweeters?
Introduction
‣   Hadoop Overview
‣   Why Pig?
‣   Evolution of Data Processing at Twitter
‣   Pig for Counting
‣   Pig for Correlating
‣   Pig for Research and Data Mining
‣   Conclusions and Next Steps
Why Hadoop?
‣   Data is growing rapidly; need horizontally scalable computation
‣   Fault tolerant; gracefully handles machine failure


‣   ... but writing MapReduce jobs in Java is harder than it should be.
Why Pig?
‣   Pig makes Hadoop accessible
‣   Pig chains together complex job flows
‣   User-defined functions are first class citizens
‣   Vibrant OS community, dedicated team at Y! improving it daily
‣   At Twitter, Pig helps us understand our business faster.
Questions?                                           Follow me at
                                                            twitter.com/kevinweil




‣   If this sounded interesting to you -- that’s because it is. And we’re hiring.

                                                                         TM

More Related Content

What's hot

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterKevin Weil
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
如何快速实现数据编织架构
如何快速实现数据编织架构如何快速实现数据编织架构
如何快速实现数据编织架构Denodo
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Apache hive
Apache hiveApache hive
Apache hive
 
如何快速实现数据编织架构
如何快速实现数据编织架构如何快速实现数据编织架构
如何快速实现数据编织架构
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 

Viewers also liked

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 

Viewers also liked (17)

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Similar to Hadoop, Pig, and Twitter (NoSQL East 2009)

Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Kevin Weil
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02matrixvn
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
Apache Druid Vision and Roadmap
Apache Druid Vision and RoadmapApache Druid Vision and Roadmap
Apache Druid Vision and RoadmapImply
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et Rpkernevez
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Big data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-contentBig data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-contentTraining Institute
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 

Similar to Hadoop, Pig, and Twitter (NoSQL East 2009) (20)

Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Apache Druid Vision and Roadmap
Apache Druid Vision and RoadmapApache Druid Vision and Roadmap
Apache Druid Vision and Roadmap
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Streams
StreamsStreams
Streams
 
Big data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-contentBig data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-content
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Hadoop, Pig, and Twitter (NoSQL East 2009)

  • 1. Hadoop, Pig, and Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
  • 2. Introduction ‣ Hadoop Overview ‣ Why Pig? ‣ Evolution of Data Processing at Twitter ‣ Pig for Counting ‣ Pig for Correlating ‣ Pig for Research and Data Mining ‣ Conclusions and Next Steps
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, ??? of data
  • 4. Introduction ‣ Hadoop Overview ‣ Why Pig? ‣ Evolution of Data Processing at Twitter ‣ Pig for Counting ‣ Pig for Correlating ‣ Pig for Research and Data Mining ‣ Conclusions and Next Steps
  • 5. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability
  • 6. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds
  • 7. MapReduce? ‣ Challenge: how many tweets per user, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=user_id, value=1 ‣ Shuffle: sort by user_id ‣ Reduce: for each user_id, sum ‣ Output: user_id, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 8. MapReduce? ‣ Challenge: how many tweets per user, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=user_id, value=1 ‣ Shuffle: sort by user_id ‣ Reduce: for each user_id, sum ‣ Output: user_id, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 9. MapReduce? ‣ Challenge: how many tweets per user, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=user_id, value=1 ‣ Shuffle: sort by user_id ‣ Reduce: for each user_id, sum ‣ Output: user_id, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 10. MapReduce? ‣ Challenge: how many tweets per user, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=user_id, value=1 ‣ Shuffle: sort by user_id ‣ Reduce: for each user_id, sum ‣ Output: user_id, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 11. MapReduce? ‣ Challenge: how many tweets per user, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=user_id, value=1 ‣ Shuffle: sort by user_id ‣ Reduce: for each user_id, sum ‣ Output: user_id, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 12. MapReduce? ‣ Challenge: how many tweets per user, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=user_id, value=1 ‣ Shuffle: sort by user_id ‣ Reduce: for each user_id, sum ‣ Output: user_id, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 13. MapReduce? ‣ Challenge: how many tweets per user, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=user_id, value=1 ‣ Shuffle: sort by user_id ‣ Reduce: for each user_id, sum ‣ Output: user_id, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 14. But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  • 15. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 16. Why Pig? ‣ Because I bet you can read the following script.
  • 17. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 19. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time ‣ Within 50% of the execution time. ‣ Innovation increasingly driven from large-scale data analysis ‣ Need fast iteration to understand the right questions ‣ More minds contributing = more value from your data
  • 20. Introduction ‣ Hadoop Overview ‣ Why Pig? ‣ Evolution of Data Processing at Twitter ‣ Pig for Counting ‣ Pig for Correlating ‣ Pig for Research and Data Mining ‣ Conclusions and Next Steps
  • 21. MySQL, MySQL, MySQL ‣ We all start there. ‣ But MySQL is not built for analysis. ‣ select count(*) from users? Maybe. ‣ select count(*) from tweets? Uh... ‣ Imagine joining them. ‣ And grouping. ‣ Then sorting.
  • 22. The Hadoop Ecosystem at Twitter ‣ Cloudera’s free distribution, running Hadoop 0.20.1 ‣ Heavily modified Facebook Scribe for log collection -> HDFS* ‣ Heavily modified LZO code for fast, splittable data compression** ‣ Data stored either as LZO-compressed flat files (logs, etc) or serialized, LZO-compressed protocol buffers (structured data). ‣ Custom InputFormats, Pig LoadFuncs for the above* ‣ Some Java-based MapReduce, some Hadoop Streaming ‣ Most analysis, and most interesting analyses, done in Pig. ‣ * Open sourced, or on the way. Please come talk afterwards if you’re interested. ‣ ** http://www.github.com/kevinweil/hadoop-lzo
  • 23. Data? ‣ Semi-structured: apache logs, search logs, RoR logs, mysql query logs, rate limiter logs, per-application logs ‣ Structured: tweets, users, block notifications, phones, favorites, saved searches, retweets, authentications, sms usage, third party clients, followings ‣ Entangled: the social graph
  • 24. Introduction ‣ Hadoop Overview ‣ Why Pig? ‣ Evolution of Data Processing at Twitter ‣ Pig for Counting ‣ Pig for Correlating ‣ Pig for Research and Data Mining ‣ Conclusions and Next Steps
  • 25. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣
  • 26. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣
  • 27. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣
  • 28. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣
  • 29. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣
  • 30. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
  • 31. Counting Big Data ‣ Where are users querying from? The API, the front page, their profile page, etc? ‣
  • 32. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users?
  • 33. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients?
  • 34. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses
  • 35. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time?
  • 36. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked?
  • 37. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often?
  • 38. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions
  • 39. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
  • 40. Correlating Big Data ‣ What is the correlation between users with registered phones and users that tweet?
  • 41. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets?
  • 42. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow?
  • 43. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers?
  • 44. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following?
  • 45. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks?
  • 46. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
  • 47. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis
  • 48. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted?
  • 49. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree?
  • 50. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection
  • 51. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning
  • 52. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection
  • 53. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
  • 54. Research on Big Data ‣ How well can we detect bots and other non-human tweeters?
  • 55. Introduction ‣ Hadoop Overview ‣ Why Pig? ‣ Evolution of Data Processing at Twitter ‣ Pig for Counting ‣ Pig for Correlating ‣ Pig for Research and Data Mining ‣ Conclusions and Next Steps
  • 56. Why Hadoop? ‣ Data is growing rapidly; need horizontally scalable computation ‣ Fault tolerant; gracefully handles machine failure ‣ ... but writing MapReduce jobs in Java is harder than it should be.
  • 57. Why Pig? ‣ Pig makes Hadoop accessible ‣ Pig chains together complex job flows ‣ User-defined functions are first class citizens ‣ Vibrant OS community, dedicated team at Y! improving it daily ‣ At Twitter, Pig helps us understand our business faster.
  • 58. Questions? Follow me at twitter.com/kevinweil ‣ If this sounded interesting to you -- that’s because it is. And we’re hiring. TM