SlideShare a Scribd company logo
Extracting Insights from Data at Twitter
Prasad Wagle
Technical Lead, Core Data and Metrics, Data Platform
twitter.com/prasadwagle
Jan 26, 2016
● What are the properties of Big Data at Twitter?
● Where do we store it and how do we process it?
● What do we learn from the data?
Overview of the talk
● Velocity: Rate at which data is created
○ 313 million monthly active users. (June 2016)
○ Hundreds of millions of Tweets are sent per day. TPS record:
one-second peak of 143,199 Tweets per second
○ 100 Billion interaction events per day
● Volume: 100s of petabytes of data
● Variety: Tweets, Users, Client events and many more
○ Client events logs have a unified Thrift format for wide variety of
application events
3Vs of Big Data @Twitter
Data Processing Big Picture
Production
systems
Batch
Scalding
Spark
Real-time
Heron
Lambda (Batch + Real-time)
Summingbird
TSAR
Interactive
Presto
Vertica
R
Custom
Dashboards
Tableau
Apache Zeppelin
Command line
tools
Batch
Hadoop
(HDFS
MapReduce)
Analytics Tools
Analytics Front-ends
Real-time
Eventbus,
Kafka
Streams
Data Abstraction Layer (DAL), Pipeline Orchestration
Data Platform
● Batch Processing Engine - Hadoop
● Real-time Processing Engine - Heron
● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet
● Data Pipeline - Data Access Layer (DAL), Orchestration
● Interactive SQL - Presto, Vertica
● Data Visualization - Tableau, Apache Zeppelin
● Core Data and Metrics
Data Platform Projects
● Largest Hadoop clusters in the world, some > 10K nodes
● Store 100s of petabytes of data
● More than 100K daily jobs
● Improvements to open source hadoop software
● hRaven - tool that collects run time data of hadoop jobs and lets users
visualize job metrics
○ YARN Timelineserver is next-gen hRaven
● Log pipeline software (scribe -> HDFS)
○ Scribe is being replace by Flume
Hadoop
● Heron - a real-time, distributed, fault tolerant stream processing engine
● Successor of Storm, API compatible with Storm
● Analyze data as it is being produced
● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency
● Use cases
○ Real-time impression and engagement counts
○ Real-time trends, recommendations, spam detection
Real-time Processing
● Tools that make it easy to create MapReduce and Heron jobs
● Scalding
○ Scala DSL on top of Cascading
● Summingbird
○ Lambda architecture: real-time and batch
● Tsar: TimeSeries AggregatoR
○ DSL implemented on top of Summingbird
Core Data Libraries
● DAL is a service that simplifies the discovery, usage, and maintainability
of data
● Users work with logical datasets
● Physical dataset describes the serialization of a logical dataset to a
specific location (hadoop, vertica) and format
● Logical dataset can simultaneously exist in multiple places
● Users can use logical dataset name to consume data with different
tools like Scalding, Presto
Data Access Layer (DAL)
● Eagleeye web application is front-end for end users
● Users discover datasets with Eagleeye
● Eagleeye displays metadata like owners and schema
● Applications access to datasets is recorded
● Enables Eagleye to show dependency graphs for a dataset - jobs that
produce a dataset and jobs that consume it
Data Access Layer (DAL)
Data Discovery
● Statebird service
○ Tracks state of batch jobs
○ Used to manage dependencies
Pipeline Orchestration
● Interactive means that results of a query are available in the range of
seconds to a few minutes
● SQL is still the lingua franca for ad hoc data analysis
● Vertica
○ Columnar architecture, high performance analytics queries
● Presto
○ Data in HDFS in Parquet format
Interactive SQL
● Custom Dashboards
● Apache Zeppelin Strengths
○ Notebook metaphor - notebook is a collection of notes, each note
is a collection of paragraphs (queries)
○ Web based report authoring, collaborative like Google docs
○ Very easy to create a note and then share it
○ > 2K notes, 18K queries
○ Supports JDBC (Presto, Vertica, MySQL)
○ Open source, Easy to add new interpreters like Scalding
Data Visualization
● Tableau Strengths
○ Easy to create reports, does not require SQL expertise
○ Built in analytics functions e.g. Rank, Percentile
○ Polished visualizations
○ Row level security
Data Visualization
● Big part of data analysis is data cleansing
● Makes sense to do this once
● Core Data
○ Create pipelines to create “verified” datasets like Users, Tweets,
Interactions
○ Reliable and easy to use
● Core Metrics
○ Create pipelines to compute Twitter’s important metrics
○ DAU, MAU, Tweet Impressions
Core Data and Metrics
Data Processing
● Analytics - Basic Counting
● A/B Testing
● Data Science - Custom analysis
● Data Science - Machine Learning
Data Processing
● Daily/Monthly Active Users
● Number of Tweets, Retweets, Likes
● Tweet Impressions
● Logic is relatively simple
● Challenges: scale and timeliness
○ Results for previous day should be available by 10 am
○ Some metrics are real-time
Basic Counting
● Goal: find the number of impressions and engagements for a tweet
● Real-time
● Used in analytics.twitter.com
Example - Counting Tweet Impressions
aggregate {
onKeys(
(TweetId)
) produce (
Count
) sinkTo (Manhattan)
} fromProducer {
ClientEventSource(“client_events”)
.filter { event => isImpressionEvent(event) }
.map { event =>
(event.timestamp, ImpressionAttributes(event.tweetId))
}
}
TSAR job
Dimension
Metric
Data Sink
Data Source
● TSAR job is converted to a Summingbird job
● Summingbird job creates
○ Real-time pipeline with Heron
○ Batch pipeline with Scalding
● Users access results using TSAR query service
● Write once, run batch and real-time
Example - Counting Tweet Impressions
● Experimentation is at the heart of Twitter’s product development cycle
● Expertise needed in Statistics and Technology
A/B Testing Framework
● Goal: informative experiment,
● Minimize false positive and false negative errors
● How many users do we need to sample?
● How long should we run the experiment?
A/B Testing Statistics
● Process 100 B events daily, compute intensive.
● Metrics computed using Scalding pipeline that combines client event
logs, internal user models, and other datasets.
● Lightweight statistics are computed in a streaming job using TSAR
running on Heron.
A/B Testing Technology
● Cause of spikes and dips in key metrics
● Growth Trends
○ By country, client
● Analysis to understand user behavior
○ Creators vs Consumers
○ Distribution of followers
○ User clusters
● Analysis to inform product feature decisions
Data Science - Custom Analysis
● Recommendations
○ Users: WTF - who to follow
○ Tweets: Algorithmic timeline
● Cortex, Deep learning based on Torch framework
○ Identify NSFW images
○ Recognize what is happening in live feeds
Data Science - Machine Learning
● Product Safety
○ Detect fake accounts
○ Detect tweet spam and abuse
● Ad Targeting
○ Promoted Trends, Accounts and Tweets
○ Show only if it is likely to be interesting and relevant to that user
○ Predict click probability using signals including what a user
chooses to follow, how they interact with a Tweet and what they
retweet
Machine Learning
● Systems (Hadoop, Vertica)
○ Necessary because higher level abstraction are leaky
● Programming (Scala, Scalding, SQL)
● Math (Statistics, Linear Algebra)
Ideal Talent Stack
Systems Programming Statistics
Data Engineers Data Scientists
Data Platform and Data Science
work hand-in-hand
to extract insights from Big Data at Twitter
Summary
Questions?
● TSAR https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
● DAL https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter
● Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron
● Heron http://www.slideshare.net/KarthikRamasamy3
● A/B testing https://blog.twitter.com/2015/twitter-experimentation-technical-overview
● A/B testing https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests
● Algorithmic timeline: https://support.twitter.com/articles/164083
● Cortex https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/
● Cortex https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/
References

More Related Content

What's hot

Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
Timothy Spann
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
David Stein
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Machine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflowMachine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflow
Databricks
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
Databricks
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
Graph Data Modeling Best Practices(Eric_Monk).pptx
Graph Data Modeling Best Practices(Eric_Monk).pptxGraph Data Modeling Best Practices(Eric_Monk).pptx
Graph Data Modeling Best Practices(Eric_Monk).pptx
Neo4j
 
Introduction to Apache Heron
Introduction to Apache HeronIntroduction to Apache Heron
Introduction to Apache Heron
Streamlio
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 

What's hot (20)

Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Machine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflowMachine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflow
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
Graph Data Modeling Best Practices(Eric_Monk).pptx
Graph Data Modeling Best Practices(Eric_Monk).pptxGraph Data Modeling Best Practices(Eric_Monk).pptx
Graph Data Modeling Best Practices(Eric_Monk).pptx
 
Introduction to Apache Heron
Introduction to Apache HeronIntroduction to Apache Heron
Introduction to Apache Heron
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 

Viewers also liked

Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
Nitin Kumar
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka Streams
Konrad Malawski
 
Apache Kafka lessons learned @PAYBACK
Apache Kafka lessons learned @PAYBACKApache Kafka lessons learned @PAYBACK
Apache Kafka lessons learned @PAYBACK
Maxim Shelest
 
Getting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesGetting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics services
Vladimir Bychkov
 
Blr hadoop meetup
Blr hadoop meetupBlr hadoop meetup
Blr hadoop meetup
Suneet Grover
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)
Landoop Ltd
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
Tianlun Zhang
 
Kafka connect
Kafka connectKafka connect
Kafka connect
Andrew Stevenson
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
Konrad Malawski
 
Processing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaProcessing IoT Data with Apache Kafka
Processing IoT Data with Apache Kafka
Matthew Howlett
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
Jason Hubbard
 
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
confluent
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
confluent
 
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Todd Fritz
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
Kay Lerch
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Building Kafka-powered Activity Stream
Building Kafka-powered Activity StreamBuilding Kafka-powered Activity Stream
Building Kafka-powered Activity StreamOleksiy Holubyev
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
 

Viewers also liked (20)

Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka Streams
 
Apache Kafka lessons learned @PAYBACK
Apache Kafka lessons learned @PAYBACKApache Kafka lessons learned @PAYBACK
Apache Kafka lessons learned @PAYBACK
 
Getting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesGetting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics services
 
Blr hadoop meetup
Blr hadoop meetupBlr hadoop meetup
Blr hadoop meetup
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
 
Kafka connect
Kafka connectKafka connect
Kafka connect
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
 
Processing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaProcessing IoT Data with Apache Kafka
Processing IoT Data with Apache Kafka
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
 
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Building Kafka-powered Activity Stream
Building Kafka-powered Activity StreamBuilding Kafka-powered Activity Stream
Building Kafka-powered Activity Stream
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 

Similar to Extracting Insights from Data at Twitter

Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
Li Gao
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
WSO2
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Jaroslav Gergic
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
Karthik Murugesan
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
Hari Shreedharan
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Databricks
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
VMware Tanzu
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
Rob Winters
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
Databricks
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 

Similar to Extracting Insights from Data at Twitter (20)

Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Extracting Insights from Data at Twitter

  • 1. Extracting Insights from Data at Twitter Prasad Wagle Technical Lead, Core Data and Metrics, Data Platform twitter.com/prasadwagle Jan 26, 2016
  • 2. ● What are the properties of Big Data at Twitter? ● Where do we store it and how do we process it? ● What do we learn from the data? Overview of the talk
  • 3. ● Velocity: Rate at which data is created ○ 313 million monthly active users. (June 2016) ○ Hundreds of millions of Tweets are sent per day. TPS record: one-second peak of 143,199 Tweets per second ○ 100 Billion interaction events per day ● Volume: 100s of petabytes of data ● Variety: Tweets, Users, Client events and many more ○ Client events logs have a unified Thrift format for wide variety of application events 3Vs of Big Data @Twitter
  • 4. Data Processing Big Picture Production systems Batch Scalding Spark Real-time Heron Lambda (Batch + Real-time) Summingbird TSAR Interactive Presto Vertica R Custom Dashboards Tableau Apache Zeppelin Command line tools Batch Hadoop (HDFS MapReduce) Analytics Tools Analytics Front-ends Real-time Eventbus, Kafka Streams Data Abstraction Layer (DAL), Pipeline Orchestration
  • 6. ● Batch Processing Engine - Hadoop ● Real-time Processing Engine - Heron ● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet ● Data Pipeline - Data Access Layer (DAL), Orchestration ● Interactive SQL - Presto, Vertica ● Data Visualization - Tableau, Apache Zeppelin ● Core Data and Metrics Data Platform Projects
  • 7. ● Largest Hadoop clusters in the world, some > 10K nodes ● Store 100s of petabytes of data ● More than 100K daily jobs ● Improvements to open source hadoop software ● hRaven - tool that collects run time data of hadoop jobs and lets users visualize job metrics ○ YARN Timelineserver is next-gen hRaven ● Log pipeline software (scribe -> HDFS) ○ Scribe is being replace by Flume Hadoop
  • 8. ● Heron - a real-time, distributed, fault tolerant stream processing engine ● Successor of Storm, API compatible with Storm ● Analyze data as it is being produced ● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency ● Use cases ○ Real-time impression and engagement counts ○ Real-time trends, recommendations, spam detection Real-time Processing
  • 9. ● Tools that make it easy to create MapReduce and Heron jobs ● Scalding ○ Scala DSL on top of Cascading ● Summingbird ○ Lambda architecture: real-time and batch ● Tsar: TimeSeries AggregatoR ○ DSL implemented on top of Summingbird Core Data Libraries
  • 10. ● DAL is a service that simplifies the discovery, usage, and maintainability of data ● Users work with logical datasets ● Physical dataset describes the serialization of a logical dataset to a specific location (hadoop, vertica) and format ● Logical dataset can simultaneously exist in multiple places ● Users can use logical dataset name to consume data with different tools like Scalding, Presto Data Access Layer (DAL)
  • 11. ● Eagleeye web application is front-end for end users ● Users discover datasets with Eagleeye ● Eagleeye displays metadata like owners and schema ● Applications access to datasets is recorded ● Enables Eagleye to show dependency graphs for a dataset - jobs that produce a dataset and jobs that consume it Data Access Layer (DAL)
  • 13.
  • 14.
  • 15. ● Statebird service ○ Tracks state of batch jobs ○ Used to manage dependencies Pipeline Orchestration
  • 16. ● Interactive means that results of a query are available in the range of seconds to a few minutes ● SQL is still the lingua franca for ad hoc data analysis ● Vertica ○ Columnar architecture, high performance analytics queries ● Presto ○ Data in HDFS in Parquet format Interactive SQL
  • 17. ● Custom Dashboards ● Apache Zeppelin Strengths ○ Notebook metaphor - notebook is a collection of notes, each note is a collection of paragraphs (queries) ○ Web based report authoring, collaborative like Google docs ○ Very easy to create a note and then share it ○ > 2K notes, 18K queries ○ Supports JDBC (Presto, Vertica, MySQL) ○ Open source, Easy to add new interpreters like Scalding Data Visualization
  • 18. ● Tableau Strengths ○ Easy to create reports, does not require SQL expertise ○ Built in analytics functions e.g. Rank, Percentile ○ Polished visualizations ○ Row level security Data Visualization
  • 19. ● Big part of data analysis is data cleansing ● Makes sense to do this once ● Core Data ○ Create pipelines to create “verified” datasets like Users, Tweets, Interactions ○ Reliable and easy to use ● Core Metrics ○ Create pipelines to compute Twitter’s important metrics ○ DAU, MAU, Tweet Impressions Core Data and Metrics
  • 21. ● Analytics - Basic Counting ● A/B Testing ● Data Science - Custom analysis ● Data Science - Machine Learning Data Processing
  • 22. ● Daily/Monthly Active Users ● Number of Tweets, Retweets, Likes ● Tweet Impressions ● Logic is relatively simple ● Challenges: scale and timeliness ○ Results for previous day should be available by 10 am ○ Some metrics are real-time Basic Counting
  • 23. ● Goal: find the number of impressions and engagements for a tweet ● Real-time ● Used in analytics.twitter.com Example - Counting Tweet Impressions
  • 24. aggregate { onKeys( (TweetId) ) produce ( Count ) sinkTo (Manhattan) } fromProducer { ClientEventSource(“client_events”) .filter { event => isImpressionEvent(event) } .map { event => (event.timestamp, ImpressionAttributes(event.tweetId)) } } TSAR job Dimension Metric Data Sink Data Source
  • 25. ● TSAR job is converted to a Summingbird job ● Summingbird job creates ○ Real-time pipeline with Heron ○ Batch pipeline with Scalding ● Users access results using TSAR query service ● Write once, run batch and real-time Example - Counting Tweet Impressions
  • 26. ● Experimentation is at the heart of Twitter’s product development cycle ● Expertise needed in Statistics and Technology A/B Testing Framework
  • 27. ● Goal: informative experiment, ● Minimize false positive and false negative errors ● How many users do we need to sample? ● How long should we run the experiment? A/B Testing Statistics
  • 28. ● Process 100 B events daily, compute intensive. ● Metrics computed using Scalding pipeline that combines client event logs, internal user models, and other datasets. ● Lightweight statistics are computed in a streaming job using TSAR running on Heron. A/B Testing Technology
  • 29. ● Cause of spikes and dips in key metrics ● Growth Trends ○ By country, client ● Analysis to understand user behavior ○ Creators vs Consumers ○ Distribution of followers ○ User clusters ● Analysis to inform product feature decisions Data Science - Custom Analysis
  • 30. ● Recommendations ○ Users: WTF - who to follow ○ Tweets: Algorithmic timeline ● Cortex, Deep learning based on Torch framework ○ Identify NSFW images ○ Recognize what is happening in live feeds Data Science - Machine Learning
  • 31. ● Product Safety ○ Detect fake accounts ○ Detect tweet spam and abuse ● Ad Targeting ○ Promoted Trends, Accounts and Tweets ○ Show only if it is likely to be interesting and relevant to that user ○ Predict click probability using signals including what a user chooses to follow, how they interact with a Tweet and what they retweet Machine Learning
  • 32. ● Systems (Hadoop, Vertica) ○ Necessary because higher level abstraction are leaky ● Programming (Scala, Scalding, SQL) ● Math (Statistics, Linear Algebra) Ideal Talent Stack Systems Programming Statistics Data Engineers Data Scientists
  • 33. Data Platform and Data Science work hand-in-hand to extract insights from Big Data at Twitter Summary
  • 35. ● TSAR https://blog.twitter.com/2014/tsar-a-timeseries-aggregator ● DAL https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter ● Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron ● Heron http://www.slideshare.net/KarthikRamasamy3 ● A/B testing https://blog.twitter.com/2015/twitter-experimentation-technical-overview ● A/B testing https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests ● Algorithmic timeline: https://support.twitter.com/articles/164083 ● Cortex https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/ ● Cortex https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/ References