Extracting Insights from Data at Twitter

225 views

Published on

Extracting Insights from Data at Twitter

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
225
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Extracting Insights from Data at Twitter

  1. 1. Extracting Insights from Data at Twitter Prasad Wagle Technical Lead, Core Data and Metrics, Data Platform twitter.com/prasadwagle Jan 26, 2016
  2. 2. ● What are the properties of Big Data at Twitter? ● Where do we store it and how do we process it? ● What do we learn from the data? Overview of the talk
  3. 3. ● Velocity: Rate at which data is created ○ 313 million monthly active users. (June 2016) ○ Hundreds of millions of Tweets are sent per day. TPS record: one-second peak of 143,199 Tweets per second ○ 100 Billion interaction events per day ● Volume: 100s of petabytes of data ● Variety: Tweets, Users, Client events and many more ○ Client events logs have a unified Thrift format for wide variety of application events 3Vs of Big Data @Twitter
  4. 4. Data Processing Big Picture Production systems Batch Scalding Spark Real-time Heron Lambda (Batch + Real-time) Summingbird TSAR Interactive Presto Vertica R Custom Dashboards Tableau Apache Zeppelin Command line tools Batch Hadoop (HDFS MapReduce) Analytics Tools Analytics Front-ends Real-time Eventbus, Kafka Streams Data Abstraction Layer (DAL), Pipeline Orchestration
  5. 5. Data Platform
  6. 6. ● Batch Processing Engine - Hadoop ● Real-time Processing Engine - Heron ● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet ● Data Pipeline - Data Access Layer (DAL), Orchestration ● Interactive SQL - Presto, Vertica ● Data Visualization - Tableau, Apache Zeppelin ● Core Data and Metrics Data Platform Projects
  7. 7. ● Largest Hadoop clusters in the world, some > 10K nodes ● Store 100s of petabytes of data ● More than 100K daily jobs ● Improvements to open source hadoop software ● hRaven - tool that collects run time data of hadoop jobs and lets users visualize job metrics ○ YARN Timelineserver is next-gen hRaven ● Log pipeline software (scribe -> HDFS) ○ Scribe is being replace by Flume Hadoop
  8. 8. ● Heron - a real-time, distributed, fault tolerant stream processing engine ● Successor of Storm, API compatible with Storm ● Analyze data as it is being produced ● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency ● Use cases ○ Real-time impression and engagement counts ○ Real-time trends, recommendations, spam detection Real-time Processing
  9. 9. ● Tools that make it easy to create MapReduce and Heron jobs ● Scalding ○ Scala DSL on top of Cascading ● Summingbird ○ Lambda architecture: real-time and batch ● Tsar: TimeSeries AggregatoR ○ DSL implemented on top of Summingbird Core Data Libraries
  10. 10. ● DAL is a service that simplifies the discovery, usage, and maintainability of data ● Users work with logical datasets ● Physical dataset describes the serialization of a logical dataset to a specific location (hadoop, vertica) and format ● Logical dataset can simultaneously exist in multiple places ● Users can use logical dataset name to consume data with different tools like Scalding, Presto Data Access Layer (DAL)
  11. 11. ● Eagleeye web application is front-end for end users ● Users discover datasets with Eagleeye ● Eagleeye displays metadata like owners and schema ● Applications access to datasets is recorded ● Enables Eagleye to show dependency graphs for a dataset - jobs that produce a dataset and jobs that consume it Data Access Layer (DAL)
  12. 12. Data Discovery
  13. 13. ● Statebird service ○ Tracks state of batch jobs ○ Used to manage dependencies Pipeline Orchestration
  14. 14. ● Interactive means that results of a query are available in the range of seconds to a few minutes ● SQL is still the lingua franca for ad hoc data analysis ● Vertica ○ Columnar architecture, high performance analytics queries ● Presto ○ Data in HDFS in Parquet format Interactive SQL
  15. 15. ● Custom Dashboards ● Apache Zeppelin Strengths ○ Notebook metaphor - notebook is a collection of notes, each note is a collection of paragraphs (queries) ○ Web based report authoring, collaborative like Google docs ○ Very easy to create a note and then share it ○ > 2K notes, 18K queries ○ Supports JDBC (Presto, Vertica, MySQL) ○ Open source, Easy to add new interpreters like Scalding Data Visualization
  16. 16. ● Tableau Strengths ○ Easy to create reports, does not require SQL expertise ○ Built in analytics functions e.g. Rank, Percentile ○ Polished visualizations ○ Row level security Data Visualization
  17. 17. ● Big part of data analysis is data cleansing ● Makes sense to do this once ● Core Data ○ Create pipelines to create “verified” datasets like Users, Tweets, Interactions ○ Reliable and easy to use ● Core Metrics ○ Create pipelines to compute Twitter’s important metrics ○ DAU, MAU, Tweet Impressions Core Data and Metrics
  18. 18. Data Processing
  19. 19. ● Analytics - Basic Counting ● A/B Testing ● Data Science - Custom analysis ● Data Science - Machine Learning Data Processing
  20. 20. ● Daily/Monthly Active Users ● Number of Tweets, Retweets, Likes ● Tweet Impressions ● Logic is relatively simple ● Challenges: scale and timeliness ○ Results for previous day should be available by 10 am ○ Some metrics are real-time Basic Counting
  21. 21. ● Goal: find the number of impressions and engagements for a tweet ● Real-time ● Used in analytics.twitter.com Example - Counting Tweet Impressions
  22. 22. aggregate { onKeys( (TweetId) ) produce ( Count ) sinkTo (Manhattan) } fromProducer { ClientEventSource(“client_events”) .filter { event => isImpressionEvent(event) } .map { event => (event.timestamp, ImpressionAttributes(event.tweetId)) } } TSAR job Dimension Metric Data Sink Data Source
  23. 23. ● TSAR job is converted to a Summingbird job ● Summingbird job creates ○ Real-time pipeline with Heron ○ Batch pipeline with Scalding ● Users access results using TSAR query service ● Write once, run batch and real-time Example - Counting Tweet Impressions
  24. 24. ● Experimentation is at the heart of Twitter’s product development cycle ● Expertise needed in Statistics and Technology A/B Testing Framework
  25. 25. ● Goal: informative experiment, ● Minimize false positive and false negative errors ● How many users do we need to sample? ● How long should we run the experiment? A/B Testing Statistics
  26. 26. ● Process 100 B events daily, compute intensive. ● Metrics computed using Scalding pipeline that combines client event logs, internal user models, and other datasets. ● Lightweight statistics are computed in a streaming job using TSAR running on Heron. A/B Testing Technology
  27. 27. ● Cause of spikes and dips in key metrics ● Growth Trends ○ By country, client ● Analysis to understand user behavior ○ Creators vs Consumers ○ Distribution of followers ○ User clusters ● Analysis to inform product feature decisions Data Science - Custom Analysis
  28. 28. ● Recommendations ○ Users: WTF - who to follow ○ Tweets: Algorithmic timeline ● Cortex, Deep learning based on Torch framework ○ Identify NSFW images ○ Recognize what is happening in live feeds Data Science - Machine Learning
  29. 29. ● Product Safety ○ Detect fake accounts ○ Detect tweet spam and abuse ● Ad Targeting ○ Promoted Trends, Accounts and Tweets ○ Show only if it is likely to be interesting and relevant to that user ○ Predict click probability using signals including what a user chooses to follow, how they interact with a Tweet and what they retweet Machine Learning
  30. 30. ● Systems (Hadoop, Vertica) ○ Necessary because higher level abstraction are leaky ● Programming (Scala, Scalding, SQL) ● Math (Statistics, Linear Algebra) Ideal Talent Stack Systems Programming Statistics Data Engineers Data Scientists
  31. 31. Data Platform and Data Science work hand-in-hand to extract insights from Big Data at Twitter Summary
  32. 32. Questions?
  33. 33. ● TSAR https://blog.twitter.com/2014/tsar-a-timeseries-aggregator ● DAL https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter ● Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron ● Heron http://www.slideshare.net/KarthikRamasamy3 ● A/B testing https://blog.twitter.com/2015/twitter-experimentation-technical-overview ● A/B testing https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests ● Algorithmic timeline: https://support.twitter.com/articles/164083 ● Cortex https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/ ● Cortex https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/ References

×