Mining Large-Scale Temporal Dynamics with Hadoop


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mining Large-Scale Temporal Dynamics with Hadoop

  1. 1. Mining Large Scale Temporal Dynamics with Hadoop Nish Parikh Senior Research ScientisteBay Research Labs eBay Research Labs Joint Work With: Evan Chiu, Hui Hong, Neel Sundaresan
  2. 2. About eBay Research Labs•  Who we are –  The groups charter is to conduct research and deliver innovative solutions.•  What we do –  Computer Vision –  Economics and Optimization –  Human Language Technologies –  Data Science –  Machine Learning –  Infrastructure & Systems –  Innovation Programs & Incubations•  Basically we “Dig” into data!
  3. 3. Why study temporal dynamics?•  Stock Markets•  Bio-Medical Signals•  Traffic, Weather and Network Systems•  Web Search & Ranking•  Recommender Systems•  eCommerce…
  4. 4. Challenges•  Large Scale data –  100M+ users –  Petabytes of click-stream logs –  Billions of user sessions –  Billions of unique queries•  Noisy Data –  Robots –  API Calls –  Crawlers, Spiders –  Tools, Scripts –  Data Biases•  Data spread across long time frames –  Differences in collection methodologies•  Complexity of certain algorithms
  5. 5. Hadoop Cluster at eBay•  Nodes –  Cent OS 4 64 Bit –  Intel Dual Hex Core Xeon 2.4 GHz –  72 GB RAM –  2 * 12 (24TB) HDD –  SSD for OS•  Network –  TOR 1Gbps –  Core Switches uplink 40 Gbps•  Cluster –  532n – 1008n –  4000+ cores – 24000 vCPUs –  5 – 18 PB
  6. 6. Mobius – Computation PlatformApplication Click Stream Visualizer Metrics Dashboard Research Projects Layer Mobius Studio (Eclipse plugin) Mobius Generic Java Dataset API Query Language Layer Low level Dataset access API eBay Infra- Hadoop Clusterstructure & Data Source Layer eBay Data (Logs, Tables) E. Chiu, W. Mann, J. Shen, G. Singh, Z. Yang
  7. 7. Mobius – Generic JAVA Dataset API•  Java-based, high-level data processing framework built on top of Apache Hadoop.•  Tuple oriented.•  Supports job chaining.•  Supports high level operators such as join (inner or outer) or grouping.•  Supports filtering.•  Used internally at eBay for various data science applications.•
  8. 8. Mobius – Inner Join
  9. 9. Mobius – Filtering
  10. 10. Mobius – Grouping
  11. 11. Mobius – Chaining
  12. 12. Mining Temporal Data•  When its in your mind, its in the Query Logs! –  Queries as a proxy for demand
  13. 13. Mining Temporal Data•  Data Preparation Christmas trend – raw –  Robot Filtering data –  Session Log Analysis•  Data Cleaning –  Normalization –  De-duplication Christmas trend – prepared data
  14. 14. Mining Temporal Data – What’s Buzzing?•  Automatic Buzz Detection
  15. 15. Mining Temporal Data – Does History RepeatItself?•  Seasonality and Trend Prediction Why are searches related to monopoly pieces popular every October?Air conditioner searches become popular as summer approaches
  16. 16. Mining Temporal Data – Temporal SimilaritySimilar patterns for queries related to Hanukkah
  17. 17. Preparing Data – Getting Queries from UserSessions Typical eBay flow Search View Purchase•  Search: specify a query, with optional constraints•  View: click on an item shown on search results page•  Purchase: buy a fixed-price item or place winning bid on an auction itemConsider only queries typed in by humans. Ignore page views from robots or views from paidadvertisements, campaigns or natural search links.
  18. 18. Cleaning Data•  Apply default robot detection and removal algorithm –  Based on IP, number of actions per day, agent information.•  Find the right flows from the sessions. –  Filter out noisy search events. –  Remove anomalies due to outlier users. –  Limit the impact a single user can have on aggregated data (de- duplication).
  19. 19. Finding the right flow in the session Session 1 Search ExitMay not consider flows without any interesting activity like clicks Session 2 Ads/paid View Purchase search May not consider searches coming from advertisements Session 3 Search View PurchaseThese kind of sessions are considered and information is aggregated.
  20. 20. Data Preparation - Map Reduce Flow Save the result so it can be reused by Preprocessing stage other apps. Collecting stage M R M R •  Find the right flow. Calculate sum perRead raw events •  Group events into sessions. •  Emit query as key. key •  Group sessions by GUID •  Emit de-duplicated query •  Apply bot filtering algorithm volume as value Query Volume output daily as dailyQueryData
  21. 21. Time Series Generation Input: dailyQueryData for multi-year time-frames •  Data Cleaning. Mapper •  Query normalization. Key: query Value: date: query volume Data not to scale and only shown as an example •  Time Series formation for all unique queriesReducer •  Time Series indicating total daily activity volume Output: Vectors of Query à Volume Time Series
  22. 22. Buzz Detection – 2 state automaton model•  Arrival of queries as a stream.•  “low rate” state (q0) and a “high rate” state (q1).•  f 0 ( x) = α 0e −α 0 x f1 (x ) = α1e−α1x where α1 > α0.•  The automaton changes state with probability p ε (0, 1) between query arrivals.•  Let Q = (qi1, qi2… qin) be a state sequence. Each state sequence Q induces a density function fQ over sequences of gaps, which has the form n fQ(x1, x2 …xn) = ∏t =1 fi (xt ) t N. Parikh, N. Sundaresan. KDD 2008. Scalable and Near Real-time Burst Detection from eCommerce Queries.
  23. 23. Buzz Detection – Modeling Queries as aStream Frequency of Query Gaps between arrival times for queries
  24. 24. Buzz Detection – 2 state automaton model•  If number of state transitions in sequence Q are denoted as b•  Prior probability of Q is given asb ⎛ ⎞⎛ ⎞ n −b ⎛ p ⎞ ⎜ ∏ p ⎟⎜ ∏1 − ⎜ i ≠ i ⎟⎜ i = i p ⎟ ⎟ = pb (1 − p ) = ⎜ ⎟ (1 − p )n ⎜ 1 − p ⎟ ⎝ t t +1 ⎠⎝ t t +1 ⎠ ⎝ ⎠•  Using Bayes theorem, the cost equation is 1− p n C (Q | X ) = b. ln( ) + (∑ − ln fit ( xt )) p t =1•  Sequence that minimizes the cost would depend on –  Ease of jumps between 2 states. –  How well the sequence conforms to the rate of query arrivals.•  Configurable Parameters for model are α0, α1 and cost p. –  α0, α1 are calculated from data in the MR job. –  Heuristically determined value of p = 0.38 is used.
  25. 25. Query Volume Time Series – 2 StateRepresentation
  26. 26. Time Series Normalization and Buzz Detection Input: 4-7 Years Query Time Series Vectors •  Normalize Time Series •  Transform Time Series to two state model •  Calculate parameters α0, α1 for every query and Mapper apply dynamic programming for 2 state calculation •  Calculate probability of being a periodic event query e.g. superbowl Key: query Value: normalized time series, two state model, probability of being a seasonal event query Key: time-frame Value: query that buzzes during that time frame Group queries buzzing at similar time intervalsReducer Output: time-frame à Queries Buzzing during that time-period
  27. 27. Catman – Application for eBay sellers & buyers
  28. 28. Binary data structure generation from MR job•  Created new FileOutputFormat•  Write time series data to two files – Binary File with fixed sized records indicating time series volume – Text file mapping each unique query string to binary file and offset•  Index created by reducers directly loaded by custom servers written in C++.•  Used for an internal Query Trends Application
  29. 29. Query Trends
  30. 30. Query Trends – Mapping to External Events
  31. 31. Trends – Comparing Queries
  32. 32. Temporal Similarity•  1+ Billion Queries•  Naïve Algorithm – Quadratic Complexity•  Pearson’s Correlation•  Candidate Set Reduction –  Correlations useful only for event-based or seasonal queries –  Correlations useful in applications only for head and torso queries –  These filters reduce candidate space from B+ to a few M.
  33. 33. Exact Correlations amongst candidates – Allpairs similarity on Reduced Set
  34. 34. Applications of Temporal Correlations – QuerySuggestions
  35. 35. Remarks•  Log Mining and Time Series mining algorithms are parallelizable.•  Easy to scale such algorithms using Hadoop.•  Hadoop empowers us to look at data-sets spanning years and years.•  Hadoop enables us to iterate faster and hence run more user-facing experiments.
  36. 36. References•  Parikh et al. Scalable and near real-time burst detection from eCommerce queries. KDD 2008.•  Sundaresan et al. Scalable Stream Processing and Map Reduce. Hadoop World 2009.•  Anil Madan. Hadoop at eBay.•  N Sundaresan. Popup Commerce, Towards Building Transient and Thematic Stores. X.Innovate 2011.•  Pantel et al. Web-Scale Distributional Similarity and Entity Set Expansion. EMNLP 2009.
  37. 37. Acknowledgments : Project Members &Alumni•  Long Hoang•  Rifat Joyee•  Wallace Mann•  Karin Mauge•  Hill Nguyen•  Jack Shen•  Gyanit Singh•  Zhou Yang
  38. 38. Questions?•