Successfully reported this slideshow.

20111104 s4 overview


Published on

S4 is a distributed stream processing system.

Published in: Technology

20111104 s4 overview

  1. 1. Apache S4: A Distributed StreamComputing PlatformPresented at Stanford Infolab – Nov 4, 2011 (migrating from S4 Committers: {fpj, kishoreg, leoneu, mmorel, robbins} Presented by Leo Neumeyer (@leoneu) 1
  2. 2. About Me Born in Buenos Aires, Argentina, studied EE. School/Work in Canada (Signal Processing, Speech Coding). SRI Intl (Menlo Park) Speech Lab, DARPA benchmarks, lab founded speech recognition spin-off Nuance Comm Inc. Mindstech: Startup to teach spoken English in Asia using web audio/video (before 2-way media was widely available). Yahoo! Labs: Search advertising (optimization, auctions). Quantbench: mission is to create a marketplace for data scientists, data providers, and investment funds. 2
  3. 3. S4 Project History Started as a research project at Yahoo! Labs in August 2008 out of the need to personalize search ads in real-time. Open sourced in September 2009. Moved to Apache Incubator in October 2011. 3
  4. 4. Motivation Online Parameter Personalized Search Twitter Trends Optimization given multiple event streamsPredict Market Prices extract information Spam Filtering Automatic Trading using data driven models in real time with low latency Network Intrusion at scale Detection Sensor Networks Its Fun! 4
  5. 5. S4 Architecture Node App App Server App App App PE Prototype App App PE Instance App App Stream App App Unlimited There is one Apps An app is a PE instances number of server process encapsulate graph are clones of nodes. Each per node. The units of work. composed of the prototype. node has one server They can PE prototypes They are process. loads/unloads consume and and streams associated with apps. produce event that produce, a unique key streams. consume, and and contain the transmit msgs. state.S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable,event driven, pluggable platform that allows programmers to easily implementapplications for processing continuous unbounded streams of data. 5
  6. 6. Latency vs. Accuracy Zero Errors Real-TimeLatency ➔ Unconstrained ➔ ConstrainedWhy? ➔ Reproducible results ➔ Limited control over inbound data rate and computing complexityUse ➔ Debug ➔ Process unstructured data ➔ Train Models ➔ Tolerance to small errors ➔ Graceful recovery from inbound data streams 6
  7. 7. Design Actors programming model. Probabilistic thinking in both algorithms and systems. Run on commodity hardware. All in-memory, no disk bottlenecks. Pluggable (Protocols, applications, serialization, etc.) Object oriented design → POJOs Static typing, no string literals, minimize type casting. Science friendly → constant change, ease of use. 7
  8. 8. Programming Model Example: estimate click- through rate in a web application after applying a filter to remove bot traffic. 8
  9. 9. Coding an App 9
  10. 10. Research Areas: Systems Checkpointing strategies Replication strategies Dynamic load balancing Adaptive load management Query languages 10
  11. 11. Fault ToleranceProblem Approaches S4High Availability ➔ Warm/hot failover ➔ Warm failover ➔ Cold failover ➔ Standby nodes + Apache ZookeeperState Loss ➔ Lossy checkpointing ➔ Lossy checkpointing ➔ Lossless checkpoint.(Crashes, systemupdates)Low Latency ➔ Decouple stream ➔ Asynchronous writes processing from ➔ Uncoordinated checkpointing checkpointingApproach: checkpoints are count or time based, pluggable backend tosupport any data store, lazy PE restore, tuning is application dependent.Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011. 11
  12. 12. Resilience in a Distributed Word Count Task 12
  13. 13. Research Areas: Algorithms Self-adaptive models: adaptive language models using small amounts of data. Personalization: learn from user feedback (clicks, location, behavior) to deliver relevant information in RT. Trend detection: find personal Twitter trends relevant to you. Intrusion detection: summarize high level state of the network and detect unusual patterns. Sensor networks: large amounts of audio/video and other sources require processing, recognition, detection, and tracking. Detect events across sensors. 13
  14. 14. Personalized Search Ads Goal is to maximize: Revenue Click yield User experience By controlling: Ranking Pricing Filtering PlacementS. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th AnnualInternational Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010. 14
  15. 15. Personalized Search Ads Model ad click intent using recent user activity. More likely to click → show more North ads. Example 1 First query is digital slr camera Next query is canon slr More likely than average to click another ad Example 2 Repeated query without previous clicks Less likely to click another ad 15
  16. 16. Personalized Search Ads Modeling user session Typical features: Number of searches/clicks by user past 24 hrs User COPC: Ratio of observed clicks to predicted clicks Identical query searched before / clicked before Time (seconds) since last search/click Similarity measures: current vs. previous queries Modeling technique: stochastic gradient-descent boosted trees (GDBT) 16
  17. 17. Personalized Search Ads Target P[CLICK|ad,query,user] Approximation P[CLICK|ad,query]* ucp[user,session] Non-personalized User Click Propensity (UCP) long-term model for user session computed using Hadoop computed using S4 17
  18. 18. Personalized Search Ads Results: We can reduce the average number of ads (ad footprint) by 7% without decreasing click yield and revenue. - OR - For a given ad footprint we can increase click yield by ~2%. 18
  19. 19. Thank you! Join the Apache S4 project: 19