20111104 s4 overview
Upcoming SlideShare
Loading in...5

20111104 s4 overview



S4 is a distributed stream processing system.

S4 is a distributed stream processing system.



Total Views
Views on SlideShare
Embed Views



6 Embeds 37

http://www.linkedin.com 15
http://a0.twimg.com 13
https://twitter.com 3
https://www.linkedin.com 3
http://tweetedtimes.com 2
http://twitter.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

20111104 s4 overview 20111104 s4 overview Presentation Transcript

  • Apache S4: A Distributed StreamComputing PlatformPresented at Stanford Infolab – Nov 4, 2011http://incubator.apache.org/projects/s4 (migrating from http://s4.io) S4 Committers: {fpj, kishoreg, leoneu, mmorel, robbins}@apache.org Presented by Leo Neumeyer (@leoneu) 1
  • About Me Born in Buenos Aires, Argentina, studied EE. School/Work in Canada (Signal Processing, Speech Coding). SRI Intl (Menlo Park) Speech Lab, DARPA benchmarks, lab founded speech recognition spin-off Nuance Comm Inc. Mindstech: Startup to teach spoken English in Asia using web audio/video (before 2-way media was widely available). Yahoo! Labs: Search advertising (optimization, auctions). Quantbench: mission is to create a marketplace for data scientists, data providers, and investment funds. 2
  • S4 Project History Started as a research project at Yahoo! Labs in August 2008 out of the need to personalize search ads in real-time. Open sourced in September 2009. Moved to Apache Incubator in October 2011. 3 View slide
  • Motivation Online Parameter Personalized Search Twitter Trends Optimization given multiple event streamsPredict Market Prices extract information Spam Filtering Automatic Trading using data driven models in real time with low latency Network Intrusion at scale Detection Sensor Networks Its Fun! 4 View slide
  • S4 Architecture Node App App Server App App App PE Prototype App App PE Instance App App Stream App App Unlimited There is one Apps An app is a PE instances number of server process encapsulate graph are clones of nodes. Each per node. The units of work. composed of the prototype. node has one server They can PE prototypes They are process. loads/unloads consume and and streams associated with apps. produce event that produce, a unique key streams. consume, and and contain the transmit msgs. state.S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable,event driven, pluggable platform that allows programmers to easily implementapplications for processing continuous unbounded streams of data. 5
  • Latency vs. Accuracy Zero Errors Real-TimeLatency ➔ Unconstrained ➔ ConstrainedWhy? ➔ Reproducible results ➔ Limited control over inbound data rate and computing complexityUse ➔ Debug ➔ Process unstructured data ➔ Train Models ➔ Tolerance to small errors ➔ Graceful recovery from inbound data streams 6
  • Design Actors programming model. Probabilistic thinking in both algorithms and systems. Run on commodity hardware. All in-memory, no disk bottlenecks. Pluggable (Protocols, applications, serialization, etc.) Object oriented design → POJOs Static typing, no string literals, minimize type casting. Science friendly → constant change, ease of use. 7
  • Programming Model Example: estimate click- through rate in a web application after applying a filter to remove bot traffic. 8
  • Coding an App 9
  • Research Areas: Systems Checkpointing strategies Replication strategies Dynamic load balancing Adaptive load management Query languages 10
  • Fault ToleranceProblem Approaches S4High Availability ➔ Warm/hot failover ➔ Warm failover ➔ Cold failover ➔ Standby nodes + Apache ZookeeperState Loss ➔ Lossy checkpointing ➔ Lossy checkpointing ➔ Lossless checkpoint.(Crashes, systemupdates)Low Latency ➔ Decouple stream ➔ Asynchronous writes processing from ➔ Uncoordinated checkpointing checkpointingApproach: checkpoints are count or time based, pluggable backend tosupport any data store, lazy PE restore, tuning is application dependent.Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011. 11
  • Resilience in a Distributed Word Count Task 12
  • Research Areas: Algorithms Self-adaptive models: adaptive language models using small amounts of data. Personalization: learn from user feedback (clicks, location, behavior) to deliver relevant information in RT. Trend detection: find personal Twitter trends relevant to you. Intrusion detection: summarize high level state of the network and detect unusual patterns. Sensor networks: large amounts of audio/video and other sources require processing, recognition, detection, and tracking. Detect events across sensors. 13
  • Personalized Search Ads Goal is to maximize: Revenue Click yield User experience By controlling: Ranking Pricing Filtering PlacementS. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th AnnualInternational Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010. 14
  • Personalized Search Ads Model ad click intent using recent user activity. More likely to click → show more North ads. Example 1 First query is digital slr camera Next query is canon slr More likely than average to click another ad Example 2 Repeated query without previous clicks Less likely to click another ad 15
  • Personalized Search Ads Modeling user session Typical features: Number of searches/clicks by user past 24 hrs User COPC: Ratio of observed clicks to predicted clicks Identical query searched before / clicked before Time (seconds) since last search/click Similarity measures: current vs. previous queries Modeling technique: stochastic gradient-descent boosted trees (GDBT) 16
  • Personalized Search Ads Target P[CLICK|ad,query,user] Approximation P[CLICK|ad,query]* ucp[user,session] Non-personalized User Click Propensity (UCP) long-term model for user session computed using Hadoop computed using S4 17
  • Personalized Search Ads Results: We can reduce the average number of ads (ad footprint) by 7% without decreasing click yield and revenue. - OR - For a given ad footprint we can increase click yield by ~2%. 18
  • Thank you! Join the Apache S4 project: s4-user-subscribe@incubator.apache.org s4-dev-subscribe@incubator.apache.org 19