SlideShare a Scribd company logo
Apache S4: A Distributed Stream
Computing Platform

Presented at Stanford Infolab – Nov 4, 2011

http://incubator.apache.org/projects/s4 (migrating from http://s4.io)


  S4 Committers: {fpj, kishoreg, leoneu, mmorel,
  robbins}@apache.org
  Presented by Leo Neumeyer (@leoneu)


                                                                        1
About Me

 Born in Buenos Aires, Argentina, studied EE.
 School/Work in Canada (Signal Processing, Speech Coding).
 SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab
 founded speech recognition spin-off Nuance Comm Inc.
 Mindstech: Startup to teach spoken English in Asia using web
 audio/video (before 2-way media was widely available).
 Yahoo! Labs: Search advertising (optimization, auctions).
 Quantbench: mission is to create a marketplace for data
 scientists, data providers, and investment funds.




                                                                2
S4 Project History

 Started as a research project at Yahoo! Labs in August 2008
 out of the need to personalize search ads in real-time.
 Open sourced in September 2009.
 Moved to Apache Incubator in October 2011.




                                                               3
Motivation


                                                       Online Parameter
 Personalized Search            Twitter Trends
                                                         Optimization



                        given multiple event streams
Predict Market Prices        extract information
                                                          Spam Filtering
 Automatic Trading
                          using data driven models
                                 in real time
                              with low latency
  Network Intrusion                at scale
     Detection                                           Sensor Networks


                               It's Fun!
                                                                           4
S4 Architecture

     Node
      App
      App           Server             App
                                       App
                                        App        PE Prototype
                                                       App
                                                        App         PE Instance
                                                                        App
                                                                         App



                                                      Stream
                                                        App
                                                         App


 Unlimited       There is one     Apps             An app is a      PE instances
 number of       server process   encapsulate      graph            are clones of
 nodes. Each     per node. The    units of work.   composed of      the prototype.
 node has one    server           They can         PE prototypes    They are
 process.        loads/unloads    consume and      and streams      associated with
                 apps.            produce event    that produce,    a unique key
                                  streams.         consume, and     and contain the
                                                   transmit msgs.   state.



S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable,
event driven, pluggable platform that allows programmers to easily implement
applications for processing continuous unbounded streams of data.
                                                                                      5
Latency vs. Accuracy


            Zero Errors                Real-Time
Latency     ➔   Unconstrained          ➔   Constrained

Why?        ➔   Reproducible results   ➔   Limited control over
                                           inbound data rate and
                                           computing complexity
Use         ➔ Debug                    ➔ Process unstructured data
            ➔ Train Models             ➔ Tolerance to small errors

                                       ➔ Graceful recovery from

                                         inbound data streams




                                                                     6
Design

 Actors programming model.
 Probabilistic thinking in both algorithms and systems.
 Run on commodity hardware.
 All in-memory, no disk bottlenecks.
 Pluggable (Protocols, applications, serialization, etc.)
 Object oriented design → POJOs
 Static typing, no string literals, minimize type casting.
 Science friendly → constant change, ease of use.




                                                             7
Programming Model


                    Example: estimate click-
                    through rate in a web
                    application after applying a
                    filter to remove bot traffic.




                                                    8
Coding an App




                9
Research Areas: Systems

 Checkpointing strategies
 Replication strategies
 Dynamic load balancing
 Adaptive load management
 Query languages




                            10
Fault Tolerance

Problem                                  Approaches                 S4
High Availability                        ➔ Warm/hot failover        ➔ Warm failover
                                         ➔ Cold failover            ➔ Standby nodes +

                                                                      Apache Zookeeper
State Loss                               ➔ Lossy checkpointing      ➔   Lossy checkpointing
                                         ➔ Lossless checkpoint.
(Crashes, system
updates)
Low Latency                              ➔   Decouple stream        ➔ Asynchronous writes
                                             processing from        ➔ Uncoordinated

                                             checkpointing            checkpointing

Approach: checkpoints are count or time based, pluggable backend to
support any data store, lazy PE restore, tuning is application dependent.
Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011.

                                                                                              11
Resilience in a Distributed Word Count Task




                                              12
Research Areas: Algorithms

 Self-adaptive models: adaptive language models using small
 amounts of data.
 Personalization: learn from user feedback (clicks, location,
 behavior) to deliver relevant information in RT.
 Trend detection: find personal Twitter trends relevant to you.
 Intrusion detection: summarize high level state of the network
 and detect unusual patterns.
 Sensor networks: large amounts of audio/video and other
 sources require processing, recognition, detection, and
 tracking. Detect events across sensors.




                                                              13
Personalized Search Ads

                                                                 Goal is to maximize:
                                                                  Revenue
                                                                  Click yield
                                                                  User experience

                                                                 By controlling:
                                                                  Ranking
                                                                  Pricing
                                                                  Filtering
                                                                  Placement

S. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th Annual
International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010.

                                                                                                                                  14
Personalized Search Ads

 Model ad click intent using recent user activity.
 More likely to click → show more North ads.

 Example 1
  First query is digital slr camera
  Next query is canon slr
  More likely than average to click another ad

 Example 2
  Repeated query without previous clicks
  Less likely to click another ad

                                                     15
Personalized Search Ads

 Modeling user session

 Typical features:
   Number of searches/clicks by user past 24 hrs
   User COPC: Ratio of observed clicks to predicted clicks
   Identical query searched before / clicked before
   Time (seconds) since last search/click
   Similarity measures: current vs. previous queries

 Modeling technique: stochastic gradient-descent boosted
 trees (GDBT)

                                                             16
Personalized Search Ads


   Target
      P[CLICK|ad,query,user]

   Approximation
     P[CLICK|ad,query]* ucp[user,session]


       Non-personalized   User Click Propensity (UCP)
       long-term model          for user session
    computed using Hadoop     computed using S4


                                                        17
Personalized Search Ads

 Results:

  We can reduce the average number of ads (ad footprint) by
  7% without decreasing click yield and revenue.

                - OR -

  For a given ad footprint we can increase click yield by
  ~2%.




                                                            18
Thank you!
 Join the Apache S4 project:

  s4-user-subscribe@incubator.apache.org

  s4-dev-subscribe@incubator.apache.org



                                           19

More Related Content

Viewers also liked

Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)
Max Ischenko
 
Edisi22o Ktaceh
Edisi22o KtacehEdisi22o Ktaceh
Edisi22o Ktaceh
epaper
 
Edisi5novaceh
Edisi5novacehEdisi5novaceh
Edisi5novaceh
epaper
 
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekUniversiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekThisco
 
Epaper Edisi 20 Aceh
Epaper Edisi 20 AcehEpaper Edisi 20 Aceh
Epaper Edisi 20 Acehepaper
 
Storytelling In Power Point
Storytelling In Power PointStorytelling In Power Point
Storytelling In Power Point
guest31da44c
 
Bioassets Management Services
Bioassets Management ServicesBioassets Management Services
Bioassets Management Services
guest5df60b0
 
Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009epaper
 
Presentation1
Presentation1Presentation1
Presentation1
douglasgreig
 
25desaceh
25desaceh25desaceh
25desacehepaper
 
Edisi 4 Des Aceh
Edisi 4 Des AcehEdisi 4 Des Aceh
Edisi 4 Des Aceh
epaper
 
11 03 15 Think
11 03 15 Think11 03 15 Think
11 03 15 Think
Tim Richardson
 
Uganda
UgandaUganda
Uganda
douglasgreig
 
Waspada Aceh 110909
Waspada  Aceh 110909Waspada  Aceh 110909
Waspada Aceh 110909epaper
 
Shop Camp3 Viren Bhandari
Shop Camp3 Viren BhandariShop Camp3 Viren Bhandari
Shop Camp3 Viren Bhandari
Viren Bhandari
 
Dubai. Religion
Dubai. ReligionDubai. Religion
Dubai. Religion
Meliiza
 
18 J An N As
18 J An N As18 J An N As
18 J An N As
epaper
 
Edisi 13 Aceh
Edisi 13 AcehEdisi 13 Aceh
Edisi 13 Aceh
epaper
 
Edisi 22 Feb Aceh
Edisi 22 Feb AcehEdisi 22 Feb Aceh
Edisi 22 Feb Acehepaper
 
OS Mapping and Industrial Location
OS Mapping and Industrial LocationOS Mapping and Industrial Location
OS Mapping and Industrial Location
douglasgreig
 

Viewers also liked (20)

Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)
 
Edisi22o Ktaceh
Edisi22o KtacehEdisi22o Ktaceh
Edisi22o Ktaceh
 
Edisi5novaceh
Edisi5novacehEdisi5novaceh
Edisi5novaceh
 
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekUniversiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
 
Epaper Edisi 20 Aceh
Epaper Edisi 20 AcehEpaper Edisi 20 Aceh
Epaper Edisi 20 Aceh
 
Storytelling In Power Point
Storytelling In Power PointStorytelling In Power Point
Storytelling In Power Point
 
Bioassets Management Services
Bioassets Management ServicesBioassets Management Services
Bioassets Management Services
 
Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009
 
Presentation1
Presentation1Presentation1
Presentation1
 
25desaceh
25desaceh25desaceh
25desaceh
 
Edisi 4 Des Aceh
Edisi 4 Des AcehEdisi 4 Des Aceh
Edisi 4 Des Aceh
 
11 03 15 Think
11 03 15 Think11 03 15 Think
11 03 15 Think
 
Uganda
UgandaUganda
Uganda
 
Waspada Aceh 110909
Waspada  Aceh 110909Waspada  Aceh 110909
Waspada Aceh 110909
 
Shop Camp3 Viren Bhandari
Shop Camp3 Viren BhandariShop Camp3 Viren Bhandari
Shop Camp3 Viren Bhandari
 
Dubai. Religion
Dubai. ReligionDubai. Religion
Dubai. Religion
 
18 J An N As
18 J An N As18 J An N As
18 J An N As
 
Edisi 13 Aceh
Edisi 13 AcehEdisi 13 Aceh
Edisi 13 Aceh
 
Edisi 22 Feb Aceh
Edisi 22 Feb AcehEdisi 22 Feb Aceh
Edisi 22 Feb Aceh
 
OS Mapping and Industrial Location
OS Mapping and Industrial LocationOS Mapping and Industrial Location
OS Mapping and Industrial Location
 

Similar to 20111104 s4 overview

Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event Processing
Sybase Türkiye
 
Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"
GeneXus
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Databricks
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Karthik Murugesan
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
Dr. Haxel Consult
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
Himansu Behera
 
The Magic of Symbiotic Security
The Magic of Symbiotic SecurityThe Magic of Symbiotic Security
The Magic of Symbiotic Security
Denim Group
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learned
Yury Chemerkin
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud Infrastructure
Newvewm
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
darach
 
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Dennis de Greef
 
WoMakersCode 2016 - Shit Happens
WoMakersCode 2016 -  Shit HappensWoMakersCode 2016 -  Shit Happens
WoMakersCode 2016 - Shit Happens
Jackson F. de A. Mafra
 
Learning's from mobile testing
Learning's from mobile testingLearning's from mobile testing
Learning's from mobile testing
Vikrant Chauhan
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
darach
 
IBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile AppsIBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile Apps
Sanjeev Sharma
 
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming Processing
Sybase Türkiye
 
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Dennis de Greef
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
Narayan Bharadwaj
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Development, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot TechnologyDevelopment, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot Technology
Antiy Labs
 

Similar to 20111104 s4 overview (20)

Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event Processing
 
Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
 
The Magic of Symbiotic Security
The Magic of Symbiotic SecurityThe Magic of Symbiotic Security
The Magic of Symbiotic Security
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learned
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud Infrastructure
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
 
WoMakersCode 2016 - Shit Happens
WoMakersCode 2016 -  Shit HappensWoMakersCode 2016 -  Shit Happens
WoMakersCode 2016 - Shit Happens
 
Learning's from mobile testing
Learning's from mobile testingLearning's from mobile testing
Learning's from mobile testing
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
 
IBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile AppsIBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile Apps
 
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming Processing
 
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Development, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot TechnologyDevelopment, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot Technology
 

Recently uploaded

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 

Recently uploaded (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 

20111104 s4 overview

  • 1. Apache S4: A Distributed Stream Computing Platform Presented at Stanford Infolab – Nov 4, 2011 http://incubator.apache.org/projects/s4 (migrating from http://s4.io) S4 Committers: {fpj, kishoreg, leoneu, mmorel, robbins}@apache.org Presented by Leo Neumeyer (@leoneu) 1
  • 2. About Me Born in Buenos Aires, Argentina, studied EE. School/Work in Canada (Signal Processing, Speech Coding). SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab founded speech recognition spin-off Nuance Comm Inc. Mindstech: Startup to teach spoken English in Asia using web audio/video (before 2-way media was widely available). Yahoo! Labs: Search advertising (optimization, auctions). Quantbench: mission is to create a marketplace for data scientists, data providers, and investment funds. 2
  • 3. S4 Project History Started as a research project at Yahoo! Labs in August 2008 out of the need to personalize search ads in real-time. Open sourced in September 2009. Moved to Apache Incubator in October 2011. 3
  • 4. Motivation Online Parameter Personalized Search Twitter Trends Optimization given multiple event streams Predict Market Prices extract information Spam Filtering Automatic Trading using data driven models in real time with low latency Network Intrusion at scale Detection Sensor Networks It's Fun! 4
  • 5. S4 Architecture Node App App Server App App App PE Prototype App App PE Instance App App Stream App App Unlimited There is one Apps An app is a PE instances number of server process encapsulate graph are clones of nodes. Each per node. The units of work. composed of the prototype. node has one server They can PE prototypes They are process. loads/unloads consume and and streams associated with apps. produce event that produce, a unique key streams. consume, and and contain the transmit msgs. state. S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable, event driven, pluggable platform that allows programmers to easily implement applications for processing continuous unbounded streams of data. 5
  • 6. Latency vs. Accuracy Zero Errors Real-Time Latency ➔ Unconstrained ➔ Constrained Why? ➔ Reproducible results ➔ Limited control over inbound data rate and computing complexity Use ➔ Debug ➔ Process unstructured data ➔ Train Models ➔ Tolerance to small errors ➔ Graceful recovery from inbound data streams 6
  • 7. Design Actors programming model. Probabilistic thinking in both algorithms and systems. Run on commodity hardware. All in-memory, no disk bottlenecks. Pluggable (Protocols, applications, serialization, etc.) Object oriented design → POJOs Static typing, no string literals, minimize type casting. Science friendly → constant change, ease of use. 7
  • 8. Programming Model Example: estimate click- through rate in a web application after applying a filter to remove bot traffic. 8
  • 10. Research Areas: Systems Checkpointing strategies Replication strategies Dynamic load balancing Adaptive load management Query languages 10
  • 11. Fault Tolerance Problem Approaches S4 High Availability ➔ Warm/hot failover ➔ Warm failover ➔ Cold failover ➔ Standby nodes + Apache Zookeeper State Loss ➔ Lossy checkpointing ➔ Lossy checkpointing ➔ Lossless checkpoint. (Crashes, system updates) Low Latency ➔ Decouple stream ➔ Asynchronous writes processing from ➔ Uncoordinated checkpointing checkpointing Approach: checkpoints are count or time based, pluggable backend to support any data store, lazy PE restore, tuning is application dependent. Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011. 11
  • 12. Resilience in a Distributed Word Count Task 12
  • 13. Research Areas: Algorithms Self-adaptive models: adaptive language models using small amounts of data. Personalization: learn from user feedback (clicks, location, behavior) to deliver relevant information in RT. Trend detection: find personal Twitter trends relevant to you. Intrusion detection: summarize high level state of the network and detect unusual patterns. Sensor networks: large amounts of audio/video and other sources require processing, recognition, detection, and tracking. Detect events across sensors. 13
  • 14. Personalized Search Ads Goal is to maximize: Revenue Click yield User experience By controlling: Ranking Pricing Filtering Placement S. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th Annual International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010. 14
  • 15. Personalized Search Ads Model ad click intent using recent user activity. More likely to click → show more North ads. Example 1 First query is digital slr camera Next query is canon slr More likely than average to click another ad Example 2 Repeated query without previous clicks Less likely to click another ad 15
  • 16. Personalized Search Ads Modeling user session Typical features: Number of searches/clicks by user past 24 hrs User COPC: Ratio of observed clicks to predicted clicks Identical query searched before / clicked before Time (seconds) since last search/click Similarity measures: current vs. previous queries Modeling technique: stochastic gradient-descent boosted trees (GDBT) 16
  • 17. Personalized Search Ads Target P[CLICK|ad,query,user] Approximation P[CLICK|ad,query]* ucp[user,session] Non-personalized User Click Propensity (UCP) long-term model for user session computed using Hadoop computed using S4 17
  • 18. Personalized Search Ads Results: We can reduce the average number of ads (ad footprint) by 7% without decreasing click yield and revenue. - OR - For a given ad footprint we can increase click yield by ~2%. 18
  • 19. Thank you! Join the Apache S4 project: s4-user-subscribe@incubator.apache.org s4-dev-subscribe@incubator.apache.org 19