SlideShare a Scribd company logo
Analytics for the
Real-Time Web
        Maria Grineva
   Systems @ ETH Zurich
Real-Time Web

• Web 2.0 + mobile devices = Real-Time Web
• People share what they do now, discuss
  breaking news on Twitter, share their current
  locations on Foursquare...
Analytics for the Real-Time Web:
       new requirements
 • Batch processing (MapReduce) is too slow
 • New requirements:
    • real-time processing: aggregate values
       incrementally, as new data arrives
    • data-base intensive: aggregate values
       are stored in a database constantly
       being updated
Our System: Triggy
•   Based on Cassandra, a distributed key-value store
•   Provides programming model similar to MapReduce,
    adapted to push-style processing
•   Extends Cassandra with
    •   push-style procedures - to immediately propagate
        the data to computations;
    •   synchronization - to ensure consistency of aggregate
        results (counters)
•   Easily scalable
Cassandra Overview
                      Data Model

•   Data Model: key-value

•   Extends basic key-value
    with 2 levels of nesting

•   Super column - if the
    second level is presented

•   Column family ~ table;

    key-value pair ~ record

•   Keys are stored ordered
Cassandra Overview
          Incremental Scalability
•   Incremental scalability requires
    mechanism to dynamically
    partition data over the nodes

•   Data partitioned by key using
    consistent hashing

•   Advantage of consistent
    hashing: departure or arrival of
    a node affects only its
    immediate neighbors, other
    nodes remain unaffected
Cassandra Overview
Log-Structured Storage
• Optimized for write-intensive workloads
  (log-structured storage)
Triggy
        Programming Model
•   Modified MapReduce to support push-style
    processing
•   Modified only reduce function: reduce*
•   reduce* incrementally applies a new input value
    to an already existing aggregate value
              Map(k1,v1) -> list(k2,v2)
              Reduce(k2, list (v2)) -> (k2, v3)
Triggy
Programming Model
Triggy
                  Synchronization
•   reduce* functions have to be synchronized for the same key to guarantee
    correct results
•   we make use of Cassandra’s partitioning strategy: all keys are routed to the same
    node
•   synchronization within a node: locks on keys that are being processed right now
Triggy
Fault Tolerance and Scalability
• No fault tolerance guarantees
• Intermediate data and data in queue can be
  lost
• Triggy is easily scalable because the
  execution and data storage are tightly
  coupled
• A new node is placed near the most loaded
  node, part of data are transferred
Experiments
•   Generated workload: tweets with user ids (1 .. 100000) in uniform
    distribution

•   The load generator issues as many requests as the system with N
    can handle

•   Application: count the number of words posted by each user
    Map: tweet => (user_id, number_of_words_in_tweet)
    Reduce: (user_id, numer_of_words_total, number_of_words_in_tweet) =>
            (user_id, number_of_words_total)
Similar Systems: Yahoo!’s S4
• Distributed stream processing engine:
  •   Programming interface: Processing
      Elements written in Java

  •   Data routed between Processing Elements by
      key

  •   No database. All processing in memory

• Used to estimate Click-Through-Rate using
  user’s behavior within a time window
Similar Systems:
                     Google’s Percolator
•   Percolator is database-intensive: based on BigTable
•   BigTable:
    •   the same data model as in Cassandra
    •   the same log-structured storage
    •   BigTable - a distributed system with a master; Cassandra - peer2peer
•   Percolator extends BigTable with
    •   observers (similar to database triggers for push-style processing)

    •   ACID transactions

•   Triggy vs. Percolator:

    •   MapReduce programming model

    •   No ACID transactions (intermediate data can be lost) - less overhead. (What is
        the real overhead of full transaction support? )
Application
    Social Media Optimization for news sites
•    A/B testing for headlines of news stories

•    Optimization of front page to attract more clicks
Application
Real-Time News Recommendations

 • TwitterTim.es - new recommendations via
   Twitter’s friends graph
 • Now - rebuilt every 2 hours; goal - real-
   time updating newspaper
Application
         Real-Time Advertising
•   Real-Time bidding:
    •   Sites track your browsing behavior via cookies and sell it to
        advertising services
    •   Web publishers offer up display inventory to advertising services
    •   No fixed CPM, instead: each ad impression is sold to the highest
        bidder
•   Retargeting (remarketing)
    •   Advertisers can do remarketing after the following events: (1) the user
        visited your site and left (assume the site is within the Google content
        network); (2) the user visited your site and added products to their
        shopping cart then left; 3) went through purchase process but stop
        somewhere.

    •   Potentially interesting to use information from social networks
Other Applications
•   Recommendations on location checkins:
    Foursquare, Facebook places...
•   Social Games: monitoring events from millions
    of users in real-time, react in real-time
What other
applications?

More Related Content

What's hot

REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
Douglas Bernardini
 
MongoDB at Baidu
MongoDB at BaiduMongoDB at Baidu
MongoDB at Baidu
Mat Keep
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overview
Monal Daxini
 
AWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual MeetupAWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual Meetup
Anahit Pogosova
 
Ml sprint16 thesis_intro
Ml sprint16 thesis_introMl sprint16 thesis_intro
Ml sprint16 thesis_intro
ThanhNguyen3805
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
NoSQLmatters
 
Druid
DruidDruid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
Traitement d'événements
Traitement d'événementsTraitement d'événements
Traitement d'événements
Amazon Web Services
 
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Real-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and SparkReal-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and Spark
SingleStore
 
Proofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaProofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social Media
DataStax Academy
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
Yann Cluchey
 
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scale
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at ScaleKafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scale
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scale
confluent
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
Sriskandarajah Suhothayan
 
The Fermilab HEPCloud Facility
The Fermilab HEPCloud FacilityThe Fermilab HEPCloud Facility
The Fermilab HEPCloud Facility
Claudio Pontili
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
HostedbyConfluent
 
An introduction to cloud computing with Amazon Web Services and MongoDB
An introduction to cloud computing with Amazon Web Services and MongoDBAn introduction to cloud computing with Amazon Web Services and MongoDB
An introduction to cloud computing with Amazon Web Services and MongoDB
Samuel Demharter
 
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
DataStax Academy
 

What's hot (20)

REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
 
MongoDB at Baidu
MongoDB at BaiduMongoDB at Baidu
MongoDB at Baidu
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overview
 
AWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual MeetupAWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual Meetup
 
Ml sprint16 thesis_intro
Ml sprint16 thesis_introMl sprint16 thesis_intro
Ml sprint16 thesis_intro
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
Druid
DruidDruid
Druid
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
Traitement d'événements
Traitement d'événementsTraitement d'événements
Traitement d'événements
 
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Real-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and SparkReal-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and Spark
 
Proofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaProofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social Media
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scale
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at ScaleKafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scale
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scale
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
 
The Fermilab HEPCloud Facility
The Fermilab HEPCloud FacilityThe Fermilab HEPCloud Facility
The Fermilab HEPCloud Facility
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
 
An introduction to cloud computing with Amazon Web Services and MongoDB
An introduction to cloud computing with Amazon Web Services and MongoDBAn introduction to cloud computing with Amazon Web Services and MongoDB
An introduction to cloud computing with Amazon Web Services and MongoDB
 
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
 

Similar to Analytics for the Real-Time Web

Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
maria.grineva
 
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
RightScale
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
Tung Nguyen
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
Crate.io
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
Ozgun Erdogan
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
viirya
 
Mongo db 2.4 time series data - Brignoli
Mongo db 2.4 time series data - BrignoliMongo db 2.4 time series data - Brignoli
Mongo db 2.4 time series data - Brignoli
Codemotion
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 
ICIECA 2014 Paper 05
ICIECA 2014 Paper 05ICIECA 2014 Paper 05
Making sense of the Graph Revolution
Making sense of the Graph RevolutionMaking sense of the Graph Revolution
Making sense of the Graph Revolution
InfiniteGraph
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
Anubhav Kale
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
Prateek Jain
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
sscdotopen
 
Relational cloud, A Database-as-a-Service for the Cloud
Relational cloud, A Database-as-a-Service for the CloudRelational cloud, A Database-as-a-Service for the Cloud
Relational cloud, A Database-as-a-Service for the Cloud
Hossein Riasati
 

Similar to Analytics for the Real-Time Web (20)

Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Mongo db 2.4 time series data - Brignoli
Mongo db 2.4 time series data - BrignoliMongo db 2.4 time series data - Brignoli
Mongo db 2.4 time series data - Brignoli
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
ICIECA 2014 Paper 05
ICIECA 2014 Paper 05ICIECA 2014 Paper 05
ICIECA 2014 Paper 05
 
Making sense of the Graph Revolution
Making sense of the Graph RevolutionMaking sense of the Graph Revolution
Making sense of the Graph Revolution
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Relational cloud, A Database-as-a-Service for the Cloud
Relational cloud, A Database-as-a-Service for the CloudRelational cloud, A Database-as-a-Service for the Cloud
Relational cloud, A Database-as-a-Service for the Cloud
 

More from maria.grineva

Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
Semantic Data Search and Analysis Using Web-based User-Generated Knowledge BasesSemantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
maria.grineva
 
Getting Value From Social Media
Getting Value From Social MediaGetting Value From Social Media
Getting Value From Social Media
maria.grineva
 
Filtering Twitter
Filtering TwitterFiltering Twitter
Filtering Twitter
maria.grineva
 
Architecture of Native XML Database Sedna
Architecture of Native XML Database SednaArchitecture of Native XML Database Sedna
Architecture of Native XML Database Sedna
maria.grineva
 
XQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database SednaXQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database Sedna
maria.grineva
 
Extracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme DocumentsExtracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme Documents
maria.grineva
 
Effective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From TextEffective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From Text
maria.grineva
 

More from maria.grineva (7)

Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
Semantic Data Search and Analysis Using Web-based User-Generated Knowledge BasesSemantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
 
Getting Value From Social Media
Getting Value From Social MediaGetting Value From Social Media
Getting Value From Social Media
 
Filtering Twitter
Filtering TwitterFiltering Twitter
Filtering Twitter
 
Architecture of Native XML Database Sedna
Architecture of Native XML Database SednaArchitecture of Native XML Database Sedna
Architecture of Native XML Database Sedna
 
XQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database SednaXQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database Sedna
 
Extracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme DocumentsExtracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme Documents
 
Effective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From TextEffective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From Text
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Analytics for the Real-Time Web

  • 1. Analytics for the Real-Time Web Maria Grineva Systems @ ETH Zurich
  • 2. Real-Time Web • Web 2.0 + mobile devices = Real-Time Web • People share what they do now, discuss breaking news on Twitter, share their current locations on Foursquare...
  • 3. Analytics for the Real-Time Web: new requirements • Batch processing (MapReduce) is too slow • New requirements: • real-time processing: aggregate values incrementally, as new data arrives • data-base intensive: aggregate values are stored in a database constantly being updated
  • 4. Our System: Triggy • Based on Cassandra, a distributed key-value store • Provides programming model similar to MapReduce, adapted to push-style processing • Extends Cassandra with • push-style procedures - to immediately propagate the data to computations; • synchronization - to ensure consistency of aggregate results (counters) • Easily scalable
  • 5. Cassandra Overview Data Model • Data Model: key-value • Extends basic key-value with 2 levels of nesting • Super column - if the second level is presented • Column family ~ table; key-value pair ~ record • Keys are stored ordered
  • 6. Cassandra Overview Incremental Scalability • Incremental scalability requires mechanism to dynamically partition data over the nodes • Data partitioned by key using consistent hashing • Advantage of consistent hashing: departure or arrival of a node affects only its immediate neighbors, other nodes remain unaffected
  • 7. Cassandra Overview Log-Structured Storage • Optimized for write-intensive workloads (log-structured storage)
  • 8. Triggy Programming Model • Modified MapReduce to support push-style processing • Modified only reduce function: reduce* • reduce* incrementally applies a new input value to an already existing aggregate value Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> (k2, v3)
  • 10.
  • 11. Triggy Synchronization • reduce* functions have to be synchronized for the same key to guarantee correct results • we make use of Cassandra’s partitioning strategy: all keys are routed to the same node • synchronization within a node: locks on keys that are being processed right now
  • 12. Triggy Fault Tolerance and Scalability • No fault tolerance guarantees • Intermediate data and data in queue can be lost • Triggy is easily scalable because the execution and data storage are tightly coupled • A new node is placed near the most loaded node, part of data are transferred
  • 13. Experiments • Generated workload: tweets with user ids (1 .. 100000) in uniform distribution • The load generator issues as many requests as the system with N can handle • Application: count the number of words posted by each user Map: tweet => (user_id, number_of_words_in_tweet) Reduce: (user_id, numer_of_words_total, number_of_words_in_tweet) => (user_id, number_of_words_total)
  • 14. Similar Systems: Yahoo!’s S4 • Distributed stream processing engine: • Programming interface: Processing Elements written in Java • Data routed between Processing Elements by key • No database. All processing in memory • Used to estimate Click-Through-Rate using user’s behavior within a time window
  • 15. Similar Systems: Google’s Percolator • Percolator is database-intensive: based on BigTable • BigTable: • the same data model as in Cassandra • the same log-structured storage • BigTable - a distributed system with a master; Cassandra - peer2peer • Percolator extends BigTable with • observers (similar to database triggers for push-style processing) • ACID transactions • Triggy vs. Percolator: • MapReduce programming model • No ACID transactions (intermediate data can be lost) - less overhead. (What is the real overhead of full transaction support? )
  • 16. Application Social Media Optimization for news sites • A/B testing for headlines of news stories • Optimization of front page to attract more clicks
  • 17. Application Real-Time News Recommendations • TwitterTim.es - new recommendations via Twitter’s friends graph • Now - rebuilt every 2 hours; goal - real- time updating newspaper
  • 18.
  • 19. Application Real-Time Advertising • Real-Time bidding: • Sites track your browsing behavior via cookies and sell it to advertising services • Web publishers offer up display inventory to advertising services • No fixed CPM, instead: each ad impression is sold to the highest bidder • Retargeting (remarketing) • Advertisers can do remarketing after the following events: (1) the user visited your site and left (assume the site is within the Google content network); (2) the user visited your site and added products to their shopping cart then left; 3) went through purchase process but stop somewhere. • Potentially interesting to use information from social networks
  • 20. Other Applications • Recommendations on location checkins: Foursquare, Facebook places... • Social Games: monitoring events from millions of users in real-time, react in real-time

Editor's Notes

  1. \n
  2. The Web 2.0 era is characterized by the emergence of large amounts of user-generated content. People started generate and contribute data on different Web services: blogs, social networks, Wikipedia. \n\nToday, with the emergence of mobile devices constantly connected to the Internet, that nature of user-generated content has changed. Now people contribute more often, with smaller posts and the life-span of these posts became shorter. \n\nNew Web services appear that encourage real-time usage:\n1) Twitter\nLifespan of each tweet is shorter than it was before for Blog post. Twitter stream is almost real-time.\n2) Location-based social networks: Foursquare, Facebook places. People share their current location (or checkin) at real venues. This data is real-time sensitive, the user reveals his current location and recommendation of near-by friends and other interesting places must be done immediately, while the user is there.\n\n
  3. So far, analyzing and making use of Web 2.0 data has been accomplished using batch-style processing. Data produced over a certain period of time is accumulated and then processed. MapReduce has become the state-of-the-art approach for analytical batch processing of user-generated data.\n\nToday, the Web 2.0 data has become more real-time and this change implies new requirements for analytical systems. Processing data in batches is too slow for real-time sensitive data. Accumulated data can lose its importance in several hours or, even, minutes. Therefore, analytical systems must aggregate values in real-time, incrementally, as new data arrives. It follows that workloads are database-intensive because aggregate values are not produced at once, as in batch processing, but stored in a database constantly being updated. For example, Google’s new web indexing system, Percolator, is not based on MapReduce anymore. Percolator allows lower document processing latencies by updating the web index incrementally (database-intensive).\n\n
  4. We are working on a system that can process analytical tasks at real-time for large amounts of data.\n\nOur system is based on Cassandra distributed key-value store.\nWe add two extensions into Cassandra in order to turn it into a system for real-time analytics: push-style procedures and synchronization.\n\nWe extend Cassandra with push-style procedures. These procedures act like triggers, you can set it onto a table and they fire up when a new key-value record is inserted. They make the computation real-time, as they immediately propagate the inserted data to the analytical computations.\n\nSynchronization: Cassandra is a simple key-value store. There is no mechanism to update a value based on the existing value. For example, to maintain counters, when we need to increment the existing value we first need to query it, and then insert a new value. In Cassandra, there is no transactions, that means, between querying and updating other client can also update the value. That leads to inconsistent counters. We add local synchronization into Cassandra, that can synchronize data within a node.\n\nFurthermore, our system provides a programming model similar to MapReduce, adapted to push-style processing, and is scalable in terms of computation and data storage.\n
  5. In a nutshell, Cassandra data model can be described as follows:\n1) Cassandra is based on a key-value model\nA database consists of column families. A column family is a set of key-value pairs. Drawing an analogy with relational databases, you can think about column family as table and a key-value pair as a record in a table.\n2) Cassandra extends basic key-value model with two levels of nesting\nAt the first level the value of a record is in turn a sequence of key-value pairs. These nested key-value pairs are called columns where key is the name of the column. In other words you can say that a record in a column family has a key and consists of columns. \nAt the second level, the value of a nested key-value pair can be a sequence of key-value pairs as well. When the second level of nesting is presented, outer key-value pairs are called super columns with key being the name of the super column and inner key-value pairs are called columns.\nLet’s consider an classical example of Twitter database to demonstrate the points.\nColumn family Tweets contains records representing tweets. The key of a record is of Time UUID type and generated when the tweet is received (we will use this feature in User_Timelines column family below). The record consist of columns (no super columns here). Columns simply represent attributes of tweets. So it is very similar to how one would store it in a relational database.\nThe next example is User_Timelines (i.e. tweets posted by a user). Records are keyed by user IDs (referenced by User_ID columns in Tweets column family). User_Timelines demonstrates how column names can be used to store values – tweet IDs in this case. The type of column names is defined as Time UUID. It means that tweets IDs are kept ordered by the time of posting. That is very useful as we usually want to show the last N tweets for a user. Values of all columns are set to an empty byte array (denoted “-”) as they are not used.\nTo demonstrate super columns let us assume that we want to collect statistics about URLs posted by each user. For that we need to group all the tweets posted by a user by URLs contained in the tweets. It can be stored using super columns as follows.\nIn User_URLs the names of the super columns are used to store URLs and the names of the nested columns are the corresponding tweet IDs.\n\n\n
  6. One of the key features of Cassandra is that it must scale incrementally. This requires a mechanism to dynamically partition the data over the set of nodes. Cassandra’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. \n\nIn consistent hashing, the output range of a hash function (which is normally MD5 ) is treated as a fixed circular space or a ring. By this, I mean, that the largest hash value wraps around to the smallest hash value. \n\nEach node in the system is assigned a random value within this space which represents its position on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. The node is deemed the coordinator for this key. Thus, each node becomes responsible for the region in the ring between it and previous node on the ring.\n\nThe principal advantage of the consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected. \n\nThe problem with MD5 hash function for nodes distribution: the random position assignment of each node on the ring leads to non-uniform load and data distribution. That’s why Cassandra analyzes load information on the ring and inserts new nodes near the highly loaded nodes, so that the overloaded node can transfer the data from it onto the new node.\n\n
  7. Cassandra is optimized for write-intensive workloads, that is a useful feature for us, as computing aggregate values for analytical tasks implies heavy updates to the system\n\nCassadra uses so called log-structured stored which was successfully used in BigTable.\nThe idea is that write operations write to buffer in main memory. When the buffer is full, it is written on disk. So, in the result, the buffer is periodically written on disk. And there is a separate thread that merges different versions a sstable. This process is called compaction.\n\nRead operation looks up the value first in memtable, then, if it was not found, in different versions of sstable moving from the most recent one.\n\nSuch storage is highly optimized for writes, and of course makes the queries slower, which is always a tradeoff for databases.\n
  8. MapReduce is a well-established programming model to express analytical applications. To support real-time analytical applications, we modify this programming model to support push-style data processing. In particular, we modify the reduce function. Originally, reduce combined a list of input values into a single aggregate value. Our modified function, reduce∗, incrementally applies a new input value to an already existing aggregate value. This modification allows to apply a new input value to the aggregate value as soon as the new input value is produced. This means,we are able to pushnewvaluestothe reduce function. \n\nFigure 1 depicts our modified programming model. reduce∗ takes as parameters a key, a new value, and the existing aggregate value. It outputs a key-value pair with the same key and the new aggregate value. We did not modify the map function as it is already allows push-style processing. The difference between map and reduce∗ is that multiple maps can be executed in parallel for the same key, while the execution of reduce∗ has to be synchronized for the same key to guarantee correct results. \n\nNote that reduce∗ exhibits some limitations in comparison to the original reduce. Not every reduce function can be converted to its incremental counterpart. For example, to compute the median of a set of values, the previous median and new value is not enough to compute the new median. The complete set of values needs to be stored to compute the new median.\n\n
  9. In order to setup a map/reduce∗ job the developer has to provide implementations for both functions and define the input table, from which the data is fed into map, and the output table, to which the output of reduce∗ is written.\n\n
  10. Example: implementation of WordCountMapReducer\n
  11. The difference between map and reduce∗ is that multiple maps can be executed in parallel for the same key, while the execution of reduce∗ has to be synchronized for the same key to guarantee correct results.\n\nFor that, we extended the nodes of the key-value store adding queues and worker threads. Figure 2 shows our extensions. Each node maintains a queue that buffers map and reduce∗ tasks. Worker threads drain the queues and execute buffered tasks. Buffering map and reduce∗ tasks allows to handle bursts of input data. Furthermore, the size of the queue allows a rough estimation of the load of a node.\n\nHow to Execute map. As described, for each map the developer has to define an input table. Whenever a new key-value pair is written to this table, the node handling this write schedules a new map task by putting it into its local queue. Eventually, a worker thread will execute the map task at this node. Map tasks can be executed in parallel at any node in the system and do not require synchronization because they do not share any data.\n\nHow to Execute reduce∗. In contrast to map, the execution of reduce∗ needs to be synchronized because several reduce∗ tasks can potentially update the same aggregate value in parallel leading to inconsistent data. Cassandra do not provide any synchronization mechanisms. In our system, synchronization is realized in two steps: (1) routing all key-value pairs output by map with the same key to a single node, and (2) synchronizing the execution of reduce∗ within a node using locks. Routing is implemented by reusing Cassandra’s partitioning strategy (using consistent hashing). That is, each key-value pair output by map is routed to the node that is primarily responsible for the respective key. At the receiver node, a new reduce∗ task is submitted to the queue. Multiple worker threads execute these reduce∗ tasks by reading and incrementing the latest aggregate value. Workers threads are synchronized such that only one worker executes a reduce∗ task for a given key. For that, we use a lock table that contains keys being processed by each worker. The output of the reduce∗ task is written to the table specified in the reduce definition. The table may be replicated to achieve reliability. By writing the result, the node might fire a subsequent map/reduce∗ task. The result of reduce∗ can be queried using the key-value store’s standard query interface.\n\nThe figure shows the execution of map and reduce∗ inside oursystem.Twokey-valuepairs(k1 , v1 )and(k1 , v2 )are written to nodes N1 and N5 of the key-value store. These writes fire map tasks defined on the updated table. There- fore,receivernodeN1putsamaptaskforpair(k1 , v1 )into its queue (denoted by m in Figure 2). Similarly, node N5 putsamaptaskforpair(k1 , v2 )intoitsqueue.Theexecution of the map tasks results in three intermediate key-value pairs. Determined by Cassandra’s partitioning strategy, the intermediate pair with key k2 is routed to node N2 while pairs with key k3 are routed to node N3. Nodes N2 and N3 put reduce∗ tasks into their respective queues (denoted by r∗). As described, reduce∗ tasks are executed locally using locks. New aggregate values are computed and stored into the result table.\n\n
  12. Our implementation does not provide fault tolerance guarantees for execution of map/reduce∗ tasks. If the node responsible to execute map fails while the map task is still in the queue, the map task will never be executed. Also, our synchronization mechanism requires intermediate key-value pairs to be routed to a single node. These intermediate pairs might be lost in case of failures. Nevertheless, once a map/ reduce∗ task has been executed successfully the results are stored reliably at a number of replica nodes. Thus, only intermediate data can be lost.\n\nThere are a number of reasons for this design decision. First, for many analytical applications losing intermediate data is not critical. For such applications it is more important to see a general trend rather than exact numbers. Second, only those map/reduce∗ tasks can be lost that wait in the queue at the moment a node fails. If there is no burst of input data, queues are usually empty. Therefore, losing intermediate data happens rarely. Third, the execution of map and reduce∗ tasks is distributed across all nodes of the system. Only a portion of intermediate data will be lost in case a single node fails.\n\nIn order to provide stronger consistency guarantees in case of node failures, we would have to provide exactly-once semantics. Relatively light-weight methods that provide at-least-once semantics are not suitable as repeated executions invalidate aggregate values. Providing exactly-once semantics requires additional storage and computation overhead and is argued to be too expensive and not easy to scale.\n\n\nScalability. In our system, the execution of map and reduce∗ is distributed across the nodes according to the data partitioning strategy of the key-value store. It allows to easily scale the system as execution and data storage are tightly coupled. By default, Cassandra provides a mechanism for scaling the data storage. Any new node is placed near the most loaded node of the system. Parts of the data from the loaded node are transferred to the new node, thus, shedding load between the nodes. We extended Cassandra’s load measurement formula to include execution load as well. As in the SEDA architecture, we use the length of the queue to measure execution load. It is a good criteria because it reflects any bottleneck at a node such as CPU overload or network saturation.\n\n\n
  13. \n
  14. Yahoo! recently open sourced S4, a system that is close to ours.\n\nWhat are the differences:\n\n1) Triggy has MapReduce programming model many developers are familiar with. Programming model of S4 is more general. \n\n2) Our system is tightly coupled with the database, while S4 process tasks in memory. Why we think database-intensive solution is important:\n\nа) With Triggy, you don’t have to worry about the window. You can compute analytics using historical data which can be used within a window, as well as without a window, or the window can be of different sizes for different parameters. For example, while monitoring user’s browsing behavior using cookies for advertising: some users show enough interest for a certain ad within a short time period, while you can monitor and wait for other users much longer.\n\nб) Triggy is easily scalable. You don’t have to scale the computation separately from the database. Tightly coupled solution allows scaling the system with a single knob.\n\n\n
  15. \n
  16. News site use real-time analytics for optimizing their sites to attract more readers.\n\n1) A/B testing for headlines of news stories. When the news is first published on the site, there are two different headlines for it. For the first 5 minutes part of the readers get one headline, while another part of the readers gets another headline. Then the headline that attracts more clicks during the first 5 minutes in chosen. \n\n2) Optimizing news layout. The system analyses clicks, likes and retweet to understand which news stories rise discussions in social media. Then put the most discussed news on to the front page to attract even more readers. \n
  17. The Twitter Tim.es - a personalized news service: http://twittertim.es. The Twitter Tim.es uses your friends relationships on Twitter to recommend news for you.\n\nCurrently, The Twitter Tim.es newspapers are being rebuilt every 2 hours (batch processing). Would be nice to have push-style processing, when the new news story is coming to the newspaper as soon as it is published on Twitter.\n\n\n
  18. \n
  19. What is real-time bidding\n\nHere's the basic gist:\n1) Sites across the web track your browsing behavior via cookies and sell basic data about you to Ad Service companies. For example, Google Content Network covers 80% of internet users.\n\n2) Web publishers offer up display inventory to the RTB market through ad services; rather than signing up for a fixed CPM, they sell each individual ad impression to the highest bidder, based on whom that individual ad is being served to. For example, a retailer who agrees to run a display ad campaign for a shoe sale at $5 per 1,000 impressions. That retailer, however, can specify that they will pay $10 per 1,000 impressions for ads that include running shoes if they know that a browser has previously visited the athletics section of its Web site.\n\nReal-time bidding auction is happening during a milliseconds while the site page is opening. Advertisers have to run their algorithms to decide what ad to show and at what price during this time.\n\nGoogle retargeting (or remarketing):\nWhat is remarketing:\nTravel company has a site where they feature the holiday vacations. Users may come to this website, browse the offers and think about booking a trip, but decide that the deal is still not cheap enough. Then, they continue to browse the web. If the travel company later decide to offer discounted deals to the Carribean, it can target the users that already visited their site (interested users) via display ads, that these users will see later on other sites.\n\nAdvertisers can do remarketing after the following events:\n1) User visited your site and left (assume the site is within the Google content network); 2) User visited your site and added products to their shopping cart then left; 3) Go through purchase process but stop somewhere; etc.\n\nThese events can be extended with information from social networks, for example. Suppose, the system can track what the user is posting on twitter and estimate their interest in different products that can be advertised later.\n\nYou can then pay per click for these people as they search and browse the web (ads will be shown in search or content network).  For retargeting you need to aggregate information about a user in a database. Window approach is not applicable here, because there is no a single time frame.\n
  20. \n
  21. \n