SlideShare a Scribd company logo
1 of 26
Data Stream Management
Authors
Lukasz Golab & M. Tamer Özsu
Supervised by
Dr. Sakti Pramanik
Presented by
AKM Tauhidul Islam
Outline
• Introduction
o Motivation
o Problem Statement
o Definitions
• Data Stream Management System (DSMS)
• Streaming Data Warehouse (SDW)
• Discussion
Introduction
• Stream data - Produced incrementally over time, rather than
being available in full before its processing begins
• Examples:
• Applications:
o Sensor Networks - E.g. TinyDB
o Network Traffic Analysis - E.g. Traffic statistics and critical condition
detection.
o Financial Tickers - On-line analysis of stock prices, discover correlations,
identify trends.
o Transaction Log Analysis - E.g. Web click streams and telephone calls
Transaction data streams Log Streams
Credit card purchases,
Telecommunications,
Web Accesses
Climate Data
GPS tracking
Sensor networks
IP networks
Motivation
• Massive data sets:
o Huge numbers of users, e.g.,
• AT&T long-distance: ~ 300M calls/day
• AT&T IP backbone: ~ 10B IP flows/day
o Highly detailed measurements, e.g.,
• NOAA: satellite-based measurements of earth geodetics
o Huge number of measurement points, e.g.,
• Sensor networks with huge number of sensors
• Near real-time analysis
o ISP: controlling service levels
o NOAA: tornado detection using weather radar
o Hospital: Patient monitoring
• Traditional data feeds
o Simple queries (e.g., value lookup) needed in real-time
o Complex queries (e.g., trend analyses) performed off-line
Problem Statement
DBMS DSMS
Data Persistent Relations Streams, time windows
Data Access Random Sequential, One-pass
Updates Arbitrary Append Only
Update Rates Relatively Low High, bursty
Processing Model Query Driven Data driven
Queries One time Continuous
Query Plans Fixed Adaptive
Query Optimizations One Query Multi-query
Query Answers Exact Exact or Approximate
Latency Relatively High Low
Data
Warehouse
SDW
Data Historical Recent and
Historical
Update
Frequency
Low High
Update
Propagation
Synchronous Asynchronous
ETL Process Complex Fast, Light-
weight
Fig : Comparison of Data Stream Management Systems
and Streaming Data Warehouses with traditional database
and warehouse systems
Definitions
• Non-blocking Execution : Query operator Q doesn’t require
entire input
• Monotonicity : All previous results preserved
o Q(т) € Q(т’), for query operator Q, where т <= т’
o Q is monotonic only if non-blocking
• Delta : Doesn’t hold monotonicity property , produce update
result at time т, negative / Positive delta
• Punctuation : Special tuple containing a predicate that is
guaranteed to be satisfied by the remainder of the data stream
• Heartbeat : Punctuations that govern timestamps of future
tuples
• Average slowdown = Tuple response time/ shortest processing
time
Outline
• Introduction
• Data Stream Management System (DSMS)
o Stream Data Models
o Query Language & Semantics
o Query Processing
o Query Optimization
• Streaming Data Warehouse (SDW)
• Discussion
DSMS
• Input Buffer/Monitor
o Captures streaming inputs
o May collect statistics on streams
o Random sampling
• Working storage
o Stores recent stream data
o Used for query processing
• Local Storage
o Used for metadata
o Foreign key mapping
o Naming translation
• Query Processor
o Convert queries into execution plans
o Change plans for different workloads /
input rates
o Contains buffers, operator queues
o Deploys scheduling methods
• Continuous Query Repository
• Results
o May input to users, to other applications
o Stored in an SDW for further analysis
Fig : i) Abstract reference architecture of a DSMS & ii) A traditional DBMS
Stream Data Models
• Base Streams – Produced by sources, append only
• Derived streams – produced by continuous queries
• Streams have fixed schema
o <timestamp, source IP Addr, source port, destination IP Addr, destination port, size>
• Data Stream Models
o Describe underlying signals S : [l ... N] -> R
o Aggregate model – Range value for a signal
o Cash Register model – Partial non-negative range value
o Turnstile model – Partial range value
o Reset model – Range value; Reset previous value of a signal
• Stream Windows – important to user and query points of view
o Fixed window
o Sliding window
o Landmark window
o Jumping window – update every k-ticks or k-arrivals
o Tumbling window - update every k-ticks or k-arrivals , k = window size
Query Language & Semantics
• Query Algebra
o Stream-to-stream
o Mixed Algebra
• Query Operators – Similar syntax to DBMS, very different semantics
• Relation-like query operators
o Selection, projection, union – stateless operators
o Join – window joins
o Aggregate operators
• DSMS exclusive operators
o Buffered sort operator
o Random sampling operator
o User defined aggregate functions (UDAF)
• Query Languages
o GSQL
o CQL
o ESL
Query Operators
• Selections, (duplicate preserving)
projections are straightforward
o Local, per-element operators
o Duplicate eliminating projection is like
grouping
o Projection needs to include ordering
attribute
o No restriction for position ordered streams
• Aggregate expressions:
o distributive: sum, count, min, max
o algebraic: average
o holistic: count-distinct, median
Fig: Simple continuous query operators: i) - Selection, ii) Count, iii) Negation
Query Operators
• Join operators problematic on
streams
o May need to join arbitrarily far apart
stream tuples
o Operations on implicit / explicit windows
• SELECT * FROM S1, S2
WHERE Sl.attr = S2.attr
GROUP BY Sl.timestamp/60 AS minute
• SELECT * FROM S1, S2
WHERE Sl.attr = S2.attr
GROUP BY IS1 .timestamp| - |S2.timestampl <= w
• SELECT * FROM S1 [RANGE w] , S2 [RANGE w]
WHERE Sl.attr = S2.attr
Fig: Simple continuous query operators: i) Join, ii) Sliding window join with state
Query Processing
• Declarative queries ->Logical query plan -> Physical Plan
o Directed Acyclic Graphs (nodes->operators, edges -> data flow)
• Queries sharing memory/streams combined to a single plan
Fig: a) Query plan for two queries: i) a join of streams Sl and S2 with a selection predicate on Sl, and
2) an aggregate on S2. b) A continuous query with selection and tumbling window aggregation
• Scheduling
o FIFS, Round Robin – simple, not efficient
o Operators with higher throughput – low
latency
o Operators with min processing & selectivity –
smaller queue
• Heartbeats & Punctuations
o Typically issued by sources
o Reduce amount of states needed by
operators
o Prevent operators doing unnecessary tasks
o Query plans can also issue heartbeats to
avoid pipeline stalls and delayed results
SELECT minute, SUM(size) FROM s
WHERE destination_port <= 80
GROUP BY timestamp/60 AS minute
Query Processing Cont..
• Queries as views & Negative tuples
o Negative tuples implemented by sign on
explicit windows
o Explicit windows on time or count based
o Generated negative tuples processed by
cascading operators
o Negative tuple on aggregate operators
• Count – easy to compute
• Max/Min – Memory intensive
o Twice as many tuples are considered
• Possible avoiding for monotonic
operators
• Tag tuples with expiration time
• Operators known as weak non-
monotonic
Fig: a) Maintaining a view over a sliding window join using negative
tuples b) Finding the maximum element in a sliding window
Query Optimization
• Finds efficient query plans
• DBMS focus on minimizing I/O while DSMS try to reduce cost per unit
• Static Analysis and Query Rewriting
o Ensures query can be evaluated in non-
blocking fashion with limited memory
• S(A,B,C), T(D,E)
• ∏A (бA=D & A>I0 & D<20(S x T) ) , Yes
• ∏A (бA=D (S x T) ), No
• ∏A (бB<D & A>I0 & D<20(S x T) ), Yes, if no duplicate
o Common Rules
• Evaluate inexpensive predicates before
complex ones
o Performing selections before joins
o Rules for continuous query operators only
• Selections and explicit time-based windows
commute
• Selections and explicit count-based windows
don’t commute
o Rewrite based on input(s) constraints
• Join of unbounded streams if matching
tuples arrive at most t time units apart
• Multi Query Optimization
Fig : Separate and shared query plans for Ql and Q2
Operator Optimization
• Join
o Need to remove expired tuples
o Expiration in each time tick costly
o Periodic removal reduce cost but increase join processing cost
o Probe streams with fewer matches
• Aggregation
o Synopses allow efficient re-computations
o Prefix synopses
• Suitable for sub-tractable aggregates
• For ex: Sum, Count
o Interval synopses
• Suitable for distributive aggregates
• For ex: Min, Max
• Need to access log b intervals
• Basic interval synopses require b accesses
o Holistic aggregates require additional info in synopses
o Algebraic aggregates computed from derived info
• Avg = Sum / Count
Fig : i) Prefix synopses, ii) Interval synopses, iii) Basic interval synopses
Query Optimization
• Load Shedding & Approximation
o Random sampling
o Semantic load shedding to drop less important
o Objective is to minimize the drop in accuracy
• Challenging for complex query plan with multiple streams and operators
• Load Balancing
o Write part of stream if possible
• Adaptive Query Optimization
o Query cost-per-unit time may change
o Query plan dynamically re-ordered on speed, selectivity and queue length
o Trade-off between resulting adaptivity and overhead of dynamic routing
• Distributed Query Optimization
o Parallelizing and distributing the system itself
• Split query plan across nodes
• Partition the streams
o Shifting partial computation to the sources
• In-network processing reduce the communication overhead
Outline
• Introduction
• Data Stream Management System (DSMS)
• Streaming Data Warehouse (SDW)
o Data ETL
o Update Propagation
o Data Expiration
o Update Scheduling
o Query Processing on SDW
• Discussion
SDW
• Data streams/feeds arrive periodically
• ETL process - data cleaning, standardization and so on
• Table types
o Base tables – Sourced directly from raw files
o Derived tables – Materialized view over base or other derived table
• Update scheduler selects files update order
o Based on dependencies and workloads
Fig : Abstract reference architecture of a SDW
ETL
• Simple tasks – un-compression, standardization
• Complex tasks
o Joining new data with descriptive attributes relations
• Relations R are disk based
• Data buffer at main memory
• Mesh Join
o Access blocks of R in sequential order
o Tuple removed from buffer when join to all blocks of R
o Loading data into tables
• Tables are partitioned into timestamp ranges
• Affect small number or recent partitions
Fig : Partitioning a table on a timestamp attribute
Update Propagation
• Goals
o Propagate changes across layers of derived
tables
o Avoid recomputing an entire derived table
o Efficiently identify partition dependency
• Partition dependencies may not be
obvious from the SQL specification
Fig : Updating a partitioned derived table
Fig : Partition dependency
Data Expiration
• Tuples may have variable lifetime
• Tables can be partitioned on insertion and expiration timestamps
o Partitions may not have equal size
• One solution is to assign updates in round robin fashion
Fig : Partitioning a table on two attributes: insertion and expiration timestamp
Update Scheduling
• External sources push new data
• So many data feeds and derived
tables
• Resource usage control by using
scheduler
• Minimize data staleness
• Priority weighted staleness metric
to select tables which minimize it
most
Fig : plot of the staleness of a SDW table over time
Query Processing
• Overhead of partitioned tables
o Too small partitions are difficult to manage
o Too big ones need to be recomputed as new data arrives
o Solution : Bigger partitions as data become old
• Data Availability and Concurrency control
o Tables are updated frequently
o Queries should not be blocked and output consistent data
o Solution : Multi-version concurrency control at partition level
Discussion
• End-to-end data stream management
• DSMS allows relational like queries as well as pattern matching
and event processing queries
• Query semantics are different than traditional ones
• SDW research problems introduced recently
• Didn’t cover data mining techniques, fault tolerance and
distributed processing in the lecture
References
1. Data stream management, Luckasz Golab & M. Tamer Özsu
• Data stream management system – introduction, concepts and issues. Morton
Lindeberg, University of Oslo

More Related Content

What's hot

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memory
Ashish Kumar
 

What's hot (20)

Database replication
Database replicationDatabase replication
Database replication
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
IBM general parallel file system - introduction
IBM general parallel file system - introductionIBM general parallel file system - introduction
IBM general parallel file system - introduction
 
Question answer
Question answerQuestion answer
Question answer
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memory
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
 
Distributed Database Management System
Distributed Database Management SystemDistributed Database Management System
Distributed Database Management System
 

Viewers also liked

Viewers also liked (8)

Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache Flink
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
 
Implementation of adaptive stft algorithm for lfm signals
Implementation of adaptive stft algorithm for lfm signalsImplementation of adaptive stft algorithm for lfm signals
Implementation of adaptive stft algorithm for lfm signals
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Dbms vs dsms
Dbms vs dsmsDbms vs dsms
Dbms vs dsms
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 

Similar to Data Stream Management

Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
Databricks
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 

Similar to Data Stream Management (20)

Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
TINET_FRnOG_2008_public
TINET_FRnOG_2008_publicTINET_FRnOG_2008_public
TINET_FRnOG_2008_public
 
Cs 331 Data Structures
Cs 331 Data StructuresCs 331 Data Structures
Cs 331 Data Structures
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data Streaming
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
HPC Resource Management: Futures
HPC Resource Management: FuturesHPC Resource Management: Futures
HPC Resource Management: Futures
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
IoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDB
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and Workflows
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Forecasting time series powerful and simple
Forecasting time series powerful and simpleForecasting time series powerful and simple
Forecasting time series powerful and simple
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 

Recently uploaded

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 

Data Stream Management

  • 1. Data Stream Management Authors Lukasz Golab & M. Tamer Özsu Supervised by Dr. Sakti Pramanik Presented by AKM Tauhidul Islam
  • 2. Outline • Introduction o Motivation o Problem Statement o Definitions • Data Stream Management System (DSMS) • Streaming Data Warehouse (SDW) • Discussion
  • 3. Introduction • Stream data - Produced incrementally over time, rather than being available in full before its processing begins • Examples: • Applications: o Sensor Networks - E.g. TinyDB o Network Traffic Analysis - E.g. Traffic statistics and critical condition detection. o Financial Tickers - On-line analysis of stock prices, discover correlations, identify trends. o Transaction Log Analysis - E.g. Web click streams and telephone calls Transaction data streams Log Streams Credit card purchases, Telecommunications, Web Accesses Climate Data GPS tracking Sensor networks IP networks
  • 4. Motivation • Massive data sets: o Huge numbers of users, e.g., • AT&T long-distance: ~ 300M calls/day • AT&T IP backbone: ~ 10B IP flows/day o Highly detailed measurements, e.g., • NOAA: satellite-based measurements of earth geodetics o Huge number of measurement points, e.g., • Sensor networks with huge number of sensors • Near real-time analysis o ISP: controlling service levels o NOAA: tornado detection using weather radar o Hospital: Patient monitoring • Traditional data feeds o Simple queries (e.g., value lookup) needed in real-time o Complex queries (e.g., trend analyses) performed off-line
  • 5. Problem Statement DBMS DSMS Data Persistent Relations Streams, time windows Data Access Random Sequential, One-pass Updates Arbitrary Append Only Update Rates Relatively Low High, bursty Processing Model Query Driven Data driven Queries One time Continuous Query Plans Fixed Adaptive Query Optimizations One Query Multi-query Query Answers Exact Exact or Approximate Latency Relatively High Low Data Warehouse SDW Data Historical Recent and Historical Update Frequency Low High Update Propagation Synchronous Asynchronous ETL Process Complex Fast, Light- weight Fig : Comparison of Data Stream Management Systems and Streaming Data Warehouses with traditional database and warehouse systems
  • 6. Definitions • Non-blocking Execution : Query operator Q doesn’t require entire input • Monotonicity : All previous results preserved o Q(т) € Q(т’), for query operator Q, where т <= т’ o Q is monotonic only if non-blocking • Delta : Doesn’t hold monotonicity property , produce update result at time т, negative / Positive delta • Punctuation : Special tuple containing a predicate that is guaranteed to be satisfied by the remainder of the data stream • Heartbeat : Punctuations that govern timestamps of future tuples • Average slowdown = Tuple response time/ shortest processing time
  • 7. Outline • Introduction • Data Stream Management System (DSMS) o Stream Data Models o Query Language & Semantics o Query Processing o Query Optimization • Streaming Data Warehouse (SDW) • Discussion
  • 8. DSMS • Input Buffer/Monitor o Captures streaming inputs o May collect statistics on streams o Random sampling • Working storage o Stores recent stream data o Used for query processing • Local Storage o Used for metadata o Foreign key mapping o Naming translation • Query Processor o Convert queries into execution plans o Change plans for different workloads / input rates o Contains buffers, operator queues o Deploys scheduling methods • Continuous Query Repository • Results o May input to users, to other applications o Stored in an SDW for further analysis Fig : i) Abstract reference architecture of a DSMS & ii) A traditional DBMS
  • 9. Stream Data Models • Base Streams – Produced by sources, append only • Derived streams – produced by continuous queries • Streams have fixed schema o <timestamp, source IP Addr, source port, destination IP Addr, destination port, size> • Data Stream Models o Describe underlying signals S : [l ... N] -> R o Aggregate model – Range value for a signal o Cash Register model – Partial non-negative range value o Turnstile model – Partial range value o Reset model – Range value; Reset previous value of a signal • Stream Windows – important to user and query points of view o Fixed window o Sliding window o Landmark window o Jumping window – update every k-ticks or k-arrivals o Tumbling window - update every k-ticks or k-arrivals , k = window size
  • 10. Query Language & Semantics • Query Algebra o Stream-to-stream o Mixed Algebra • Query Operators – Similar syntax to DBMS, very different semantics • Relation-like query operators o Selection, projection, union – stateless operators o Join – window joins o Aggregate operators • DSMS exclusive operators o Buffered sort operator o Random sampling operator o User defined aggregate functions (UDAF) • Query Languages o GSQL o CQL o ESL
  • 11. Query Operators • Selections, (duplicate preserving) projections are straightforward o Local, per-element operators o Duplicate eliminating projection is like grouping o Projection needs to include ordering attribute o No restriction for position ordered streams • Aggregate expressions: o distributive: sum, count, min, max o algebraic: average o holistic: count-distinct, median Fig: Simple continuous query operators: i) - Selection, ii) Count, iii) Negation
  • 12. Query Operators • Join operators problematic on streams o May need to join arbitrarily far apart stream tuples o Operations on implicit / explicit windows • SELECT * FROM S1, S2 WHERE Sl.attr = S2.attr GROUP BY Sl.timestamp/60 AS minute • SELECT * FROM S1, S2 WHERE Sl.attr = S2.attr GROUP BY IS1 .timestamp| - |S2.timestampl <= w • SELECT * FROM S1 [RANGE w] , S2 [RANGE w] WHERE Sl.attr = S2.attr Fig: Simple continuous query operators: i) Join, ii) Sliding window join with state
  • 13. Query Processing • Declarative queries ->Logical query plan -> Physical Plan o Directed Acyclic Graphs (nodes->operators, edges -> data flow) • Queries sharing memory/streams combined to a single plan Fig: a) Query plan for two queries: i) a join of streams Sl and S2 with a selection predicate on Sl, and 2) an aggregate on S2. b) A continuous query with selection and tumbling window aggregation • Scheduling o FIFS, Round Robin – simple, not efficient o Operators with higher throughput – low latency o Operators with min processing & selectivity – smaller queue • Heartbeats & Punctuations o Typically issued by sources o Reduce amount of states needed by operators o Prevent operators doing unnecessary tasks o Query plans can also issue heartbeats to avoid pipeline stalls and delayed results SELECT minute, SUM(size) FROM s WHERE destination_port <= 80 GROUP BY timestamp/60 AS minute
  • 14. Query Processing Cont.. • Queries as views & Negative tuples o Negative tuples implemented by sign on explicit windows o Explicit windows on time or count based o Generated negative tuples processed by cascading operators o Negative tuple on aggregate operators • Count – easy to compute • Max/Min – Memory intensive o Twice as many tuples are considered • Possible avoiding for monotonic operators • Tag tuples with expiration time • Operators known as weak non- monotonic Fig: a) Maintaining a view over a sliding window join using negative tuples b) Finding the maximum element in a sliding window
  • 15. Query Optimization • Finds efficient query plans • DBMS focus on minimizing I/O while DSMS try to reduce cost per unit • Static Analysis and Query Rewriting o Ensures query can be evaluated in non- blocking fashion with limited memory • S(A,B,C), T(D,E) • ∏A (бA=D & A>I0 & D<20(S x T) ) , Yes • ∏A (бA=D (S x T) ), No • ∏A (бB<D & A>I0 & D<20(S x T) ), Yes, if no duplicate o Common Rules • Evaluate inexpensive predicates before complex ones o Performing selections before joins o Rules for continuous query operators only • Selections and explicit time-based windows commute • Selections and explicit count-based windows don’t commute o Rewrite based on input(s) constraints • Join of unbounded streams if matching tuples arrive at most t time units apart • Multi Query Optimization Fig : Separate and shared query plans for Ql and Q2
  • 16. Operator Optimization • Join o Need to remove expired tuples o Expiration in each time tick costly o Periodic removal reduce cost but increase join processing cost o Probe streams with fewer matches • Aggregation o Synopses allow efficient re-computations o Prefix synopses • Suitable for sub-tractable aggregates • For ex: Sum, Count o Interval synopses • Suitable for distributive aggregates • For ex: Min, Max • Need to access log b intervals • Basic interval synopses require b accesses o Holistic aggregates require additional info in synopses o Algebraic aggregates computed from derived info • Avg = Sum / Count Fig : i) Prefix synopses, ii) Interval synopses, iii) Basic interval synopses
  • 17. Query Optimization • Load Shedding & Approximation o Random sampling o Semantic load shedding to drop less important o Objective is to minimize the drop in accuracy • Challenging for complex query plan with multiple streams and operators • Load Balancing o Write part of stream if possible • Adaptive Query Optimization o Query cost-per-unit time may change o Query plan dynamically re-ordered on speed, selectivity and queue length o Trade-off between resulting adaptivity and overhead of dynamic routing • Distributed Query Optimization o Parallelizing and distributing the system itself • Split query plan across nodes • Partition the streams o Shifting partial computation to the sources • In-network processing reduce the communication overhead
  • 18. Outline • Introduction • Data Stream Management System (DSMS) • Streaming Data Warehouse (SDW) o Data ETL o Update Propagation o Data Expiration o Update Scheduling o Query Processing on SDW • Discussion
  • 19. SDW • Data streams/feeds arrive periodically • ETL process - data cleaning, standardization and so on • Table types o Base tables – Sourced directly from raw files o Derived tables – Materialized view over base or other derived table • Update scheduler selects files update order o Based on dependencies and workloads Fig : Abstract reference architecture of a SDW
  • 20. ETL • Simple tasks – un-compression, standardization • Complex tasks o Joining new data with descriptive attributes relations • Relations R are disk based • Data buffer at main memory • Mesh Join o Access blocks of R in sequential order o Tuple removed from buffer when join to all blocks of R o Loading data into tables • Tables are partitioned into timestamp ranges • Affect small number or recent partitions Fig : Partitioning a table on a timestamp attribute
  • 21. Update Propagation • Goals o Propagate changes across layers of derived tables o Avoid recomputing an entire derived table o Efficiently identify partition dependency • Partition dependencies may not be obvious from the SQL specification Fig : Updating a partitioned derived table Fig : Partition dependency
  • 22. Data Expiration • Tuples may have variable lifetime • Tables can be partitioned on insertion and expiration timestamps o Partitions may not have equal size • One solution is to assign updates in round robin fashion Fig : Partitioning a table on two attributes: insertion and expiration timestamp
  • 23. Update Scheduling • External sources push new data • So many data feeds and derived tables • Resource usage control by using scheduler • Minimize data staleness • Priority weighted staleness metric to select tables which minimize it most Fig : plot of the staleness of a SDW table over time
  • 24. Query Processing • Overhead of partitioned tables o Too small partitions are difficult to manage o Too big ones need to be recomputed as new data arrives o Solution : Bigger partitions as data become old • Data Availability and Concurrency control o Tables are updated frequently o Queries should not be blocked and output consistent data o Solution : Multi-version concurrency control at partition level
  • 25. Discussion • End-to-end data stream management • DSMS allows relational like queries as well as pattern matching and event processing queries • Query semantics are different than traditional ones • SDW research problems introduced recently • Didn’t cover data mining techniques, fault tolerance and distributed processing in the lecture
  • 26. References 1. Data stream management, Luckasz Golab & M. Tamer Özsu • Data stream management system – introduction, concepts and issues. Morton Lindeberg, University of Oslo

Editor's Notes

  1. Partition dependencies : Change in raw files / tables could be mapped to partitions of data
  2. Reset model : router cpu measurement Cash register : internet packet stream
  3. knowing the stream arrival rates and the selectivity of each operator allows us to estimate each operator's output rate. If a punctuation arrives on one of the input streams with the predicate attr ! = a1 AND attr ! = a2, we can immediately remove tuples with those attr-values from both hash tables.
  4. knowing the stream arrival rates and the selectivity of each operator allows us to estimate each operator's output rate. If a punctuation arrives on one of the input streams with the predicate attr ! = a1 AND attr ! = a2, we can immediately remove tuples with those attr-values from both hash tables.
  5. Other techniques are :
  6. Other techniques are :