2. What is a datastream?
• Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered
(implicitly by arrival time or explicitly by timestamp) sequence of items.
It is impossible to control the order in which items arrive, nor is it
feasible to locally store a stream in its entirety.”
• Massive volumes of data, items arrive at a high rate.
3. Data Streams
• A data stream is a (potentially unbounded) sequence of tuples. Each
tuple consist of a set of attributes, similar to a row in database table.
• Transactional data streams: log interactions between entities
• Credit card: purchases by consumers from merchants
• Telecommunications: phone calls by callers to dialed parties
• Web: accesses by clients of resources at servers
• Measurement data streams: monitor evolution of entity states
• Sensor networks: physical phenomena, road traffic
• IP network: traffic at router interfaces
• Earth climate: temperature, moisture at weather stations
4. Examples of StreamSources
Before proceeding, let us consider some of the ways in which stream data arises aturally.
Sensor Data : Imagine a temperature sensor bobbing about in the ocean, sending back to a base
station a reading of the surface temperature each hour. The data produced by this sensor is a stream
of real numbers. Now we have 3.5 terabytes arriving every day, and we definitely need to think about
what can be kept in working storage and what can only be archived.
Image Data : Satellites often send down to earth streams consisting of many terabytes of images per
day. Surveillance cameras produce images with lower resolution than satellites, but there can be many
of them, each producing a stream of images at intervals like one second.
Internet and Web Traffic : A switching node in the middle of the Internet receives streams of IP
packets from many inputs and routes them to its outputs. Web sites receive streams of various types.
For example, Google receives several hundred million search queries per day. Yahoo! accepts billions
of “clicks” per day on its various sites.
5. Characteristics of DataStreams
• Characteristics
• Huge volumes of continuous data, possibly infinite
• Fast changing and requires fast, real-time response
• Data stream captures nicely our data processing needs of today
• Random access is expensive—single scan algorithm (can only have
one look)
• Store only the summary of the data seen thus far
• Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
6. Applications of data streamprocessing
• Data stream processing
• Process queries (compute statistics, activate alarms)
• Apply data mining algorithms
• Requirements
• Real-time processing
• One-pass processing
• Bounded storage (no complete storage of streams)
• Possibly consider several streams
• Let’s go deeper into some examples
• Network management
• Stock monitoring
10. A data-stream-management system(DSMS)
• Streams may be archived in a large archival
store, but we assume it is not possible to answer
queries from the archival store.
• I t could be examined only under special
circumstances using time-consuming retrieval
processes.
• There is also a working store , into which
summaries or parts of streams may be placed,
and which can be used for answering queries.
• The working store might be disk, or it might be
main memory, depending on how fast we need
to process queries.
• But either way, it is of sufficiently limited
capacity that it cannot store all the data from all
the streams.
11. Generic DSMS Architecture
Updates to
Static Data
User
Queries
[Golab & Özsu 2003]
Input
Monitor
Output
Buffer
Query
Processor
Query
Reposi-
tory
Working
Storage
Summary
Storage
Static
Storage
Streaming
Inputs
Streaming
Outputs
14. DBMS versus DSMS (Data Stream
Management System)
• Persistent relations
• One-time queries
• Random access
• “Unbounded” disk store
• Only current state matters
• No real-time services
• Relatively low update rate
• Data at any granularity
• Assume precise data
• Access plan determined by query
processor, physical DB design
• Transient streams
• Continuous queries
• Sequential access
• Bounded main memory
• Historical data is important
• Real-time requirements
• Possibly multi-GB arrival rate
• Data at fine granularity
• Data stale/imprecise
• Unpredictable/variable data arrival
and characteristics
16. Challenges of Stream DataProcessing
• Multiple, continuous, rapid, time-varying, ordered streams
• Main memory computations
• Queries are often continuous
• Evaluated continuously as stream data arrives
• Answer updated over time
• Queries are often complex
• Beyond element-at-a-time processing
• Beyond stream-at-a-time processing
• Beyond relational queries (scientific, data mining, OLAP)
• Multi-level/multi-dimensional processing and data mining
• Most stream data are at low-level or multi-dimensional in nature
18. Approximate answers toqueries
▪ When ?
• Queries needing unbounded memory
• Too much queries/too rapid streams/too high response time
requirements
• CPU limit
• Memory limit
• Solution : approximate answers to queries
• Sliding windows
• Sampling and load shedding
• Definition of synopsis
19. Streaming Computing
Approaches
• Two approaches for handling such streams
• Use a time window, and query the window as a static table
• When you can’t store collected data, or to keep track of historical data
• Sampling
• Filtering
• Counting