4. Data Storage Requirements
User data
Multimedia
Phone messages/conversations (500 Kbps – 10 Mbps)
Music (500 Kbps)
TV/Radio broadcasts (500 Kbps – 10 Mbps)
Home movies (10 Mbps)
Images
Computer
Programs
Data files
Operating systems
5. Data Storage Issues
Issues
Query frequency and type
Sampling/recording rates
205 sensors (158,900 Kbps)
Multimedia recordings
Simultaneous playback
Analysis, prediction, decision-making queries
Transaction granularity
Historical data, decay
Security and privacy
Centralized vs. distributed
6. What Data to Store
Type of Data
Raw data
Pre-processed
Compressed
Frequency of Data Storage for
Sensor Data
Tradeoff between precision and
quantity
7. Sensor Data Example
9/8/2002 2:0:1 AM~A5 (Coffee Maker) ON
9/8/2002 1:6:59 AM~A9 (A/C) ON
9/8/2002 3:58:52 AM~A0 (Stereo) ON
9/8/2002 5:57:0 AM~A2 (Kitchen Light) ON
9/8/2002 3:1:42 AM~A5 (Coffee Maker) OFF
9/8/2002 7:8:3 AM~A3 (Stove) ON
9/8/2002 12:54:52 PM~A10 (Bathroom Light) ON
9/8/2002 4:58:5 AM~A0 (Stereo) OFF
9/8/2002 8:1:20 AM~A3 (Stove) OFF
9/8/2002 9:6:10 AM~A8 (Computer) ON
9/8/2002 10:8:19 AM~A4 (Bathtub Heater) ON
9/8/2002 11:9:4 AM~A0 (Stereo) ON
9/8/2002 9:4:5 AM~A8 (Computer) OFF
9/8/2002 10:9:4 AM~A4 (Bathtub Heater) OFF
9/8/2002 2:2:5 PM~A10 (Bathroom Light) OFF
9/8/2002 2:52:37 PM~A0 (Stereo) OFF
9/8/2002 4:2:0 PM~A9 (A/C) OFF
8. Media Viewing Example
Watching Events
Date Day Mood Start End Device Program name Type Comments Others Rating
020302 Su normal 1330 1600 T nba basketball sports
dallas mavericks go
team
none 5
020302 Su normal 1700 2100 t super bowl sports
gotta watch the
commercials
Dad 5
020402 m normal 1900 2000 t boston public drama hot teachers none 5
020402 m normal 2000 2100 t ally mcbeal drama funny lawyers none 4
020402 m normal 2300 100 V WWF RAW wrestling testosterone none 5
020502 t normal 2100 2200 t philly drama hot lawyers none 4
020602 w bored 1830 2200 t nba basketball sports GO MAVS none 5
020702 th tired 1900 2100 t
wwf
smackdown
wrestling its me soap none 5
020702 th tired 2100 2200 t ER drama good show none 4
020802 f excited 1900 2230 t olympics sports gotta watch none 4
020902 sa excited 1900 2230 t olympics sports gotta watch none 4
021002 su ecstatic 1500 1800 t
NBA allstar
game
sports
gotta see what
happens
none 3
012802 M normal 1900 2000 T Boston Public Drama hot chicks teaching none 5
012802 M normal 2000 2100 T Ally McBeal Drama hot chicks lawyering none 5
9. Multimedia Example
Digital Silhouettes (Predictive Networks)
Predicting web surfing behavior ($$$)
Microsoft (2002) track TV viewing preferences
140 data items for each user
Demographics (50)
Subcategories within gender, age, income,
education, occupation, and race
90 Content preferences
golf, music, yoga
11. Example Data Representations
Relational
We all know…flat tables of atomic attributes with
foreign key relationships
OO
Complex data reps
multivalued, composite
Temporal
Relational model: add valid start, end dates to
each table (versions of info and when valid)
Includes time, events, durations…
13. Example Operations for
Temporal Databases
INCLUDES
Rows valid in a certain time period
BEFORE/AFTER a time condition
Set operations
Union, intersection of 2 time periods
14. Active DB
Event-Condition-Action rules
Allow for decisions to be made in the database
instead of a separate application
Relational
Implemented as triggers
Challenges
Rule consistency
(2+ rules do not contradict)
Guaranteed termination
Trigger loops (T1 <->T2)
15. Smart Home Active DB Example
Java, Postgres, Jess rules
Event classification (local&composite)
Data Manipulation Events
TV show being viewed (channel, time, genre…)
Temporal Events (instance,recurring)
Set temp to 70 degrees at 7:00am workdays
Exception Events
Power failure
Behavioral Events
Time children home from school; dinner time
16. Active DB Example (TCU)
Title Event Condition Action
TV View
Menu
TV turned on Molly is holding
remote
Display shows
matching Molly’s
preferences
Entry
Lighting
Inhabitant enters
house
Light level
<threshold
Adjust lighting to
predetermined level
Aroma-
therapy
Every Friday
night when
Hanna sits on
sofa
Always Release aroma
Night Idle John on sofa idle
> 15 minutes,
TV&lights are on
No other
inhabitant in
room
Turn off all devices in
the room
17. Distributed vs. Centralized
Centralized database can produce a
bottleneck
Large volume of data input
Large database
Large volume of queries
In distributed databases, data consistency,
replication, and retrieval can be more
problematic
Consistency of schemas
Retrieval in case the data location is not known
Communication overhead to ensure database
consistency
18. SmartHome
Database Architecture
Centralized vs. distributed?
Answer: Both
Central storage of high demand, persistent data
Distributed storage of low demand, dynamic data
Distributed queries
Push processing toward sensors
Adaptive, hierarchical organization
End-effector autonomy (“smart sensor”)
19. Database Systems
Commercial
DB2
Empress
Informix
Oracle
MS Access
MS SQL
Sybase
Free
Berkeley DB
PostgreSQL
MySQL
20. UTA MavHome DB
Active
Reactive & proactive (e.g., to predict)
Distributed
Information collection agents
Rules
Local Agent: what data they need to collect
Distributed: coordinate overall monitoring of collected
information
Continuous monitoring of events
Extension of SNOOP
21. Microsoft Easy Living DB
(2002)
Relational
Fast & robust, but awkward for some data
World Model DB Describes:
Computing devices
People and their personal preferences/settings
Services
Rooms and doorways
Serves as Abstraction Layer between sensors and
application that use data from sensors
e.g. new sensors no change to applications
22. Stanford Interactive
Workspace
Uses LORE
A semi-structured XML DB system
Still available, but work stopped in 2000
Data stored is catalog of (index to)
documents, images, 3-D models, application-
specific domain models
23. Sensor Database Systems
COUGAR project
www.cs.cornell.edu/database/cougar
Query processing over ad-hoc sensor
networks
Small database component (QueryProxy) at
each sensor
Sensor clusters provide local aggregations
(e.g., min, max, mean)
Assumes centralized index of all data sources
24. Siemens Netabase
“The network is the database.”
Navas and Wynblatt, ACM SIGMOD 2001
Sensor networks
Large number of data sources (105)
Volatile data and data organization
“Thin” data servers on scaled-down hardware
Netabase approach
Query decomposition
Characteristic routing (ala IP routing)
Local joins
Query evaluation
26. Data Warehouses
Repositories for data mining activities
Aggregates/summaries of data help efficiency
Optimized for decision-support, not
transaction processing
Definition (Elmasri, page 900)
A subject-oriented, integrated, non-volatile, time-
variant collection of data in support of
management’s decisions”
Replace “management”, with “smart home agents”
27. Warehouse Properties
Very large: 100gigabytes to many terabytes
Tends to include historical data
Workload: mostly complex queries that access lots of data, and
do many scans, joins, aggregations. Tend to look for "the big
picture".
Updates pumped to warehouse in batches (overnight)
Data may be heavily summarized and/or consolidated in
advance (must be done in batches too, must finish overnight).
Research work has been done (e.g. "materialized views") -- a small
piece of the problem.
02.15.04 from http://redbook.cs.berkeley.edu/lec28.html
28. Data Warehouses
Data Cleaning
Data Migration: simple transformation rules (replace "gender" with "sex")
Data Scrubbing: use domain-specific knowledge (e.g. zip codes) to modify
data. Try parsing and fuzzy matching from multiple sources.
Data Auditing: discover rules and relationships (or signal violations thereof).
Not unlike data mining.
Data Loading
can take a very long time! (Sorting, indexing, summarization, integrity
constraint checking, etc.) Parallelism a must.
Full load: like one big xact – change from old data to new is atomic.
Incremental loading ("refresh") makes sense for big warehouses, but
transaction model is more complex – have to break the load into lots of
transactions, and commit them periodically to avoid locking
everything. Need to be careful to keep metadata & indices consistent along
the way.
02.15.04 from http://redbook.cs.berkeley.edu/lec28.html
30. Data Mining Definition
Discovery of new information in terms of patterns or
rules from vast amounts of data
Extracts patterns that can’t readily be found by
asking the right questions (queries)
TOO MUCH DATA FOR HUMANS
Emerged from
Artificial Intelligence:Machine learning, Neural nets, Genetic
Algorithms
Statistics
Operations Research
31. Data Mining Steps
Data selection -- pick the data needed
Data cleansing
Fix bad data (e.g., spelling, zip codes)
Hard to deal with missing, erroneous, conflicting, redundant
data
Enrichment
Add data (e.g., age, gender, income)
Data transformation
Aggregate (e.g., zip codes regions)
Data mining
Reporting on discovered Knowledge
32. Types of Results
Association rules
Buy diapers buy lots of beer
Sequential patterns
Buy house buy furniture within months
Classification trees
Types of buyers (upscale,bargain-conscience, …)
Why do it?
Make more money
Science & medicine
33. Data Mining Goals
Find patterns to predict future events
Find major groupings
Groupings of buyers, stars, diseases …
Find which group something belongs to
creditworthiness
34. Data Mining Results
Association rules
Classification hierarchies
Clustering
Sequential patterns
Patterns within time series
Type of result, inputs & algorithms vary
Often interested in some combination of
these types of Knowledge
35. Clustering
Unsupervised learning techniques
Training samples are unclassified
Vs. supervised learning (classification)
Drug categories for depression
Categories of TV viewers
Categories of buyers (likely, unlikely)
Categories of households?
Single male, mother/children, conventional
(M/D/kids), DINKs.
36. Sequential Patterns
Detecting associations among events
with certain temporal relationships
Example:
Cardiac bypass for blocked arteries
AND within 18 months, high blood urea
THEN kidney failure likely in next 18
months
Particularly important in smart homes
38. Sequential Pattern Discovery
The support for a sequence S is the % of the given
set U of sequences of which S is a subsequence.
That is: how many times does S show up?
Find all subsequences from the given sequence sets
that have a user-defined minimum support.
The sequence S1, S2, … Sn, is a predictor of “fact”
that a customer that buys itemset S1 is likely to buy
itemset S2, then S3, …
Prediction support based on frequency of this
sequence in the past
Many research issues to create good algos
39. Patterns Within Time Series
Finding 2 patterns that occur over time
2003 stock prices of Choice Homes and
Home Depot
2 products show same sales pattern in
summer but different one in winter
Solar magnetic wind patterns may predict
earth atmospheric changes
40. Time Series Pattern Discovery
Time series are sequences of events
Event could be a transaction (closing daily
stock price)
Look at sequences over n days, or
Longest period in which change is no
greater than 1%
Comparing
Must define similarity measures
41. Other Approaches in Data Mining
Neural nets
Infer a function from a set of examples
Non-parametric curve-fitting
Interpolates to solve new problems
Supervised & unsupervised algorithms
Capabilities
classification
time-series prediction
Disadvantages
can’t see what it learned (not declarative)
42. Other Approaches in Data Mining
Genetic algorithms
Set up
Representation (strings over an alphabet)
Evaluation (fitness) function
Parameters: # of generations, cross-over rate,
mutation rate, etc.
Randomized (probabilistic operators),
parallel search over search space
Used for problem solving and clustering