Sensor Data Management @ EPFL


          Karl Aberer
Overview

  Sensor Data Management
  –    Global Sensor Networks
  –    Swiss Experiment
  –    Sensor Metadata Management
  –    Time Series compression and retrieval
  –    Sensor data analysis and quality
  –    Economics-based resource allocation in distributed clouds
  –    Cloud-based time series management system
  Web Data Management
  –  Large-scale Semantic Data Integration
  –  Web Stream Data Analysis (Twitter)
Global Global Sensor Networks
                              Sensor Networks (GSN)

Integrates different sensor networks               GSN:
– Different abstractions, hard to share   Reference Implementation
                                              Integrity Service
– Isolated networks, hard to republish
                                               Access Control

GSN server:                                GSN/Web/Web-Services
                                            Notification Manager
– Goal: Publishing streams generated          Query Processor
  by sensor networks                          Query Repository
– Storage, archive                            Storage Manager
                                            Virtual Sensor Manager
– Access to sensor network hardware
                                             Input Stream Manager
– Easy setup, easy to change               Stream Quality Manager

Virtual Sensor:
                                            Life Cycle Manager


– Processing, filtering, aggregation       Pool Of Sensing Devices

– Functional/non-functional properties
– Described in a XML file
Current GSN deployments
        GSN Deployments
Swiss Experiment Infrastructure
!"# "$%&'( )*'+*,'-
 !"#$%&&%'




                               (
                               ()%"*%'




                    $+!,)"%'
Sensor Metadata Management
                                               Metadata

       Effective Metadata Management in Federated Sensor
       Networks
       !"#$%&'()&*+,$-&*()&.+/+,,-012&3()&*+45"&*()&67",",&8()&9+:"2&;()&.+/+-1+$$1#&<()&="5$-$%&>()&&&
       41&+//"+,&-$&*?<@ ABCB(




   !"#$%&'(&)*%+,-,%&-*',./%"01$%.'-,+,-,

+2&-*234-'+%5)2(/%,4-).,-'+%.'-,+,-,%6'('*,-2)(
                                                                                           &(,:&9)-& );%"01%%%%%%%
          ,+7,(8'+%.'-,+,-,%&',*89                                                            ;)*%"<2&&=>
Time Series Compression and Retrieval

  A model M describes the dependency between two sets of variables X and Y
  Models may capture data correlations, derive unknown values, quantify and
    correct measurement errors
    –  They are particularly useful for data compression, data completion and data cleaning


  Our work is on
    –  Deriving lower bounds on the achievable compression ratio for a time series
    –  Define a suitable model-based storage and indexing scheme for fast
       retrieval
    –  Defining innovative models for data cleaning and data quality estimation


  Publications: ICDE’10, MDM’11, VLDB’11 (under preparation)
Parameter Compression
Data Compression
  Towards Multi-Model Approximation of Time-Series
              Thanasis Papaioannou, Mehdi Riahi, Karl Aberer [MDM 2011] (under review)
Probabilistic Data Generation
Sensor Context Extraction
  SeMiTri: A Framework for Semantic Annotation of Heterogeneous Trajectories
                         Z. Yan, D. Chakraborty, C. Parent, S. Spaccapietra, K. Aberer [EDBT 2011]

 Objec&ve:	
  	
  A	
  Middleware	
  for	
  automa&cally	
  annota&ng	
  trajectories	
  of	
  different	
  types	
  
                                      of	
  moving	
  objects	
  (cars,	
  people)	
  
                                                                                                                                   Spa&al	
  join	
  (region)	
  
                        bus            metro            walking
  Semantic
  trajectory     home         office           market             home



           Semantic Annotation Middleware
                                                                                            Map-­‐matching	
  (road	
  network)	
  

                                                        Hidden
      Spatial               Map
                                                        Markov
       Join               Matching
                                                        Model




                                                                                            HMM	
  (point	
  of	
  Interest)	
  
       region            road network             point of interest


                  e1 e2 e3              e4       e5       e6       e7
    GPS
  episodes
Trusted Privacy-preserving Sensing
Economic Cloud Resource Management

  Objective: high availability and low response-time in a cost-effective w
   ay in data clouds
    –  Hardware (correlated) failures, highly irregular query rates, NP multi-constr
       ained global optimization problem!
  Solution: decentralized virtual economy (‘Skute’)
    –  Partition data using consistent hashing
    –  A virtual node is responsible for a key range
    –  Virtual ring organizes virtual nodes per availability level and per application
    –  Virtual nodes act as economic agents and independently migrate, replicate
       or delete themselves
    –  Skute offers differentiated availability guarantees, as well as automated an
       d balanced cloud resources elasticity
  Publications: ACDC’09, ICDE’09, SoCC’10, Cloud’10, CCGrid’11
  Springer book on “Economic Cloud Resource Management”, under prep
   aration
TimeCloud

  A Cloud System for Massive Time Series Management
    –  Web-based time series management in the cloud
             •  Storage cloud, various time-series visualization, group-based data share, …
             •  Potentially linked to third-party software, e.g. SensorMap, SwissEx Wiki
    –  Storage-and-computing platform for massive time series processing
             •  Built on Hadoop/Hbase/GSN with capability of handling data streams
             •  Very efficient model-based parallel time-series data processing

  third-parties




                                                                            data streams




                                                                                Time-series compression
                                                                Efficient data processing based on model-based views
                                                                           Distributed time-series processing
Overview

  Sensor Data Management
  –    Global Sensor Networks
  –    Swiss Experiment
  –    Sensor Metadata Management
  –    Time Series compression and retrieval
  –    Sensor data analysis and quality
  –    Economics-based resource allocation in distributed clouds
  –    Cloud-based time series management system
  Web Data Management
  –  Large-scale Semantic Data Integration
  –  Web Stream Data Analysis (Twitter)
“The Wisdom of the Network”

Problem                                     Emergent semantics
• Schema heterogeneity inherent             • Establishing semantic
problem for enterprise cooperation          interoperability as a self-organizing
networks                                    process within a community or
• Both manual and automated mapping         social network
error-prone                                 • Mappings are established in a
• Interoperability challenges evolve        localized, incremental manner
constantly
                                       •     Create mappings in a pay-as-you-go
                                             fashion
                                       •     Exploit the the knowledge available in the
                                             network:
                                               •   Available mappings in the network
                                               •   Content features
                                               •   Social structure of the network
                                               •   User feedback
                                               •   Economic incentives
                                       •      Apply probabilistic reasoning techniques to
                                             improve mapping quality
Web Data Stream Analysis

  Classifying Twitter messages
    We would like to classify tweets, containing a given keyword (e.g. “
     apple”), whether they are related to a given company or not
    Won the WePS 2010 tweet classification task
  Thank you for your attention!

  For more information please visit

                      http://lsir.epfl.ch/

Sensor Data Management

  • 1.
    Sensor Data Management@ EPFL Karl Aberer
  • 2.
    Overview   Sensor DataManagement –  Global Sensor Networks –  Swiss Experiment –  Sensor Metadata Management –  Time Series compression and retrieval –  Sensor data analysis and quality –  Economics-based resource allocation in distributed clouds –  Cloud-based time series management system   Web Data Management –  Large-scale Semantic Data Integration –  Web Stream Data Analysis (Twitter)
  • 3.
    Global Global SensorNetworks Sensor Networks (GSN) Integrates different sensor networks GSN: – Different abstractions, hard to share Reference Implementation Integrity Service – Isolated networks, hard to republish Access Control GSN server: GSN/Web/Web-Services Notification Manager – Goal: Publishing streams generated Query Processor by sensor networks Query Repository – Storage, archive Storage Manager Virtual Sensor Manager – Access to sensor network hardware Input Stream Manager – Easy setup, easy to change Stream Quality Manager Virtual Sensor: Life Cycle Manager – Processing, filtering, aggregation Pool Of Sensing Devices – Functional/non-functional properties – Described in a XML file
  • 4.
    Current GSN deployments GSN Deployments
  • 5.
    Swiss Experiment Infrastructure !"#"$%&'( )*'+*,'- !"#$%&&%' ( ()%"*%' $+!,)"%'
  • 6.
    Sensor Metadata Management Metadata Effective Metadata Management in Federated Sensor Networks !"#$%&'()&*+,$-&*()&.+/+,,-012&3()&*+45"&*()&67",",&8()&9+:"2&;()&.+/+-1+$$1#&<()&="5$-$%&>()&&& 41&+//"+,&-$&*?<@ ABCB( !"#$%&'(&)*%+,-,%&-*',./%"01$%.'-,+,-, +2&-*234-'+%5)2(/%,4-).,-'+%.'-,+,-,%6'('*,-2)( &(,:&9)-& );%"01%%%%%%% ,+7,(8'+%.'-,+,-,%&',*89 ;)*%"<2&&=>
  • 7.
    Time Series Compressionand Retrieval   A model M describes the dependency between two sets of variables X and Y   Models may capture data correlations, derive unknown values, quantify and correct measurement errors –  They are particularly useful for data compression, data completion and data cleaning   Our work is on –  Deriving lower bounds on the achievable compression ratio for a time series –  Define a suitable model-based storage and indexing scheme for fast retrieval –  Defining innovative models for data cleaning and data quality estimation   Publications: ICDE’10, MDM’11, VLDB’11 (under preparation)
  • 8.
  • 9.
    Data Compression   TowardsMulti-Model Approximation of Time-Series Thanasis Papaioannou, Mehdi Riahi, Karl Aberer [MDM 2011] (under review)
  • 10.
  • 11.
    Sensor Context Extraction  SeMiTri: A Framework for Semantic Annotation of Heterogeneous Trajectories Z. Yan, D. Chakraborty, C. Parent, S. Spaccapietra, K. Aberer [EDBT 2011] Objec&ve:    A  Middleware  for  automa&cally  annota&ng  trajectories  of  different  types   of  moving  objects  (cars,  people)   Spa&al  join  (region)   bus metro walking Semantic trajectory home office market home Semantic Annotation Middleware Map-­‐matching  (road  network)   Hidden Spatial Map Markov Join Matching Model HMM  (point  of  Interest)   region road network point of interest e1 e2 e3 e4 e5 e6 e7 GPS episodes
  • 12.
  • 13.
    Economic Cloud ResourceManagement   Objective: high availability and low response-time in a cost-effective w ay in data clouds –  Hardware (correlated) failures, highly irregular query rates, NP multi-constr ained global optimization problem!   Solution: decentralized virtual economy (‘Skute’) –  Partition data using consistent hashing –  A virtual node is responsible for a key range –  Virtual ring organizes virtual nodes per availability level and per application –  Virtual nodes act as economic agents and independently migrate, replicate or delete themselves –  Skute offers differentiated availability guarantees, as well as automated an d balanced cloud resources elasticity   Publications: ACDC’09, ICDE’09, SoCC’10, Cloud’10, CCGrid’11   Springer book on “Economic Cloud Resource Management”, under prep aration
  • 14.
    TimeCloud   A CloudSystem for Massive Time Series Management –  Web-based time series management in the cloud •  Storage cloud, various time-series visualization, group-based data share, … •  Potentially linked to third-party software, e.g. SensorMap, SwissEx Wiki –  Storage-and-computing platform for massive time series processing •  Built on Hadoop/Hbase/GSN with capability of handling data streams •  Very efficient model-based parallel time-series data processing third-parties data streams Time-series compression Efficient data processing based on model-based views Distributed time-series processing
  • 15.
    Overview   Sensor DataManagement –  Global Sensor Networks –  Swiss Experiment –  Sensor Metadata Management –  Time Series compression and retrieval –  Sensor data analysis and quality –  Economics-based resource allocation in distributed clouds –  Cloud-based time series management system   Web Data Management –  Large-scale Semantic Data Integration –  Web Stream Data Analysis (Twitter)
  • 16.
    “The Wisdom ofthe Network” Problem Emergent semantics • Schema heterogeneity inherent • Establishing semantic problem for enterprise cooperation interoperability as a self-organizing networks process within a community or • Both manual and automated mapping social network error-prone • Mappings are established in a • Interoperability challenges evolve localized, incremental manner constantly •  Create mappings in a pay-as-you-go fashion •  Exploit the the knowledge available in the network: •  Available mappings in the network •  Content features •  Social structure of the network •  User feedback •  Economic incentives •  Apply probabilistic reasoning techniques to improve mapping quality
  • 17.
    Web Data StreamAnalysis   Classifying Twitter messages   We would like to classify tweets, containing a given keyword (e.g. “ apple”), whether they are related to a given company or not   Won the WePS 2010 tweet classification task
  • 18.
      Thank youfor your attention!   For more information please visit http://lsir.epfl.ch/