Dan Han, Eleni Stroulia
            University of Alberta
9/20/2012




                MESOCA 2012           1
Outline
            »   Background and Motivation
            »   Related Work
            »   A 3-Dimensional Data Model in HBase
            »   Case Study and Experiment Results
            »   Discussion
            »   Conclusions and Future Work
9/20/2012




                 MESOCA 2012                          2
Migrating Applications
            To the Cloud
            » Cloud is an attractive computing platform
               ˃ Elasticity, Excellent Scalability, High Availability, Low Operating
                 Cost

            » Applications are moving to the cloud
               ˃ Social networking, online shopping, monitoring system
               ˃ Time-Series data: grows monotonously over time
               ˃ Analysis of large scale time-series data
                    + May lead to new knowledge
                    + May lead to improvements of existing services


            » Success adoption of this movement paradigm requires a
9/20/2012




              new model of storage

                MESOCA 2012                                                        3
Migrating RDBMS Content
            To NoSQL
            » From RDBMS to NoSQL storage systems
               ˃ Enable the storage of big data, in order of row key
               ˃ Scale horizontally across storage nodes easily
               ˃ Not much data-organization support


            » Migration challenges
               ˃ Few experiences and principles to follow
               ˃ Steep learning curve for programming
               ˃ Much experimentation is required before deployment
                    + Much time is spent in designing the data schema
                    + The “wrong” schema may lead to inefficient, high-latency queries
9/20/2012




                MESOCA 2012                                                              4
We need Design Patterns for
            HBase Schemas
            » Our objective is to develop a systematic method for
               ˃ Guiding data organization in NoSQL databases, given
               ˃ the types of data stored
               ˃ the amount of data
               ˃ The data-usage patterns


            » We start our investigation with HBase
               ˃ A NoSQL database offering, built on top of Hadoop
               ˃ Parallel Distributed Computation
                   + MapReduce Framework
                   + Coprocessor Framework
9/20/2012




               MESOCA 2012                                             5
Related Work
            » Talks in HBaseCon2012, held in May
               ˃ Data schema and Coprocessor are two main topics
               ˃ Experience from 30 enterprises, i.e., Facebook, Yapmap, eBay, Adobe


            » Organizing time-series data in period-specific “buckets”
               ˃ OpenTSDB: a distributed scalable time-series database, on top of
                 HBase
               ˃ A data Model in Cassandra, another NoSQL database offering
               ˃ Applied in our case study
9/20/2012




                   MESOCA 2012                                                      6
Data Organization in HBase
             » Cell in HBase
                  ˃(Row, Family: Column, Version) => (X,Y,Z) = value
                            Y                   Z
                                                    Y
                      X                  VS     X




            Schema/   Row                     Family: Column       Version
            dimension
            2-D            unique id -        varying properties   current
                           timestamp                               timestamp
9/20/2012




            3-D            unique id          varying properties   timestamps

                  MESOCA 2012                                                   7
Case study:
             The Datasets
            » Cosmology Dataset
               ˃ Product of an N-body simulation
               ˃ Three types of particles: dark matter, gas and star
               ˃ Particles evolve over a series of discrete timestamps
               ˃ Each snapshot records the properties of all particles at
                 the time of the snapshot
               ˃ 9 snapshots, consists of 321,065,547 particles
            » Bixi Dataset
               ˃ Data from a bicycle-renting service in the city of
                 Montreal
               ˃ Every minute, the statistic information about bike usage
                 a station is collected by the sensor
9/20/2012




               ˃ 100,800 timestamps, consists of 404 stations

                  MESOCA 2012                                               8
Three Schemas
              for the Cosmology Dataset
            Schema/     Row                  Family:         Version
            dimension                        Column
            Schema1     sid-type-pid         particle        No meaning
                                                                               Z
                                             properties
                                                                                   Y
            Schema2     type-pid             particle        Snapshot id
                                                                               X
                                             properties
            Schema3     type-reversedpid     particle        Snapshot id
                                             properties

                                 Schema1        Schema2           Schema3
               Region        24-2-33446666      2-33446666        2-00005533
9/20/2012




               Region        64-2-33559999      2-33550000        2-66664433

               Region       84-2-33550000       2-33559999        2-99995533
                   MESOCA 2012                                                         9
The cosmology dataset
            » Dataset called“cosmo50”
              ˃ 9 snapshots
                               S-ID   Star Particles   Total particles

                               24              1,291       33,555,723
                               29              5,568       33,559,998
                               36             20,246       33,574,630
                               45             67,268       33,620,890
                               60            259,219       33,800,108
                               84            907,025       34,369,014
                               128         2,743,966       35,908,164
9/20/2012




                               216         6,396,955       38,889,220
                               512        12,417,544       43,787,800

               MESOCA 2012                                               10
Three Schemas
            for the Bixi Dataset
            Schema/           Row        Family: Column           Version
            dimension
            Schema1           hour-sid   minutes[0,59]            no meaning

            Schema2           hour-sid   monitoring metrics       minutes [0,59]

            Schema3           day-sid    monitoring metrics       minutes [0,1439]


                         Schema1          Schema2             Schema3
                                         Time
                         Time               metrics                 Time
                   X
                                         X                        metrics
9/20/2012




                                                              X


                MESOCA 2012                                                          11
The Bixi dataset
            »   A period of 70 days, from Sep 24, 2010 to Dec 1, 2010,
            »   100,800 timestamps
            »   404 stations involved
            »   Stored in XML file
9/20/2012




                   MESOCA 2012                                      12
Experiment Results
            » Experiment Environment
               ˃ A four-node cluster on virtual machines with Ubuntu
               ˃ Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support)
               ˃ HBase Configuration
                    + The replication factor of 2
                    + 5KB Caching Size


            » Queries for each dataset
               ˃ Three queries of Cosmology dataset from related research
               ˃ One query of Bixi dataset from business requirement


            » Query processing Implementation
9/20/2012




               ˃ Native java API
               ˃ User-Level Coprocessor Implementation

                MESOCA 2012                                                 13
Query1 of Cosmology Dataset
            »   Get all the particles of a type: star
            »   in a single snapshot
            »   with a given property: tform
            »   whose property matches the expression
                ˃ [>0.01;84]
                ˃ [>0.08;128]
                ˃ [>0.05;128]
                ˃ [>0.08;216]
                ˃ [>0.08;512]
9/20/2012




                    MESOCA 2012                         14
Query2 of Cosmology Dataset
            » Get all the particles added/destroyed
            » between s1 and s2
               ˃ [29;24]
               ˃ [60;24]
               ˃ [84;24]
               ˃ [128;24]
               ˃ [216;128]
               ˃ [216;24]
               ˃ [512;24]
               ˃ [512;128]
               ˃ [512;216]
9/20/2012




                   MESOCA 2012                        16
Query3 of Cosmology Dataset
            » Get the values of a property
            » for a set of particle IDs
            » across the selected snapshots
               ˃10;[24]
               ˃10; [24,512],
               ˃10;[24,60,128,512]
               ˃10;[24,29,60,84,128,512]
               ˃10;[24,36,45,60,84,128,216,512]
               ˃50;[24,29,84,512]
               ˃50;[24,29,36,45,60,84,128,216,512]
               ˃100;[24,29,36,45,60,84,128,216,512]
9/20/2012




               ˃150;[24,29,36,45,60,84,128,216,512
                  MESOCA 2012                         18
Query3 of Cosmology Dataset
            » Get the values of a property: star:eps
            » for a set of particle IDs: a continuous range particle IDs
            » across the selected snapshots
               ˃ 10;[24]
               ˃ 10; [24,512],
               ˃ 10;[24,60,128,512]
               ˃ 10;[24,29,60,84,128,512]
               ˃ 10;[24,36,45,60,84,128,216,512]
               ˃ 50;[24,29,84,512]
               ˃ 50;[24,29,36,45,60,84,128,216,512]
               ˃ 100;[24,29,36,45,60,84,128,216,512]
9/20/2012




               ˃ 150;[24,29,36,45,60,84,128,216,512]

                   MESOCA 2012                                             19
Bixi Query
            » For a given list of stations: 200 stations
            » get average bike usage in a given period
               ˃ [1day]
               ˃ [2day]
               ˃ [4day]
               ˃ [8day]
               ˃ [16day]
9/20/2012




                   MESOCA 2012                             21
Discussion
            »   “Qualitative” versus “Quantitative” Suggestions
            »   Dynamic Data versus Static Data
            »   Historical Dataset versus Real-Time Datasets
            »   Supported versus Non-Supported Datasets
9/20/2012




                 MESOCA 2012                                      23
Conclusion
            » The objective is to make queries local
            » To do that, you have to design the right key, so that all
              queries traverse a range of keys
               ˃With all answers in them
               ˃With not much irrelevant data in it
            » But, hotspotting occurs when
               ˃???
9/20/2012




                MESOCA 2012                                         24
Conclusion
            » A 3-dimensional data model
               ˃Improved performance can be got from the data schema
                that use the version dimension of HBase
            » Fit in “write-once, read-many” system
               ˃Monitoring system
               ˃Sensor-based system
               ˃Version-based analysis
9/20/2012




               MESOCA 2012                                             25
Future Work
            » More Evaluation of this data model
               ˃Scalability
               ˃Elasticity
               ˃Utilization
            » How to design data model for other datasets
               ˃Spatial dataset
               ˃Graphic dataset
9/20/2012




               MESOCA 2012                                  26
Questions?

                          Thank you
9/20/2012




            MESOCA 2012                27

A 3 dimensional data model in hbase for large time-series dataset-20120915

  • 1.
    Dan Han, EleniStroulia University of Alberta 9/20/2012 MESOCA 2012 1
  • 2.
    Outline » Background and Motivation » Related Work » A 3-Dimensional Data Model in HBase » Case Study and Experiment Results » Discussion » Conclusions and Future Work 9/20/2012 MESOCA 2012 2
  • 3.
    Migrating Applications To the Cloud » Cloud is an attractive computing platform ˃ Elasticity, Excellent Scalability, High Availability, Low Operating Cost » Applications are moving to the cloud ˃ Social networking, online shopping, monitoring system ˃ Time-Series data: grows monotonously over time ˃ Analysis of large scale time-series data + May lead to new knowledge + May lead to improvements of existing services » Success adoption of this movement paradigm requires a 9/20/2012 new model of storage MESOCA 2012 3
  • 4.
    Migrating RDBMS Content To NoSQL » From RDBMS to NoSQL storage systems ˃ Enable the storage of big data, in order of row key ˃ Scale horizontally across storage nodes easily ˃ Not much data-organization support » Migration challenges ˃ Few experiences and principles to follow ˃ Steep learning curve for programming ˃ Much experimentation is required before deployment + Much time is spent in designing the data schema + The “wrong” schema may lead to inefficient, high-latency queries 9/20/2012 MESOCA 2012 4
  • 5.
    We need DesignPatterns for HBase Schemas » Our objective is to develop a systematic method for ˃ Guiding data organization in NoSQL databases, given ˃ the types of data stored ˃ the amount of data ˃ The data-usage patterns » We start our investigation with HBase ˃ A NoSQL database offering, built on top of Hadoop ˃ Parallel Distributed Computation + MapReduce Framework + Coprocessor Framework 9/20/2012 MESOCA 2012 5
  • 6.
    Related Work » Talks in HBaseCon2012, held in May ˃ Data schema and Coprocessor are two main topics ˃ Experience from 30 enterprises, i.e., Facebook, Yapmap, eBay, Adobe » Organizing time-series data in period-specific “buckets” ˃ OpenTSDB: a distributed scalable time-series database, on top of HBase ˃ A data Model in Cassandra, another NoSQL database offering ˃ Applied in our case study 9/20/2012 MESOCA 2012 6
  • 7.
    Data Organization inHBase » Cell in HBase ˃(Row, Family: Column, Version) => (X,Y,Z) = value Y Z Y X VS X Schema/ Row Family: Column Version dimension 2-D unique id - varying properties current timestamp timestamp 9/20/2012 3-D unique id varying properties timestamps MESOCA 2012 7
  • 8.
    Case study: The Datasets » Cosmology Dataset ˃ Product of an N-body simulation ˃ Three types of particles: dark matter, gas and star ˃ Particles evolve over a series of discrete timestamps ˃ Each snapshot records the properties of all particles at the time of the snapshot ˃ 9 snapshots, consists of 321,065,547 particles » Bixi Dataset ˃ Data from a bicycle-renting service in the city of Montreal ˃ Every minute, the statistic information about bike usage a station is collected by the sensor 9/20/2012 ˃ 100,800 timestamps, consists of 404 stations MESOCA 2012 8
  • 9.
    Three Schemas for the Cosmology Dataset Schema/ Row Family: Version dimension Column Schema1 sid-type-pid particle No meaning Z properties Y Schema2 type-pid particle Snapshot id X properties Schema3 type-reversedpid particle Snapshot id properties Schema1 Schema2 Schema3 Region 24-2-33446666 2-33446666 2-00005533 9/20/2012 Region 64-2-33559999 2-33550000 2-66664433 Region 84-2-33550000 2-33559999 2-99995533 MESOCA 2012 9
  • 10.
    The cosmology dataset » Dataset called“cosmo50” ˃ 9 snapshots S-ID Star Particles Total particles 24 1,291 33,555,723 29 5,568 33,559,998 36 20,246 33,574,630 45 67,268 33,620,890 60 259,219 33,800,108 84 907,025 34,369,014 128 2,743,966 35,908,164 9/20/2012 216 6,396,955 38,889,220 512 12,417,544 43,787,800 MESOCA 2012 10
  • 11.
    Three Schemas for the Bixi Dataset Schema/ Row Family: Column Version dimension Schema1 hour-sid minutes[0,59] no meaning Schema2 hour-sid monitoring metrics minutes [0,59] Schema3 day-sid monitoring metrics minutes [0,1439] Schema1 Schema2 Schema3 Time Time metrics Time X X metrics 9/20/2012 X MESOCA 2012 11
  • 12.
    The Bixi dataset » A period of 70 days, from Sep 24, 2010 to Dec 1, 2010, » 100,800 timestamps » 404 stations involved » Stored in XML file 9/20/2012 MESOCA 2012 12
  • 13.
    Experiment Results » Experiment Environment ˃ A four-node cluster on virtual machines with Ubuntu ˃ Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support) ˃ HBase Configuration + The replication factor of 2 + 5KB Caching Size » Queries for each dataset ˃ Three queries of Cosmology dataset from related research ˃ One query of Bixi dataset from business requirement » Query processing Implementation 9/20/2012 ˃ Native java API ˃ User-Level Coprocessor Implementation MESOCA 2012 13
  • 14.
    Query1 of CosmologyDataset » Get all the particles of a type: star » in a single snapshot » with a given property: tform » whose property matches the expression ˃ [>0.01;84] ˃ [>0.08;128] ˃ [>0.05;128] ˃ [>0.08;216] ˃ [>0.08;512] 9/20/2012 MESOCA 2012 14
  • 15.
    Query2 of CosmologyDataset » Get all the particles added/destroyed » between s1 and s2 ˃ [29;24] ˃ [60;24] ˃ [84;24] ˃ [128;24] ˃ [216;128] ˃ [216;24] ˃ [512;24] ˃ [512;128] ˃ [512;216] 9/20/2012 MESOCA 2012 16
  • 16.
    Query3 of CosmologyDataset » Get the values of a property » for a set of particle IDs » across the selected snapshots ˃10;[24] ˃10; [24,512], ˃10;[24,60,128,512] ˃10;[24,29,60,84,128,512] ˃10;[24,36,45,60,84,128,216,512] ˃50;[24,29,84,512] ˃50;[24,29,36,45,60,84,128,216,512] ˃100;[24,29,36,45,60,84,128,216,512] 9/20/2012 ˃150;[24,29,36,45,60,84,128,216,512 MESOCA 2012 18
  • 17.
    Query3 of CosmologyDataset » Get the values of a property: star:eps » for a set of particle IDs: a continuous range particle IDs » across the selected snapshots ˃ 10;[24] ˃ 10; [24,512], ˃ 10;[24,60,128,512] ˃ 10;[24,29,60,84,128,512] ˃ 10;[24,36,45,60,84,128,216,512] ˃ 50;[24,29,84,512] ˃ 50;[24,29,36,45,60,84,128,216,512] ˃ 100;[24,29,36,45,60,84,128,216,512] 9/20/2012 ˃ 150;[24,29,36,45,60,84,128,216,512] MESOCA 2012 19
  • 18.
    Bixi Query » For a given list of stations: 200 stations » get average bike usage in a given period ˃ [1day] ˃ [2day] ˃ [4day] ˃ [8day] ˃ [16day] 9/20/2012 MESOCA 2012 21
  • 19.
    Discussion » “Qualitative” versus “Quantitative” Suggestions » Dynamic Data versus Static Data » Historical Dataset versus Real-Time Datasets » Supported versus Non-Supported Datasets 9/20/2012 MESOCA 2012 23
  • 20.
    Conclusion » The objective is to make queries local » To do that, you have to design the right key, so that all queries traverse a range of keys ˃With all answers in them ˃With not much irrelevant data in it » But, hotspotting occurs when ˃??? 9/20/2012 MESOCA 2012 24
  • 21.
    Conclusion » A 3-dimensional data model ˃Improved performance can be got from the data schema that use the version dimension of HBase » Fit in “write-once, read-many” system ˃Monitoring system ˃Sensor-based system ˃Version-based analysis 9/20/2012 MESOCA 2012 25
  • 22.
    Future Work » More Evaluation of this data model ˃Scalability ˃Elasticity ˃Utilization » How to design data model for other datasets ˃Spatial dataset ˃Graphic dataset 9/20/2012 MESOCA 2012 26
  • 23.
    Questions? Thank you 9/20/2012 MESOCA 2012 27