Big Data Schema Design

               Deepak
Overview
•   Schema design is vital for performance.
•   Keywords : Non-relational, NOSQL, Distributed
•   Underlying File system : GFS, HDFS
•   Examples : Hadoop, GFS, Hbase, Big Tables etc
•   Example implementations : Facebook, Wallmart
    etc.
When to use
• Typically with systems having >=100’s of
  millions/billions rows
• Records of the order of 100’s or 1000’s of
  TB’s
• No advanced Query Language needed
• Typed columns or other RDBMS features not
  needed
Hadoop Architecture
Hadoop Ecosystem
HBase Architecture
Overview
• HBase runs on top of HDFS
• HDFS was chosen because of its fault tolerance,
  check summing, failover properties
• Java Native client or REST API
• Manager manages cluster, Region Servers
  manages data
HBase Data Model
• Table: design-time namespace, has many rows.
• Row: atomic key/value container, with one row
  key
• Column Family: divide columns into physical files
• Column: a key in the k/v container inside a row
• Timestamp: long milliseconds, sorted descending
• Value: a time-versioned value in the k/v container
Distribution
More distribution
Thoughts on the logical view
• Unit of scalability is Region.
• The rows are not tied to a server. They maybe
  moved around for load balancing.
• Add nodes so that we do not have too many
  regions per node
• Too many regions per node will work against
  distribution
Column Family
• Each Column Family represents a Physical storage
  unit ( A Directory)
• Data that are queried together should be stored
  together.
• Features such as compression can be enabled per
  Column Family
Bloom Filter
• Generated automatically when an HFile is
  flushed to disk
• Available in primary memory
• Contains Row keys
• CK can be stored as part of RK, but that
  might overload the memory.
• Can filter based on what is stored.
Physical View
Key Cardinality
Tall vs Fat Tables
• Fat tables with large amounts of data in each
  column.
• Tall tables with large amounts of rows.
• Tall is good for search or scans
• Fat is good for fetches or gets
• Rows don’t split
• Atomicity is only at row level, having compound
  keys, atomicity is not guaranteed
Key Design
• Sequential keys : Example timestamp as key
• With Sequential keys you keep hot spotting on a
  region.
• Salting to distribute the records
• Field promotion
• Random keys
Key Design Performance
Summary
• Think twice before you decide on NOSQL
  technologies
• Avoid hotspots
• Store values at appropriate places
• Choose the right keys
• Store inferences into RDBMS if necessary
Visit us:

   Facebook: http://www.facebook.com/QBurst
        Twitter: http://twitter.com/qburst
 Google+: https://plus.google.com/+qburst/posts
LinkedIn: http://www.linkedin.com/company/qburst
YouTube: http://www.youtube.com/QBurstVideos


                www.qburst.com

Schema Design

  • 1.
    Big Data SchemaDesign Deepak
  • 2.
    Overview • Schema design is vital for performance. • Keywords : Non-relational, NOSQL, Distributed • Underlying File system : GFS, HDFS • Examples : Hadoop, GFS, Hbase, Big Tables etc • Example implementations : Facebook, Wallmart etc.
  • 3.
    When to use •Typically with systems having >=100’s of millions/billions rows • Records of the order of 100’s or 1000’s of TB’s • No advanced Query Language needed • Typed columns or other RDBMS features not needed
  • 4.
  • 5.
  • 6.
  • 7.
    Overview • HBase runson top of HDFS • HDFS was chosen because of its fault tolerance, check summing, failover properties • Java Native client or REST API • Manager manages cluster, Region Servers manages data
  • 8.
    HBase Data Model •Table: design-time namespace, has many rows. • Row: atomic key/value container, with one row key • Column Family: divide columns into physical files • Column: a key in the k/v container inside a row • Timestamp: long milliseconds, sorted descending • Value: a time-versioned value in the k/v container
  • 9.
  • 10.
  • 11.
    Thoughts on thelogical view • Unit of scalability is Region. • The rows are not tied to a server. They maybe moved around for load balancing. • Add nodes so that we do not have too many regions per node • Too many regions per node will work against distribution
  • 12.
    Column Family • EachColumn Family represents a Physical storage unit ( A Directory) • Data that are queried together should be stored together. • Features such as compression can be enabled per Column Family
  • 13.
    Bloom Filter • Generatedautomatically when an HFile is flushed to disk • Available in primary memory • Contains Row keys • CK can be stored as part of RK, but that might overload the memory. • Can filter based on what is stored.
  • 14.
  • 15.
  • 16.
    Tall vs FatTables • Fat tables with large amounts of data in each column. • Tall tables with large amounts of rows. • Tall is good for search or scans • Fat is good for fetches or gets • Rows don’t split • Atomicity is only at row level, having compound keys, atomicity is not guaranteed
  • 17.
    Key Design • Sequentialkeys : Example timestamp as key • With Sequential keys you keep hot spotting on a region. • Salting to distribute the records • Field promotion • Random keys
  • 18.
  • 19.
    Summary • Think twicebefore you decide on NOSQL technologies • Avoid hotspots • Store values at appropriate places • Choose the right keys • Store inferences into RDBMS if necessary
  • 20.
    Visit us: Facebook: http://www.facebook.com/QBurst Twitter: http://twitter.com/qburst Google+: https://plus.google.com/+qburst/posts LinkedIn: http://www.linkedin.com/company/qburst YouTube: http://www.youtube.com/QBurstVideos www.qburst.com

Editor's Notes

  • #21 Activity 1   - Study Make a conscious effort to improve attention to detail everywhere. Wherever you go, look for things to recall later. When you're shopping look for three things to study. Take 15 to 20 seconds to study each object. After returning home, write down specific things about the objects. Make notes of the size, the shape, the color.   Activity 2     - Recollection           People tend to get careless about the things in which they are familiar. Complacency especially during routine actions does not exercise the mind. Make a point to look for details and notice things as often as possible. Have you noticed the number of steps you need to climb from the ground to reach 3rd floor and 4th floor  at QBurst.63,85)