Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Securely explore your data
WHAT'S NEXT FOR
BIGTABLE?
Adam Fuchs, CTO
Sqrrl Data, Inc.
May 22, 2014
TODAY’S TALK
•  History of the World: Part 3
•  Bigtable/Accumulo Technology Overview
•  Accumulo Demonstration
•  Databas...
TIMELINE OF RELEVANT EVENTS
© 2014 Sqrrl Data, Inc. | All Rights Reserved
Google’s
BigTable Paper
2006
NSA Builds
Accumulo...
Accumulo is a:
•  Apache Software Foundation (ASF) Open-
Source Software Project
•  Clone of Google’s Bigtable
•  Secure, ...
Sqrrl is:
•  A commercial software company located in
Cambridge, MA
•  A search and Exploration Platform built with
Apache...
6
BIGTABLE & ACCUMULO TECH
OVERVIEW
1.  Data Model & API
2.  Underlying Architecture
3.  Distinguishing Features
© 2014 Sqrr...
An Accumulo key is a 5-tuple, consisting of:
•  Row: Controls Atomicity
•  Column Family: Controls Locality
•  Column Qual...
Instance
new ZooKeeperInstance(...)
new MockInstance()
Connector
getConnector(...)
TableOperations
InstanceOperations
Secu...
•  Collections of KV pairs form Tables
•  Tables are partitioned into Tablets
•  Metadata tablets hold info about
other ta...
Tablet	
  Server	
  
Tablet	
  
Tablet	
  Server	
  
Tablet	
  
Tablet	
  Server	
  
Tablet	
  
Applica9on	
  
Zookeeper	
...
In-­‐Memory	
  
Map	
  
Write	
  Ahead	
  
Log	
  
(For	
  Recovery)	
  
Sorted,	
  
Indexed	
  
File	
  
Sorted,	
  
Inde...
Iterator Operations:
•  File Reads
•  Block Caching
•  Merging
•  Deletion
•  Isolation
•  Locality Groups
•  Range Select...
WORD COUNT:
SUMMING AGGREGATING ITERATOR
Input Corpus
© 2014 Sqrrl Data, Inc. | All Rights Reserved 14
Ingesters QueriersTablet Servers
ACCUMULO LATENCIES
Input
Batch
Writer
In-
Memory
Map
Scan
Iterators
Scanner/
Batch
Scanne...
ACCUMULO THROUGHPUT
Ingesters QueriersTablet Servers
Input
Batch
Writer
In-
Memory
Map
Scan
Iterators
Scanner
/Batch
Scann...
Securely explore your data
DEMO
R-M-R VS. COMPACTION-TIME
AGGREGATION
Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)
© 2014 Sqrrl Data, Inc....
SURVEY OF DATABASE
TECHNOLOGY
•  Exercises in Center-Seeking
•  SQL vs. NoSQL
•  Ingest-time vs. Query-time Analytics
•  A...
SQL VS. NOSQL
NoSQL
•  Optimized for get/put
operations
•  Specialized for client
languages
•  High concurrency
•  More cl...
INGEST-TIME VS. QUERY-TIME
ANALYTICS
Ingest-Time
•  Optimized for online
statistics
•  Can reduce storage
footprint
•  Can...
ACID VS. BASE
ACID
•  Atomicity: all or
nothing for a group of
operations
•  Consistency and
Isolation: support
simple rea...
NORMALIZED VS. DENORMALIZED
DATA MODELS
Normalized
•  “Normal Form
Relational Database”
•  Minimizes data
footprint
•  Min...
KNOWLEDGE-BASE USE CASE
2014-04-14
06:36:09 429
73.105.179.202
username@msn.c
om 500 POST
application/json
2014-04-14 06:3...
STREAM PROCESSING USE CASE
© 2014 Sqrrl Data, Inc. | All Rights Reserved
Dashboards
Actions
Interactive
Analysis Tools
(Di...
SQRRL OPERATIONALIZES
ACCUMULO WITH...
© 2014 Sqrrl Data, Inc. | All Rights Reserved 26
Data-Centric Security
Petabyte Sca...
MODERNIZING VISUALIZATION
© 2014 Sqrrl Data, Inc. | All Rights Reserved 27
Sqrrl is building the next generation of
operat...
UPCOMING EVENTS
Accumulo Summit 2014
•  June 12 in College Park, MD
•  http://accumulosummit.com
•  Multiple tracks of tal...
Upcoming SlideShare
Loading in …5
×

What's Next for Google's BigTable

511 views

Published on

Adam Fuchs' presentation slides on what's next in the evolution of BigTable implementations (transactions, indexing, etc.) and what these advances could mean for the massive database that gave rise to Google.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

What's Next for Google's BigTable

  1. 1. Securely explore your data WHAT'S NEXT FOR BIGTABLE? Adam Fuchs, CTO Sqrrl Data, Inc. May 22, 2014
  2. 2. TODAY’S TALK •  History of the World: Part 3 •  Bigtable/Accumulo Technology Overview •  Accumulo Demonstration •  Database Technology Survey © 2014 Sqrrl Data, Inc. | All Rights Reserved 2
  3. 3. TIMELINE OF RELEVANT EVENTS © 2014 Sqrrl Data, Inc. | All Rights Reserved Google’s BigTable Paper 2006 NSA Builds Accumulo 2008 Sqrrl Founded 2012 1st Sqrrl Release and Customers 2013 NSA Open Sources Accumulo 2011 3
  4. 4. Accumulo is a: •  Apache Software Foundation (ASF) Open- Source Software Project •  Clone of Google’s Bigtable •  Secure, Sorted Key-Value Store •  Row-level ACID (locally) Distributed NoSQL Database © 2014 Sqrrl Data, Inc. | All Rights Reserved 4
  5. 5. Sqrrl is: •  A commercial software company located in Cambridge, MA •  A search and Exploration Platform built with Apache Accumulo •  An exciting startup with a long roadmap of challenging problems to solve •  Hiring! © 2014 Sqrrl Data, Inc. | All Rights Reserved 5
  6. 6. 6
  7. 7. BIGTABLE & ACCUMULO TECH OVERVIEW 1.  Data Model & API 2.  Underlying Architecture 3.  Distinguishing Features © 2014 Sqrrl Data, Inc. | All Rights Reserved 7
  8. 8. An Accumulo key is a 5-tuple, consisting of: •  Row: Controls Atomicity •  Column Family: Controls Locality •  Column Qualifier: Controls Uniqueness •  Visibility Label: Controls Access •  Timestamp: Controls Versioning Row Col. Fam. Col. Qual. Visibility Timestamp Value John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute … John Doe Test Results Cholesterol JD|PCP_JD 20120912 183 John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100 … Accumulo  Key/Value  Example   ACCUMULO DATA FORMAT © 2014 Sqrrl Data, Inc. | All Rights Reserved 8
  9. 9. Instance new ZooKeeperInstance(...) new MockInstance() Connector getConnector(...) TableOperations InstanceOperations SecurityOperations Scanner BatchScanner createScanner(...)createBatchScanner(...) Range IteratorOption Map.Entry Key Value iterator() BatchWriter createBatchWriter(...) Mutation addMutation(...) THE ACCUMULO CLIENT API © 2014 Sqrrl Data, Inc. | All Rights Reserved 9
  10. 10. •  Collections of KV pairs form Tables •  Tables are partitioned into Tablets •  Metadata tablets hold info about other tablets, forming a 3-level hierarchy •  A Tablet is a unit of work for a Tablet Server Data  Tablet   -­‐∞  :  thing   Data  Tablet   thing  :  ∞     Data  Tablet   -­‐∞  :  Ocelot     Data  Tablet   Ocelot  :  Yak     Data  Tablet   Yak  :  ∞     Data  Tablet   -­‐∞  to  ∞     Table:    Adam’s  Table   Table:    Encyclopedia   Table:    Foo   ACCUMULO TABLETS Well-­‐Known   Loca9on   (zookeeper)   Root  Tablet   -­‐∞  to  ∞     Metadata  Tablet  2   “Encyclopedia:Ocelot”  to  ∞   Metadata  Tablet  1   -­‐∞  to  “Encyclopedia:Ocelot”   © 2014 Sqrrl Data, Inc. | All Rights Reserved 10
  11. 11. Tablet  Server   Tablet   Tablet  Server   Tablet   Tablet  Server   Tablet   Applica9on   Zookeeper   Zookeeper   Zookeeper   Master   HDFS   Read/Write   Store/Replicate   Assign/Balance   Delegate   Authority   Delegate   Authority   Applica9on   Applica9on   ACCUMULO PROCESSES © 2014 Sqrrl Data, Inc. | All Rights Reserved 11
  12. 12. In-­‐Memory   Map   Write  Ahead   Log   (For  Recovery)   Sorted,   Indexed   File   Sorted,   Indexed   File   Sorted,   Indexed   File   Tablet   Reads   Iterator   Tree   Minor   Compac<on   Merging  /  Major   Compac<on   Iterator   Tree   Writes   Iterator   Tree   Scan   TABLET DATA FLOW © 2014 Sqrrl Data, Inc. | All Rights Reserved 12
  13. 13. Iterator Operations: •  File Reads •  Block Caching •  Merging •  Deletion •  Isolation •  Locality Groups •  Range Selection •  Column Selection •  Cell-level Security •  Versioning •  Filtering •  Aggregation •  Partitioned Joins ITERATOR FRAMEWORK © 2014 Sqrrl Data, Inc. | All Rights Reserved 13
  14. 14. WORD COUNT: SUMMING AGGREGATING ITERATOR Input Corpus © 2014 Sqrrl Data, Inc. | All Rights Reserved 14
  15. 15. Ingesters QueriersTablet Servers ACCUMULO LATENCIES Input Batch Writer In- Memory Map Scan Iterators Scanner/ Batch Scanner In- Memory Map RFile Compactio n Iterators Scan Iterators RFile Compactio n Iterators In- Memory Map RFiles Compactio n Iterators Scan Iterators Output ~ms~ms ~ms ms-min © 2014 Sqrrl Data, Inc. | All Rights Reserved 15
  16. 16. ACCUMULO THROUGHPUT Ingesters QueriersTablet Servers Input Batch Writer In- Memory Map Scan Iterators Scanner /Batch Scanner In- Memory Map RFile Compacti on Iterators Scan Iterators RFile Compacti on Iterators In- Memory Map RFiles Compactio n Iterators Scan Iterators Output ~ms~ms ~ms ms-min Scan: ~1M entries/s per node Ingest: ~200K entries/s per node Read-Modify-Write Latency: ~ms ê >1K entries/s challenging with R-M-W © 2014 Sqrrl Data, Inc. | All Rights Reserved 16
  17. 17. Securely explore your data DEMO
  18. 18. R-M-R VS. COMPACTION-TIME AGGREGATION Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo) © 2014 Sqrrl Data, Inc. | All Rights Reserved 18
  19. 19. SURVEY OF DATABASE TECHNOLOGY •  Exercises in Center-Seeking •  SQL vs. NoSQL •  Ingest-time vs. Query-time Analytics •  ACID vs. BASE •  Normalized vs. Denormalized Data Models •  Primary Use Cases for Sqrrl+Accumulo © 2014 Sqrrl Data, Inc. | All Rights Reserved 19
  20. 20. SQL VS. NOSQL NoSQL •  Optimized for get/put operations •  Specialized for client languages •  High concurrency •  More client-side control Hybrid •  Extend and evolve SQL •  Standardize and incorporate NoSQL paradigms SQL •  Optimized for joins •  Strong mathematical roots in set theory •  Automatic query optimization © 2014 Sqrrl Data, Inc. | All Rights Reserved 20
  21. 21. INGEST-TIME VS. QUERY-TIME ANALYTICS Ingest-Time •  Optimized for online statistics •  Can reduce storage footprint •  Can be indexed for low latency •  Leverages a variety of indexes •  Requires extensive data organization at ingest Hybrid •  Create partial summary at ingest (Question-focused datasets, knowledge bases, etc.) •  Support ad-hoc queries over summaries •  Leverage all known indexing strategies ** Query-Time •  Can compute holistic statistics, like ranking, topN, etc. •  Ad-hoc analytics: don’t know the query ahead of time •  High latency and low concurrency at scale •  Leverages block indexes, columnar layout •  Ingest can be “stream to disk” © 2014 Sqrrl Data, Inc. | All Rights Reserved 21
  22. 22. ACID VS. BASE ACID •  Atomicity: all or nothing for a group of operations •  Consistency and Isolation: support simple reasoning for distributed, multithreaded clients •  Durability: simple reasoning for whether data might be lost Hybrid •  Must make some relaxations for performance at scale (under failure modes) •  Many options for “Lightweight” transaction support •  Accumulo limits atomicity, consistency, and isolation to row-level operations BASE •  Basically Available: ensure that core operations always complete in an advertised time •  Soft-State: relaxation of referential integrity, etc. •  Eventual Consistency: relaxation of © 2014 Sqrrl Data, Inc. | All Rights Reserved 22
  23. 23. NORMALIZED VS. DENORMALIZED DATA MODELS Normalized •  “Normal Form Relational Database” •  Minimizes data footprint •  Minimizes cost of data maintenance •  Can lead to expensive joins at query time Hybrid •  Start with document store •  Introduce links/edges for quick joins •  Dynamically adapt to flexible or sparse schemas •  Similar to property graphs Denormalized •  “Document Store” •  Flexible schema lets applications adapt quickly to changing environments •  Pre-joined to eliminate joins at query-time •  Optimized for “append-only” data •  Can inflate data sizes and slow data ingest © 2014 Sqrrl Data, Inc. | All Rights Reserved 23
  24. 24. KNOWLEDGE-BASE USE CASE 2014-04-14 06:36:09 429 73.105.179.202 username@msn.c om 500 POST application/json 2014-04-14 06:36:09 429 73.105.179.202 username@msn.com 500 POST application/json HTTPS “wikipedia.org:443/grouchinesses/?215=felled&297=wading&768=shimmies...” "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/ 26.0.1410.43 Safari/537.31” 208.80.152.201 HR Netflow Proxy Logs HTTPS “wikipedia.org: 443/grouchinesses/? 215=felled&297=wadin g&768=shimmies...” "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31” 208.80.152.201 Email Social Media © 2014 Sqrrl Data, Inc. | All Rights Reserved 24
  25. 25. STREAM PROCESSING USE CASE © 2014 Sqrrl Data, Inc. | All Rights Reserved Dashboards Actions Interactive Analysis Tools (Discovery + Forensics) 1.  SPE queries Sqrrl to enrich streaming data 2.  SPE persists results in Sqrrl for future query 3.  SPE takes action automatically 4.  SPE issues data-driven alerts 5.  Sqrrl provides context for dashboards 6.  Analysis tools query use Sqrrl to search and manipulate historical data DATA SPE 25
  26. 26. SQRRL OPERATIONALIZES ACCUMULO WITH... © 2014 Sqrrl Data, Inc. | All Rights Reserved 26 Data-Centric Security Petabyte Scale and Operational Speeds Document and Graph Data Models SqrrlQL, including Aggregates, Secure Full- Text Search, and Secure Graph Search Analytics, including Real-Time Statistics and Hadoop Integrations
  27. 27. MODERNIZING VISUALIZATION © 2014 Sqrrl Data, Inc. | All Rights Reserved 27 Sqrrl is building the next generation of operational analytics visualizations
  28. 28. UPCOMING EVENTS Accumulo Summit 2014 •  June 12 in College Park, MD •  http://accumulosummit.com •  Multiple tracks of talks from the leaders of the Accumulo community IEEE HPEC Conference 2014 •  September 9-11 in Waltham, MA •  http://www.ieee-hpec.org/ •  Accumulo Users Group Meeting as a Special Event •  Accumulo tutorial Watch for more meetup opportunities coming soon! © 2014 Sqrrl Data, Inc. | All Rights Reserved 28

×