Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Accumulo

616 views

Published on

An introduction to Apache Accumulo and related technologies for data analysis

Published in: Data & Analytics
  • Be the first to like this

Introduction to Accumulo

  1. 1. Introduction to Accumulo Mario Pastorelli mario.pastorelli@teralytics.ch March 7, 2016 1
  2. 2. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: 2
  3. 3. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed filesystem 2
  4. 4. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed filesystem MapReduce: distributed data processing 2
  5. 5. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed filesystem MapReduce: distributed data processing BigTable: distributed storage system for structured data 2
  6. 6. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed filesystem MapReduce: distributed data processing BigTable: distributed storage system for structured data Accumulo is an open-source implementation of BigTable 2
  7. 7. Distributed Structured Data structured data should be – distributed for parallel processing – indexed for fast retrieval (“structured” means that it has some kind of “primary key”) – tabular for easy processing of complex data, each row can potentially have many columns 3
  8. 8. Distributed Structured Data structured data should be – distributed for parallel processing – indexed for fast retrieval (“structured” means that it has some kind of “primary key”) – tabular for easy processing of complex data, each row can potentially have many columns databases offer indexes and tables but don’t scale without significant effort 3
  9. 9. Distributed Structured Data structured data should be – distributed for parallel processing – indexed for fast retrieval (“structured” means that it has some kind of “primary key”) – tabular for easy processing of complex data, each row can potentially have many columns databases offer indexes and tables but don’t scale without significant effort key-value stores can easily be distributed but have limited index support over keys and don’t have support for tabular format out of the box 3
  10. 10. Accumulo Accumulo is a key-value store with support for tabular data – keys are columns identifiers, i.e. they uniquely identify a column of a row – a row is composed by multiple keys-values grouped by the prefix of the key, the row id 4
  11. 11. Example EMAIL NAME LASTNAME COMPANY olismith85@gmail.com Olivia Smith Winsystems emily.brown@facebook.com Emily Brown Jones Inc. ⇓ KEY (composed by row id and column id) VALUE olismith85@gmail.comNAME Olivia olismith85@gmail.comLASTNAME Smith olismith85@gmail.comCOMPANY Winsystems emily.brown@facebook.comNAME Emily emily.brown@facebook.comLASTNAME Brown emily.brown@facebook.comCOMPANY Jones Inc. 5
  12. 12. Composite Keys Keys in Accumulo are composite and have the following components row id: to which row the key belongs to column family: to which “column group” the key belongs to column qualifier: the column id column visibility: who can access this column timestamp: the version of the key 6
  13. 13. Composite Keys Keys in Accumulo are composite and have the following components row id: to which row the key belongs to column family: to which “column group” the key belongs to column qualifier: the column id column visibility: who can access this column timestamp: the version of the key A single key-value is stored as KEY VALUE row id column timestamp family qualifier visibility 6
  14. 14. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast 7
  15. 15. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds 7
  16. 16. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables 7
  17. 17. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables built-in cache for recently queried data 7
  18. 18. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables built-in cache for recently queried data many others, such as bulk imports, iterators, fault tolerance, large rows, multiple-batch queries, testing utilities (mocks, miniclusters) . . . 7
  19. 19. Example we want to store and analyze tweets from all around the world. 8
  20. 20. Example: Tweets analysis A tweet has the following (simplified) fields – coordinate: geospatial information composed by longitude and latitude – created at: UTC time of the tweet – id: tweet unique identifier – user informations, such as user.id: unique identifier of the user user.screen name: user name . . . – entities such as hashtags, urls. . . – text: tweet content – . . . how do we store this data in Accumulo? 9
  21. 21. Example: Tweets analysis there is no single way to do it, it depends on the query 10
  22. 22. Example: Tweets analysis there is no single way to do it, it depends on the query two good practices – work with denormalized data – specialize tables for each kind of query 10
  23. 23. Example: Twitter User Timeline schema KEY VALUE row id column timestamp family qualifier visibility user.id + created at + id ”coordinate” lon/lat ”entities” ”hashtags” hashtags ”urls” urls ”text” text Easy to process the entire timeline or a time interval for the same user 11
  24. 24. Example: Twitter User Timeline schema KEY VALUE row id column timestamp family qualifier visibility user.id + created at + id ”coordinate” lon/lat ”entities” ”hashtags” hashtags ”urls” urls ”text” text Easy to process the entire timeline or a time interval for the same user Not good for other kind of analysis – find all the tweets with a given hashtag – find all the tweets in New York – . . . 11
  25. 25. Summary Accumulo is great for storing large amount of structured data Accumulo is good for interactive queries as well as more batch queries Accumulo is a low-level system – NoSQL (that’s not good!), which means no high-level language to query the data – a lot of flexibility which can easily backfire 12
  26. 26. Thank you Questions? 13

×