The Amino Analytical Framework - Leveraging Accumulo to the Fullest
Upcoming SlideShare
Loading in...5
×
 

The Amino Analytical Framework - Leveraging Accumulo to the Fullest

on

  • 1,035 views

Speaker: Steve Touw, CTO, 42six Solutions a CSC Company ...

Speaker: Steve Touw, CTO, 42six Solutions a CSC Company

Amino is an open source analytical framework that focuses on a “building-blocks” approach to data discovery by pre-computing features about data at the most granular level possible and then allows analysts and data scientists to easily combine those features into more complex questions.

The magic behind Amino is found in it’s custom Accumulo index; that index strives to provide fast scans, highly dimensional scans, data compression, and a simple query structure. The index leverages Accumulo iterators to do much of the scan time logic which has no limit on dimensionality of the query. Iterators are what makes Accumulo unique and enables the Amino index to execute the complex queries.

Statistics

Views

Total Views
1,035
Views on SlideShare
1,006
Embed Views
29

Actions

Likes
0
Downloads
13
Comments
0

1 Embed 29

https://twitter.com 29

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest Presentation Transcript

  • Framework for Big Data Discovery and Analytics © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Hadoop MapReduce • We can look across all our data to answer questions! Problem Statement: Developers can write MapReduce code to analyze data, but don’t know what to look for; the analysts know what to look for, but don’t know how to write code. Technology is not the problem. It’s enabling the analyst to effectively leverage technology and reuse it. © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Typical Analyst Workflow: • I have an entity I want to learn more about • Everything is indexed by entities • We can ask questions of Big Data, but they aren’t Big Questions – we always start with an entity We should be able to: • Have a pattern and see entities that match that pattern • We can ask complex questions of Big Data © 2013 42six Solutions, All Rights Reserved, www.42six.com View slide
  • Naïve Way: Custom MapReduce job for each question Amino Way: Pre-compute features (micro-analytics), the building blocks of questions, and let analysts mix those on the fly to ask complex questions The Amino index executes Analysts’ complex questions as a real time scan, less competition for resources, more scalable. Scales to billions of entities and features © 2013 42six Solutions, All Rights Reserved, www.42six.com View slide
  • Live Demo What could go wrong?… © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Amino Framework Feature Creation API • Abstracts the complexities of MapReduce • Focus on logic of the feature/micro-analytic • Write-once DataLoader for each data source • Simple and powerful data joins Amino Index • AminoOutputFormat • Bulk Ingest into Accumulo Query API • Iterators © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Workflow © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Benefits • Data Agnostic • Not a black box • Fully scalable • Crowd source micro-analytics • Inherent cross-datasource linked indexes • Encourages sharing of knowledge, discovery • Index built to support machine learning • Security considered up front – index is in Accumulo • Built on open source, for open source © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Feature Creation -Can join multiple datasets -Keys are established in the DataLoader Any external job can output this format and it will be indexed properly during indexing jobs Notice there’s no key – that’s on purpose! © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Index Goals Now all our features are indexed, let’s let the analysts start building! • Fast scans • Highly dimensional scans • Data compression • Simple query structure © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Accumulo Index 1: More Dimensions than Entities Row CF Shard Number: Data Source : Bucket Name Bucket Value CQ Value Hash Salt Compressed Bitmap Example: Row CF CQ Value 2:Twitter:handle stevetouw 0 010011010010011 JavaEWAH is a word-aligned compressed variant of the Java bitset class. It does not achieve the best compression, but rather improves query processing time Indexes in the bit vector represent the features that entity falls in – a feature vector © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • At Query Time… Bloom Filter based on Lexicographical first and last of each dimension of the query Number of followers: 10 - 200 First: aachimba Last: zzrka Number of tweets per day: 0 - 6 First: aaabbb Last: zyrbb Handle starts with letter: S First: saarba Last: szaban Smallest range Dimensions map to a query bit vector 000001001111000101000011100101010011100101 Note there is an index for every possible value between the ratio features © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Accumulo Iterator Time!! Row CF CQ Value 2:Twitter:handle saarba 0 00101011001110 2:Twitter:handle saarra 0 00101111010100 2:Twitter:handle stevetouw 0 01111100001100 2:Twitter:handle szaban 0 00110011001111 Push our query bit vector through the range found in the previous step If the result of the bitwise operation contains an index at each dimension, we have a match! © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • What is the Salt For? Row CF CQ Value Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap Row CF CQ Value 2:Twitter:handle stevetouw 0 0100110100100101 Collisions are possible (using 32 bit vector). Salt is used to hash the feature indexes, so you need as many matches in the previous step as you have salts. We have used 3 salts with 15 billion records and have had no collisions © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Benefits of this Index • Tables are small, bit vector compression is good, only one row per entity • Works great if you have more dimensions than you have entities or the range in your dimensions are good bloom filters (like “handle starts with letter …”) • No matter how many dimensions, the query will always be as fast as the smallest range • All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable • Represents a feature vector for your entities – great for machine learning © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Accumulo Index 2: More Entities than Dimensions Row CF CQ shard:salt Data Source#Bucket Name#FeatureId Value Feature Value Compressed Bitmap Example: Row CF CQ Value 2:0 Twitter#handle#123456 s 0100110100101001 123456 could map to feature “Handle starts with letter” Indexes in the bit vector represent the entities that fall in that feature So handle stevetouw could map to index 73 (for salt 0) © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • That Same Query Again… Number of followers: 10 – 200 (feature id: 444411) Number of tweets per day: 0 – 6 (feature id: 555522) Handle starts with letter: S (feature id: 123456) Row CQ Value 2:0 OR CF Twitter#handle#444411 10 0010111011100 2:0 Twitter#handle#444411 11 0101010101101 …… 2:0 OR 200 0000001011000 2:0 AND Twitter#handle#444411 Twitter#handle#555522 0 1111110001101 2:0 Twitter#handle#555522 1 1010100000100 …… 2:0 Twitter#handle#555522 6 1111001010000 2:0 Twitter#handle#123456 S 1111110001101 Magic iterator that handles all the boolean logic © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • More Details Row CQ Value 2:0 OR CF Twitter#handle#444411 10 0010111011100 2:0 Twitter#handle#444411 11 0101010101101 …… 2:0 OR 200 0000001011000 2:0 AND Twitter#handle#444411 Twitter#handle#555522 0 1111110001101 2:0 Twitter#handle#555522 1 1010100000100 …… 2:0 Twitter#handle#555522 6 1111001010000 2:0 Twitter#handle#123456 S 1111110001101 The same entity is guaranteed to always land in the same shard:salt no matter the feature We are left with a set of indexes for each salt, now what? © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Convert Indexes to Entities Row CF CQ shard Index Position#Data Source#Bucket Name#Salt Value Bucket Value Example: Row CF CQ 2 73#Twitter#handle#0 Value stevetouw The iterator scans the rows using a CF filter with the indexes desired The iterator ensures it gets the same CQ “# of salts” times before it sends the resulting CQ results back Again, use the power of iterators and pushing code to the data rather than doing the salt set operation in the web tier © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Benefits of this Index • Tables are small, bit vector compression is good • Works great if you have more entities than you have dimensions (most likely scenario) • Affords the ability to do full boolean logic in-iterator, rather than just ANDs as in the previous index • All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Conclusion • Amino helps non-technical folk leverage MapReduce cleanly and without hogging cluster resources • Accumulo iterators are the reason for the index performance • Amino is all about sharing and reuse, crowd source the building blocks, save analysts hypotheses, the more people touching Amino, the smarter it becomes • Open source (documentation needs help): https://github.com/aminocloud/amino © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • Questions? Steve Touw, steve@42six.com Barrett Stabile, bstabile@42six.com Joe Bruner, jbruner@42six.com Sapan Shah, sshah@42six.com © 2013 42six Solutions, All Rights Reserved, www.42six.com