-
1.
Introduction to Apache Accumulo
Boulder/Denver BigData Meetup - March 21,2012
Jared Winick
@jaredwinick
-
2.
Accumulo /əˈkjuˈmj ʊ/
ʊˈlo
1. Sorted, distributed key/value store with
cell-based access control and
customizable server-side processing
-
3.
http://yourmotivational.com/uploads/8604.jpg
-
4.
Annotation Added
Jeff Dean: Designs, Lessons and Advice from Building Large Distributed Systems
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
-
5.
Enables interactive access to…
Trillions of records
petabytes of indexed data
across 100s-1000s of servers
-
6.
Short Accumulo History Lesson
http://www.flickr.com/photos/mr_t_in_dc/4249886990/sizes/l/in/photostream/
-
7.
2006
-
8.
2008
http://upload.wikimedia.org/wikipedia/commons/8/84/National_Security_Agency_headquarters%2C_Fort_Meade%2C_Maryland.jpg
-
9.
2011
-
10.
2012
-
11.
Uses of BigTable and Kin
(BigTable) (HBase)
•Google Analytics1
•Messages3,4,6
•Crawl1
•Insights5,6
•AppEngine Datastore2
•Many more1
(Cassandra) (Accumulo)
•Rainbird (realtime analytics)7
•???
1.) http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf
2.) http://code.google.com/appengine/articles/storage_breakdown.html
3.) http://www.facebook.com/note.php?note_id=454991608919
4.) http://mvdirona.com/jrh/TalksAndPapers/KannanMuthukkaruppan_StorageInfraBehindMessages.pdf
5.) http://www.facebook.com/note.php?note_id=10150103900258920
6.) http://borthakur.com/ftp/SIGMODRealtimeHadoopPresentation.pdf
7.) http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
-
12.
Accumulo /əˈkjuˈmj ʊ/
ʊˈlo
1. Sorted, distributed key/value store with
cell-based access control and
customizable server-side processing
-
13.
Multi-dimension Key
Key
Column Value
Row ID Timestamp
Family Qualifier Visibility
http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html
-
14.
Keys Sorted Lexicographically
Row ID, Column Family, Column Qualifier, Column Visibility, Timestamp
Everything is a byte[] except the Timestamp which is a long
-
15.
Physical Layout
Key Value
Row ID Col Fam Col Qual Col Vis Time Value
Alice properties age public March 2011 31
Alice properties phone private Feb 2011 555-1234
Alice purchases Xbox public Feb 2011 $299
Bob properties phone private March 2011 555-4321
Bob purchases iPhone Public Feb 2011 $399
-
16.
Queries
•By exact Key or range of Keys
•Data is always returned in sorted order
Query Requirements Drive
Data Model Design
-
17.
http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html
-
18.
Hadoop
Clients
MapReduce
Read/
Analytics
Write
Accumulo
Configuration/
Storage
State
Hadoop HDFS Zookeeper
-
19.
Table
Tablets
Accumulo
…
Tablet
Server
… …
Tablet
Server
…
... …
Tablet
Server
…
Master
Data
Node
Data
Node
... Data
Node
Name
Node
Hadoop HDFS
-
20.
Table Tablet Server Failure
Tablets
1.) Detect Failure
Accumulo
Tablet
Server
Tablet
Server ... Tablet
Server
Master
2.) Reassign
Data
Node
Data
Node
... Data
Node
Name
Node
Hadoop HDFS
-
21.
Writes
Write-
Ahead Accumulo
Log (WAL) Tablet Server
1 Tablet
2
MemTable
Client
Data
Node
... Data
Node
Data
Node
Hadoop HDFS
-
22.
Writes
Write-
Ahead Accumulo
Log (WAL) Tablet Server
1 Tablet
2
MemTable
Client
3
File 1
Data
Node
... Data
Node
Data
Node
Hadoop HDFS
-
23.
Compactions
Minor Major
The process of flushing The process of
a MemTable of a Tablet combining multiple files
to a single file in HDFS into a single file
-
24.
Tablet Splits
• Tablets are split when they reach a max size
• Always split on row boundary
• Master assigns a split Tablet to another Tablet
server (no data is moved!)
-
25.
Reads
Accumulo
Tablet Server
Tablet
MemTable
Client
File 1 File 1
-
26.
Accumulo /əˈkjuˈmj ʊ/
ʊˈlo
1. Sorted, distributed key/value store with
cell-based access control and
customizable server-side processing
-
27.
Iterators: Server-side programming
http://wiki.eeng.dcu.ie/ee557/287-EE/version/default/part/ImageData/data/server-side_intro.gif
-
28.
Iterators
Can be run at: Can do things like:
•Scan Time •Aggregation (Combiners)
•Minor Compaction •Age-Off
•Major Compaction •Filtering (access control)
•Transformation
Push Processing to the Data
-
29.
Accumulo /əˈkjuˈmj ʊ/
ʊˈlo
1. Sorted, distributed key/value store with
cell-based access control and
customizable server-side processing
-
30.
Access Control
• Every key-value has a visibility label
• Label is defined with boolean operators
• Label is arbitrary and ad-hoc
Public Private | Admin Finance | (HR & Manager)
• Authorizations presented at scan time
• Data is filtered out automatically by system-
level Iterator
-
31.
Access Control – Typical Architecture
Trusted Zone
6.) Return Data 5.) Return Visible Data
Web Server Accumulo
1.) Pass Credentials 4.) Proxy Authorization
3.) Return
Authorizations
2.) Lookup
User Enterprise
Identity
Management
-
32.
Access Control – Typical Architecture
Trusted Zone
Accumulo
6.) Return [6,8] 5.) Return [6,8] SECRET&PROJECT X, 6
Web Server SECRET&PROJECT Y, 8
1.) PKI Cert 4.) Proxy Bob’s Auths SECRET&PROJECT Z, 3
Bob
3.) Auths:[SECRET, UNCLASSIFIED,
2.) Lookup PROJECT X, PROJECT Y]
Bob Enterprise
Identity
Management
-
33.
Demo
-
34.
Application Requirements
Build an application to analyze trends in Twitter
messages.
•Query for word/phrase and view real-time activity
in a time series graph
•View at different time ranges (1 day, 7 days, 30
days, etc)
•Allow multiple query terms to compare activity (ex.
Breakfast,Lunch)
•Automatically extract daily trends for the user
-
35.
Demo Setup/Data
• Twitter Streaming API
• US country codes only messages
• 1,2,3-grams built
• Data since Dec 24 – Live
• Running on average workstation, 1 SATA disk,
6 GB memory.
• 72GB, 2.6 billion entries and counting
-
36.
Data Model
• Tweets table
– Row ID: n-gram
– Column Family: Date Granularity (DAY, HOUR)
– Column Qual: Date Value
– Value: Count
– SummingCombiner (Iterator) used to update Count
Row ID Col Fam Col Qual Value
breakfast DAY 20120318 31
breakfast DAY 20120319 56
… … … …
lunch HOUR 2012031801 3
lunch HOUR 2012031802 4
-
37.
Data Model
• Trends table
– Row ID: (Date Granularity + Date Value)
– Column Family: (Integer.MAX_VALUE –
trendScore)
– Column Qual: n-gram
– Value: []
Row ID Col Fam Col Qual Value
DAY:20120318 2147483145 church
DAY:20120318 2147483316 hangover
… … … …
DAY:20120319 2147476521 the broncos
DAY:20120319 2147477704 tim tebow
-
38.
MapReduce Analytics
• Utilize MapReduce for building trends
• AccumuloInputFormat reads from tweets
table
• AccumuloOutputFormat writes to trends
table
• AccumuloStorage LoadFunc for Pig
available on github
-
39.
Summary
•Accumulo exploits locality to enable
interactive access to huge data sets while
adding cell-level access control and server-
side programming
•Nothing in life is free. Accumulo comes with
the complexity and responsibility of
managing a distributed system and designing
indexes on your data
-
40.
References
• Documentation, Mailing Lists, Links
http://incubator.apache.org/accumulo/
• HBase Shootout
http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012
• Trendulo
https://github.com/jaredwinick/trendulo
Obviously most people’s data set isn’t this large. If you can fit your data into memory of a single large server, Accumulo probably isn’t for you.
20 billion events day for Insights6+ billion msg -> 75 billion rw operations/day
Sparse,sorted
Table is partitioned into tablets which are logically assigned to tablet servers (they are physically in HDFS). Tablet is a range of keys.
Tablets are only logically assigned to tablet servers by theAccumulo Master. The are physically stored in HDFS. Tablet is one or more files.
Data first written to WAL (outside of HDFS on a different machine), then inserted into sorted MemTable (balanced, sorted binary tree)
When MemTable is full, it gets flushed to a file which is stored in HDFS (minor compaction). Writes to disk are sequential as MemTable is sorted
All of these files are always sorted!
TabletServer merges key-values from all its files and its MemTable to present a complete sorted view of data
One of the most powerful features of Accumulo – a lot to learn. Come back to aggregation in demo
Example: Trendistic (http://trendistic.indextank.com)
Documentation is a work in progress…