Your SlideShare is downloading. ×
0
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Introduction to Apache Accumulo
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to Apache Accumulo

18,171

Published on

Presented at the Boulder/Denver BigData Meetup on March 21, 2012

Presented at the Boulder/Denver BigData Meetup on March 21, 2012

Published in: Technology
0 Comments
23 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
18,171
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
357
Comments
0
Likes
23
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Obviously most people’s data set isn’t this large. If you can fit your data into memory of a single large server, Accumulo probably isn’t for you.
  • 20 billion events day for Insights6+ billion msg -> 75 billion rw operations/day
  • Sparse,sorted
  • Table is partitioned into tablets which are logically assigned to tablet servers (they are physically in HDFS). Tablet is a range of keys.
  • Tablets are only logically assigned to tablet servers by theAccumulo Master. The are physically stored in HDFS. Tablet is one or more files.
  • Data first written to WAL (outside of HDFS on a different machine), then inserted into sorted MemTable (balanced, sorted binary tree)
  • When MemTable is full, it gets flushed to a file which is stored in HDFS (minor compaction). Writes to disk are sequential as MemTable is sorted
  • All of these files are always sorted!
  • TabletServer merges key-values from all its files and its MemTable to present a complete sorted view of data
  • One of the most powerful features of Accumulo – a lot to learn. Come back to aggregation in demo
  • Example: Trendistic (http://trendistic.indextank.com)
  • Documentation is a work in progress…
  • Transcript

    • 1. Introduction to Apache AccumuloBoulder/Denver BigData Meetup - March 21,2012Jared Winick@jaredwinick
    • 2. Accumulo /əˈkjuˈmj ʊ/ ʊˈlo1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
    • 3. http://yourmotivational.com/uploads/8604.jpg
    • 4. Annotation AddedJeff Dean: Designs, Lessons and Advice from Building Large Distributed Systemshttp://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
    • 5. Enables interactive access to… Trillions of records petabytes of indexed data across 100s-1000s of servers
    • 6. Short Accumulo History Lesson http://www.flickr.com/photos/mr_t_in_dc/4249886990/sizes/l/in/photostream/
    • 7. 2006
    • 8. 2008http://upload.wikimedia.org/wikipedia/commons/8/84/National_Security_Agency_headquarters%2C_Fort_Meade%2C_Maryland.jpg
    • 9. 2011
    • 10. 2012
    • 11. Uses of BigTable and Kin (BigTable) (HBase)•Google Analytics1 •Messages3,4,6•Crawl1 •Insights5,6•AppEngine Datastore2•Many more1 (Cassandra) (Accumulo) •Rainbird (realtime analytics)7 •???1.) http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf2.) http://code.google.com/appengine/articles/storage_breakdown.html3.) http://www.facebook.com/note.php?note_id=4549916089194.) http://mvdirona.com/jrh/TalksAndPapers/KannanMuthukkaruppan_StorageInfraBehindMessages.pdf5.) http://www.facebook.com/note.php?note_id=101501039002589206.) http://borthakur.com/ftp/SIGMODRealtimeHadoopPresentation.pdf7.) http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
    • 12. Accumulo /əˈkjuˈmj ʊ/ ʊˈlo1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
    • 13. Multi-dimension Key Key Column Value Row ID Timestamp Family Qualifier Visibilityhttp://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html
    • 14. Keys Sorted LexicographicallyRow ID, Column Family, Column Qualifier, Column Visibility, Timestamp Everything is a byte[] except the Timestamp which is a long
    • 15. Physical Layout Key ValueRow ID Col Fam Col Qual Col Vis Time Value Alice properties age public March 2011 31 Alice properties phone private Feb 2011 555-1234 Alice purchases Xbox public Feb 2011 $299 Bob properties phone private March 2011 555-4321 Bob purchases iPhone Public Feb 2011 $399
    • 16. Queries •By exact Key or range of Keys •Data is always returned in sorted orderQuery Requirements Drive Data Model Design
    • 17. http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html
    • 18. Hadoop Clients MapReduce Read/ Analytics Write Accumulo Configuration/ Storage StateHadoop HDFS Zookeeper
    • 19. Table Tablets Accumulo… Tablet Server … … Tablet Server … ... … Tablet Server … Master Data Node Data Node ... Data Node Name Node Hadoop HDFS
    • 20. Table Tablet Server Failure Tablets 1.) Detect FailureAccumulo Tablet Server Tablet Server ... Tablet Server Master 2.) Reassign Data Node Data Node ... Data Node Name Node Hadoop HDFS
    • 21. Writes Write- Ahead Accumulo Log (WAL) Tablet Server 1 Tablet 2 MemTableClient Data Node ... Data Node Data Node Hadoop HDFS
    • 22. Writes Write- Ahead Accumulo Log (WAL) Tablet Server 1 Tablet 2 MemTableClient 3 File 1 Data Node ... Data Node Data Node Hadoop HDFS
    • 23. Compactions Minor MajorThe process of flushing The process ofa MemTable of a Tablet combining multiple filesto a single file in HDFS into a single file
    • 24. Tablet Splits• Tablets are split when they reach a max size• Always split on row boundary• Master assigns a split Tablet to another Tablet server (no data is moved!)
    • 25. Reads Accumulo Tablet Server Tablet MemTableClient File 1 File 1
    • 26. Accumulo /əˈkjuˈmj ʊ/ ʊˈlo1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
    • 27. Iterators: Server-side programminghttp://wiki.eeng.dcu.ie/ee557/287-EE/version/default/part/ImageData/data/server-side_intro.gif
    • 28. IteratorsCan be run at: Can do things like:•Scan Time •Aggregation (Combiners)•Minor Compaction •Age-Off•Major Compaction •Filtering (access control) •TransformationPush Processing to the Data
    • 29. Accumulo /əˈkjuˈmj ʊ/ ʊˈlo1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
    • 30. Access Control• Every key-value has a visibility label• Label is defined with boolean operators• Label is arbitrary and ad-hoc Public Private | Admin Finance | (HR & Manager)• Authorizations presented at scan time• Data is filtered out automatically by system- level Iterator
    • 31. Access Control – Typical Architecture Trusted Zone 6.) Return Data 5.) Return Visible Data Web Server Accumulo1.) Pass Credentials 4.) Proxy Authorization 3.) Return Authorizations 2.) Lookup User Enterprise Identity Management
    • 32. Access Control – Typical Architecture Trusted Zone Accumulo 6.) Return [6,8] 5.) Return [6,8] SECRET&PROJECT X, 6 Web Server SECRET&PROJECT Y, 8 1.) PKI Cert 4.) Proxy Bob’s Auths SECRET&PROJECT Z, 3Bob 3.) Auths:[SECRET, UNCLASSIFIED, 2.) Lookup PROJECT X, PROJECT Y] Bob Enterprise Identity Management
    • 33. Demo
    • 34. Application RequirementsBuild an application to analyze trends in Twittermessages.•Query for word/phrase and view real-time activityin a time series graph•View at different time ranges (1 day, 7 days, 30days, etc)•Allow multiple query terms to compare activity (ex.Breakfast,Lunch)•Automatically extract daily trends for the user
    • 35. Demo Setup/Data• Twitter Streaming API• US country codes only messages• 1,2,3-grams built• Data since Dec 24 – Live• Running on average workstation, 1 SATA disk, 6 GB memory.• 72GB, 2.6 billion entries and counting
    • 36. Data Model• Tweets table – Row ID: n-gram – Column Family: Date Granularity (DAY, HOUR) – Column Qual: Date Value – Value: Count – SummingCombiner (Iterator) used to update Count Row ID Col Fam Col Qual Value breakfast DAY 20120318 31 breakfast DAY 20120319 56 … … … … lunch HOUR 2012031801 3 lunch HOUR 2012031802 4
    • 37. Data Model• Trends table – Row ID: (Date Granularity + Date Value) – Column Family: (Integer.MAX_VALUE – trendScore) – Column Qual: n-gram – Value: [] Row ID Col Fam Col Qual ValueDAY:20120318 2147483145 churchDAY:20120318 2147483316 hangover … … … …DAY:20120319 2147476521 the broncosDAY:20120319 2147477704 tim tebow
    • 38. MapReduce Analytics• Utilize MapReduce for building trends• AccumuloInputFormat reads from tweets table• AccumuloOutputFormat writes to trends table• AccumuloStorage LoadFunc for Pig available on github
    • 39. Summary•Accumulo exploits locality to enableinteractive access to huge data sets whileadding cell-level access control and server-side programming•Nothing in life is free. Accumulo comes withthe complexity and responsibility ofmanaging a distributed system and designingindexes on your data
    • 40. References• Documentation, Mailing Lists, Linkshttp://incubator.apache.org/accumulo/• HBase Shootouthttp://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012• Trendulohttps://github.com/jaredwinick/trendulo

    ×