Accumulo Nutch/GORA, Storm, and Pig


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Accumulo Nutch/GORA, Storm, and Pig

  1. 1. Large Scale Web Analytics with Accumulo (and Nutch/Gora, Pig, and Storm) Jason Trost @jason_trost
  2. 2. Introductions• Jason Trost (• Senior Software Engineer at Endgame Systems• Former Accumulo Trainer• Apache Accumulo Committer – Apache Pig integration with Accumulo – some minor bug fixes
  3. 3. Agenda• Technologies Introduction – Apache Accumulo – Apache Gora – Apache Nutch/Gora – Storm• Accumulo at Endgame – Web Crawl Analytics – Real-time DNS Processing – Operations
  4. 4. Apache Accumulo• Accumulo is a BigTable implementation with cell level security• It is conceptually very similar to HBase, but it has some nice features that HBase is currently lacking.• Some of these features are: – Cell level security – No fat row problem – No limitation on col fams or when col fams can be created – Server side, data local, programming abstraction called Iterators – Iterators enable fast aggregation, searching, filtering, streaming Reduce
  5. 5. Apache Gora• Gora is a object relational/non-relational mapping for arbitrary data stores including both relational (MySQL) and non-relational data stores (HBase, Cassandra, Accumulo, Redis, Voldermort, etc.). • It was designed for Big Data applications and has support (interfaces) for Apache Pig, Apache Hive, Cascading, and generic MapReduce.
  6. 6. Apache Nutch/Gora• Nutch is a highly scalable web crawler built over Hadoop MapReduce.• It was designed from the ground up to be an Internet scale web crawler and to enable large scale search applications• GORA enables the storing of the web crawl data and metadata in Accumulo
  7. 7. Storm• Highly scalable streaming event processing system• Conceptually similar to MapReduce, but operates on streaming data in real-time• Released by Twitter after they acquired Backtype• Development led by Nathan Marz • At-least-once-processing of events • Spouts and Bolts are wired together to form computation Topologies • Topologies run until killed Twitter Storm
  8. 8. at
  9. 9. Web Crawl Analytics• Formerly used Heritrix with a Cassandra backend for collection and storage• We now use Nutch/GORA to perform Large-scale web crawling• All pages and HTTP headers are stored in Accumulo• Run Pig scripts for pulling data out of Accumulo, performing rollups, performing pattern matching (using regular expressions), and processing the pages using python scripts
  10. 10. Real-time DNS Processing• We used to use MapReduce/PIG to generate daily reports on all DNS event data from files in HDFS; this took several hours• Now, we use an internally developed framework called Velocity that was built over Storm• In real-time, enrich DNS and security events with IP geo data (country, city, company, vertical), correlate with internally developed/maintained DNS blacklists • Store the events in Accumulo & use custom Accumulo iterators to perform rollups • At report generation time, Accumulo aggregates records server side • This process now takes minutes, not hours, and we can query for partial results instead Twitter Storm of having to wait until the end of the day
  11. 11. Custom Iterators & Aggregation Ingest Format At IngestRow GROUP BY FIELDS • RowID contains a CSV record thatCol Fam Constant String represents the fields used to basicallyCol Qual Event UUID perform a GROUP BYVal - • Col Qual contains the event UUID At Scan timeFormat After Custom Iterator • Basically strip off the event UUIDRow GROUP BY FIELDS • Set the value to be “1”Col Fam Constant String • Prepares Key/Value for input intoCol Qual “” SummingCombinerVal “1” • Output from SummingCombiner is an accurate count of aggregated records • This is, in essence, a streaming Reduce
  12. 12. Operations with Accumulo• Hadoop Streaming jobs tend to kill tablet servers – Streaming jobs use more memory than Hadoop allows – This can make service memory allocations challenging – Reducing number of Map tasks helped• Running tablet servers under supervision is critical – Tablet servers fail fast – Supervisord or daemontools restart failed processes – Has improved our cluster’s stability dramatically• Pre-splitting tables is very important for throughput – Our rows lead with a day day, e.g. “20120101”• Locality Groups are your friend for Nutch/Gora
  13. 13. We’re Hiring• Like to work on hard problems with Big Data?• Are you familiar/interested in these technologies? – Hadoop, Storm, Django, Nutch/GORA – Accumulo, Solr/ElasticSearch, Redis – Python, Java, Pig, Node.JS, Github• Want to contribute to Open Source?• We have offices in Atlanta, Washington DC, Baltimore, and San Antonio•
  14. 14. Questions?
  15. 15. Contact Info• Jason Trost• Email:• Twitter: @jason_trost• Blog: