SlideShare a Scribd company logo
1 of 15
Download to read offline
Large Scale Web Analytics with Accumulo
    (and Nutch/Gora, Pig, and Storm)


                        Jason Trost
                        jtrost@endgames.us
                        @jason_trost
Introductions
•   Jason Trost (jtrost@endgames.us)
•   Senior Software Engineer at Endgame Systems
•   Former Accumulo Trainer
•   Apache Accumulo Committer
    – Apache Pig integration with Accumulo
    – some minor bug fixes
Agenda
• Technologies Introduction
  – Apache Accumulo
  – Apache Gora
  – Apache Nutch/Gora
  – Storm
• Accumulo at Endgame
  – Web Crawl Analytics
  – Real-time DNS Processing
  – Operations
Apache Accumulo
• Accumulo is a BigTable implementation with cell
  level security
• It is conceptually very similar to HBase, but it has
  some nice features that HBase is currently
  lacking.
• Some of these features are:
   –   Cell level security
   –   No fat row problem
   –   No limitation on col fams or when col fams can be created
   –   Server side, data local, programming abstraction called Iterators
   –   Iterators enable fast aggregation, searching, filtering, streaming
       Reduce
Apache Gora
• Gora is a object relational/non-relational
  mapping for arbitrary data stores including
  both relational (MySQL) and non-relational
  data stores (HBase, Cassandra, Accumulo,
  Redis, Voldermort, etc.).
            • It was designed for Big Data
              applications and has support
              (interfaces) for Apache Pig, Apache
              Hive, Cascading, and generic
              MapReduce.
Apache Nutch/Gora
• Nutch is a highly scalable web crawler built
  over Hadoop MapReduce.
• It was designed from the ground up to be an
  Internet scale web crawler and to enable large
  scale search applications
• GORA enables the storing of the web crawl
  data and metadata in Accumulo
Storm
• Highly scalable streaming event processing system
• Conceptually similar to MapReduce, but operates on
  streaming data in real-time
• Released by Twitter after they acquired Backtype
• Development led by Nathan Marz
                     • At-least-once-processing of events
                     • Spouts and Bolts are wired
                       together to form computation
                       Topologies
                     • Topologies run until killed
    Twitter Storm
at
Web Crawl Analytics
• Formerly used Heritrix with a Cassandra backend
  for collection and storage
• We now use Nutch/GORA to perform Large-scale
  web crawling
• All pages and HTTP headers are stored in
  Accumulo
• Run Pig scripts for pulling data out of Accumulo,
  performing rollups, performing pattern matching
  (using regular expressions), and processing the
  pages using python scripts
Real-time DNS Processing
• We used to use MapReduce/PIG to generate daily reports on all
  DNS event data from files in HDFS; this took several hours
• Now, we use an internally developed framework called Velocity
  that was built over Storm
• In real-time, enrich DNS and security events with IP geo data
  (country, city, company, vertical), correlate with internally
  developed/maintained DNS blacklists
                       • Store the events in Accumulo & use custom
                         Accumulo iterators to perform rollups
                       • At report generation time, Accumulo
                         aggregates records server side
                       • This process now takes minutes, not hours,
                         and we can query for partial results instead
    Twitter Storm        of having to wait until the end of the day
Custom Iterators & Aggregation
          Ingest Format        At Ingest
Row         GROUP BY FIELDS    • RowID contains a CSV record that
Col Fam     Constant String      represents the fields used to basically
Col Qual    Event UUID           perform a GROUP BY
Val         -                  • Col Qual contains the event UUID

                                At Scan time
Format After Custom Iterator    • Basically strip off the event UUID
Row         GROUP BY FIELDS     • Set the value to be “1”
Col Fam     Constant String     • Prepares Key/Value for input into
Col Qual    “”                    SummingCombiner
Val         “1”                 • Output from SummingCombiner is an
                                  accurate count of aggregated records
                                • This is, in essence, a streaming
                                  Reduce
Operations with Accumulo
• Hadoop Streaming jobs tend to kill tablet servers
   – Streaming jobs use more memory than Hadoop allows
   – This can make service memory allocations challenging
   – Reducing number of Map tasks helped
• Running tablet servers under supervision is critical
   – Tablet servers fail fast
   – Supervisord or daemontools restart failed processes
   – Has improved our cluster’s stability dramatically
• Pre-splitting tables is very important for throughput
   – Our rows lead with a day day, e.g. “20120101”
• Locality Groups are your friend for Nutch/Gora
We’re Hiring
• Like to work on hard problems with Big Data?
• Are you familiar/interested in these
  technologies?
  – Hadoop, Storm, Django, Nutch/GORA
  – Accumulo, Solr/ElasticSearch, Redis
  – Python, Java, Pig, Node.JS, Github
• Want to contribute to Open Source?
• We have offices in Atlanta, Washington DC,
  Baltimore, and San Antonio
• www.linkedin.com/jobs/at-Endgame-Systems
Questions?
Contact Info
•   Jason Trost
•   Email: jtrost@endgames.us
•   Twitter: @jason_trost
•   Blog: www.covert.io

More Related Content

What's hot

Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesNacho García Fernández
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the roomcacois
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionMia D Champion
 
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAmazon Web Services
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiSearce Inc
 
AWS re:Invent 2013 Recap
AWS re:Invent 2013 RecapAWS re:Invent 2013 Recap
AWS re:Invent 2013 RecapBarry Jones
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with DaskUwe Korn
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architectureOmid Vahdaty
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 

What's hot (20)

Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Hadoop on-mesos
Hadoop on-mesosHadoop on-mesos
Hadoop on-mesos
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
 
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
 
AWS re:Invent 2013 Recap
AWS re:Invent 2013 RecapAWS re:Invent 2013 Recap
AWS re:Invent 2013 Recap
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 

Similar to Accumulo Nutch/GORA, Storm, and Pig

Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Jon Haddad
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverstonbcoverston
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabitsYves Goeleven
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Gridsjlorenzocima
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in ProductionJoel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in ProductionOutlyer
 
Budapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingBudapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingGabor Boros
 

Similar to Accumulo Nutch/GORA, Storm, and Pig (20)

Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Google Compute and MapR
Google Compute and MapRGoogle Compute and MapR
Google Compute and MapR
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Philly DB MapR Overview
Philly DB MapR OverviewPhilly DB MapR Overview
Philly DB MapR Overview
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in ProductionJoel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
 
Budapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingBudapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processing
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 

More from Jason Trost

Anomali Detect 2016 - Borderless Threat Intelligence
Anomali Detect 2016 - Borderless Threat IntelligenceAnomali Detect 2016 - Borderless Threat Intelligence
Anomali Detect 2016 - Borderless Threat IntelligenceJason Trost
 
R-CISC Summit 2016 Borderless Threat Intelligence
R-CISC Summit 2016 Borderless Threat IntelligenceR-CISC Summit 2016 Borderless Threat Intelligence
R-CISC Summit 2016 Borderless Threat IntelligenceJason Trost
 
SANS CTI Summit 2016 Borderless Threat Intelligence
SANS CTI Summit 2016 Borderless Threat IntelligenceSANS CTI Summit 2016 Borderless Threat Intelligence
SANS CTI Summit 2016 Borderless Threat IntelligenceJason Trost
 
BSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
BSidesNYC 2016 - An Adversarial View of SaaS Malware SandboxesBSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
BSidesNYC 2016 - An Adversarial View of SaaS Malware SandboxesJason Trost
 
Distributed Sensor Data Contextualization for Threat Intelligence Analysis
Distributed Sensor Data Contextualization for Threat Intelligence AnalysisDistributed Sensor Data Contextualization for Threat Intelligence Analysis
Distributed Sensor Data Contextualization for Threat Intelligence AnalysisJason Trost
 
An Adversarial View of SaaS Malware Sandboxes
An Adversarial View of SaaS Malware SandboxesAn Adversarial View of SaaS Malware Sandboxes
An Adversarial View of SaaS Malware SandboxesJason Trost
 
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...Jason Trost
 
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...Jason Trost
 
Modern Honey Network at Bay Area Open Source Security Hackers
Modern Honey Network at Bay Area Open Source Security HackersModern Honey Network at Bay Area Open Source Security Hackers
Modern Honey Network at Bay Area Open Source Security HackersJason Trost
 
Modern Honey Network (MHN)
Modern Honey Network (MHN)Modern Honey Network (MHN)
Modern Honey Network (MHN)Jason Trost
 
BinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopBinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopJason Trost
 
Clairvoyant Squirrel: Large Scale Malicious Domain Classification
Clairvoyant Squirrel: Large Scale Malicious Domain ClassificationClairvoyant Squirrel: Large Scale Malicious Domain Classification
Clairvoyant Squirrel: Large Scale Malicious Domain ClassificationJason Trost
 

More from Jason Trost (12)

Anomali Detect 2016 - Borderless Threat Intelligence
Anomali Detect 2016 - Borderless Threat IntelligenceAnomali Detect 2016 - Borderless Threat Intelligence
Anomali Detect 2016 - Borderless Threat Intelligence
 
R-CISC Summit 2016 Borderless Threat Intelligence
R-CISC Summit 2016 Borderless Threat IntelligenceR-CISC Summit 2016 Borderless Threat Intelligence
R-CISC Summit 2016 Borderless Threat Intelligence
 
SANS CTI Summit 2016 Borderless Threat Intelligence
SANS CTI Summit 2016 Borderless Threat IntelligenceSANS CTI Summit 2016 Borderless Threat Intelligence
SANS CTI Summit 2016 Borderless Threat Intelligence
 
BSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
BSidesNYC 2016 - An Adversarial View of SaaS Malware SandboxesBSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
BSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
 
Distributed Sensor Data Contextualization for Threat Intelligence Analysis
Distributed Sensor Data Contextualization for Threat Intelligence AnalysisDistributed Sensor Data Contextualization for Threat Intelligence Analysis
Distributed Sensor Data Contextualization for Threat Intelligence Analysis
 
An Adversarial View of SaaS Malware Sandboxes
An Adversarial View of SaaS Malware SandboxesAn Adversarial View of SaaS Malware Sandboxes
An Adversarial View of SaaS Malware Sandboxes
 
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
 
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
 
Modern Honey Network at Bay Area Open Source Security Hackers
Modern Honey Network at Bay Area Open Source Security HackersModern Honey Network at Bay Area Open Source Security Hackers
Modern Honey Network at Bay Area Open Source Security Hackers
 
Modern Honey Network (MHN)
Modern Honey Network (MHN)Modern Honey Network (MHN)
Modern Honey Network (MHN)
 
BinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopBinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in Hadoop
 
Clairvoyant Squirrel: Large Scale Malicious Domain Classification
Clairvoyant Squirrel: Large Scale Malicious Domain ClassificationClairvoyant Squirrel: Large Scale Malicious Domain Classification
Clairvoyant Squirrel: Large Scale Malicious Domain Classification
 

Accumulo Nutch/GORA, Storm, and Pig

  • 1. Large Scale Web Analytics with Accumulo (and Nutch/Gora, Pig, and Storm) Jason Trost jtrost@endgames.us @jason_trost
  • 2. Introductions • Jason Trost (jtrost@endgames.us) • Senior Software Engineer at Endgame Systems • Former Accumulo Trainer • Apache Accumulo Committer – Apache Pig integration with Accumulo – some minor bug fixes
  • 3. Agenda • Technologies Introduction – Apache Accumulo – Apache Gora – Apache Nutch/Gora – Storm • Accumulo at Endgame – Web Crawl Analytics – Real-time DNS Processing – Operations
  • 4. Apache Accumulo • Accumulo is a BigTable implementation with cell level security • It is conceptually very similar to HBase, but it has some nice features that HBase is currently lacking. • Some of these features are: – Cell level security – No fat row problem – No limitation on col fams or when col fams can be created – Server side, data local, programming abstraction called Iterators – Iterators enable fast aggregation, searching, filtering, streaming Reduce
  • 5. Apache Gora • Gora is a object relational/non-relational mapping for arbitrary data stores including both relational (MySQL) and non-relational data stores (HBase, Cassandra, Accumulo, Redis, Voldermort, etc.). • It was designed for Big Data applications and has support (interfaces) for Apache Pig, Apache Hive, Cascading, and generic MapReduce.
  • 6. Apache Nutch/Gora • Nutch is a highly scalable web crawler built over Hadoop MapReduce. • It was designed from the ground up to be an Internet scale web crawler and to enable large scale search applications • GORA enables the storing of the web crawl data and metadata in Accumulo
  • 7. Storm • Highly scalable streaming event processing system • Conceptually similar to MapReduce, but operates on streaming data in real-time • Released by Twitter after they acquired Backtype • Development led by Nathan Marz • At-least-once-processing of events • Spouts and Bolts are wired together to form computation Topologies • Topologies run until killed Twitter Storm
  • 8. at
  • 9. Web Crawl Analytics • Formerly used Heritrix with a Cassandra backend for collection and storage • We now use Nutch/GORA to perform Large-scale web crawling • All pages and HTTP headers are stored in Accumulo • Run Pig scripts for pulling data out of Accumulo, performing rollups, performing pattern matching (using regular expressions), and processing the pages using python scripts
  • 10. Real-time DNS Processing • We used to use MapReduce/PIG to generate daily reports on all DNS event data from files in HDFS; this took several hours • Now, we use an internally developed framework called Velocity that was built over Storm • In real-time, enrich DNS and security events with IP geo data (country, city, company, vertical), correlate with internally developed/maintained DNS blacklists • Store the events in Accumulo & use custom Accumulo iterators to perform rollups • At report generation time, Accumulo aggregates records server side • This process now takes minutes, not hours, and we can query for partial results instead Twitter Storm of having to wait until the end of the day
  • 11. Custom Iterators & Aggregation Ingest Format At Ingest Row GROUP BY FIELDS • RowID contains a CSV record that Col Fam Constant String represents the fields used to basically Col Qual Event UUID perform a GROUP BY Val - • Col Qual contains the event UUID At Scan time Format After Custom Iterator • Basically strip off the event UUID Row GROUP BY FIELDS • Set the value to be “1” Col Fam Constant String • Prepares Key/Value for input into Col Qual “” SummingCombiner Val “1” • Output from SummingCombiner is an accurate count of aggregated records • This is, in essence, a streaming Reduce
  • 12. Operations with Accumulo • Hadoop Streaming jobs tend to kill tablet servers – Streaming jobs use more memory than Hadoop allows – This can make service memory allocations challenging – Reducing number of Map tasks helped • Running tablet servers under supervision is critical – Tablet servers fail fast – Supervisord or daemontools restart failed processes – Has improved our cluster’s stability dramatically • Pre-splitting tables is very important for throughput – Our rows lead with a day day, e.g. “20120101” • Locality Groups are your friend for Nutch/Gora
  • 13. We’re Hiring • Like to work on hard problems with Big Data? • Are you familiar/interested in these technologies? – Hadoop, Storm, Django, Nutch/GORA – Accumulo, Solr/ElasticSearch, Redis – Python, Java, Pig, Node.JS, Github • Want to contribute to Open Source? • We have offices in Atlanta, Washington DC, Baltimore, and San Antonio • www.linkedin.com/jobs/at-Endgame-Systems
  • 15. Contact Info • Jason Trost • Email: jtrost@endgames.us • Twitter: @jason_trost • Blog: www.covert.io