• Share
  • Email
  • Embed
  • Like
  • Private Content
Hw09   Cross Data Center Logs Processing
 

Hw09 Cross Data Center Logs Processing

on

  • 3,657 views

 

Statistics

Views

Total Views
3,657
Views on SlideShare
3,635
Embed Views
22

Actions

Likes
5
Downloads
142
Comments
0

2 Embeds 22

http://www.slideshare.net 21
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • All storage in Hadoop: no filers or SANs

Hw09   Cross Data Center Logs Processing Hw09 Cross Data Center Logs Processing Presentation Transcript

  • Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date: October 2, 2009 Cross Datacenter Logs Processing
  • Overview
    • Use case
      • Background
      • Log Types
      • Querying
    • Previous Solutions
    • The Hadoop Solution
    • Implementation
      • Collection
      • Index time
      • Query time
    • Advantages of Hadoop
      • Storage
      • Analysis
      • Scalability
      • Community
  • Use Case: Background
    • “ Rackapps” - Email and Apps Division
      • Founded 1999, merged with Rackspace 2007
      • Hybrid Mail Hosting
        • 40% of accounts have a mix of Exchange and Rackspace Email
        • Fantastic Control Panel to juggle accounts
        • Webmail client with calendar/contact/note sharing
      • More Apps to come
      • Environment
        • 1K+ servers at 3 of 6 Rackspace datacenters
        • Breakdown - 80% Linux, 20% Windows
          • “ Rackspace Email” - custom email and application platform
          • Microsoft Exchange
  • Use Case: Log Types
    • MTA (mail delivery) logs
      • Postfix
      • Exchange
      • Momentum
    • Spam and virus logs
      • Amavis
    • Access logs
      • Dovecot
      • Exchange
      • httpd logs
  • Use Case: Querying
    • Support Team
      • Needs to answer basic questions:
        • Mail Transfer – Was it delivered?
        • Spam – Why was this (not) marked as spam?
        • Access – Who (checked | failed to check) mail?
    • Engineering
      • More advanced questions:
        • Which delivery routes have the highest latency?
        • Which are the spammiest IPs?
        • Where in the world do customers log in from?
    • Elsewhere
      • Cloud teams use Hadoop for even more mission critical statistics
  • Previous Solutions
    • V1 – Query at the Source
      • Founding – 2006
      • No processing: flat log files on each source machine
      • To query, support escalates a ticket to Engineering
      • Queries take hours
      • 14 days available, single datacenter
    • V2 – Bulk load to MySQL
      • 2006 – 2007
      • Process logs, bulk load into denormalized schema
      • Add merge tables for common query time ranges
      • SQL self joins to find log entries for a path
      • Queries take minutes
      • 1 day available, single datacenter
  • The Hadoop Solution
    • V3 – Lucene Indexes in Hadoop
      • 2007 – Present
      • Raw logs collected and processed in Hadoop
      • Lucene indexes as intermediate format
      • “ Realtime” queries via Solr
        • Indexes merged to Solr nodes with15 minute turnaround
        • 7 days stored uncompressed
        • Queries take seconds
      • Long term querying via MapReduce, high level languages
        • Hadoop InputFormat for Lucene indexes
        • 6 months available for MR queries
        • Queries take minutes
      • Multiple datacenters
  • The Hadoop Solution: Alternatives
    • Splunk
      • Great for realtime querying, but weak for long term analysis
      • Archived data is not easily queryable
    • Data warehouse package
      • Weak for realtime querying, great for long term analysis
    • Partitioned MySQL
      • Mediocre solution to either goal
      • Needed something similar to MapReduce for sharded MySQL
  • Implementation: Collection
    • Software
      • Transport
        • Syslog-ng,
        • SSH tunnel between datacenters
        • Considering Scribe/rsyslog/?
      • Storage
        • App to deposit to Hadoop using Java API
    • Hardware
      • Per Datacenter
        • 2-4 collector machines
        • hundreds of source machines
      • Single Datacenter
        • 30 node Hadoop cluster
        • 20 Solr nodes
  • Implementation: Indexing/Querying
    • Indexing
      • Unique processing code for schema’d, unschema’d logs
      • SolrOutputFormat generates compressed Lucene indexes
    • Querying
      • “ Realtime”
        • Sharded Lucene/Solr instances merge index chunks from Hadoop
        • Using Solr API
          • Plugin to optimize sharding: queries are distributed to relevant nodes
          • Solr merges esults
      • Raw Logs
        • Using Hadoop Streaming and unix grep
      • MapReduce
  • Implementation: Example
  • Implementation: Timeframe
      • Development
        • Developed by a team of 1.5 in 3 months
        • Indexing, Statistics
      • Deployment
        • Developers acted as operations team
        • Cloudera deployment resolved this problem
      • Roadblocks
        • Bumped into job-size limitations
          • Resolved now
  • Advantages: Storage
    • Raw Logs
      • 3 days
      • For debugging purposes, use by engineering
      • In HDFS
    • Indexes
      • 7 days
      • Queryable via Solr API
      • On local disk
    • Archived Indexes
      • 6+ months
      • Queryable via Hadoop, or use API to ask for old data to be made accessible in Solr
      • In HDFS
  • Advantages: Analysis
    • Java MapReduce API
      • For optimal performance of frequently run jobs
    • Apache Pig
      • Ideal for one off queries
      • Interactive development
      • No need to understand MapReduce (SQL replacement)
      • Extensible via UDFs
    • Hadoop Streaming
      • For users comfortable with MapReduce, in a hurry
      • Use any language (frequently Python)
  • Pig Example
    • records = LOAD 'amavis' USING us.webmail.pig.io.SolrSlicer('sender,timestamp,rip,recips', '1251777901', '1252447501');
    • flat = FOREACH records GENERATE FLATTEN(sender), FLATTEN(timestamp), FLATTEN(rip), FLATTEN(recips);
    • filtered = FILTER flat BY sender IS NOT NULL AND sender MATCHES '.*whitehousegov$';
    • cleantimes = FOREACH filtered GENERATE sender,(us.webmail.pig.udf.FromSolrLong(timestamp) / 3600 * 3600) as timestamp,rip,recips;
    • grouped = GROUP cleantimes BY (sender, rip, timestamp);
    • counts = FOREACH grouped GENERATE group, COUNT(*);
    • hostcounts = FOREACH counts GENERATE group.sender, us.webmail.pig.udf.ReverseDNS(group.rip) as host, group.timestamp, $1;
    • dump hostcounts;
  • Advantages: Scalability, Cost, Community
    • Scalability
      • Add or remove nodes at any time
      • Linearly increase processing and storage capacity
      • No code changes
    • Cost
      • Only expansion cost is hardware
      • No licensing
    • Community
      • Constant development and improvements
      • Stream of patches adding capability and performance
      • Companies like Cloudera exist to:
        • Abstract away patch selection
        • Trivialize deployment
        • Provide emergency support
  • Fin! Questions?