Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date:  October 2, 2009 Cross Datacenter Logs Pro...
Overview <ul><li>Use case </li></ul><ul><ul><li>Background </li></ul></ul><ul><ul><li>Log Types </li></ul></ul><ul><ul><li...
Use Case: Background <ul><li>“ Rackapps” - Email and Apps Division </li></ul><ul><ul><li>Founded 1999, merged with Rackspa...
Use Case: Log Types <ul><li>MTA (mail delivery) logs </li></ul><ul><ul><li>Postfix </li></ul></ul><ul><ul><li>Exchange </l...
Use Case: Querying <ul><li>Support Team </li></ul><ul><ul><li>Needs to answer basic questions: </li></ul></ul><ul><ul><ul>...
Previous Solutions <ul><li>V1 – Query at the Source </li></ul><ul><ul><li>Founding – 2006 </li></ul></ul><ul><ul><li>No pr...
The Hadoop Solution <ul><li>V3 – Lucene Indexes in Hadoop </li></ul><ul><ul><li>2007 – Present </li></ul></ul><ul><ul><li>...
The Hadoop Solution: Alternatives <ul><li>Splunk </li></ul><ul><ul><li>Great for realtime querying, but weak for long term...
Implementation: Collection <ul><li>Software </li></ul><ul><ul><li>Transport </li></ul></ul><ul><ul><ul><li>Syslog-ng, </li...
Implementation: Indexing/Querying <ul><li>Indexing </li></ul><ul><ul><li>Unique processing code for schema’d, unschema’d l...
Implementation: Example
Implementation: Timeframe <ul><ul><li>Development </li></ul></ul><ul><ul><ul><li>Developed by a team of 1.5 in 3 months </...
Advantages: Storage <ul><li>Raw Logs </li></ul><ul><ul><li>3 days </li></ul></ul><ul><ul><li>For debugging purposes, use b...
Advantages: Analysis <ul><li>Java MapReduce API </li></ul><ul><ul><li>For optimal performance of frequently run jobs </li>...
Pig Example <ul><li>records = LOAD 'amavis' USING us.webmail.pig.io.SolrSlicer('sender,timestamp,rip,recips', '1251777901'...
Advantages: Scalability, Cost, Community <ul><li>Scalability </li></ul><ul><ul><li>Add or remove nodes at any time </li></...
Fin! Questions?
Upcoming SlideShare
Loading in...5
×

Hw09 Cross Data Center Logs Processing

2,205

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,205
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
147
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • All storage in Hadoop: no filers or SANs
  • Hw09 Cross Data Center Logs Processing

    1. 1. Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date: October 2, 2009 Cross Datacenter Logs Processing
    2. 2. Overview <ul><li>Use case </li></ul><ul><ul><li>Background </li></ul></ul><ul><ul><li>Log Types </li></ul></ul><ul><ul><li>Querying </li></ul></ul><ul><li>Previous Solutions </li></ul><ul><li>The Hadoop Solution </li></ul><ul><li>Implementation </li></ul><ul><ul><li>Collection </li></ul></ul><ul><ul><li>Index time </li></ul></ul><ul><ul><li>Query time </li></ul></ul><ul><li>Advantages of Hadoop </li></ul><ul><ul><li>Storage </li></ul></ul><ul><ul><li>Analysis </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Community </li></ul></ul>
    3. 3. Use Case: Background <ul><li>“ Rackapps” - Email and Apps Division </li></ul><ul><ul><li>Founded 1999, merged with Rackspace 2007 </li></ul></ul><ul><ul><li>Hybrid Mail Hosting </li></ul></ul><ul><ul><ul><li>40% of accounts have a mix of Exchange and Rackspace Email </li></ul></ul></ul><ul><ul><ul><li>Fantastic Control Panel to juggle accounts </li></ul></ul></ul><ul><ul><ul><li>Webmail client with calendar/contact/note sharing </li></ul></ul></ul><ul><ul><li>More Apps to come </li></ul></ul><ul><ul><li>Environment </li></ul></ul><ul><ul><ul><li>1K+ servers at 3 of 6 Rackspace datacenters </li></ul></ul></ul><ul><ul><ul><li>Breakdown - 80% Linux, 20% Windows </li></ul></ul></ul><ul><ul><ul><ul><li>“ Rackspace Email” - custom email and application platform </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Microsoft Exchange </li></ul></ul></ul></ul>
    4. 4. Use Case: Log Types <ul><li>MTA (mail delivery) logs </li></ul><ul><ul><li>Postfix </li></ul></ul><ul><ul><li>Exchange </li></ul></ul><ul><ul><li>Momentum </li></ul></ul><ul><li>Spam and virus logs </li></ul><ul><ul><li>Amavis </li></ul></ul><ul><li>Access logs </li></ul><ul><ul><li>Dovecot </li></ul></ul><ul><ul><li>Exchange </li></ul></ul><ul><ul><li>httpd logs </li></ul></ul>
    5. 5. Use Case: Querying <ul><li>Support Team </li></ul><ul><ul><li>Needs to answer basic questions: </li></ul></ul><ul><ul><ul><li>Mail Transfer – Was it delivered? </li></ul></ul></ul><ul><ul><ul><li>Spam – Why was this (not) marked as spam? </li></ul></ul></ul><ul><ul><ul><li>Access – Who (checked | failed to check) mail? </li></ul></ul></ul><ul><li>Engineering </li></ul><ul><ul><li>More advanced questions: </li></ul></ul><ul><ul><ul><li>Which delivery routes have the highest latency? </li></ul></ul></ul><ul><ul><ul><li>Which are the spammiest IPs? </li></ul></ul></ul><ul><ul><ul><li>Where in the world do customers log in from? </li></ul></ul></ul><ul><li>Elsewhere </li></ul><ul><ul><li>Cloud teams use Hadoop for even more mission critical statistics </li></ul></ul>
    6. 6. Previous Solutions <ul><li>V1 – Query at the Source </li></ul><ul><ul><li>Founding – 2006 </li></ul></ul><ul><ul><li>No processing: flat log files on each source machine </li></ul></ul><ul><ul><li>To query, support escalates a ticket to Engineering </li></ul></ul><ul><ul><li>Queries take hours </li></ul></ul><ul><ul><li>14 days available, single datacenter </li></ul></ul><ul><li>V2 – Bulk load to MySQL </li></ul><ul><ul><li>2006 – 2007 </li></ul></ul><ul><ul><li>Process logs, bulk load into denormalized schema </li></ul></ul><ul><ul><li>Add merge tables for common query time ranges </li></ul></ul><ul><ul><li>SQL self joins to find log entries for a path </li></ul></ul><ul><ul><li>Queries take minutes </li></ul></ul><ul><ul><li>1 day available, single datacenter </li></ul></ul>
    7. 7. The Hadoop Solution <ul><li>V3 – Lucene Indexes in Hadoop </li></ul><ul><ul><li>2007 – Present </li></ul></ul><ul><ul><li>Raw logs collected and processed in Hadoop </li></ul></ul><ul><ul><li>Lucene indexes as intermediate format </li></ul></ul><ul><ul><li>“ Realtime” queries via Solr </li></ul></ul><ul><ul><ul><li>Indexes merged to Solr nodes with15 minute turnaround </li></ul></ul></ul><ul><ul><ul><li>7 days stored uncompressed </li></ul></ul></ul><ul><ul><ul><li>Queries take seconds </li></ul></ul></ul><ul><ul><li>Long term querying via MapReduce, high level languages </li></ul></ul><ul><ul><ul><li>Hadoop InputFormat for Lucene indexes </li></ul></ul></ul><ul><ul><ul><li>6 months available for MR queries </li></ul></ul></ul><ul><ul><ul><li>Queries take minutes </li></ul></ul></ul><ul><ul><li>Multiple datacenters </li></ul></ul>
    8. 8. The Hadoop Solution: Alternatives <ul><li>Splunk </li></ul><ul><ul><li>Great for realtime querying, but weak for long term analysis </li></ul></ul><ul><ul><li>Archived data is not easily queryable </li></ul></ul><ul><li>Data warehouse package </li></ul><ul><ul><li>Weak for realtime querying, great for long term analysis </li></ul></ul><ul><li>Partitioned MySQL </li></ul><ul><ul><li>Mediocre solution to either goal </li></ul></ul><ul><ul><li>Needed something similar to MapReduce for sharded MySQL </li></ul></ul>
    9. 9. Implementation: Collection <ul><li>Software </li></ul><ul><ul><li>Transport </li></ul></ul><ul><ul><ul><li>Syslog-ng, </li></ul></ul></ul><ul><ul><ul><li>SSH tunnel between datacenters </li></ul></ul></ul><ul><ul><ul><li>Considering Scribe/rsyslog/? </li></ul></ul></ul><ul><ul><li>Storage </li></ul></ul><ul><ul><ul><li>App to deposit to Hadoop using Java API </li></ul></ul></ul><ul><li>Hardware </li></ul><ul><ul><li>Per Datacenter </li></ul></ul><ul><ul><ul><li>2-4 collector machines </li></ul></ul></ul><ul><ul><ul><li>hundreds of source machines </li></ul></ul></ul><ul><ul><li>Single Datacenter </li></ul></ul><ul><ul><ul><li>30 node Hadoop cluster </li></ul></ul></ul><ul><ul><ul><li>20 Solr nodes </li></ul></ul></ul>
    10. 10. Implementation: Indexing/Querying <ul><li>Indexing </li></ul><ul><ul><li>Unique processing code for schema’d, unschema’d logs </li></ul></ul><ul><ul><li>SolrOutputFormat generates compressed Lucene indexes </li></ul></ul><ul><li>Querying </li></ul><ul><ul><li>“ Realtime” </li></ul></ul><ul><ul><ul><li>Sharded Lucene/Solr instances merge index chunks from Hadoop </li></ul></ul></ul><ul><ul><ul><li>Using Solr API </li></ul></ul></ul><ul><ul><ul><ul><li>Plugin to optimize sharding: queries are distributed to relevant nodes </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Solr merges esults </li></ul></ul></ul></ul><ul><ul><li>Raw Logs </li></ul></ul><ul><ul><ul><li>Using Hadoop Streaming and unix grep </li></ul></ul></ul><ul><ul><li>MapReduce </li></ul></ul>
    11. 11. Implementation: Example
    12. 12. Implementation: Timeframe <ul><ul><li>Development </li></ul></ul><ul><ul><ul><li>Developed by a team of 1.5 in 3 months </li></ul></ul></ul><ul><ul><ul><li>Indexing, Statistics </li></ul></ul></ul><ul><ul><li>Deployment </li></ul></ul><ul><ul><ul><li>Developers acted as operations team </li></ul></ul></ul><ul><ul><ul><li>Cloudera deployment resolved this problem </li></ul></ul></ul><ul><ul><li>Roadblocks </li></ul></ul><ul><ul><ul><li>Bumped into job-size limitations </li></ul></ul></ul><ul><ul><ul><ul><li>Resolved now </li></ul></ul></ul></ul>
    13. 13. Advantages: Storage <ul><li>Raw Logs </li></ul><ul><ul><li>3 days </li></ul></ul><ul><ul><li>For debugging purposes, use by engineering </li></ul></ul><ul><ul><li>In HDFS </li></ul></ul><ul><li>Indexes </li></ul><ul><ul><li>7 days </li></ul></ul><ul><ul><li>Queryable via Solr API </li></ul></ul><ul><ul><li>On local disk </li></ul></ul><ul><li>Archived Indexes </li></ul><ul><ul><li>6+ months </li></ul></ul><ul><ul><li>Queryable via Hadoop, or use API to ask for old data to be made accessible in Solr </li></ul></ul><ul><ul><li>In HDFS </li></ul></ul>
    14. 14. Advantages: Analysis <ul><li>Java MapReduce API </li></ul><ul><ul><li>For optimal performance of frequently run jobs </li></ul></ul><ul><li>Apache Pig </li></ul><ul><ul><li>Ideal for one off queries </li></ul></ul><ul><ul><li>Interactive development </li></ul></ul><ul><ul><li>No need to understand MapReduce (SQL replacement) </li></ul></ul><ul><ul><li>Extensible via UDFs </li></ul></ul><ul><li>Hadoop Streaming </li></ul><ul><ul><li>For users comfortable with MapReduce, in a hurry </li></ul></ul><ul><ul><li>Use any language (frequently Python) </li></ul></ul>
    15. 15. Pig Example <ul><li>records = LOAD 'amavis' USING us.webmail.pig.io.SolrSlicer('sender,timestamp,rip,recips', '1251777901', '1252447501'); </li></ul><ul><li>flat = FOREACH records GENERATE FLATTEN(sender), FLATTEN(timestamp), FLATTEN(rip), FLATTEN(recips); </li></ul><ul><li>filtered = FILTER flat BY sender IS NOT NULL AND sender MATCHES '.*whitehousegov$'; </li></ul><ul><li>cleantimes = FOREACH filtered GENERATE sender,(us.webmail.pig.udf.FromSolrLong(timestamp) / 3600 * 3600) as timestamp,rip,recips; </li></ul><ul><li>grouped = GROUP cleantimes BY (sender, rip, timestamp); </li></ul><ul><li>counts = FOREACH grouped GENERATE group, COUNT(*); </li></ul><ul><li>hostcounts = FOREACH counts GENERATE group.sender, us.webmail.pig.udf.ReverseDNS(group.rip) as host, group.timestamp, $1; </li></ul><ul><li>dump hostcounts; </li></ul>
    16. 16. Advantages: Scalability, Cost, Community <ul><li>Scalability </li></ul><ul><ul><li>Add or remove nodes at any time </li></ul></ul><ul><ul><li>Linearly increase processing and storage capacity </li></ul></ul><ul><ul><li>No code changes </li></ul></ul><ul><li>Cost </li></ul><ul><ul><li>Only expansion cost is hardware </li></ul></ul><ul><ul><li>No licensing </li></ul></ul><ul><li>Community </li></ul><ul><ul><li>Constant development and improvements </li></ul></ul><ul><ul><li>Stream of patches adding capability and performance </li></ul></ul><ul><ul><li>Companies like Cloudera exist to: </li></ul></ul><ul><ul><ul><li>Abstract away patch selection </li></ul></ul></ul><ul><ul><ul><li>Trivialize deployment </li></ul></ul></ul><ul><ul><ul><li>Provide emergency support </li></ul></ul></ul>
    17. 17. Fin! Questions?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×