Your SlideShare is downloading. ×
0
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Hw09   Cross Data Center Logs Processing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hw09 Cross Data Center Logs Processing

2,181

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,181
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
145
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • All storage in Hadoop: no filers or SANs
  • Transcript

    • 1. Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date: October 2, 2009 Cross Datacenter Logs Processing
    • 2. Overview <ul><li>Use case </li></ul><ul><ul><li>Background </li></ul></ul><ul><ul><li>Log Types </li></ul></ul><ul><ul><li>Querying </li></ul></ul><ul><li>Previous Solutions </li></ul><ul><li>The Hadoop Solution </li></ul><ul><li>Implementation </li></ul><ul><ul><li>Collection </li></ul></ul><ul><ul><li>Index time </li></ul></ul><ul><ul><li>Query time </li></ul></ul><ul><li>Advantages of Hadoop </li></ul><ul><ul><li>Storage </li></ul></ul><ul><ul><li>Analysis </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Community </li></ul></ul>
    • 3. Use Case: Background <ul><li>“ Rackapps” - Email and Apps Division </li></ul><ul><ul><li>Founded 1999, merged with Rackspace 2007 </li></ul></ul><ul><ul><li>Hybrid Mail Hosting </li></ul></ul><ul><ul><ul><li>40% of accounts have a mix of Exchange and Rackspace Email </li></ul></ul></ul><ul><ul><ul><li>Fantastic Control Panel to juggle accounts </li></ul></ul></ul><ul><ul><ul><li>Webmail client with calendar/contact/note sharing </li></ul></ul></ul><ul><ul><li>More Apps to come </li></ul></ul><ul><ul><li>Environment </li></ul></ul><ul><ul><ul><li>1K+ servers at 3 of 6 Rackspace datacenters </li></ul></ul></ul><ul><ul><ul><li>Breakdown - 80% Linux, 20% Windows </li></ul></ul></ul><ul><ul><ul><ul><li>“ Rackspace Email” - custom email and application platform </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Microsoft Exchange </li></ul></ul></ul></ul>
    • 4. Use Case: Log Types <ul><li>MTA (mail delivery) logs </li></ul><ul><ul><li>Postfix </li></ul></ul><ul><ul><li>Exchange </li></ul></ul><ul><ul><li>Momentum </li></ul></ul><ul><li>Spam and virus logs </li></ul><ul><ul><li>Amavis </li></ul></ul><ul><li>Access logs </li></ul><ul><ul><li>Dovecot </li></ul></ul><ul><ul><li>Exchange </li></ul></ul><ul><ul><li>httpd logs </li></ul></ul>
    • 5. Use Case: Querying <ul><li>Support Team </li></ul><ul><ul><li>Needs to answer basic questions: </li></ul></ul><ul><ul><ul><li>Mail Transfer – Was it delivered? </li></ul></ul></ul><ul><ul><ul><li>Spam – Why was this (not) marked as spam? </li></ul></ul></ul><ul><ul><ul><li>Access – Who (checked | failed to check) mail? </li></ul></ul></ul><ul><li>Engineering </li></ul><ul><ul><li>More advanced questions: </li></ul></ul><ul><ul><ul><li>Which delivery routes have the highest latency? </li></ul></ul></ul><ul><ul><ul><li>Which are the spammiest IPs? </li></ul></ul></ul><ul><ul><ul><li>Where in the world do customers log in from? </li></ul></ul></ul><ul><li>Elsewhere </li></ul><ul><ul><li>Cloud teams use Hadoop for even more mission critical statistics </li></ul></ul>
    • 6. Previous Solutions <ul><li>V1 – Query at the Source </li></ul><ul><ul><li>Founding – 2006 </li></ul></ul><ul><ul><li>No processing: flat log files on each source machine </li></ul></ul><ul><ul><li>To query, support escalates a ticket to Engineering </li></ul></ul><ul><ul><li>Queries take hours </li></ul></ul><ul><ul><li>14 days available, single datacenter </li></ul></ul><ul><li>V2 – Bulk load to MySQL </li></ul><ul><ul><li>2006 – 2007 </li></ul></ul><ul><ul><li>Process logs, bulk load into denormalized schema </li></ul></ul><ul><ul><li>Add merge tables for common query time ranges </li></ul></ul><ul><ul><li>SQL self joins to find log entries for a path </li></ul></ul><ul><ul><li>Queries take minutes </li></ul></ul><ul><ul><li>1 day available, single datacenter </li></ul></ul>
    • 7. The Hadoop Solution <ul><li>V3 – Lucene Indexes in Hadoop </li></ul><ul><ul><li>2007 – Present </li></ul></ul><ul><ul><li>Raw logs collected and processed in Hadoop </li></ul></ul><ul><ul><li>Lucene indexes as intermediate format </li></ul></ul><ul><ul><li>“ Realtime” queries via Solr </li></ul></ul><ul><ul><ul><li>Indexes merged to Solr nodes with15 minute turnaround </li></ul></ul></ul><ul><ul><ul><li>7 days stored uncompressed </li></ul></ul></ul><ul><ul><ul><li>Queries take seconds </li></ul></ul></ul><ul><ul><li>Long term querying via MapReduce, high level languages </li></ul></ul><ul><ul><ul><li>Hadoop InputFormat for Lucene indexes </li></ul></ul></ul><ul><ul><ul><li>6 months available for MR queries </li></ul></ul></ul><ul><ul><ul><li>Queries take minutes </li></ul></ul></ul><ul><ul><li>Multiple datacenters </li></ul></ul>
    • 8. The Hadoop Solution: Alternatives <ul><li>Splunk </li></ul><ul><ul><li>Great for realtime querying, but weak for long term analysis </li></ul></ul><ul><ul><li>Archived data is not easily queryable </li></ul></ul><ul><li>Data warehouse package </li></ul><ul><ul><li>Weak for realtime querying, great for long term analysis </li></ul></ul><ul><li>Partitioned MySQL </li></ul><ul><ul><li>Mediocre solution to either goal </li></ul></ul><ul><ul><li>Needed something similar to MapReduce for sharded MySQL </li></ul></ul>
    • 9. Implementation: Collection <ul><li>Software </li></ul><ul><ul><li>Transport </li></ul></ul><ul><ul><ul><li>Syslog-ng, </li></ul></ul></ul><ul><ul><ul><li>SSH tunnel between datacenters </li></ul></ul></ul><ul><ul><ul><li>Considering Scribe/rsyslog/? </li></ul></ul></ul><ul><ul><li>Storage </li></ul></ul><ul><ul><ul><li>App to deposit to Hadoop using Java API </li></ul></ul></ul><ul><li>Hardware </li></ul><ul><ul><li>Per Datacenter </li></ul></ul><ul><ul><ul><li>2-4 collector machines </li></ul></ul></ul><ul><ul><ul><li>hundreds of source machines </li></ul></ul></ul><ul><ul><li>Single Datacenter </li></ul></ul><ul><ul><ul><li>30 node Hadoop cluster </li></ul></ul></ul><ul><ul><ul><li>20 Solr nodes </li></ul></ul></ul>
    • 10. Implementation: Indexing/Querying <ul><li>Indexing </li></ul><ul><ul><li>Unique processing code for schema’d, unschema’d logs </li></ul></ul><ul><ul><li>SolrOutputFormat generates compressed Lucene indexes </li></ul></ul><ul><li>Querying </li></ul><ul><ul><li>“ Realtime” </li></ul></ul><ul><ul><ul><li>Sharded Lucene/Solr instances merge index chunks from Hadoop </li></ul></ul></ul><ul><ul><ul><li>Using Solr API </li></ul></ul></ul><ul><ul><ul><ul><li>Plugin to optimize sharding: queries are distributed to relevant nodes </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Solr merges esults </li></ul></ul></ul></ul><ul><ul><li>Raw Logs </li></ul></ul><ul><ul><ul><li>Using Hadoop Streaming and unix grep </li></ul></ul></ul><ul><ul><li>MapReduce </li></ul></ul>
    • 11. Implementation: Example
    • 12. Implementation: Timeframe <ul><ul><li>Development </li></ul></ul><ul><ul><ul><li>Developed by a team of 1.5 in 3 months </li></ul></ul></ul><ul><ul><ul><li>Indexing, Statistics </li></ul></ul></ul><ul><ul><li>Deployment </li></ul></ul><ul><ul><ul><li>Developers acted as operations team </li></ul></ul></ul><ul><ul><ul><li>Cloudera deployment resolved this problem </li></ul></ul></ul><ul><ul><li>Roadblocks </li></ul></ul><ul><ul><ul><li>Bumped into job-size limitations </li></ul></ul></ul><ul><ul><ul><ul><li>Resolved now </li></ul></ul></ul></ul>
    • 13. Advantages: Storage <ul><li>Raw Logs </li></ul><ul><ul><li>3 days </li></ul></ul><ul><ul><li>For debugging purposes, use by engineering </li></ul></ul><ul><ul><li>In HDFS </li></ul></ul><ul><li>Indexes </li></ul><ul><ul><li>7 days </li></ul></ul><ul><ul><li>Queryable via Solr API </li></ul></ul><ul><ul><li>On local disk </li></ul></ul><ul><li>Archived Indexes </li></ul><ul><ul><li>6+ months </li></ul></ul><ul><ul><li>Queryable via Hadoop, or use API to ask for old data to be made accessible in Solr </li></ul></ul><ul><ul><li>In HDFS </li></ul></ul>
    • 14. Advantages: Analysis <ul><li>Java MapReduce API </li></ul><ul><ul><li>For optimal performance of frequently run jobs </li></ul></ul><ul><li>Apache Pig </li></ul><ul><ul><li>Ideal for one off queries </li></ul></ul><ul><ul><li>Interactive development </li></ul></ul><ul><ul><li>No need to understand MapReduce (SQL replacement) </li></ul></ul><ul><ul><li>Extensible via UDFs </li></ul></ul><ul><li>Hadoop Streaming </li></ul><ul><ul><li>For users comfortable with MapReduce, in a hurry </li></ul></ul><ul><ul><li>Use any language (frequently Python) </li></ul></ul>
    • 15. Pig Example <ul><li>records = LOAD 'amavis' USING us.webmail.pig.io.SolrSlicer('sender,timestamp,rip,recips', '1251777901', '1252447501'); </li></ul><ul><li>flat = FOREACH records GENERATE FLATTEN(sender), FLATTEN(timestamp), FLATTEN(rip), FLATTEN(recips); </li></ul><ul><li>filtered = FILTER flat BY sender IS NOT NULL AND sender MATCHES '.*whitehousegov$'; </li></ul><ul><li>cleantimes = FOREACH filtered GENERATE sender,(us.webmail.pig.udf.FromSolrLong(timestamp) / 3600 * 3600) as timestamp,rip,recips; </li></ul><ul><li>grouped = GROUP cleantimes BY (sender, rip, timestamp); </li></ul><ul><li>counts = FOREACH grouped GENERATE group, COUNT(*); </li></ul><ul><li>hostcounts = FOREACH counts GENERATE group.sender, us.webmail.pig.udf.ReverseDNS(group.rip) as host, group.timestamp, $1; </li></ul><ul><li>dump hostcounts; </li></ul>
    • 16. Advantages: Scalability, Cost, Community <ul><li>Scalability </li></ul><ul><ul><li>Add or remove nodes at any time </li></ul></ul><ul><ul><li>Linearly increase processing and storage capacity </li></ul></ul><ul><ul><li>No code changes </li></ul></ul><ul><li>Cost </li></ul><ul><ul><li>Only expansion cost is hardware </li></ul></ul><ul><ul><li>No licensing </li></ul></ul><ul><li>Community </li></ul><ul><ul><li>Constant development and improvements </li></ul></ul><ul><ul><li>Stream of patches adding capability and performance </li></ul></ul><ul><ul><li>Companies like Cloudera exist to: </li></ul></ul><ul><ul><ul><li>Abstract away patch selection </li></ul></ul></ul><ul><ul><ul><li>Trivialize deployment </li></ul></ul></ul><ul><ul><ul><li>Provide emergency support </li></ul></ul></ul>
    • 17. Fin! Questions?

    ×