Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hw09 Cross Data Center Logs Processing

2,485 views

Published on

Published in: Technology
  • Be the first to comment

Hw09 Cross Data Center Logs Processing

  1. 1. Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date: October 2, 2009 Cross Datacenter Logs Processing
  2. 2. Overview <ul><li>Use case </li></ul><ul><ul><li>Background </li></ul></ul><ul><ul><li>Log Types </li></ul></ul><ul><ul><li>Querying </li></ul></ul><ul><li>Previous Solutions </li></ul><ul><li>The Hadoop Solution </li></ul><ul><li>Implementation </li></ul><ul><ul><li>Collection </li></ul></ul><ul><ul><li>Index time </li></ul></ul><ul><ul><li>Query time </li></ul></ul><ul><li>Advantages of Hadoop </li></ul><ul><ul><li>Storage </li></ul></ul><ul><ul><li>Analysis </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Community </li></ul></ul>
  3. 3. Use Case: Background <ul><li>“ Rackapps” - Email and Apps Division </li></ul><ul><ul><li>Founded 1999, merged with Rackspace 2007 </li></ul></ul><ul><ul><li>Hybrid Mail Hosting </li></ul></ul><ul><ul><ul><li>40% of accounts have a mix of Exchange and Rackspace Email </li></ul></ul></ul><ul><ul><ul><li>Fantastic Control Panel to juggle accounts </li></ul></ul></ul><ul><ul><ul><li>Webmail client with calendar/contact/note sharing </li></ul></ul></ul><ul><ul><li>More Apps to come </li></ul></ul><ul><ul><li>Environment </li></ul></ul><ul><ul><ul><li>1K+ servers at 3 of 6 Rackspace datacenters </li></ul></ul></ul><ul><ul><ul><li>Breakdown - 80% Linux, 20% Windows </li></ul></ul></ul><ul><ul><ul><ul><li>“ Rackspace Email” - custom email and application platform </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Microsoft Exchange </li></ul></ul></ul></ul>
  4. 4. Use Case: Log Types <ul><li>MTA (mail delivery) logs </li></ul><ul><ul><li>Postfix </li></ul></ul><ul><ul><li>Exchange </li></ul></ul><ul><ul><li>Momentum </li></ul></ul><ul><li>Spam and virus logs </li></ul><ul><ul><li>Amavis </li></ul></ul><ul><li>Access logs </li></ul><ul><ul><li>Dovecot </li></ul></ul><ul><ul><li>Exchange </li></ul></ul><ul><ul><li>httpd logs </li></ul></ul>
  5. 5. Use Case: Querying <ul><li>Support Team </li></ul><ul><ul><li>Needs to answer basic questions: </li></ul></ul><ul><ul><ul><li>Mail Transfer – Was it delivered? </li></ul></ul></ul><ul><ul><ul><li>Spam – Why was this (not) marked as spam? </li></ul></ul></ul><ul><ul><ul><li>Access – Who (checked | failed to check) mail? </li></ul></ul></ul><ul><li>Engineering </li></ul><ul><ul><li>More advanced questions: </li></ul></ul><ul><ul><ul><li>Which delivery routes have the highest latency? </li></ul></ul></ul><ul><ul><ul><li>Which are the spammiest IPs? </li></ul></ul></ul><ul><ul><ul><li>Where in the world do customers log in from? </li></ul></ul></ul><ul><li>Elsewhere </li></ul><ul><ul><li>Cloud teams use Hadoop for even more mission critical statistics </li></ul></ul>
  6. 6. Previous Solutions <ul><li>V1 – Query at the Source </li></ul><ul><ul><li>Founding – 2006 </li></ul></ul><ul><ul><li>No processing: flat log files on each source machine </li></ul></ul><ul><ul><li>To query, support escalates a ticket to Engineering </li></ul></ul><ul><ul><li>Queries take hours </li></ul></ul><ul><ul><li>14 days available, single datacenter </li></ul></ul><ul><li>V2 – Bulk load to MySQL </li></ul><ul><ul><li>2006 – 2007 </li></ul></ul><ul><ul><li>Process logs, bulk load into denormalized schema </li></ul></ul><ul><ul><li>Add merge tables for common query time ranges </li></ul></ul><ul><ul><li>SQL self joins to find log entries for a path </li></ul></ul><ul><ul><li>Queries take minutes </li></ul></ul><ul><ul><li>1 day available, single datacenter </li></ul></ul>
  7. 7. The Hadoop Solution <ul><li>V3 – Lucene Indexes in Hadoop </li></ul><ul><ul><li>2007 – Present </li></ul></ul><ul><ul><li>Raw logs collected and processed in Hadoop </li></ul></ul><ul><ul><li>Lucene indexes as intermediate format </li></ul></ul><ul><ul><li>“ Realtime” queries via Solr </li></ul></ul><ul><ul><ul><li>Indexes merged to Solr nodes with15 minute turnaround </li></ul></ul></ul><ul><ul><ul><li>7 days stored uncompressed </li></ul></ul></ul><ul><ul><ul><li>Queries take seconds </li></ul></ul></ul><ul><ul><li>Long term querying via MapReduce, high level languages </li></ul></ul><ul><ul><ul><li>Hadoop InputFormat for Lucene indexes </li></ul></ul></ul><ul><ul><ul><li>6 months available for MR queries </li></ul></ul></ul><ul><ul><ul><li>Queries take minutes </li></ul></ul></ul><ul><ul><li>Multiple datacenters </li></ul></ul>
  8. 8. The Hadoop Solution: Alternatives <ul><li>Splunk </li></ul><ul><ul><li>Great for realtime querying, but weak for long term analysis </li></ul></ul><ul><ul><li>Archived data is not easily queryable </li></ul></ul><ul><li>Data warehouse package </li></ul><ul><ul><li>Weak for realtime querying, great for long term analysis </li></ul></ul><ul><li>Partitioned MySQL </li></ul><ul><ul><li>Mediocre solution to either goal </li></ul></ul><ul><ul><li>Needed something similar to MapReduce for sharded MySQL </li></ul></ul>
  9. 9. Implementation: Collection <ul><li>Software </li></ul><ul><ul><li>Transport </li></ul></ul><ul><ul><ul><li>Syslog-ng, </li></ul></ul></ul><ul><ul><ul><li>SSH tunnel between datacenters </li></ul></ul></ul><ul><ul><ul><li>Considering Scribe/rsyslog/? </li></ul></ul></ul><ul><ul><li>Storage </li></ul></ul><ul><ul><ul><li>App to deposit to Hadoop using Java API </li></ul></ul></ul><ul><li>Hardware </li></ul><ul><ul><li>Per Datacenter </li></ul></ul><ul><ul><ul><li>2-4 collector machines </li></ul></ul></ul><ul><ul><ul><li>hundreds of source machines </li></ul></ul></ul><ul><ul><li>Single Datacenter </li></ul></ul><ul><ul><ul><li>30 node Hadoop cluster </li></ul></ul></ul><ul><ul><ul><li>20 Solr nodes </li></ul></ul></ul>
  10. 10. Implementation: Indexing/Querying <ul><li>Indexing </li></ul><ul><ul><li>Unique processing code for schema’d, unschema’d logs </li></ul></ul><ul><ul><li>SolrOutputFormat generates compressed Lucene indexes </li></ul></ul><ul><li>Querying </li></ul><ul><ul><li>“ Realtime” </li></ul></ul><ul><ul><ul><li>Sharded Lucene/Solr instances merge index chunks from Hadoop </li></ul></ul></ul><ul><ul><ul><li>Using Solr API </li></ul></ul></ul><ul><ul><ul><ul><li>Plugin to optimize sharding: queries are distributed to relevant nodes </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Solr merges esults </li></ul></ul></ul></ul><ul><ul><li>Raw Logs </li></ul></ul><ul><ul><ul><li>Using Hadoop Streaming and unix grep </li></ul></ul></ul><ul><ul><li>MapReduce </li></ul></ul>
  11. 11. Implementation: Example
  12. 12. Implementation: Timeframe <ul><ul><li>Development </li></ul></ul><ul><ul><ul><li>Developed by a team of 1.5 in 3 months </li></ul></ul></ul><ul><ul><ul><li>Indexing, Statistics </li></ul></ul></ul><ul><ul><li>Deployment </li></ul></ul><ul><ul><ul><li>Developers acted as operations team </li></ul></ul></ul><ul><ul><ul><li>Cloudera deployment resolved this problem </li></ul></ul></ul><ul><ul><li>Roadblocks </li></ul></ul><ul><ul><ul><li>Bumped into job-size limitations </li></ul></ul></ul><ul><ul><ul><ul><li>Resolved now </li></ul></ul></ul></ul>
  13. 13. Advantages: Storage <ul><li>Raw Logs </li></ul><ul><ul><li>3 days </li></ul></ul><ul><ul><li>For debugging purposes, use by engineering </li></ul></ul><ul><ul><li>In HDFS </li></ul></ul><ul><li>Indexes </li></ul><ul><ul><li>7 days </li></ul></ul><ul><ul><li>Queryable via Solr API </li></ul></ul><ul><ul><li>On local disk </li></ul></ul><ul><li>Archived Indexes </li></ul><ul><ul><li>6+ months </li></ul></ul><ul><ul><li>Queryable via Hadoop, or use API to ask for old data to be made accessible in Solr </li></ul></ul><ul><ul><li>In HDFS </li></ul></ul>
  14. 14. Advantages: Analysis <ul><li>Java MapReduce API </li></ul><ul><ul><li>For optimal performance of frequently run jobs </li></ul></ul><ul><li>Apache Pig </li></ul><ul><ul><li>Ideal for one off queries </li></ul></ul><ul><ul><li>Interactive development </li></ul></ul><ul><ul><li>No need to understand MapReduce (SQL replacement) </li></ul></ul><ul><ul><li>Extensible via UDFs </li></ul></ul><ul><li>Hadoop Streaming </li></ul><ul><ul><li>For users comfortable with MapReduce, in a hurry </li></ul></ul><ul><ul><li>Use any language (frequently Python) </li></ul></ul>
  15. 15. Pig Example <ul><li>records = LOAD 'amavis' USING us.webmail.pig.io.SolrSlicer('sender,timestamp,rip,recips', '1251777901', '1252447501'); </li></ul><ul><li>flat = FOREACH records GENERATE FLATTEN(sender), FLATTEN(timestamp), FLATTEN(rip), FLATTEN(recips); </li></ul><ul><li>filtered = FILTER flat BY sender IS NOT NULL AND sender MATCHES '.*whitehousegov$'; </li></ul><ul><li>cleantimes = FOREACH filtered GENERATE sender,(us.webmail.pig.udf.FromSolrLong(timestamp) / 3600 * 3600) as timestamp,rip,recips; </li></ul><ul><li>grouped = GROUP cleantimes BY (sender, rip, timestamp); </li></ul><ul><li>counts = FOREACH grouped GENERATE group, COUNT(*); </li></ul><ul><li>hostcounts = FOREACH counts GENERATE group.sender, us.webmail.pig.udf.ReverseDNS(group.rip) as host, group.timestamp, $1; </li></ul><ul><li>dump hostcounts; </li></ul>
  16. 16. Advantages: Scalability, Cost, Community <ul><li>Scalability </li></ul><ul><ul><li>Add or remove nodes at any time </li></ul></ul><ul><ul><li>Linearly increase processing and storage capacity </li></ul></ul><ul><ul><li>No code changes </li></ul></ul><ul><li>Cost </li></ul><ul><ul><li>Only expansion cost is hardware </li></ul></ul><ul><ul><li>No licensing </li></ul></ul><ul><li>Community </li></ul><ul><ul><li>Constant development and improvements </li></ul></ul><ul><ul><li>Stream of patches adding capability and performance </li></ul></ul><ul><ul><li>Companies like Cloudera exist to: </li></ul></ul><ul><ul><ul><li>Abstract away patch selection </li></ul></ul></ul><ul><ul><ul><li>Trivialize deployment </li></ul></ul></ul><ul><ul><ul><li>Provide emergency support </li></ul></ul></ul>
  17. 17. Fin! Questions?

×