Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mozilla - Anurag Phadke - Hadoop World 2010

2,352 views

Published on

Scale in Collecting Distributed Data Via Flume and Querying Through Hive

Anurag Phadke
Mozilla

Learn more @ http://www.cloudera.com/hadoop/

Published in: Technology
  • Be the first to comment

Mozilla - Anurag Phadke - Hadoop World 2010

  1. 1. ScaleIn Collecting and Querying log data in near real-time. Anurag Phadke
  2. 2. 22 Current Scenario Scaling out is a relatively solved problem.
  3. 3. 33 Solutions are aplenty
  4. 4. 44 The Problem How do we centrally collect logs from these distributed machines: -that are compressed…. -in near real-time…. -analyzable via SQLish queries.
  5. 5. 55
  6. 6. 66 A Solution Flume + Hive + Hadoop
  7. 7. 77 A Solution Hadoop: Clustered solution that provides storage + ability to run MapReduce Jobs Hive: SQLish interface to query data. Flume: Collect data from log files, syslog etc. to a central location
  8. 8. 88 Flume – HIVE config
  9. 9. 99 The Solution Application (web- server) that writes data locally Flume Agents – Monitor logs and xfer to HDFS Hadoop Cluster HIVE – HQLish interface to insert and query log data End user – Running SQLish commands to generate reports
  10. 10. 1010 Web-server writes logs to disk Flume agent tails the file for new data. HIVE Service/MetaStore Time to rotate logs? On roll-over, data is written to HDFS Create new partition Move data to corresponding location/partition in HDFS
  11. 11. 1111 Web server writes log to disk Flume Master 1 Flume Node/Agents Flume Node/Agents Flume Master 2 Failover scenarios – Flume Master Crashes Web server writes log to disk Web server writes log to disk Web server writes log to disk Web server writes log to disk
  12. 12. 1212 Flume Master 1 Flume Node/Agents Flume Node/Agents Flume Master 2 Failover scenarios – Flume Master Crashes Web server writes log to disk Web server writes log to disk Web server writes log to disk Web server writes log to disk
  13. 13. 1313 Web-server writes logs to disk Flume Node/Agents HIVE Server executes the query On rollover: Close file and run HQL query to load data inside “HIVE” Failover scenarios – Hive Server Crashes Return true
  14. 14. 1414 Web-server writes logs to disk Flume Node/Agents HIVE Server executes the query On rollover: Close file and run HQL query to load data inside “HIVE” Failover scenarios – Hive Server Crashes Return true On rollover: Close file and write HQL to “HIVE MARKER FOLDER” inside “HDFS”
  15. 15. 1515 Issues •Small files is a problem, a roll-over can result in tiny files -Nightly merge job •Real-time querying is not possible at the moment. -Data available on roll-over and roll-over cannot be too small. •Data appears at-least once. -Duplicity is an issue
  16. 16. 1616 TinderBox •Build Engine – Compiles and runs test cases.
  17. 17. 1717 TinderBox •Developers want to search for “specific strings” •Across their own commits •Across everybody else’s commit •…. and find ‘n’ lines of context surrounding their “search query”
  18. 18. 1818 TinderBox •Each line gets it own line number inside HIVE •Custom query wrapper: SELECT line_number, line_message FROM tinderbox_logs WHERE line_message LIKE ‘%0xfeeca%’ AND ds=‘2010- 10-01’ AND line_number:CONTEXT=50;
  19. 19. 1919 TinderBox •The custom query clause: line_number:CONTEXT=50; •Runs usual query, finds corresponding line number/s •Runs new query with +/- 50 lines of corresponding context
  20. 20. 2020 Issues •Data duplicity in case of Flume-node failure is a problem. -De-dupe via a new query, painful….
  21. 21. 2121 Conclusion •Currently testing Flume for Socorro (crash reporting) and Tinderbox (build system) •No performance issues yet. •Plan to release flume-hive patch for 0.9.2
  22. 22. 2222 Anurag Phadke aphadke@mozilla.co m We are HIRING! Daniel Einspanjer daniel@mozilla.com Justin Fitzhugh justin@mozilla.com Xavier Stevens xstevens@mozilla.co m Thank You

×