ScaleIn
Collecting and Querying log
data in near real-time.
Anurag Phadke
22
Current Scenario
Scaling out is a relatively solved problem.
33
Solutions are aplenty
44
The Problem
How do we centrally collect logs from these
distributed machines:
-that are compressed….
-in near real-time...
55
66
A Solution
Flume + Hive + Hadoop
77
A Solution
Hadoop: Clustered solution that provides storage +
ability to run MapReduce Jobs
Hive: SQLish interface to q...
88
Flume – HIVE config
99
The Solution
Application
(web-
server) that
writes data
locally
Flume
Agents –
Monitor logs
and xfer to
HDFS
Hadoop
Clu...
1010
Web-server writes logs to
disk
Flume agent tails the file
for new data.
HIVE Service/MetaStore
Time to
rotate logs?
O...
1111
Web server
writes log to
disk
Flume Master 1
Flume
Node/Agents
Flume
Node/Agents
Flume Master 2
Failover scenarios – ...
1212
Flume Master 1
Flume
Node/Agents
Flume
Node/Agents
Flume Master 2
Failover scenarios – Flume Master Crashes
Web serve...
1313
Web-server writes logs to
disk
Flume
Node/Agents
HIVE Server executes the
query
On rollover: Close file and
run
HQL q...
1414
Web-server writes logs to
disk
Flume
Node/Agents
HIVE Server executes the
query
On rollover: Close file and
run
HQL q...
1515
Issues
•Small files is a problem, a roll-over can result in tiny files
-Nightly merge job
•Real-time querying is not ...
1616
TinderBox
•Build Engine – Compiles and runs test cases.
1717
TinderBox
•Developers want to search for “specific strings”
•Across their own commits
•Across everybody else’s commit...
1818
TinderBox
•Each line gets it own line number inside HIVE
•Custom query wrapper:
SELECT line_number, line_message FROM...
1919
TinderBox
•The custom query clause:
line_number:CONTEXT=50;
•Runs usual query, finds corresponding line number/s
•Run...
2020
Issues
•Data duplicity in case of Flume-node failure is a problem.
-De-dupe via a new query, painful….
2121
Conclusion
•Currently testing Flume for Socorro (crash reporting) and
Tinderbox (build system)
•No performance issues...
2222
Anurag
Phadke
aphadke@mozilla.co
m
We are HIRING!
Daniel
Einspanjer
daniel@mozilla.com
Justin
Fitzhugh
justin@mozilla...
Upcoming SlideShare
Loading in …5
×

Mozilla - Anurag Phadke - Hadoop World 2010

2,182 views

Published on

Scale in Collecting Distributed Data Via Flume and Querying Through Hive

Anurag Phadke
Mozilla

Learn more @ http://www.cloudera.com/hadoop/

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,182
On SlideShare
0
From Embeds
0
Number of Embeds
490
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Flume architecture:
    Talk about flume node, agent: collectorSink(<hdfs>,<prefix>).
  • Mozilla - Anurag Phadke - Hadoop World 2010

    1. 1. ScaleIn Collecting and Querying log data in near real-time. Anurag Phadke
    2. 2. 22 Current Scenario Scaling out is a relatively solved problem.
    3. 3. 33 Solutions are aplenty
    4. 4. 44 The Problem How do we centrally collect logs from these distributed machines: -that are compressed…. -in near real-time…. -analyzable via SQLish queries.
    5. 5. 55
    6. 6. 66 A Solution Flume + Hive + Hadoop
    7. 7. 77 A Solution Hadoop: Clustered solution that provides storage + ability to run MapReduce Jobs Hive: SQLish interface to query data. Flume: Collect data from log files, syslog etc. to a central location
    8. 8. 88 Flume – HIVE config
    9. 9. 99 The Solution Application (web- server) that writes data locally Flume Agents – Monitor logs and xfer to HDFS Hadoop Cluster HIVE – HQLish interface to insert and query log data End user – Running SQLish commands to generate reports
    10. 10. 1010 Web-server writes logs to disk Flume agent tails the file for new data. HIVE Service/MetaStore Time to rotate logs? On roll-over, data is written to HDFS Create new partition Move data to corresponding location/partition in HDFS
    11. 11. 1111 Web server writes log to disk Flume Master 1 Flume Node/Agents Flume Node/Agents Flume Master 2 Failover scenarios – Flume Master Crashes Web server writes log to disk Web server writes log to disk Web server writes log to disk Web server writes log to disk
    12. 12. 1212 Flume Master 1 Flume Node/Agents Flume Node/Agents Flume Master 2 Failover scenarios – Flume Master Crashes Web server writes log to disk Web server writes log to disk Web server writes log to disk Web server writes log to disk
    13. 13. 1313 Web-server writes logs to disk Flume Node/Agents HIVE Server executes the query On rollover: Close file and run HQL query to load data inside “HIVE” Failover scenarios – Hive Server Crashes Return true
    14. 14. 1414 Web-server writes logs to disk Flume Node/Agents HIVE Server executes the query On rollover: Close file and run HQL query to load data inside “HIVE” Failover scenarios – Hive Server Crashes Return true On rollover: Close file and write HQL to “HIVE MARKER FOLDER” inside “HDFS”
    15. 15. 1515 Issues •Small files is a problem, a roll-over can result in tiny files -Nightly merge job •Real-time querying is not possible at the moment. -Data available on roll-over and roll-over cannot be too small. •Data appears at-least once. -Duplicity is an issue
    16. 16. 1616 TinderBox •Build Engine – Compiles and runs test cases.
    17. 17. 1717 TinderBox •Developers want to search for “specific strings” •Across their own commits •Across everybody else’s commit •…. and find ‘n’ lines of context surrounding their “search query”
    18. 18. 1818 TinderBox •Each line gets it own line number inside HIVE •Custom query wrapper: SELECT line_number, line_message FROM tinderbox_logs WHERE line_message LIKE ‘%0xfeeca%’ AND ds=‘2010- 10-01’ AND line_number:CONTEXT=50;
    19. 19. 1919 TinderBox •The custom query clause: line_number:CONTEXT=50; •Runs usual query, finds corresponding line number/s •Runs new query with +/- 50 lines of corresponding context
    20. 20. 2020 Issues •Data duplicity in case of Flume-node failure is a problem. -De-dupe via a new query, painful….
    21. 21. 2121 Conclusion •Currently testing Flume for Socorro (crash reporting) and Tinderbox (build system) •No performance issues yet. •Plan to release flume-hive patch for 0.9.2
    22. 22. 2222 Anurag Phadke aphadke@mozilla.co m We are HIRING! Daniel Einspanjer daniel@mozilla.com Justin Fitzhugh justin@mozilla.com Xavier Stevens xstevens@mozilla.co m Thank You

    ×