Successfully reported this slideshow.
Your SlideShare is downloading. ×

In-situ MapReduce for Log Processing

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
MapReduce and Hadoop
MapReduce and Hadoop
Loading in …3
×

Check these out next

1 of 21 Ad

More Related Content

Advertisement

Similar to In-situ MapReduce for Log Processing (20)

Recently uploaded (20)

Advertisement

In-situ MapReduce for Log Processing

  1. 1. In-situ MapReduce for Log Processing Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  2. 2. Log analytics • Data centers with 1000s of servers • Data-intensive computing: Store and analyze TBs of logs Examples: • Click logs – ad-targeting, personalization • Social media feeds – brand monitoring • Purchase logs – fraud detection • System logs – anomaly detection, debugging 1
  3. 3. Log analytics today • “Store-first-query later” Servers Problems: • Scale – Stress network and disks Store first ... • Failures – Delay analysis or process incomplete ... query later data • Timeliness MapReduce – Hinder real-time apps Dedicated cluster 2
  4. 4. In-situ MapReduce (iMR) Idea: Servers • Move analysis to the servers • MapReduce for continuous MapReduce data • Ability to trade fidelity for latency Optimized for: • Highly selective workloads – e.g., up to 80% data filtered or summarized! • Online analytics – e.g., ad re-targeting based on most recent clicks Dedicated cluster 3
  5. 5. An iMR query The same: • MapReduce API – map(r)  {k,v} : extract/filter data – reduce({k,v[]})  v’ : data aggregation – combine({k,v[]})  v’ : early, partial aggregation The new: • Provides continuous results – because logs are continuous 4
  6. 6. Continuous MapReduce Log entries • Input – An infinite stream of logs ... Time 0’’ 30’’ 60’’ 90’’ • Bound input with sliding windows Map – Range of data (R) Combine – Update frequency (S) • Output Reduce – Stream of results, one for each window 5
  7. 7. Processing windows in-network Overlapping data User’s reduce function ... Time 0’’ 30’’ 60’’ 90’’ Map Combine ... Reduce Aggregation tree for efficiency 6
  8. 8. Efficient processing with panes P1 P2 P3 P4 P5 • Divide window into panes (sub-windows) ... – Each pane is Time 0’’ 30’’ 60’’ 90’’ processed and sent only once – Root combines panes Map to produce window Combine • Eliminate redundant P1 P2 work P3 P4 – Save CPU & network P5 resources, faster analysis Reduce 7
  9. 9. Impact of data loss on analysis • Servers may get P1 P2 P3 P4 P5 overloaded or fail ... X Challenges: • Characterize Map Combine incomplete results • Allow users to trade fidelity for latency Reduce ? 8
  10. 10. Quantifying data fidelity • Data are naturally distributed – Space (server nodes) – Time (processing window) • C2 metric – Annotates result windows with a “scoreboard” 9
  11. 11. Trading fidelity for latency • Use C2 to trade fidelity for latency – Maximum latency requirement – Minimum fidelity requirement • Different ways to meet minimum fidelity – 4 useful classes of C2 specifications 10
  12. 12. Minimizing result latency • Minimum fidelity with earlier results • Give freedom to decrease latency – Return the earliest data available • Appropriate for uniformly distributed events 11
  13. 13. Sampling non-uniform events • Minimum fidelity with random sampling • Less freedom to decrease latency – Included data may not be the first available • Appropriate even for non-uniform data 12
  14. 14. Correlating events across time and space Leverage knowledge about data distribution • Temporal completeness – Include all data from a node or no data at all • Spatial completeness – Each pane contains data from all nodes 13
  15. 15. Prototype • Build upon Mortar – Sliding windows – In-network aggregation trees • Extended to support: – MapReduce API – Pane-based processing – Fault tolerance mechanisms 14
  16. 16. Processing data in-situ • Useful when ... • Goal: use available resources intelligently • Load shedding mechanism – Nodes monitor local processing rate – Shed panes that cannot be processed on time • Increase result fidelity under time and resource constraints 15
  17. 17. Evaluation • System scalability • Usefulness of C2 metric – Understanding incomplete results – Trading fidelity for latency • Processing data in-situ – Improving fidelity under load with load shedding – Minimizing impact on services 16
  18. 18. Scaling • Synthetic input data, reducer of word count • 3 reducers provide sufficient processing to handle the 30 map tasks 17
  19. 19. Exploring fidelity-latency trade-offs Data loss affects accuracy of distribution 100% accuracy • Temporal completeness • Spatial completeness and random sampling >25% decrease C2 allows to trade fidelity for lower latency 18
  20. 20. In-situ performance • iMR side-by-side with a real service (Hadoop) 560% • Vary CPU allocated to iMR – Result fidelity – Hadoop performance (job throughput) <11% overhead 19
  21. 21. Conclusion • In-situ architecture processes logs at the sources, avoids bulk data transfers, and reduces analysis time • Model allows incomplete data under failures or server load, provides timely analysis • C2 metric helps understand incomplete data and trade fidelity for latency • Pro-actively sheds load, improves data fidelity under resource and time constraints 20

×