In-situ MapReduce for Log Processing

In-situ MapReduce for Log Processing

Speaker: LIN Qian
http://www.comp.nus.edu.sg/~linqian

Log analytics
• Data centers with 1000s of
servers

• Data-intensive computing:
Store and analyze TBs of logs

Examples:
• Click logs
– ad-targeting, personalization
• Social media feeds
– brand monitoring
• Purchase logs
– fraud detection
• System logs
– anomaly detection, debugging 1

Log analytics today
• “Store-first-query later” Servers

Problems:
• Scale
– Stress network and
disks
Store first ...
• Failures
– Delay analysis or
process incomplete ... query later
data
• Timeliness
MapReduce
– Hinder real-time apps
Dedicated cluster
2

In-situ MapReduce (iMR)
Idea: Servers
• Move analysis to the
servers
• MapReduce for continuous MapReduce
data
• Ability to trade fidelity for
latency

Optimized for:
• Highly selective workloads
– e.g., up to 80% data
filtered or summarized!
• Online analytics
– e.g., ad re-targeting based
on most recent clicks Dedicated cluster
3

An iMR query
The same:
• MapReduce API
– map(r)  {k,v} : extract/filter data
– reduce({k,v[]})  v’ : data aggregation
– combine({k,v[]})  v’ : early, partial aggregation

The new:
• Provides continuous results
– because logs are continuous
4

Continuous MapReduce
Log entries
• Input
– An infinite stream of logs ...
Time
0’’ 30’’ 60’’ 90’’
• Bound input with sliding
windows
Map
– Range of data (R) Combine
– Update frequency (S)

• Output
Reduce
– Stream of results, one
for each window
5

Processing windows in-network
Overlapping data
User’s reduce function
...
Time
0’’ 30’’ 60’’ 90’’

Map
Combine

...

Reduce

Aggregation tree for efficiency 6

Efficient processing with panes
P1 P2 P3 P4 P5
• Divide window into
panes (sub-windows) ...
– Each pane is Time
0’’ 30’’ 60’’ 90’’
processed and sent
only once
– Root combines panes Map
to produce window Combine
• Eliminate redundant P1
P2
work P3
P4

– Save CPU & network
P5

resources, faster
analysis Reduce

7

Impact of data loss on analysis
• Servers may get
P1 P2 P3 P4 P5

overloaded or fail ...

X
Challenges:
• Characterize
Map
Combine
incomplete results
• Allow users to
trade fidelity for
latency Reduce

? 8

Quantifying data fidelity
• Data are naturally
distributed
– Space (server nodes)
– Time (processing window)

• C2 metric
– Annotates result windows
with a “scoreboard”
9

Trading fidelity for latency
• Use C2 to trade fidelity for
latency
– Maximum latency requirement
– Minimum fidelity requirement

• Different ways to meet
minimum fidelity
– 4 useful classes of C2
specifications

10

Minimizing result latency

• Minimum fidelity with earlier results
• Give freedom to decrease latency
– Return the earliest data available
• Appropriate for uniformly distributed
events
11

Sampling non-uniform events

• Minimum fidelity with random sampling
• Less freedom to decrease latency
– Included data may not be the first
available
• Appropriate even for non-uniform data
12

Correlating events across time and space

Leverage knowledge about data distribution
• Temporal completeness
– Include all data from a
node or no data at all

• Spatial completeness
– Each pane contains data
from all nodes
13

Prototype
• Build upon Mortar
– Sliding windows
– In-network aggregation trees

• Extended to support:
– MapReduce API
– Pane-based processing
– Fault tolerance mechanisms
14

Processing data in-situ
• Useful when ...
• Goal: use available resources intelligently

• Load shedding mechanism
– Nodes monitor local processing rate
– Shed panes that cannot be processed on
time
• Increase result fidelity under time and
resource constraints
15

Evaluation
• System scalability
• Usefulness of C2 metric
– Understanding incomplete results
– Trading fidelity for latency
• Processing data in-situ
– Improving fidelity under load with load
shedding
– Minimizing impact on services
16

Scaling
• Synthetic input data, reducer of word
count
• 3 reducers provide sufficient processing
to handle the 30 map tasks

17

Exploring fidelity-latency trade-offs
Data loss affects accuracy of
distribution
100%
accuracy
• Temporal completeness
• Spatial completeness and
random sampling >25%
decrease

C2 allows to trade fidelity for
lower latency
18

In-situ performance
• iMR side-by-side with a
real service (Hadoop)
560%

• Vary CPU allocated to iMR
– Result fidelity
– Hadoop performance (job
throughput) <11% overhead

19

Conclusion
• In-situ architecture processes logs at the
sources, avoids bulk data transfers, and
reduces analysis time
• Model allows incomplete data under failures
or server load, provides timely analysis
• C2 metric helps understand incomplete data
and trade fidelity for latency
• Pro-actively sheds load, improves data fidelity
under resource and time constraints
20

In-situ MapReduce for Log Processing

More Related Content

What's hot

Viewers also liked

Similar to In-situ MapReduce for Log Processing

Recently uploaded

In-situ MapReduce for Log Processing