In-situ MapReduce for Log Processing


          Speaker: LIN Qian
 http://www.comp.nus.edu.sg/~linqian
Log analytics
• Data centers with 1000s of
  servers

• Data-intensive computing:
  Store and analyze TBs of logs

Examples:
• Click logs
   – ad-targeting, personalization
• Social media feeds
   – brand monitoring
• Purchase logs
   – fraud detection
• System logs
   – anomaly detection, debugging      1
Log analytics today
• “Store-first-query later”   Servers

Problems:
• Scale
   – Stress network and
     disks
                                               Store first ...
• Failures
   – Delay analysis or
     process incomplete                       ... query later
     data
• Timeliness
                                   MapReduce
   – Hinder real-time apps
                                  Dedicated cluster
                                                          2
In-situ MapReduce (iMR)
Idea:                              Servers
• Move analysis to the
   servers
• MapReduce for continuous              MapReduce
   data
• Ability to trade fidelity for
   latency

Optimized for:
• Highly selective workloads
   – e.g., up to 80% data
     filtered or summarized!
• Online analytics
   – e.g., ad re-targeting based
     on most recent clicks             Dedicated cluster
                                                           3
An iMR query
The same:
• MapReduce API
  – map(r)  {k,v} : extract/filter data
  – reduce({k,v[]})  v’ : data aggregation
  – combine({k,v[]})  v’ : early, partial aggregation


The new:
• Provides continuous results
  – because logs are continuous
                                                         4
Continuous MapReduce
                                   Log entries
• Input
   – An infinite stream of logs                                       ...
                                                                            Time
                                  0’’            30’’         60’’   90’’
• Bound input with sliding
  windows
                                                        Map
   – Range of data (R)                              Combine
   – Update frequency (S)


• Output
                                                    Reduce
   – Stream of results, one
     for each window
                                                                              5
Processing windows in-network
                                                   Overlapping data
 User’s reduce function
                                                            ...
                                                                  Time
                             0’’   30’’         60’’       90’’



                                          Map
                                      Combine


                                          ...

                                      Reduce



Aggregation tree for efficiency                                     6
Efficient processing with panes
                            P1 P2 P3 P4 P5
• Divide window into
  panes (sub-windows)                                    ...
  – Each pane is                                               Time
                          0’’     30’’           60’’   90’’
    processed and sent
    only once
  – Root combines panes                  Map
    to produce window                  Combine
• Eliminate redundant             P1
                                  P2
  work                            P3
                                  P4

  – Save CPU & network
                                  P5


    resources, faster
    analysis                           Reduce


                                                                 7
Impact of data loss on analysis
• Servers may get
                       P1 P2 P3 P4 P5


  overloaded or fail                     ...




                             X
Challenges:
• Characterize
                                Map
                               Combine
  incomplete results
• Allow users to
  trade fidelity for
  latency                      Reduce


                                 ?             8
Quantifying data fidelity
• Data are naturally
  distributed
  – Space (server nodes)
  – Time (processing window)


• C2 metric
  – Annotates result windows
    with a “scoreboard”
                                    9
Trading fidelity for latency
• Use C2 to trade fidelity for
  latency
  – Maximum latency requirement
  – Minimum fidelity requirement


• Different ways to meet
  minimum fidelity
  – 4 useful classes of C2
    specifications

                                        10
Minimizing result latency




• Minimum fidelity with earlier results
• Give freedom to decrease latency
  – Return the earliest data available
• Appropriate for uniformly distributed
  events
                                          11
Sampling non-uniform events




• Minimum fidelity with random sampling
• Less freedom to decrease latency
  – Included data may not be the first
    available
• Appropriate even for non-uniform data
                                          12
Correlating events across time and space

Leverage knowledge about data distribution
• Temporal completeness
  – Include all data from a
    node or no data at all


• Spatial completeness
  – Each pane contains data
    from all nodes
                                             13
Prototype
• Build upon Mortar
  – Sliding windows
  – In-network aggregation trees

• Extended to support:
  – MapReduce API
  – Pane-based processing
  – Fault tolerance mechanisms
                                   14
Processing data in-situ
• Useful when ...
• Goal: use available resources intelligently

• Load shedding mechanism
  – Nodes monitor local processing rate
  – Shed panes that cannot be processed on
    time
• Increase result fidelity under time and
  resource constraints
                                                15
Evaluation
• System scalability
• Usefulness of C2 metric
  – Understanding incomplete results
  – Trading fidelity for latency
• Processing data in-situ
  – Improving fidelity under load with load
    shedding
  – Minimizing impact on services
                                              16
Scaling
• Synthetic input data, reducer of word
  count
• 3 reducers provide sufficient processing
  to handle the 30 map tasks




                                             17
Exploring fidelity-latency trade-offs
Data loss affects accuracy of
distribution
                                    100%
                                   accuracy
• Temporal completeness
• Spatial completeness and
  random sampling                 >25%
                                  decrease



C2 allows to trade fidelity for
lower latency
                                              18
In-situ performance
• iMR side-by-side with a
  real service (Hadoop)
                                   560%

• Vary CPU allocated to iMR
  – Result fidelity
  – Hadoop performance (job
    throughput)                 <11% overhead




                                            19
Conclusion
• In-situ architecture processes logs at the
  sources, avoids bulk data transfers, and
  reduces analysis time
• Model allows incomplete data under failures
  or server load, provides timely analysis
• C2 metric helps understand incomplete data
  and trade fidelity for latency
• Pro-actively sheds load, improves data fidelity
  under resource and time constraints
                                                    20

In-situ MapReduce for Log Processing

  • 1.
    In-situ MapReduce forLog Processing Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  • 2.
    Log analytics • Datacenters with 1000s of servers • Data-intensive computing: Store and analyze TBs of logs Examples: • Click logs – ad-targeting, personalization • Social media feeds – brand monitoring • Purchase logs – fraud detection • System logs – anomaly detection, debugging 1
  • 3.
    Log analytics today •“Store-first-query later” Servers Problems: • Scale – Stress network and disks Store first ... • Failures – Delay analysis or process incomplete ... query later data • Timeliness MapReduce – Hinder real-time apps Dedicated cluster 2
  • 4.
    In-situ MapReduce (iMR) Idea: Servers • Move analysis to the servers • MapReduce for continuous MapReduce data • Ability to trade fidelity for latency Optimized for: • Highly selective workloads – e.g., up to 80% data filtered or summarized! • Online analytics – e.g., ad re-targeting based on most recent clicks Dedicated cluster 3
  • 5.
    An iMR query Thesame: • MapReduce API – map(r)  {k,v} : extract/filter data – reduce({k,v[]})  v’ : data aggregation – combine({k,v[]})  v’ : early, partial aggregation The new: • Provides continuous results – because logs are continuous 4
  • 6.
    Continuous MapReduce Log entries • Input – An infinite stream of logs ... Time 0’’ 30’’ 60’’ 90’’ • Bound input with sliding windows Map – Range of data (R) Combine – Update frequency (S) • Output Reduce – Stream of results, one for each window 5
  • 7.
    Processing windows in-network Overlapping data User’s reduce function ... Time 0’’ 30’’ 60’’ 90’’ Map Combine ... Reduce Aggregation tree for efficiency 6
  • 8.
    Efficient processing withpanes P1 P2 P3 P4 P5 • Divide window into panes (sub-windows) ... – Each pane is Time 0’’ 30’’ 60’’ 90’’ processed and sent only once – Root combines panes Map to produce window Combine • Eliminate redundant P1 P2 work P3 P4 – Save CPU & network P5 resources, faster analysis Reduce 7
  • 9.
    Impact of dataloss on analysis • Servers may get P1 P2 P3 P4 P5 overloaded or fail ... X Challenges: • Characterize Map Combine incomplete results • Allow users to trade fidelity for latency Reduce ? 8
  • 10.
    Quantifying data fidelity •Data are naturally distributed – Space (server nodes) – Time (processing window) • C2 metric – Annotates result windows with a “scoreboard” 9
  • 11.
    Trading fidelity forlatency • Use C2 to trade fidelity for latency – Maximum latency requirement – Minimum fidelity requirement • Different ways to meet minimum fidelity – 4 useful classes of C2 specifications 10
  • 12.
    Minimizing result latency •Minimum fidelity with earlier results • Give freedom to decrease latency – Return the earliest data available • Appropriate for uniformly distributed events 11
  • 13.
    Sampling non-uniform events •Minimum fidelity with random sampling • Less freedom to decrease latency – Included data may not be the first available • Appropriate even for non-uniform data 12
  • 14.
    Correlating events acrosstime and space Leverage knowledge about data distribution • Temporal completeness – Include all data from a node or no data at all • Spatial completeness – Each pane contains data from all nodes 13
  • 15.
    Prototype • Build uponMortar – Sliding windows – In-network aggregation trees • Extended to support: – MapReduce API – Pane-based processing – Fault tolerance mechanisms 14
  • 16.
    Processing data in-situ •Useful when ... • Goal: use available resources intelligently • Load shedding mechanism – Nodes monitor local processing rate – Shed panes that cannot be processed on time • Increase result fidelity under time and resource constraints 15
  • 17.
    Evaluation • System scalability •Usefulness of C2 metric – Understanding incomplete results – Trading fidelity for latency • Processing data in-situ – Improving fidelity under load with load shedding – Minimizing impact on services 16
  • 18.
    Scaling • Synthetic inputdata, reducer of word count • 3 reducers provide sufficient processing to handle the 30 map tasks 17
  • 19.
    Exploring fidelity-latency trade-offs Dataloss affects accuracy of distribution 100% accuracy • Temporal completeness • Spatial completeness and random sampling >25% decrease C2 allows to trade fidelity for lower latency 18
  • 20.
    In-situ performance • iMRside-by-side with a real service (Hadoop) 560% • Vary CPU allocated to iMR – Result fidelity – Hadoop performance (job throughput) <11% overhead 19
  • 21.
    Conclusion • In-situ architectureprocesses logs at the sources, avoids bulk data transfers, and reduces analysis time • Model allows incomplete data under failures or server load, provides timely analysis • C2 metric helps understand incomplete data and trade fidelity for latency • Pro-actively sheds load, improves data fidelity under resource and time constraints 20