SlideShare a Scribd company logo
1 of 21
In-situ MapReduce for Log Processing


          Speaker: LIN Qian
 http://www.comp.nus.edu.sg/~linqian
Log analytics
• Data centers with 1000s of
  servers

• Data-intensive computing:
  Store and analyze TBs of logs

Examples:
• Click logs
   – ad-targeting, personalization
• Social media feeds
   – brand monitoring
• Purchase logs
   – fraud detection
• System logs
   – anomaly detection, debugging      1
Log analytics today
• “Store-first-query later”   Servers

Problems:
• Scale
   – Stress network and
     disks
                                               Store first ...
• Failures
   – Delay analysis or
     process incomplete                       ... query later
     data
• Timeliness
                                   MapReduce
   – Hinder real-time apps
                                  Dedicated cluster
                                                          2
In-situ MapReduce (iMR)
Idea:                              Servers
• Move analysis to the
   servers
• MapReduce for continuous              MapReduce
   data
• Ability to trade fidelity for
   latency

Optimized for:
• Highly selective workloads
   – e.g., up to 80% data
     filtered or summarized!
• Online analytics
   – e.g., ad re-targeting based
     on most recent clicks             Dedicated cluster
                                                           3
An iMR query
The same:
• MapReduce API
  – map(r)  {k,v} : extract/filter data
  – reduce({k,v[]})  v’ : data aggregation
  – combine({k,v[]})  v’ : early, partial aggregation


The new:
• Provides continuous results
  – because logs are continuous
                                                         4
Continuous MapReduce
                                   Log entries
• Input
   – An infinite stream of logs                                       ...
                                                                            Time
                                  0’’            30’’         60’’   90’’
• Bound input with sliding
  windows
                                                        Map
   – Range of data (R)                              Combine
   – Update frequency (S)


• Output
                                                    Reduce
   – Stream of results, one
     for each window
                                                                              5
Processing windows in-network
                                                   Overlapping data
 User’s reduce function
                                                            ...
                                                                  Time
                             0’’   30’’         60’’       90’’



                                          Map
                                      Combine


                                          ...

                                      Reduce



Aggregation tree for efficiency                                     6
Efficient processing with panes
                            P1 P2 P3 P4 P5
• Divide window into
  panes (sub-windows)                                    ...
  – Each pane is                                               Time
                          0’’     30’’           60’’   90’’
    processed and sent
    only once
  – Root combines panes                  Map
    to produce window                  Combine
• Eliminate redundant             P1
                                  P2
  work                            P3
                                  P4

  – Save CPU & network
                                  P5


    resources, faster
    analysis                           Reduce


                                                                 7
Impact of data loss on analysis
• Servers may get
                       P1 P2 P3 P4 P5


  overloaded or fail                     ...




                             X
Challenges:
• Characterize
                                Map
                               Combine
  incomplete results
• Allow users to
  trade fidelity for
  latency                      Reduce


                                 ?             8
Quantifying data fidelity
• Data are naturally
  distributed
  – Space (server nodes)
  – Time (processing window)


• C2 metric
  – Annotates result windows
    with a “scoreboard”
                                    9
Trading fidelity for latency
• Use C2 to trade fidelity for
  latency
  – Maximum latency requirement
  – Minimum fidelity requirement


• Different ways to meet
  minimum fidelity
  – 4 useful classes of C2
    specifications

                                        10
Minimizing result latency




• Minimum fidelity with earlier results
• Give freedom to decrease latency
  – Return the earliest data available
• Appropriate for uniformly distributed
  events
                                          11
Sampling non-uniform events




• Minimum fidelity with random sampling
• Less freedom to decrease latency
  – Included data may not be the first
    available
• Appropriate even for non-uniform data
                                          12
Correlating events across time and space

Leverage knowledge about data distribution
• Temporal completeness
  – Include all data from a
    node or no data at all


• Spatial completeness
  – Each pane contains data
    from all nodes
                                             13
Prototype
• Build upon Mortar
  – Sliding windows
  – In-network aggregation trees

• Extended to support:
  – MapReduce API
  – Pane-based processing
  – Fault tolerance mechanisms
                                   14
Processing data in-situ
• Useful when ...
• Goal: use available resources intelligently

• Load shedding mechanism
  – Nodes monitor local processing rate
  – Shed panes that cannot be processed on
    time
• Increase result fidelity under time and
  resource constraints
                                                15
Evaluation
• System scalability
• Usefulness of C2 metric
  – Understanding incomplete results
  – Trading fidelity for latency
• Processing data in-situ
  – Improving fidelity under load with load
    shedding
  – Minimizing impact on services
                                              16
Scaling
• Synthetic input data, reducer of word
  count
• 3 reducers provide sufficient processing
  to handle the 30 map tasks




                                             17
Exploring fidelity-latency trade-offs
Data loss affects accuracy of
distribution
                                    100%
                                   accuracy
• Temporal completeness
• Spatial completeness and
  random sampling                 >25%
                                  decrease



C2 allows to trade fidelity for
lower latency
                                              18
In-situ performance
• iMR side-by-side with a
  real service (Hadoop)
                                   560%

• Vary CPU allocated to iMR
  – Result fidelity
  – Hadoop performance (job
    throughput)                 <11% overhead




                                            19
Conclusion
• In-situ architecture processes logs at the
  sources, avoids bulk data transfers, and
  reduces analysis time
• Model allows incomplete data under failures
  or server load, provides timely analysis
• C2 metric helps understand incomplete data
  and trade fidelity for latency
• Pro-actively sheds load, improves data fidelity
  under resource and time constraints
                                                    20

More Related Content

What's hot (7)

Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based Scheduling
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Massive Solutions Clustrx Os
Massive Solutions Clustrx OsMassive Solutions Clustrx Os
Massive Solutions Clustrx Os
 
Intelligent cloud computing
Intelligent cloud computingIntelligent cloud computing
Intelligent cloud computing
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 

Viewers also liked

Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldKineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Qian Lin
 
Adaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationAdaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable Computation
Qian Lin
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the Cloud
Qian Lin
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Qian Lin
 
Optimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationOptimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid Virtualization
Qian Lin
 
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud
Qian Lin
 

Viewers also liked (8)

C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsC-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
 
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldKineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
 
Adaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationAdaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable Computation
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the Cloud
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
 
Optimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationOptimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid Virtualization
 
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud
 

Similar to In-situ MapReduce for Log Processing

Cloud connect 03 08-2011
Cloud connect 03 08-2011Cloud connect 03 08-2011
Cloud connect 03 08-2011
Colin Clark
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
Prateek Jain
 
Architecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingArchitecture Challenges In Cloud Computing
Architecture Challenges In Cloud Computing
IndicThreads
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache Accumulo
Sqrrl
 

Similar to In-situ MapReduce for Log Processing (20)

Mobile+Cloud: a viable replacement for desktop cheminformatics?
Mobile+Cloud: a viable replacement for desktop cheminformatics?Mobile+Cloud: a viable replacement for desktop cheminformatics?
Mobile+Cloud: a viable replacement for desktop cheminformatics?
 
Cloud connect 03 08-2011
Cloud connect 03 08-2011Cloud connect 03 08-2011
Cloud connect 03 08-2011
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ Applications
 
Building a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data CloudBuilding a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data Cloud
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
 
Architecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingArchitecture Challenges In Cloud Computing
Architecture Challenges In Cloud Computing
 
MonogDB Admin 101 - MonogDBDays Munich
MonogDB Admin 101 - MonogDBDays MunichMonogDB Admin 101 - MonogDBDays Munich
MonogDB Admin 101 - MonogDBDays Munich
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Disco workshop
Disco workshopDisco workshop
Disco workshop
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache Accumulo
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
 
数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战
 

Recently uploaded

Recently uploaded (20)

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 

In-situ MapReduce for Log Processing

  • 1. In-situ MapReduce for Log Processing Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  • 2. Log analytics • Data centers with 1000s of servers • Data-intensive computing: Store and analyze TBs of logs Examples: • Click logs – ad-targeting, personalization • Social media feeds – brand monitoring • Purchase logs – fraud detection • System logs – anomaly detection, debugging 1
  • 3. Log analytics today • “Store-first-query later” Servers Problems: • Scale – Stress network and disks Store first ... • Failures – Delay analysis or process incomplete ... query later data • Timeliness MapReduce – Hinder real-time apps Dedicated cluster 2
  • 4. In-situ MapReduce (iMR) Idea: Servers • Move analysis to the servers • MapReduce for continuous MapReduce data • Ability to trade fidelity for latency Optimized for: • Highly selective workloads – e.g., up to 80% data filtered or summarized! • Online analytics – e.g., ad re-targeting based on most recent clicks Dedicated cluster 3
  • 5. An iMR query The same: • MapReduce API – map(r)  {k,v} : extract/filter data – reduce({k,v[]})  v’ : data aggregation – combine({k,v[]})  v’ : early, partial aggregation The new: • Provides continuous results – because logs are continuous 4
  • 6. Continuous MapReduce Log entries • Input – An infinite stream of logs ... Time 0’’ 30’’ 60’’ 90’’ • Bound input with sliding windows Map – Range of data (R) Combine – Update frequency (S) • Output Reduce – Stream of results, one for each window 5
  • 7. Processing windows in-network Overlapping data User’s reduce function ... Time 0’’ 30’’ 60’’ 90’’ Map Combine ... Reduce Aggregation tree for efficiency 6
  • 8. Efficient processing with panes P1 P2 P3 P4 P5 • Divide window into panes (sub-windows) ... – Each pane is Time 0’’ 30’’ 60’’ 90’’ processed and sent only once – Root combines panes Map to produce window Combine • Eliminate redundant P1 P2 work P3 P4 – Save CPU & network P5 resources, faster analysis Reduce 7
  • 9. Impact of data loss on analysis • Servers may get P1 P2 P3 P4 P5 overloaded or fail ... X Challenges: • Characterize Map Combine incomplete results • Allow users to trade fidelity for latency Reduce ? 8
  • 10. Quantifying data fidelity • Data are naturally distributed – Space (server nodes) – Time (processing window) • C2 metric – Annotates result windows with a “scoreboard” 9
  • 11. Trading fidelity for latency • Use C2 to trade fidelity for latency – Maximum latency requirement – Minimum fidelity requirement • Different ways to meet minimum fidelity – 4 useful classes of C2 specifications 10
  • 12. Minimizing result latency • Minimum fidelity with earlier results • Give freedom to decrease latency – Return the earliest data available • Appropriate for uniformly distributed events 11
  • 13. Sampling non-uniform events • Minimum fidelity with random sampling • Less freedom to decrease latency – Included data may not be the first available • Appropriate even for non-uniform data 12
  • 14. Correlating events across time and space Leverage knowledge about data distribution • Temporal completeness – Include all data from a node or no data at all • Spatial completeness – Each pane contains data from all nodes 13
  • 15. Prototype • Build upon Mortar – Sliding windows – In-network aggregation trees • Extended to support: – MapReduce API – Pane-based processing – Fault tolerance mechanisms 14
  • 16. Processing data in-situ • Useful when ... • Goal: use available resources intelligently • Load shedding mechanism – Nodes monitor local processing rate – Shed panes that cannot be processed on time • Increase result fidelity under time and resource constraints 15
  • 17. Evaluation • System scalability • Usefulness of C2 metric – Understanding incomplete results – Trading fidelity for latency • Processing data in-situ – Improving fidelity under load with load shedding – Minimizing impact on services 16
  • 18. Scaling • Synthetic input data, reducer of word count • 3 reducers provide sufficient processing to handle the 30 map tasks 17
  • 19. Exploring fidelity-latency trade-offs Data loss affects accuracy of distribution 100% accuracy • Temporal completeness • Spatial completeness and random sampling >25% decrease C2 allows to trade fidelity for lower latency 18
  • 20. In-situ performance • iMR side-by-side with a real service (Hadoop) 560% • Vary CPU allocated to iMR – Result fidelity – Hadoop performance (job throughput) <11% overhead 19
  • 21. Conclusion • In-situ architecture processes logs at the sources, avoids bulk data transfers, and reduces analysis time • Model allows incomplete data under failures or server load, provides timely analysis • C2 metric helps understand incomplete data and trade fidelity for latency • Pro-actively sheds load, improves data fidelity under resource and time constraints 20