Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dependency Driven Analytics
a Compass for Uncharted Data Oceans/Jungles
Ruslan Mavlyutov, Carlo Curino, Boris Asipov, Phil...
The production job “JobA” failed…
impact? debug? re-run?
1) look in the logs
PBs of daily
2) ask local experts
(they know “how” to look)
But don’t bother them too much…
The Problem
Focused analyses of massive, loosely structured, evolving
data has prohibitive cognitive and computational cos...
Focused analyses of massive, loosely structured, evolving
data has prohibitive cognitive and computational costs.
The Prob...
A better vantage point?
Dependency Driven Analytics (DDA)
• Derive a dependency graph (DG) from raw data
The DG serve as:
• Conceptual Map, and
• ...
DDA: infrastructure logs “incarnation”
• The DG stores:
provenance + telemetry
• NODES: jobs / files / machines / tasks / ...
Current implementation
Raw
Data
Extraction
Dependency
Definition
Storage
Querying
Scope/
Cosmos Neo4J
dependency
graph
Sch...
Extract “jobs processing hours”
extStart = EXTRACT * FROM "ProcStarted_%Y%m%d.log"
USING EventExtractor("ProcStarted");
st...
Example: “Measure JobA’s impact”
graph.traversal().V()
.has("JobTemplateName","JobA_*")
.local(
emit().repeat(out()).times...
DDA: Initial Experiments
Improvements of up to:
• 7x less LoC*
• 700x less run-time
• > 50,000x less CPU-time
• > 800x les...
Not all queries are as easy…
Simple search/browsing
Local or agg. queries on
telemetry / provenance
 Graph queries on DG
...
DDA: open challenges
• Automatically “map” the raw data
• Real-time log ingestion at scale
• Scale-out graph management
• ...
Scope
Enterprise SearchInternet of ThingsInfrastructure logs
…
Conclusions
Problem:
• Focused analyses of massive, loosely structured, evolving data has
prohibitive costs
DDA solution:
...
Upcoming SlideShare
Loading in …5
×

Dependency-Driven Analytics: A Compass for Uncharted Data Oceans

363 views

Published on

Slides of Carlo's CIDR 2017 talk on GUIDER
Paper is here: https://exascale.info/assets/pdf/cidr2017_dependency-driven-analytics.pdf

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Dependency-Driven Analytics: A Compass for Uncharted Data Oceans

  1. 1. Dependency Driven Analytics a Compass for Uncharted Data Oceans/Jungles Ruslan Mavlyutov, Carlo Curino, Boris Asipov, Phil Cudre-Mauroux
  2. 2. The production job “JobA” failed… impact? debug? re-run?
  3. 3. 1) look in the logs PBs of daily
  4. 4. 2) ask local experts (they know “how” to look)
  5. 5. But don’t bother them too much…
  6. 6. The Problem Focused analyses of massive, loosely structured, evolving data has prohibitive cognitive and computational costs.
  7. 7. Focused analyses of massive, loosely structured, evolving data has prohibitive cognitive and computational costs. The Problem Cost of understanding raw data Cost of processing raw data
  8. 8. A better vantage point?
  9. 9. Dependency Driven Analytics (DDA) • Derive a dependency graph (DG) from raw data The DG serve as: • Conceptual Map, and • Sparse Index for the raw data DDA today DDA vision • Automation • Language-integration • Real-time • …
  10. 10. DDA: infrastructure logs “incarnation” • The DG stores: provenance + telemetry • NODES: jobs / files / machines / tasks / … • EDGES: job-reads-file, task-runs-on-machine • PROPERTIES: timestamps / resources usage / … Raw data (logs) Query Interface “JobA’s impact?”
  11. 11. Current implementation Raw Data Extraction Dependency Definition Storage Querying Scope/ Cosmos Neo4J dependency graph Schema + extr. rules Big Data System Graph System Raw Data Raw Data
  12. 12. Extract “jobs processing hours” extStart = EXTRACT * FROM "ProcStarted_%Y%m%d.log" USING EventExtractor("ProcStarted"); startData = SELECT ProcessGuid AS ProcessId, CurrentTimeStamp.Value AS StartTime, JobGuid AS JobId FROM extStart WHERE ProcessGuid != null AND JobGuid != null AND CurrentTimeStamp.HasValue; … procH = SELECT endData.JobId, SUM((End - Start).TotalMs)/1000/3600 AS procHours, FROM startData INNER JOIN endData ON startData.ProcessId == endData.ProcessId AND startData.JobId == endData.JobId GROUP BY JobId; OUTPUT (SELECT JobId, procHours FROM procH) TO "processingHours.csv";
  13. 13. Example: “Measure JobA’s impact” graph.traversal().V() .has("JobTemplateName","JobA_*") .local( emit().repeat(out()).times(100) .hasLabel("job").dedup() .values(“procHours").sum() ).mean() …
  14. 14. DDA: Initial Experiments Improvements of up to: • 7x less LoC* • 700x less run-time • > 50,000x less CPU-time • > 800x less I/O * Heavy under-representation of hardness of baseline
  15. 15. Not all queries are as easy… Simple search/browsing Local or agg. queries on telemetry / provenance  Graph queries on DG (i.e., covering index) Complex/AdHoc queries (e.g., debugging)  Mix of DG and raw data querying (clumsy today)  UI (keyword search) Neo4J Scope/ Cosmos Neo4J +
  16. 16. DDA: open challenges • Automatically “map” the raw data • Real-time log ingestion at scale • Scale-out graph management • Leverage specialized graph structures • Integrated language for graph+relational+unstructured
  17. 17. Scope Enterprise SearchInternet of ThingsInfrastructure logs …
  18. 18. Conclusions Problem: • Focused analyses of massive, loosely structured, evolving data has prohibitive costs DDA solution: • Extract a Dependency Graph (DG)  conceptual map + sparse index • Current impl. leverages existing BigData/Graph tech Open challenges: • automation / real-time / scalable graph tech / integrated language

×