SlideShare a Scribd company logo
1 of 18
Download to read offline
Dependency Driven Analytics
a Compass for Uncharted Data Oceans/Jungles
Ruslan Mavlyutov, Carlo Curino, Boris Asipov, Phil Cudre-Mauroux
The production job “JobA” failed…
impact? debug? re-run?
1) look in the logs
PBs of daily
2) ask local experts
(they know “how” to look)
But don’t bother them too much…
The Problem
Focused analyses of massive, loosely structured, evolving
data has prohibitive cognitive and computational costs.
Focused analyses of massive, loosely structured, evolving
data has prohibitive cognitive and computational costs.
The Problem
Cost of understanding raw data Cost of processing raw data
A better vantage point?
Dependency Driven Analytics (DDA)
• Derive a dependency graph (DG) from raw data
The DG serve as:
• Conceptual Map, and
• Sparse Index for the raw data
DDA today DDA vision
• Automation
• Language-integration
• Real-time
• …
DDA: infrastructure logs “incarnation”
• The DG stores:
provenance + telemetry
• NODES: jobs / files / machines / tasks / …
• EDGES: job-reads-file, task-runs-on-machine
• PROPERTIES: timestamps / resources usage / …
Raw data (logs)
Query Interface
“JobA’s impact?”
Current implementation
Raw
Data
Extraction
Dependency
Definition
Storage
Querying
Scope/
Cosmos Neo4J
dependency
graph
Schema +
extr. rules
Big Data
System
Graph
System
Raw
Data
Raw
Data
Extract “jobs processing hours”
extStart = EXTRACT * FROM "ProcStarted_%Y%m%d.log"
USING EventExtractor("ProcStarted");
startData = SELECT ProcessGuid AS ProcessId,
CurrentTimeStamp.Value AS StartTime,
JobGuid AS JobId
FROM extStart
WHERE ProcessGuid != null AND JobGuid != null AND
CurrentTimeStamp.HasValue;
…
procH = SELECT endData.JobId,
SUM((End - Start).TotalMs)/1000/3600 AS procHours,
FROM startData INNER JOIN endData ON startData.ProcessId ==
endData.ProcessId AND startData.JobId == endData.JobId
GROUP BY JobId;
OUTPUT (SELECT JobId, procHours FROM procH) TO "processingHours.csv";
Example: “Measure JobA’s impact”
graph.traversal().V()
.has("JobTemplateName","JobA_*")
.local(
emit().repeat(out()).times(100)
.hasLabel("job").dedup()
.values(“procHours").sum()
).mean()
…
DDA: Initial Experiments
Improvements of up to:
• 7x less LoC*
• 700x less run-time
• > 50,000x less CPU-time
• > 800x less I/O
* Heavy under-representation of hardness of baseline
Not all queries are as easy…
Simple search/browsing
Local or agg. queries on
telemetry / provenance
 Graph queries on DG
(i.e., covering index)
Complex/AdHoc queries
(e.g., debugging)
 Mix of DG and raw data
querying (clumsy today)
 UI (keyword search)
Neo4J
Scope/
Cosmos Neo4J
+
DDA: open challenges
• Automatically “map” the raw data
• Real-time log ingestion at scale
• Scale-out graph management
• Leverage specialized graph structures
• Integrated language for
graph+relational+unstructured
Scope
Enterprise SearchInternet of ThingsInfrastructure logs
…
Conclusions
Problem:
• Focused analyses of massive, loosely structured, evolving data has
prohibitive costs
DDA solution:
• Extract a Dependency Graph (DG)  conceptual map + sparse index
• Current impl. leverages existing BigData/Graph tech
Open challenges:
• automation / real-time / scalable graph tech / integrated language

More Related Content

Similar to Dependency-Driven Analytics: A Compass for Uncharted Data Oceans

Stream dataprocessing101
Stream dataprocessing101Stream dataprocessing101
Stream dataprocessing101Sotaro Kimura
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wyciekówKonrad Kokosa
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationMongoDB
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
10 Key MongoDB Performance Indicators
10 Key MongoDB Performance Indicators  10 Key MongoDB Performance Indicators
10 Key MongoDB Performance Indicators iammutex
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Visual Exploration of Large Data sets with D3, crossfilter and dc.js
Visual Exploration of Large Data sets with D3, crossfilter and dc.jsVisual Exploration of Large Data sets with D3, crossfilter and dc.js
Visual Exploration of Large Data sets with D3, crossfilter and dc.jsFlorian Georg
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebJames Rakich
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoopColin Su
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Matthew J Collins
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesBig Data Colombia
 
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...DevOpsDays Tel Aviv
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 

Similar to Dependency-Driven Analytics: A Compass for Uncharted Data Oceans (20)

Stream dataprocessing101
Stream dataprocessing101Stream dataprocessing101
Stream dataprocessing101
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
10 Key MongoDB Performance Indicators
10 Key MongoDB Performance Indicators  10 Key MongoDB Performance Indicators
10 Key MongoDB Performance Indicators
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Visual Exploration of Large Data sets with D3, crossfilter and dc.js
Visual Exploration of Large Data sets with D3, crossfilter and dc.jsVisual Exploration of Large Data sets with D3, crossfilter and dc.js
Visual Exploration of Large Data sets with D3, crossfilter and dc.js
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 

More from eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex GraphseXascale Infolab
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapeXascale Infolab
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...eXascale Infolab
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingeXascale Infolab
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
 

More from eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 

Recently uploaded

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 

Recently uploaded (20)

Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 

Dependency-Driven Analytics: A Compass for Uncharted Data Oceans

  • 1. Dependency Driven Analytics a Compass for Uncharted Data Oceans/Jungles Ruslan Mavlyutov, Carlo Curino, Boris Asipov, Phil Cudre-Mauroux
  • 2. The production job “JobA” failed… impact? debug? re-run?
  • 3. 1) look in the logs PBs of daily
  • 4. 2) ask local experts (they know “how” to look)
  • 5. But don’t bother them too much…
  • 6. The Problem Focused analyses of massive, loosely structured, evolving data has prohibitive cognitive and computational costs.
  • 7. Focused analyses of massive, loosely structured, evolving data has prohibitive cognitive and computational costs. The Problem Cost of understanding raw data Cost of processing raw data
  • 9. Dependency Driven Analytics (DDA) • Derive a dependency graph (DG) from raw data The DG serve as: • Conceptual Map, and • Sparse Index for the raw data DDA today DDA vision • Automation • Language-integration • Real-time • …
  • 10. DDA: infrastructure logs “incarnation” • The DG stores: provenance + telemetry • NODES: jobs / files / machines / tasks / … • EDGES: job-reads-file, task-runs-on-machine • PROPERTIES: timestamps / resources usage / … Raw data (logs) Query Interface “JobA’s impact?”
  • 12. Extract “jobs processing hours” extStart = EXTRACT * FROM "ProcStarted_%Y%m%d.log" USING EventExtractor("ProcStarted"); startData = SELECT ProcessGuid AS ProcessId, CurrentTimeStamp.Value AS StartTime, JobGuid AS JobId FROM extStart WHERE ProcessGuid != null AND JobGuid != null AND CurrentTimeStamp.HasValue; … procH = SELECT endData.JobId, SUM((End - Start).TotalMs)/1000/3600 AS procHours, FROM startData INNER JOIN endData ON startData.ProcessId == endData.ProcessId AND startData.JobId == endData.JobId GROUP BY JobId; OUTPUT (SELECT JobId, procHours FROM procH) TO "processingHours.csv";
  • 13. Example: “Measure JobA’s impact” graph.traversal().V() .has("JobTemplateName","JobA_*") .local( emit().repeat(out()).times(100) .hasLabel("job").dedup() .values(“procHours").sum() ).mean() …
  • 14. DDA: Initial Experiments Improvements of up to: • 7x less LoC* • 700x less run-time • > 50,000x less CPU-time • > 800x less I/O * Heavy under-representation of hardness of baseline
  • 15. Not all queries are as easy… Simple search/browsing Local or agg. queries on telemetry / provenance  Graph queries on DG (i.e., covering index) Complex/AdHoc queries (e.g., debugging)  Mix of DG and raw data querying (clumsy today)  UI (keyword search) Neo4J Scope/ Cosmos Neo4J +
  • 16. DDA: open challenges • Automatically “map” the raw data • Real-time log ingestion at scale • Scale-out graph management • Leverage specialized graph structures • Integrated language for graph+relational+unstructured
  • 17. Scope Enterprise SearchInternet of ThingsInfrastructure logs …
  • 18. Conclusions Problem: • Focused analyses of massive, loosely structured, evolving data has prohibitive costs DDA solution: • Extract a Dependency Graph (DG)  conceptual map + sparse index • Current impl. leverages existing BigData/Graph tech Open challenges: • automation / real-time / scalable graph tech / integrated language