SlideShare a Scribd company logo

Analyzing Log Data With Apache Spark

Spark Summit 2016 talk by William Benton (Red Hat)

1 of 112
Download to read offline
Analyzing log data with
Apache Spark
William Benton
Red Hat Emerging Technology
BACKGROUND
Challenges of log data
Challenges of log data
SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)
FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%'
GROUP BY hostname, hour
Challenges of log data
11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00
SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)
FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%'
GROUP BY hostname, hour
Challenges of log data
postgres
httpd
syslog
INFO INFO WARN CRIT DEBUG INFO
GET GET GET POST
WARN WARN INFO INFO INFO
GET (404)
INFO
(ca. 2000)

Recommended

Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jDatabricks
 
New Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overviewNew Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overviewPaolo Dametto
 
DNA MICROARRAY TECHNIQUES
DNA MICROARRAY TECHNIQUESDNA MICROARRAY TECHNIQUES
DNA MICROARRAY TECHNIQUESgayathryp1
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
CRISPR-Cas9 크리스퍼 유전자 가위
CRISPR-Cas9 크리스퍼 유전자 가위CRISPR-Cas9 크리스퍼 유전자 가위
CRISPR-Cas9 크리스퍼 유전자 가위Pin_network
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Alphafold2 - Protein Structural Bioinformatics After CASP14
Alphafold2 - Protein Structural Bioinformatics After CASP14Alphafold2 - Protein Structural Bioinformatics After CASP14
Alphafold2 - Protein Structural Bioinformatics After CASP14Purdue University
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Mrinal Vashisth
 

More Related Content

What's hot

Sanger Sequencing By D Gnanasingh Arputhadas
Sanger Sequencing By D Gnanasingh ArputhadasSanger Sequencing By D Gnanasingh Arputhadas
Sanger Sequencing By D Gnanasingh ArputhadasGnanasingh Arputhadas
 
Kegg database resources
Kegg database resources Kegg database resources
Kegg database resources innocent87
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqEnis Afgan
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingShaheen Alam
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2maikroeder
 
Next Generation Sequencing application in virology
Next Generation Sequencing application in virologyNext Generation Sequencing application in virology
Next Generation Sequencing application in virologyEben Titus
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignmentavrilcoghlan
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsAjit Shinde
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingUzma Jabeen
 
Next generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvementNext generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvementanjaligoud
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingFarid MUSA
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 

What's hot (20)

Sanger Sequencing By D Gnanasingh Arputhadas
Sanger Sequencing By D Gnanasingh ArputhadasSanger Sequencing By D Gnanasingh Arputhadas
Sanger Sequencing By D Gnanasingh Arputhadas
 
Kegg database resources
Kegg database resources Kegg database resources
Kegg database resources
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
 
Next Generation Sequencing application in virology
Next Generation Sequencing application in virologyNext Generation Sequencing application in virology
Next Generation Sequencing application in virology
 
Ensembl annotation
Ensembl annotationEnsembl annotation
Ensembl annotation
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Basics of association_mapping
Basics of association_mappingBasics of association_mapping
Basics of association_mapping
 
Next generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvementNext generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvement
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
BLAST
BLASTBLAST
BLAST
 
Crispr cas9
Crispr cas9 Crispr cas9
Crispr cas9
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 

Similar to Analyzing Log Data With Apache Spark

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
OSMC 2013 | Making monitoring simple? by Michael Medin
OSMC 2013 | Making monitoring simple? by Michael MedinOSMC 2013 | Making monitoring simple? by Michael Medin
OSMC 2013 | Making monitoring simple? by Michael MedinNETWAYS
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problemsgrepalex
 
Juggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBJuggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBDavid Golden
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillCharles Givre
 
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicOC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicBig Data Joe™ Rossi
 
Keep it simple web development stack
Keep it simple web development stackKeep it simple web development stack
Keep it simple web development stackEric Ahn
 
Informix SQL & NoSQL: Putting it all together
Informix SQL & NoSQL: Putting it all togetherInformix SQL & NoSQL: Putting it all together
Informix SQL & NoSQL: Putting it all togetherKeshav Murthy
 
Rest API using Flask & SqlAlchemy
Rest API using Flask & SqlAlchemyRest API using Flask & SqlAlchemy
Rest API using Flask & SqlAlchemyAlessandro Cucci
 
NSClient++: Monitoring Simplified at OSMC 2013
NSClient++: Monitoring Simplified at OSMC 2013NSClient++: Monitoring Simplified at OSMC 2013
NSClient++: Monitoring Simplified at OSMC 2013Michael Medin
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-pythonEric Ahn
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-pythonEric Ahn
 
RDF Analytics... SPARQL and Beyond
RDF Analytics... SPARQL and BeyondRDF Analytics... SPARQL and Beyond
RDF Analytics... SPARQL and BeyondFadi Maali
 
Introduction to Riak - Red Dirt Ruby Conf Training
Introduction to Riak - Red Dirt Ruby Conf TrainingIntroduction to Riak - Red Dirt Ruby Conf Training
Introduction to Riak - Red Dirt Ruby Conf TrainingSean Cribbs
 
How We Learned To Love The Data Center Operating System
How We Learned To Love The Data Center Operating SystemHow We Learned To Love The Data Center Operating System
How We Learned To Love The Data Center Operating Systemsaulius_vl
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresSerena Villata
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redactedRyan Breed
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 

Similar to Analyzing Log Data With Apache Spark (20)

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
OSMC 2013 | Making monitoring simple? by Michael Medin
OSMC 2013 | Making monitoring simple? by Michael MedinOSMC 2013 | Making monitoring simple? by Michael Medin
OSMC 2013 | Making monitoring simple? by Michael Medin
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problems
 
Juggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBJuggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDB
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
 
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicOC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
 
Keep it simple web development stack
Keep it simple web development stackKeep it simple web development stack
Keep it simple web development stack
 
Informix SQL & NoSQL: Putting it all together
Informix SQL & NoSQL: Putting it all togetherInformix SQL & NoSQL: Putting it all together
Informix SQL & NoSQL: Putting it all together
 
Rest API using Flask & SqlAlchemy
Rest API using Flask & SqlAlchemyRest API using Flask & SqlAlchemy
Rest API using Flask & SqlAlchemy
 
NSClient++: Monitoring Simplified at OSMC 2013
NSClient++: Monitoring Simplified at OSMC 2013NSClient++: Monitoring Simplified at OSMC 2013
NSClient++: Monitoring Simplified at OSMC 2013
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-python
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-python
 
RDF Analytics... SPARQL and Beyond
RDF Analytics... SPARQL and BeyondRDF Analytics... SPARQL and Beyond
RDF Analytics... SPARQL and Beyond
 
Introduction to Riak - Red Dirt Ruby Conf Training
Introduction to Riak - Red Dirt Ruby Conf TrainingIntroduction to Riak - Red Dirt Ruby Conf Training
Introduction to Riak - Red Dirt Ruby Conf Training
 
How We Learned To Love The Data Center Operating System
How We Learned To Love The Data Center Operating SystemHow We Learned To Love The Data Center Operating System
How We Learned To Love The Data Center Operating System
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph Stores
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redacted
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............mahetamanav24
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...ThinkInnovation
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelTope Osanyintuyi
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxPoonamRijal
 
What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?Denodo
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Thibaud Le Douarin
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdfAlexia Trejo
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxHizkiaJastis
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughsNikolas Markou
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...DrSumathyV
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfAustraliaChapterIIBA
 

Recently uploaded (15)

Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft Excel
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptx
 
What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdf
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptx
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
 

Analyzing Log Data With Apache Spark

  • 1. Analyzing log data with Apache Spark William Benton Red Hat Emerging Technology
  • 4. Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour
  • 5. Challenges of log data 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour
  • 6. Challenges of log data postgres httpd syslog INFO INFO WARN CRIT DEBUG INFO GET GET GET POST WARN WARN INFO INFO INFO GET (404) INFO (ca. 2000)
  • 7. Challenges of log data postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN (ca. 2016)
  • 8. Challenges of log datapostgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN How many services are generating logs in your datacenter today?
  • 10. Collecting log data collecting Ingesting live log data via rsyslog, logstash, fluentd normalizing Reconciling log record metadata across sources warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  • 11. Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  • 12. Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  • 16. Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.
  • 17. logs .select("level").distinct .map { case Row(s: String) => s } .collect Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … debug, notice, emerg, err, warning, crit, info, severe, alert
  • 18. Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert
  • 19. Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert This class must be declared outside the REPL!
  • 21. From log records to vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010
  • 22. From log records to vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles -> 00010000 -> 00101000
  • 25. Similarity and distance (q - p) • (q - p)
  • 29. WARN INFO INFOINFO WARN DEBUGINFOINFOINFO WARN WARNINFO INFO INFO WAR INFO INFO Other interesting features host01 host02 host03
  • 30. INFO INFOINFO DEBUGINFOINFO WARNNFO INFO INFO WARN INFO INFOINFO INFO INFO INFOINFOINFO INFOINFO INFO INFO INFO INFO WARN DEBUG Other interesting features host01 host02 host03
  • 31. INFO INFOINFO INFO INFO INFOINFO INFO INFO INFO WARN DEBUG WARN INFO INFO EBUG WARN INFO INFO INFO INFO INFO INFO INFO INFO WARN WARN INFO WARN INFO Other interesting features host01 host02 host03
  • 32. Other interesting features : Great food, great service, a must-visit! : Our whole table got gastroenteritis. : This place is so wonderful that it has ruined all other tacos for me and my family.
  • 33. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK.
  • 34. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.
  • 35. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes.
  • 36. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes. See https://links.freevariable.com/nlp-logs/ for more!
  • 45. A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
  • 46. A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
  • 49. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray
  • 50. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no
  • 51. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no
  • 60. Outliers in log data 0.95
  • 61. Outliers in log data 0.95 0.97
  • 62. Outliers in log data 0.95 0.97 0.92
  • 63. Outliers in log data 0.95 0.97 0.92 0.37 An outlier is any record whose best match was at least 4σ below the mean. 0.94 0.89 0.91 0.93 0.96
  • 65. Out of 310 million log records, we identified 0.0012% as outliers.
  • 68. Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.
  • 74. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt
  • 75. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt at each step, we update each unit by adding its value from the previous step…
  • 76. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt to the example that we considered…
  • 77. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt scaled by a learning factor and the distance from this unit to its best match
  • 79. On-line SOM training sensitive to learning rate not parallel sensitive to example order
  • 80. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
  • 81. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) update the state of every cell in the neighborhood of the best matching unit, weighting by distance
  • 82. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) keep track of the distance weights we’ve seen for a weighted average
  • 83. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) since we can easily merge multiple states, we can train in parallel across many examples
  • 94. driver (using aggregate) workers What if you have a 3 mb model and 2,048 partitions?
  • 101. Sharing models class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here }
  • 102. Sharing models class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here } case class FrozenModel(entries: Array[Double], /* ... */ ) { }
  • 103. Sharing models case class FrozenModel(entries: Array[Double], /* ... */ ) { } class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here def freeze: FrozenModel = // ... } object Model { def thaw(im: FrozenModel): Model = // ... }
  • 104. Sharing models import org.json4s.jackson.Serialization import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite} implicit val formats = Serialization.formats(NoTypeHints) def toJson(m: Model): String = { jwrite(som.freeze) } def fromJson(json: String): Try[Model] = { Try({ Model.thaw(jread[FrozenModel](json)) }) }
  • 105. Sharing models import org.json4s.jackson.Serialization import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite} implicit val formats = Serialization.formats(NoTypeHints) def toJson(m: Model): String = { jwrite(som.freeze) } def fromJson(json: String): Try[Model] = { Try({ Model.thaw(jread[FrozenModel](json)) }) } Also consider how you’ll share feature encoders and other parts of your learning pipeline!
  • 107. Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. If your data are write-once, consider exporting ES indices to Parquet files and analyzing those instead.
  • 108. Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.
  • 109. Memory and partitioning Large JVM heaps can lead to appalling GC pauses and executor timeouts. Use multiple JVMs or off-heap storage (in Spark 2.0!) Tree aggregation can save you both memory and execution time by partially aggregating at worker nodes.
  • 110. Interoperability Avoid brittle or language-specific model serializers when sharing models with non-Spark environments. JSON is imperfect but ubiquitous. However, json4s will serialize case classes for free! See also SPARK-13944, merged recently into 2.0.
  • 111. Feature engineering Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models. Design your feature engineering pipeline so you can translate feature vectors back to factor values.