SlideShare a Scribd company logo
E l a s t i c s e a r c h & B i g D a t a
E l a s t i c s e a r c h M e e t u p 0 3 / 1 5 / 2 0 1 6
2
W H O A M I
• Roy Russo
• VP Engineering, Predikto
• ElasticHQ.org
3
A G E N D A
• What is “Predictive Analytics”?
• Why Elasticsearch & Spark for ETL?
• Elasticsearch in production
• Lessons learned
4
W H Y A M I H E R E ?
&
(Big Data) Predictive Analytics
5
W H O I S P R E D I K T O ?
• Atlanta-based
• Founded in 2012
• Funded
• Paying Customers
• Mechanical Engineers / Statisticians
• Big Data Architects
• Global 1000
6
W H A T W E D O ?
• Predictive Maintenance
• Predictive Analytics
• Asset Health
• Anomalies
• SaaS
7
H O W D O E S P R E D I C T I V E
A N A L Y T I C S W O R K ?
8
D a t a S o u r c e s
9
P R E D I K T O S A A S P L A T F O R M
OEM Specs
Inventory
Operator
Weather
Financial
Sensors
Asset Mgmt.
INPUT
(APIs)
DATA LAKE TRANSFORM BUSINESS
INTELLIGENC
E
LEARN OUTPUT
(APIs)
MAX
PREDIKTO
APPS
10
T H E P R E D I K T O A D V A N T A G E ( P A T E N T
P E N D I N G )
MAX
Raw data Auto dynamic
engineered
features
Auto dynamic
feature
selection
Method and
algorithm
selection
Dynamic self learning
11
Auto dynamic
engineered features
Average Std. Deviation of pressure
sensor in the last hour, compared to
avg std. dev of pressure sensor last
week for locomotives leaving the
same depot, and in the same series
of make and model.
125,000
X 125
2
M A C H I N E L E A R N I N G A U T O M A T I O N & S C A L A B I L I T Y
Load Raw Data
Load data from Sensors,
Maintenance and Repairs, Usage,
Topology, Weather,…
1
1,000
Auto dynamic
feature selection
37
3
Software algorithms and logic find the
best raw data and/or engineered features
that correlate to the best representation
of the health and condition of the
equipment.
MAX
12
I N P R O D U C T I O N
• Apache Spark Integration
• Why Elasticsearch?
• Nginx
• Elasticsearch Configuration
13
S O W E U S E S P A R K …
• ETL:
• Shared Memory
• Not Disk-Bound
• Distributed workloads
• Scale horizontally
• New node = more capacity
• Spin up. Spin down.
• Runs from Instruction-set
• DAGs
14
E S - H a d o o p : S e t u p
client = Elasticsearch(hosts=[es_uri])
query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}'
es_conf = {
"es.nodes": es_host_name,
"es.port": es_host_port,
"es.resource": es_index + '/' + es_mapping,
"es.query": query
}
es_RDD = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_conf)
dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))
15
E S - H a d o o p : T i p s
• Connects to specific shards
• Spark workers - to - ES (primary) shards
• Typing matters
• Beware of ‘index.mapping.coerce’
• Ignores ‘size’ parameter!
16
W H Y E L A S T I C S E A R C H ?
• Time-Series Data
• Fast-Reads
• Fast writes with Bulk Inserts
• Asynch
• Dynamic querying
• Differing schema
• Everything is indexed
• Visualization
• GeoJSON
• Scale horizontally
• New node = more capacity
• Spark-ES Connector
• Python lib
17
M I L L I O N S O F Q U E R I E S …
18
F R O N T I N G W I T H N G I N X
• Reverse Proxy
• Load-Balance across cluster
• Authentication
• read-only access
• block HTTP DELETE
• Log:
• request-body, response-time, response-size, etc…
• Keep-alive connections
• t2.micro often suffices
19
E S S E T U P : C O N F I G
• index.*.slowlog = DEBUG
• Plugins:
• Cloud-AWS
• IAM Roles for auto-discovery
• S3 Snapshot
• Changes in 2.x!
• Curator
• Master nodes:
• N+1/2 = discovery.zen.minimum_master_nodes
• boostrap.mlockall : true
20
E S S E T U P : H A R D W A R E
• r3.2xlarge instance - 64GB Quad Core
• 30GB Heap
• AMI deployment
• 500GB mounted EBS (commonly)
• path.data = /mnt-point
• Cluster-per-customer
21
E S S E T U P : I N D I C E S
• Many indices per project (use-case)
• Mappings differ considerably
• Unixtime in seconds (2.x changes)
• Type coercion (GIGO)
• source: true
• Groovy scripting : off!
• Index segmentation depends
• by… month, month-year, device, etc…
• Sparse data sucks
oil_temp RPM
80
2500
22
U I I M P L I C A T I O N S
• REST API access
• Predikto QDSL
• Dynamic Queries
• Reporting BI Interface

More Related Content

What's hot

xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Databricks
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
Vaclav Kosar
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Effective monitoring with statsd - Alexis lê-quôc
Effective monitoring with statsd - Alexis lê-quôcEffective monitoring with statsd - Alexis lê-quôc
Effective monitoring with statsd - Alexis lê-quôcDevopsdays
 
Databricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks with R: Deep Dive
Databricks with R: Deep Dive
Databricks
 
Using druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scaleUsing druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scale
Itai Yaffe
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Kibana + timelion: time series with the elastic stack
Kibana + timelion: time series with the elastic stackKibana + timelion: time series with the elastic stack
Kibana + timelion: time series with the elastic stack
Sylvain Wallez
 
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Spark Summit
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
kbajda
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!
Databricks
 
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Lviv Startup Club
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
DataWorks Summit
 
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander ZaitsevClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
Altinity Ltd
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
Spark Summit
 

What's hot (20)

xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
 
Effective monitoring with statsd - Alexis lê-quôc
Effective monitoring with statsd - Alexis lê-quôcEffective monitoring with statsd - Alexis lê-quôc
Effective monitoring with statsd - Alexis lê-quôc
 
Databricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks with R: Deep Dive
Databricks with R: Deep Dive
 
Using druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scaleUsing druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scale
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 
Kibana + timelion: time series with the elastic stack
Kibana + timelion: time series with the elastic stackKibana + timelion: time series with the elastic stack
Kibana + timelion: time series with the elastic stack
 
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!
 
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander ZaitsevClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 

Similar to Elasticsearch Atlanta Meetup 3/15/16

PyATL Meetup, Oct 8, 2015
PyATL Meetup, Oct 8, 2015PyATL Meetup, Oct 8, 2015
PyATL Meetup, Oct 8, 2015
Roy Russo
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Lucas Jellema
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Demi Ben-Ari
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Demi Ben-Ari
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Codemotion
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
DevOpsDays Amsterdam 2016 workshop
DevOpsDays Amsterdam 2016 workshopDevOpsDays Amsterdam 2016 workshop
DevOpsDays Amsterdam 2016 workshop
Arnold Van Wijnbergen
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily Life
Bryan Yang
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Open Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsOpen Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design Patterns
Matthew Kalan
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
PerformanceVision (previously SecurActive)
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Demi Ben-Ari
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 

Similar to Elasticsearch Atlanta Meetup 3/15/16 (20)

PyATL Meetup, Oct 8, 2015
PyATL Meetup, Oct 8, 2015PyATL Meetup, Oct 8, 2015
PyATL Meetup, Oct 8, 2015
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
DevOpsDays Amsterdam 2016 workshop
DevOpsDays Amsterdam 2016 workshopDevOpsDays Amsterdam 2016 workshop
DevOpsDays Amsterdam 2016 workshop
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily Life
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Open Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsOpen Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design Patterns
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 

Recently uploaded

在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 

Recently uploaded (20)

在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 

Elasticsearch Atlanta Meetup 3/15/16

  • 1. E l a s t i c s e a r c h & B i g D a t a E l a s t i c s e a r c h M e e t u p 0 3 / 1 5 / 2 0 1 6
  • 2. 2 W H O A M I • Roy Russo • VP Engineering, Predikto • ElasticHQ.org
  • 3. 3 A G E N D A • What is “Predictive Analytics”? • Why Elasticsearch & Spark for ETL? • Elasticsearch in production • Lessons learned
  • 4. 4 W H Y A M I H E R E ? & (Big Data) Predictive Analytics
  • 5. 5 W H O I S P R E D I K T O ? • Atlanta-based • Founded in 2012 • Funded • Paying Customers • Mechanical Engineers / Statisticians • Big Data Architects • Global 1000
  • 6. 6 W H A T W E D O ? • Predictive Maintenance • Predictive Analytics • Asset Health • Anomalies • SaaS
  • 7. 7 H O W D O E S P R E D I C T I V E A N A L Y T I C S W O R K ?
  • 8. 8 D a t a S o u r c e s
  • 9. 9 P R E D I K T O S A A S P L A T F O R M OEM Specs Inventory Operator Weather Financial Sensors Asset Mgmt. INPUT (APIs) DATA LAKE TRANSFORM BUSINESS INTELLIGENC E LEARN OUTPUT (APIs) MAX PREDIKTO APPS
  • 10. 10 T H E P R E D I K T O A D V A N T A G E ( P A T E N T P E N D I N G ) MAX Raw data Auto dynamic engineered features Auto dynamic feature selection Method and algorithm selection Dynamic self learning
  • 11. 11 Auto dynamic engineered features Average Std. Deviation of pressure sensor in the last hour, compared to avg std. dev of pressure sensor last week for locomotives leaving the same depot, and in the same series of make and model. 125,000 X 125 2 M A C H I N E L E A R N I N G A U T O M A T I O N & S C A L A B I L I T Y Load Raw Data Load data from Sensors, Maintenance and Repairs, Usage, Topology, Weather,… 1 1,000 Auto dynamic feature selection 37 3 Software algorithms and logic find the best raw data and/or engineered features that correlate to the best representation of the health and condition of the equipment. MAX
  • 12. 12 I N P R O D U C T I O N • Apache Spark Integration • Why Elasticsearch? • Nginx • Elasticsearch Configuration
  • 13. 13 S O W E U S E S P A R K … • ETL: • Shared Memory • Not Disk-Bound • Distributed workloads • Scale horizontally • New node = more capacity • Spin up. Spin down. • Runs from Instruction-set • DAGs
  • 14. 14 E S - H a d o o p : S e t u p client = Elasticsearch(hosts=[es_uri]) query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}' es_conf = { "es.nodes": es_host_name, "es.port": es_host_port, "es.resource": es_index + '/' + es_mapping, "es.query": query } es_RDD = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf) dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))
  • 15. 15 E S - H a d o o p : T i p s • Connects to specific shards • Spark workers - to - ES (primary) shards • Typing matters • Beware of ‘index.mapping.coerce’ • Ignores ‘size’ parameter!
  • 16. 16 W H Y E L A S T I C S E A R C H ? • Time-Series Data • Fast-Reads • Fast writes with Bulk Inserts • Asynch • Dynamic querying • Differing schema • Everything is indexed • Visualization • GeoJSON • Scale horizontally • New node = more capacity • Spark-ES Connector • Python lib
  • 17. 17 M I L L I O N S O F Q U E R I E S …
  • 18. 18 F R O N T I N G W I T H N G I N X • Reverse Proxy • Load-Balance across cluster • Authentication • read-only access • block HTTP DELETE • Log: • request-body, response-time, response-size, etc… • Keep-alive connections • t2.micro often suffices
  • 19. 19 E S S E T U P : C O N F I G • index.*.slowlog = DEBUG • Plugins: • Cloud-AWS • IAM Roles for auto-discovery • S3 Snapshot • Changes in 2.x! • Curator • Master nodes: • N+1/2 = discovery.zen.minimum_master_nodes • boostrap.mlockall : true
  • 20. 20 E S S E T U P : H A R D W A R E • r3.2xlarge instance - 64GB Quad Core • 30GB Heap • AMI deployment • 500GB mounted EBS (commonly) • path.data = /mnt-point • Cluster-per-customer
  • 21. 21 E S S E T U P : I N D I C E S • Many indices per project (use-case) • Mappings differ considerably • Unixtime in seconds (2.x changes) • Type coercion (GIGO) • source: true • Groovy scripting : off! • Index segmentation depends • by… month, month-year, device, etc… • Sparse data sucks oil_temp RPM 80 2500
  • 22. 22 U I I M P L I C A T I O N S • REST API access • Predikto QDSL • Dynamic Queries • Reporting BI Interface

Editor's Notes

  1. Many Data Sources Drive Asset Condition Insights Asset Health Condition monitoring moves away from relying on historical data alone: Increasing deployment of sensors allows for extensive data collection (IoT, M2M) Big Data allows us to store and work with vast amounts of data Predictive Analytics Predictive analytics technology allows us to combine sensor data with any other relevant source of data Automated Predictive Analytics Solution For Assets (Individual/Fleet) Asset Health Score New Measure Of Asset Reliability Where Is My Asset In Its Life? Automated Machine Learning Solution For Industrial Equipment Predikto Asset Condition Score (PAC) New Measure Of Asset Reliability Where Is My Asset In Its Life? Which equipment is healthier? Equipment is coming in for repair, what else should we look at? imo.