SlideShare a Scribd company logo
1 of 25
MAKING ENTERPRISE DATA 
AVAILABLE IN REAL TIME 
WITH ELASTICSEARCH 
Yann Cluchey 
CTO @ Cogenta 
CTO @ GfK Online Pricing Intelligence
What is Enterprise Data?
What is Enterprise Data?
Online Pricing Intelligence 
1. Gather data from 500+ of eCommerce sites 
2. Organise into high quality market view 
3. Competitive intelligence tools
Price, 
Stock, 
Meta 
HTML 
Price, 
Stock, 
Meta Price, 
Price, 
Stock, 
Meta 
Stock, 
Meta 
Custom Crawler 
 Parse web content 
 Discover product data 
 Tracking 20m products 
 Daily+ 
HTML 
HTML 
HTML
Database 
Processing, Storage 
 Enrichment 
 Persistent Storage 
 Product Catalogue 
 + time series data 
Processing
Matcher 
Index Builder 
Database 
Thing #1 - Detection 
 Identify distinct products 
 Automated information retrieval 
 Lucene + custom index builder 
 Continuous process 
 (Humans for QA) 
Lucene 
Index 
GUI
Thing #2 - BI Tools 
 Web Applications 
 Also based on Lucene 
 Batch index build process 
 Per-customer indexes 
Database 
Index Builder 
Customer 
Index 1 
2 
BI Tools 
Customer 
Index 3
Thing #1 - Pain 
 Continuously indexing 
 Track changes, read back out to index 
 Drain on performance 
 Latency, coping with peaks 
 Full rebuild for index schema change 
or inconsistencies 
 Full rebuild doesn’t scale well… 
 Unnecessary work..? 
Database 
Index Builder 
Lucene 
Index 
GUI
Batch Sync 
Customer 
Index 2 
Thing #2 - Pain 
 Twice daily batch rebuild, per customer 
 Very slow 
 Moar customers? 
 Moar data? 
 Moar often? 
 Data set too complex, 
keeps changing 
 Index shipping 
 Moar web servers? 
Database 
Customer 
Index 1 
Index Builder 
BI Tools 
Customer 
Index 3 
Indexing 
Database 
Web Server 1 Web Server 2
Pain Points 
 As data, customers scale, 
processes slow down 
 Adapting to change 
 Easy to layer on, 
hard to make fundamental changes 
 Read vs write concerns 
 Database Maintenance 
Database 
Index Builder 
Index
Goals 
 Eliminate latencies 
Improve scalability 
Improve availability 
Something achievable 
 Your mileage will vary
elasticsearch 
 Open source, distributed search engine 
 Based on Lucene, fully featured API 
 Querying, filtering, aggregation 
 Text processing / IR 
 Schema-free 
 Yummy 
(real-time, sharding, highly available) 
 Silver bullets not included
Indexing 
Database 
Indexing 
Database 
Our Pipeline 
Database 
Processors 
Processors 
Processors 
Processors 
Crawlers 
Crawlers 
Processors 
Processors 
Indexers 
Indexes 
Indexes 
Indexes 
Web Servers 
Web Servers 
Web Servers
Our New Pipeline 
Database 
Processors 
Processors 
Processors 
Processors 
Crawlers 
Crawlers 
Processors 
Processors 
Indexers 
Indexes 
Indexes 
Indexes 
Web Servers 
Web Servers 
Web Servers
Event Hooks 
 Messages fired OnCreate.. and OnUpdate 
 Payload contains everything needed for indexing 
 The data 
 Keys (still mastered in SQL) 
 Versioning 
 Sender has all the information already 
 Use RabbitMQ to control event message flow 
 Messages are durable
Indexing Strategy 
 RESTful API (HTTP, Thrift, Memcache) 
 Use bulk methods 
 They support percolation 
 Rivers (pull) 
 RabbitMQ River 
 JDBC River 
 Mongo/Couch/etc. River 
 Logstash 
Event Q 
Index Q 
Indexer
Model Your Data 
 What’s in your documents? 
 Database = Index 
Table = Type ...? 
 Start backwards 
 What do your applications need? 
 How will they need to query the data? 
 Prototyping! Fail quickly! 
 elasticsearch supports Nested objects, parent/child docs
Joins 
 Events relate to line-items 
 Amazon decreased price 
 Pixmania is running a promotion 
 Need to group by Product 
 Use key/value store 
 Get full Product document 
 Modify it, write it back 
 Enqueue indexing instruction 
3 3 5 
Event Q Indexer 
1 4 
1 
2 
1 
3 
4 
Index Q 
Key/value 
store 
5
Where to join? 
 elasticsearch 
 Consider performance 
 Depends how data is structured/indexed (e.g. parent/child) 
 Compression, collisions 
 In-memory cache (e.g. Memcache) 
 Persistent storage (e.g. Cassandra or Mongo) 
 Two awesome benefits 
 Quickly re-index if needed 
 Updates have access to the full Product data 
 Serialisation is costly
Synchronisation & Concurrency 
 Fault tolerance 
 Code to expect missing data 
 Out of sequence events 
 Concurrency Control 
 Apply Optimistic Concurrency Control at Mongo 
 Optimise for collisions
Synchronisation & Concurrency 
 Synchronisation 
 Out of sequence index instructions 
 elasticsearch external versioning 
 Can rebuild from scratch if need to 
 Consistency 
 Which version is right? 
 Dates 
 Revision numbers from SQL 
 Independent updates
Figures 
 Ingestion 
 20m data points/day (continuously) 
 ~ 200GB 
 3K msgs/second at peak 
 Hardware 
 SQL: 2 x 12-core, 64GB, 72-spindle SAN 
 Indexing: 4 x 4-core, 8GB 
 Mongo: 1 x 4-core, 16GB, 1xSSD 
 Elastic: 5 x 4-core, 16GB, 1xSSD 
Custom-Built 
Lucene 
elasticsearch 
Latency 3 hours < 1 second 
Bottleneck Disk (SQL) CPU
Managing Change 
Key/value 
store 
Client 
Index_A 
Event Q Indexer 
Alias 
Index_B 
Index
Thanks 
 @YannCluchey 
 Concurrency Patterns with MongoDB 
http://slidesha.re/YFOehF 
 Consistency without Consensus 
Peter Bourgon, SoundCloud 
http://bit.ly/1DUAO1R 
 Eventually Consistent Data Structures 
Sean Cribbs, Basho 
https://vimeo.com/43903960

More Related Content

What's hot

What's hot (20)

tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on Everything
 
MongoDB at Baidu
MongoDB at BaiduMongoDB at Baidu
MongoDB at Baidu
 
MongoDB on Azure
MongoDB on AzureMongoDB on Azure
MongoDB on Azure
 
Mindtalk Tech - Behind the scenes
Mindtalk Tech - Behind the scenesMindtalk Tech - Behind the scenes
Mindtalk Tech - Behind the scenes
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Stream processing at Hotstar
Stream processing at HotstarStream processing at Hotstar
Stream processing at Hotstar
 
Azure Big Data Story
Azure Big Data StoryAzure Big Data Story
Azure Big Data Story
 
Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"
 
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
 
Distributed Query Service Powered By Presto & Alluxio Across Clouds @Walmart...
 Distributed Query Service Powered By Presto & Alluxio Across Clouds @Walmart... Distributed Query Service Powered By Presto & Alluxio Across Clouds @Walmart...
Distributed Query Service Powered By Presto & Alluxio Across Clouds @Walmart...
 
Rpsonmongodb
RpsonmongodbRpsonmongodb
Rpsonmongodb
 
Enterprise Presto PaaS offering in Google Cloud
Enterprise Presto PaaS offering in Google Cloud Enterprise Presto PaaS offering in Google Cloud
Enterprise Presto PaaS offering in Google Cloud
 
Using Elasticsearch for Analytics
Using Elasticsearch for AnalyticsUsing Elasticsearch for Analytics
Using Elasticsearch for Analytics
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
DBaaS at Scale
DBaaS at ScaleDBaaS at Scale
DBaaS at Scale
 
MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...
MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...
MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...
 

Similar to GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
VMware Tanzu
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 

Similar to GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch (20)

A Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionA Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in Action
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeUnlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
 
Elastic Stack: Using data for insight and action
Elastic Stack: Using data for insight and actionElastic Stack: Using data for insight and action
Elastic Stack: Using data for insight and action
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
J1 - Keynote Data Platform - Rohan Kumar
J1 - Keynote Data Platform - Rohan KumarJ1 - Keynote Data Platform - Rohan Kumar
J1 - Keynote Data Platform - Rohan Kumar
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Data Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM ExellysData Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM Exellys
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

  • 1. MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence
  • 4. Online Pricing Intelligence 1. Gather data from 500+ of eCommerce sites 2. Organise into high quality market view 3. Competitive intelligence tools
  • 5. Price, Stock, Meta HTML Price, Stock, Meta Price, Price, Stock, Meta Stock, Meta Custom Crawler  Parse web content  Discover product data  Tracking 20m products  Daily+ HTML HTML HTML
  • 6. Database Processing, Storage  Enrichment  Persistent Storage  Product Catalogue  + time series data Processing
  • 7. Matcher Index Builder Database Thing #1 - Detection  Identify distinct products  Automated information retrieval  Lucene + custom index builder  Continuous process  (Humans for QA) Lucene Index GUI
  • 8. Thing #2 - BI Tools  Web Applications  Also based on Lucene  Batch index build process  Per-customer indexes Database Index Builder Customer Index 1 2 BI Tools Customer Index 3
  • 9. Thing #1 - Pain  Continuously indexing  Track changes, read back out to index  Drain on performance  Latency, coping with peaks  Full rebuild for index schema change or inconsistencies  Full rebuild doesn’t scale well…  Unnecessary work..? Database Index Builder Lucene Index GUI
  • 10. Batch Sync Customer Index 2 Thing #2 - Pain  Twice daily batch rebuild, per customer  Very slow  Moar customers?  Moar data?  Moar often?  Data set too complex, keeps changing  Index shipping  Moar web servers? Database Customer Index 1 Index Builder BI Tools Customer Index 3 Indexing Database Web Server 1 Web Server 2
  • 11. Pain Points  As data, customers scale, processes slow down  Adapting to change  Easy to layer on, hard to make fundamental changes  Read vs write concerns  Database Maintenance Database Index Builder Index
  • 12. Goals  Eliminate latencies Improve scalability Improve availability Something achievable  Your mileage will vary
  • 13. elasticsearch  Open source, distributed search engine  Based on Lucene, fully featured API  Querying, filtering, aggregation  Text processing / IR  Schema-free  Yummy (real-time, sharding, highly available)  Silver bullets not included
  • 14. Indexing Database Indexing Database Our Pipeline Database Processors Processors Processors Processors Crawlers Crawlers Processors Processors Indexers Indexes Indexes Indexes Web Servers Web Servers Web Servers
  • 15. Our New Pipeline Database Processors Processors Processors Processors Crawlers Crawlers Processors Processors Indexers Indexes Indexes Indexes Web Servers Web Servers Web Servers
  • 16. Event Hooks  Messages fired OnCreate.. and OnUpdate  Payload contains everything needed for indexing  The data  Keys (still mastered in SQL)  Versioning  Sender has all the information already  Use RabbitMQ to control event message flow  Messages are durable
  • 17. Indexing Strategy  RESTful API (HTTP, Thrift, Memcache)  Use bulk methods  They support percolation  Rivers (pull)  RabbitMQ River  JDBC River  Mongo/Couch/etc. River  Logstash Event Q Index Q Indexer
  • 18. Model Your Data  What’s in your documents?  Database = Index Table = Type ...?  Start backwards  What do your applications need?  How will they need to query the data?  Prototyping! Fail quickly!  elasticsearch supports Nested objects, parent/child docs
  • 19. Joins  Events relate to line-items  Amazon decreased price  Pixmania is running a promotion  Need to group by Product  Use key/value store  Get full Product document  Modify it, write it back  Enqueue indexing instruction 3 3 5 Event Q Indexer 1 4 1 2 1 3 4 Index Q Key/value store 5
  • 20. Where to join?  elasticsearch  Consider performance  Depends how data is structured/indexed (e.g. parent/child)  Compression, collisions  In-memory cache (e.g. Memcache)  Persistent storage (e.g. Cassandra or Mongo)  Two awesome benefits  Quickly re-index if needed  Updates have access to the full Product data  Serialisation is costly
  • 21. Synchronisation & Concurrency  Fault tolerance  Code to expect missing data  Out of sequence events  Concurrency Control  Apply Optimistic Concurrency Control at Mongo  Optimise for collisions
  • 22. Synchronisation & Concurrency  Synchronisation  Out of sequence index instructions  elasticsearch external versioning  Can rebuild from scratch if need to  Consistency  Which version is right?  Dates  Revision numbers from SQL  Independent updates
  • 23. Figures  Ingestion  20m data points/day (continuously)  ~ 200GB  3K msgs/second at peak  Hardware  SQL: 2 x 12-core, 64GB, 72-spindle SAN  Indexing: 4 x 4-core, 8GB  Mongo: 1 x 4-core, 16GB, 1xSSD  Elastic: 5 x 4-core, 16GB, 1xSSD Custom-Built Lucene elasticsearch Latency 3 hours < 1 second Bottleneck Disk (SQL) CPU
  • 24. Managing Change Key/value store Client Index_A Event Q Indexer Alias Index_B Index
  • 25. Thanks  @YannCluchey  Concurrency Patterns with MongoDB http://slidesha.re/YFOehF  Consistency without Consensus Peter Bourgon, SoundCloud http://bit.ly/1DUAO1R  Eventually Consistent Data Structures Sean Cribbs, Basho https://vimeo.com/43903960

Editor's Notes

  1. Complex, relational Spanning Built up long time World requirements keep changing Designed by children Dangerous at night time Like Minecraft, very valuable Opportunity more value Cost savings, efficiencies or new services Paywall, baggage
  2. My use case Problems we faced How solved Premise can’t start from scratch Need to deliver more
  3. I’m lying Separate arrows, because all processing is async
  4. Two specific things we do once data in system Need to make sense of the raw data Need to understand that these two products are in fact the same Lucene, Customised analysis chain Volume and rate of change = automated
  5. Very similar Batch on twice daily schedule
  6. Embarrass myself. I know better now.. ;-) Effort writing data to db, drag back out again
  7. Very similar process Also using Lucene Batch on twice daily schedule
  8. So where does it hurt? Constantly changing data Hard to read back out (consistently) ..without impacting performance
  9. Get data where needed (quickly) Real-time not requirement, > Microsoft minutes ;-) Consider any one a success. (cost benefit) !scratch, roll out incrementally, reasonable time-frame. So not a lot then! ** My environment is unique and my specific problems won’t apply to you
  10. Sharding, replication, simple clustering
  11. Where is the bottleneck?
  12. Where is the bottleneck now?
  13. Watch out for slow HTTP clients Rivers subscription model Answer, roll YO.. with a queue We use Thrift for API calls, Custom RMQ river (QoS)
  14. Black holes General tip for NoSQL/denorm Countless prototypes, worked but too slow Good at prototyping, bad at failing fast. No silver bullet! Nuances in performance and querying. Blog post.
  15. Since your data is complicated and you haven’t got one document per row.
  16. Can use ES as canonical store Prototyped incremental map/reduce CouchDB view was good logical fit, but didn’t perform Mongo API was lightning fast
  17. Distributed event processing is hard Messages go missing Be forgiving, support recovery OCC pattern with Mongo Clever tricks for performance / high collision rate. My talk.
  18. Use universal versioning scheme ES has very clever versioning, e.g. deletion memory (efficient thanks to Lucene) OCC protects against accidental overwrite. Doesn’t protect against wrong thing. Consistency without Consensus Peter Bourgon, SoundCloud Eventually Consistent Data Structures, Sean Cribbs, Basho
  19. Bottlenecking on everything, CPU, LAN Go SSD Cost of scale is massively reduced Active data only, 90%
  20. Atomic switch Capacity