My talk from GOTO Aarhus, 30th September 2014. Cogenta is a retail intelligence company which tracks ecommerce web sites around the world to provide competitive monitoring and analysis services to retailers. Using its proprietary crawler technology, Lucene and SQL Server, a stream of 20 million raw product data entries is captured and processed each day. This case study looks at how Cogenta uses Elasticsearch to break the shackles imposed by the RDBMS (and a limited budget) to make the data available in real time to its customers.
Cogenta uses SQL as its canonical store & for complex reporting, and Elasticsearch for real-time processing & to drive its SaaS web applications. Elasticsearch is easy to use, delivers the powerful features of Lucene and enables the data & platform cost to scale linearly. But… synchronising your existing data in two places presents some interesting challenges such as aggregation and concurrency control. This talk will take a detailed look at how Cogenta how overcame those challenges, with a perpetually changing and asynchronously updated dataset.
http://gotocon.com/aarhus-2014/presentation/Cogenta%20-%20Making%20Enterprise%20Data%20Available%20in%20Real%20Time%20with%20Elasticsearch
4. Online Pricing Intelligence
1. Gather data from 500+ of eCommerce sites
2. Organise into high quality market view
3. Competitive intelligence tools
5. Price,
Stock,
Meta
HTML
Price,
Stock,
Meta Price,
Price,
Stock,
Meta
Stock,
Meta
Custom Crawler
Parse web content
Discover product data
Tracking 20m products
Daily+
HTML
HTML
HTML
6. Database
Processing, Storage
Enrichment
Persistent Storage
Product Catalogue
+ time series data
Processing
7. Matcher
Index Builder
Database
Thing #1 - Detection
Identify distinct products
Automated information retrieval
Lucene + custom index builder
Continuous process
(Humans for QA)
Lucene
Index
GUI
8. Thing #2 - BI Tools
Web Applications
Also based on Lucene
Batch index build process
Per-customer indexes
Database
Index Builder
Customer
Index 1
2
BI Tools
Customer
Index 3
9. Thing #1 - Pain
Continuously indexing
Track changes, read back out to index
Drain on performance
Latency, coping with peaks
Full rebuild for index schema change
or inconsistencies
Full rebuild doesn’t scale well…
Unnecessary work..?
Database
Index Builder
Lucene
Index
GUI
10. Batch Sync
Customer
Index 2
Thing #2 - Pain
Twice daily batch rebuild, per customer
Very slow
Moar customers?
Moar data?
Moar often?
Data set too complex,
keeps changing
Index shipping
Moar web servers?
Database
Customer
Index 1
Index Builder
BI Tools
Customer
Index 3
Indexing
Database
Web Server 1 Web Server 2
11. Pain Points
As data, customers scale,
processes slow down
Adapting to change
Easy to layer on,
hard to make fundamental changes
Read vs write concerns
Database Maintenance
Database
Index Builder
Index
12. Goals
Eliminate latencies
Improve scalability
Improve availability
Something achievable
Your mileage will vary
13. elasticsearch
Open source, distributed search engine
Based on Lucene, fully featured API
Querying, filtering, aggregation
Text processing / IR
Schema-free
Yummy
(real-time, sharding, highly available)
Silver bullets not included
14. Indexing
Database
Indexing
Database
Our Pipeline
Database
Processors
Processors
Processors
Processors
Crawlers
Crawlers
Processors
Processors
Indexers
Indexes
Indexes
Indexes
Web Servers
Web Servers
Web Servers
15. Our New Pipeline
Database
Processors
Processors
Processors
Processors
Crawlers
Crawlers
Processors
Processors
Indexers
Indexes
Indexes
Indexes
Web Servers
Web Servers
Web Servers
16. Event Hooks
Messages fired OnCreate.. and OnUpdate
Payload contains everything needed for indexing
The data
Keys (still mastered in SQL)
Versioning
Sender has all the information already
Use RabbitMQ to control event message flow
Messages are durable
17. Indexing Strategy
RESTful API (HTTP, Thrift, Memcache)
Use bulk methods
They support percolation
Rivers (pull)
RabbitMQ River
JDBC River
Mongo/Couch/etc. River
Logstash
Event Q
Index Q
Indexer
18. Model Your Data
What’s in your documents?
Database = Index
Table = Type ...?
Start backwards
What do your applications need?
How will they need to query the data?
Prototyping! Fail quickly!
elasticsearch supports Nested objects, parent/child docs
19. Joins
Events relate to line-items
Amazon decreased price
Pixmania is running a promotion
Need to group by Product
Use key/value store
Get full Product document
Modify it, write it back
Enqueue indexing instruction
3 3 5
Event Q Indexer
1 4
1
2
1
3
4
Index Q
Key/value
store
5
20. Where to join?
elasticsearch
Consider performance
Depends how data is structured/indexed (e.g. parent/child)
Compression, collisions
In-memory cache (e.g. Memcache)
Persistent storage (e.g. Cassandra or Mongo)
Two awesome benefits
Quickly re-index if needed
Updates have access to the full Product data
Serialisation is costly
21. Synchronisation & Concurrency
Fault tolerance
Code to expect missing data
Out of sequence events
Concurrency Control
Apply Optimistic Concurrency Control at Mongo
Optimise for collisions
22. Synchronisation & Concurrency
Synchronisation
Out of sequence index instructions
elasticsearch external versioning
Can rebuild from scratch if need to
Consistency
Which version is right?
Dates
Revision numbers from SQL
Independent updates
23. Figures
Ingestion
20m data points/day (continuously)
~ 200GB
3K msgs/second at peak
Hardware
SQL: 2 x 12-core, 64GB, 72-spindle SAN
Indexing: 4 x 4-core, 8GB
Mongo: 1 x 4-core, 16GB, 1xSSD
Elastic: 5 x 4-core, 16GB, 1xSSD
Custom-Built
Lucene
elasticsearch
Latency 3 hours < 1 second
Bottleneck Disk (SQL) CPU
25. Thanks
@YannCluchey
Concurrency Patterns with MongoDB
http://slidesha.re/YFOehF
Consistency without Consensus
Peter Bourgon, SoundCloud
http://bit.ly/1DUAO1R
Eventually Consistent Data Structures
Sean Cribbs, Basho
https://vimeo.com/43903960
Editor's Notes
Complex, relational
Spanning
Built up long time
World requirements keep changing
Designed by children
Dangerous at night time
Like Minecraft, very valuable
Opportunity more value
Cost savings, efficiencies or new services
Paywall, baggage
My use case
Problems we faced
How solved
Premise can’t start from scratch
Need to deliver more
I’m lying
Separate arrows, because all processing is async
Two specific things we do once data in system
Need to make sense of the raw data
Need to understand that these two products are in fact the same
Lucene, Customised analysis chain
Volume and rate of change = automated
Very similar
Batch on twice daily schedule
Embarrass myself. I know better now.. ;-)
Effort writing data to db, drag back out again
Very similar process
Also using Lucene
Batch on twice daily schedule
So where does it hurt?
Constantly changing data
Hard to read back out (consistently)
..without impacting performance
Get data where needed (quickly)
Real-time not requirement, > Microsoft minutes ;-)
Consider any one a success. (cost benefit)
!scratch, roll out incrementally, reasonable time-frame.
So not a lot then!
** My environment is unique and my specific problems won’t apply to you
Sharding, replication, simple clustering
Where is the bottleneck?
Where is the bottleneck now?
Watch out for slow HTTP clients
Rivers subscription model
Answer, roll YO.. with a queue
We use Thrift for API calls, Custom RMQ river (QoS)
Black holes
General tip for NoSQL/denorm
Countless prototypes, worked but too slow
Good at prototyping, bad at failing fast. No silver bullet!
Nuances in performance and querying. Blog post.
Since your data is complicated and you haven’t got one document per row.
Can use ES as canonical store
Prototyped incremental map/reduce
CouchDB view was good logical fit, but didn’t perform
Mongo API was lightning fast
Distributed event processing is hard
Messages go missing
Be forgiving, support recovery
OCC pattern with Mongo
Clever tricks for performance / high collision rate. My talk.
Use universal versioning scheme
ES has very clever versioning, e.g. deletion memory (efficient thanks to Lucene)
OCC protects against accidental overwrite.
Doesn’t protect against wrong thing.
Consistency without Consensus Peter Bourgon, SoundCloud
Eventually Consistent Data Structures, Sean Cribbs, Basho
Bottlenecking on everything, CPU, LAN
Go SSD
Cost of scale is massively reduced
Active data only, 90%