Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop

2,295 views

Published on

Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop

Published in: Technology
  • Be the first to comment

Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop

  1. 1. Eventual Consistency @WalmartLabs with Kafka, SolrCloud and Hadoop Ayon Sinha asinha@walmartlabs.com
  2. 2. Introductions • @WalmartLabs – Building Walmart Global eCommerce from the 2 ground up • Data Foundation Team – Build, manage and provide tools for all OLTP operations
  3. 3. Large Scale eCommerce problems • Our customers love to shop online 24X7 and we love them for that • Reads are many orders of magnitude more than writes, and reads 3 have to be blazing fast (every millisecond has a monetary value attached to it, according to some studies) • Scaling up only takes you so far, you have to scale out • Low latency analytics absolutely canNOT be on OLTP data stores • No full table scans • Too many RDBMS column indexes leading to slow writes
  4. 4. Data Foundation Architecture 4 End
  5. 5. Very large scale and always available means.. • There is really NO way around Brewer’s CAP theorem Source: http://blog.mccrory.me/2010/11/03/cap-theorem- and-the-clouds/ • Embrace “eventual” consistency and asynchrony • Clearly articulate “eventual” to business stakeholders. Computer 5 “eventual” and human “eventual” are different scales entirely.
  6. 6. EC Use cases 6
  7. 7. Typical data flow into EC data stores IC Web Service Web Service Client 7 Client Web Service Client EC Web Service Web Service Client Orchestrator Service Client Resource Tier Resource Tier Resource Tier Batch layer (processes data on Hadoop and loads into faster serving Kafka Event driven updater Kafka Consumer for Solr datastore) Fire job and pull results Kafka Consumer for Hadoop SolrCloud Hadoop Web Service Client 70-80% of total load read write write
  8. 8. Challenges • Messaging System: Kafka was already being used and supported by 8 our Big Fast Data team • Virtualization – Shared CPU and memory among compute tenants generally bad for Search engine infrastructure. If your use-case takes off, you will eventually move to dedicated hardware. – We started with big dedicated bare-metal hardware – Virtualization requires complete lifecycle management • Serialization format – Our choice Avro (Schema + Data) • Hierarchical Object to Flat – If you are familiar with ElasticSearch, you’d say “No problem..maybe” – If you are already using HBase or Cassandra or similar, you’d say “No problem..maybe” – For Solr people, lets talk about schema.xml and plugin based flattening
  9. 9. SolrCloud 101 • Solr is the web app wrapper on Lucene • SolrCloud is the distributed search where a bunch of Solr nodes 9 coordinate using ZooKeeper Source: SolrCloud Wiki
  10. 10. Solr schema.xml choices • Let each team build their own schema.xml from scratch 10 – This would require each customer team to intimately learn search engines, Solr etc. – This would also mean that each time there is a change in schema.xml, everything must be re-indexed. • Leverage Solr’s dynamic fields and create a naming convention – this gives the customer a kick-start – Schema.xml doesn’t need to change often and can be mostly used unchanged team to team
  11. 11. Best possible (unrealistic) scenario • No writes • No scoring, sorting, faceting • 100% document cache hit ratio • 99.6% of 192GB physical memory usage • 2000+ select/sec • 0.3 ms/query 11
  12. 12. We even got.. 12
  13. 13. Initial Solr Settings 13
  14. 14. Getting Worse.. • Hundreds of ms/query with close to 100% Doc cache hit ratio 14
  15. 15. Most common causes of slowdowns • GC pauses. Cure: trial-and-error with help from experts 15
  16. 16. More naïve mistakes.. • Zookeeper in the same Solr machine 16 – We did not experience this, as we knew this going in • Frequent commits (in our case was DB-style, 1 doc/update + commit) – DON’T commit after every update. Solr commit is very different from DBMS commit. It opens up a new searcher and warms it up in the background. “Too many on-deck searchers” warning is a telltale sign – Batch as many docs as your application can tolerate in a single update post – We chose batching docs for 1 sec • IO contention (Log level too high) – Easy fix
  17. 17. Zookeeper • Prefer odd number of nodes for the ensemble as quorum is N/2 + 1 • More nodes are not necessarily better 17 – 3 nodes is too low as you can handle only 1 failure – 5 nodes is good balance between HA and write speed. More nodes creates slower writes and slower quorums. – We had to go with 9 = 3 nodes in each of 3 protects us from a complete outage in one cloud. • Pay good attention to Zookeeper availability as SolrCloud will only function for a little while after ZK is dead • CloudSolrServer (SolrJ client) completely relies on Zookeeper for talking to SolrCloud
  18. 18. How do you do Disaster Recovery? • SolrCloud is CP model (CAP theorem) • You should not add replica from another data center. Every write will 18 get excruciatingly slow • Use Kafka or other messaging system to send data cross-DC • Get used to cross-DC eventual consistency. Monitor for tolerance thresholds
  19. 19. Metrics Monitoring • We poll metrics from Mbeans and push to Graphite servers 19
  20. 20. Real Query Performance 20
  21. 21. Real Update Performance 21
  22. 22. Real Customer Results 22
  23. 23. 23 Q&A

×