Cassandra @walmartlabs
Cassandra @walmartlabs• Cassandra adoption at Walmart    – Using the DataStax distribution http://www.datastax.com/• Intro...
Cassandra @walmartlabs• Introduction to the talks    – Walmartlabs       • @labs – Using Cassandra for real-time stream pr...
Cassandra @walmartlabs• Hiring @labs    – Cassandra admins    – Java engineers    – http://www.walmartlabs.com/open-positi...
Cassandra for Real-timeStream ProcessingKarl Mueller, @WalmartLabsWang Lam, @WalmartLabs
Data-stream computation• “Big” data: MapReduce (Hadoop)    – Map and Reduce steps    – Batch process large input (e.g., fr...
The MapReduce framework (Hadoop)• Event    – A <key, value> pair of data• Map    – A function that performs (stateless) co...
The MapUpdate framework (Muppet)• Event    – A <key, value> pair of data• Map    – A function that performs (stateless) co...
A MapUpdate application2012 Cassandra for Real-Time Stream Processing @WalmartLabs
The Map (Foursquare::CheckinMapper)sub map {      my $self = shift;      my $event = shift;      my $checkin = $event->{ch...
The Update (Foursquare::RetailerUpdater)use Muppet::Updater;package Foursquare::RetailerUpdater;@ISA = qw( Muppet::Updater...
Example results2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Muppet Processing• Slates are 1 – 100KB in size• Local cache on Muppet Node    – 85% reads from cache    – Write-though de...
Muppet & Cassandra Architecture                                     ~100x Muppet        Node                              ...
Datastore Requirements• Consistent, low response time    – 10ms or less for slate reads on average• 1+ billion keys, futur...
Why Cassandra?• Timeframe: Early 2010    – Low latency: a rare feature among NoSQL    – Most NoSQL favors throughput over ...
Why Cassandra – the Challenges• Seeks are going to be difficult    –   Overwrites mean nightly compactions    –   Compacti...
Frequent Row Overwrites in Cassandra                                                     TAIL     Few Seeks Full Compactio...
Solution• Cassandra + SSDs !!• Expensive in terms of space, cheap in terms of IOps• Random seeks “free”• Good performance ...
Compaction Effect on System2012 Cassandra for Real-Time Stream Processing @WalmartLabs
How did Cassandra do?• Average latency below 10ms, often 5-8ms• read-write ratio: 1:2    – Today, 1:1• Compacting 500GB ev...
Helping Cassandra out• Muppet absorbs writes in local cache    – Write on # of updates or staleness    – Reduces write cou...
Recent and Future• Cassandra 0.8.x    – Faster compaction    – Stability    – Performance• Cassandra 1.0.x    –   Close to...
Lessons• Simple is usually faster and cheaper    – Add complexity only where needed• Best solution can usually be made to ...
Q&A2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Using Cassandra for Products & ItemsRajkumar Venkatrvenkat@walmartlabs.com
First ChallengeBuild a truly Global Product Catalog
Dimensions, Products & Product Offerings - Example
Second ChallengeCatalog (& Categorize) Any Sellable Item
Flexible Categorization & Attribution • The right kind of categorization and   attribution is crucial to making sense of t...
Other excerpts from the “shopping list”• Lookup and potentially match products  and offerings by any combination of  attri...
Translating to Cassandra• Modeling options    1. Product as a “wide row” encompassing all       offerings    2. Product as...
Translating to Cassandra (contd.)• Flexible, selective denormalization• Secondary indexes for faster attribute-level queri...
The “Supporting Cast”• Solr for additional indexing querying capabilities   • Mainly for attribute values        • Pattern...
Queries?
Upcoming SlideShare
Loading in...5
×

Cassandra atwalmartlabsmeetup201203

3,351

Published on

Slides from Cassandra Meetup 2012/03/15

Published in: Technology, Business
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,351
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • Products are inherently multi-dimensional and mostly multi-variantDimensions includeBusiness Unit (Walmart, Sam’s Club, ASDA etc.)Geography (US, Canada, UK etc.)Language (en_US, fr_CA, en_UK etc.) Supply Chain (Owned Inventory, Direct Ship, Marketplace etc.) Channel (Website, Retail/Store, Mobile, Facebook etc.)Variants includeSize (S, M, L, XL etc.)Color (Red, Green, Blue etc.)Capacity (8 GB, 16GB etc.)A true Global Product content is typically agnostic of any specific dimensions or variants Items as we know and see them are actually Product OfferingsRepresenting content and behavior changes captured at every dimension and variant intersectionWhat you shop for is different from what you order is different from what you actually get!
  • Notice the need for the concept of dimensions and variants to capture and maintain data at each levelIngest external catalogs, even if we do not plan to sell it rightawayOn a scale of 100’s of millions of unique SKU’sBase – Variant and pre-configured bundles create order of magnitude increases in these estimates
  • How do we give our customers access to the largest assortment in the world?As the digital arm of the worlds largest retailer, we need to not only give existing customers access to an endless shelf, but we also need to have a broad assortment to expand into the consideration set of retailer non-walmart shoppers… this means millions and millions of items. And, we do so in a manner that is scalable and gives the consumer the right product information to make an informed decision about whether or not the product will meet their needs.
  • Ultimate shopping experience Customer finds everything that he/she needs intuitively and in the right place, whether browsing or searchingFine-grained analytics &amp; planning Fine-grained analytics helps us put the right kind of products on our shelves (physical or virtual) at the right level of availability (inventory) and pricingStandards exist, but severely limiting e.g. GPC hierarchical classification and attribution structureProducts landscape changes dramatically every day e.g. Tablets, a radically new form factor, unleashes itself on the market, we want to be able to adopt it and sell it ASAP and not wait for a cumbersome change control process due to inflexible categorization and attribution
  • Ability to lookup and potentially match products and offerings/items by any combination of attributes and other dimensional criteriaItem-Item Relationships &amp; CollectionsHierarchicalBase-variants (e.g. iPhone 4S 16/32/64 GB)GraphBundlesHard, Fixed, Inflexible or ConfigurableComponents &amp; IngredientsAccessories &amp; ReplacementsCase Packs &amp; Vendor PacksLow Latency, High Throughput, Highly AvailableSellers typically update 40-50% of their offerings at some level each dayBased on global projections, this may be comparable to the scale of social media feedsAccept, process, search, retrieve and analyze large volumes of data 24x7
  • Multiple Column FamiliesProduct fragmentsCustom consistency enablerSeparate the “data” from the “index” or “event log”Use to separate “Work In Progress” from golden copyImplicit versioning and potential archiving/purging requirementsTunable consistency levels per API call (Read/Write)Custom row caching at column family levelOptimize for read-intensivevs. write-intensive column familiesSingle keyspace to hold all data fragmentsTighter control of replication factor (DC + 3 or 5), strategy (NetworkTopologyStrategy (formerly known as Datacenter-ShardStrategy))Additional keyspaces only for supporting dataLower priority, loosely coupled or completely decoupledE.g. Purgeable audit &amp; history logs
  • Flexible, selective denormalizationBi-directional relationshipsCapture more than just foreign keysIndicesMerge records to create product offering in the application/DaaS layerRight balance of optimization of the retrieval algorithm vs, spaceSecondary indexes for faster attribute-level queries, but simple queries onlyHowever, complex queries may need to be supplemented with other tools as we will see later Dynamic composites capture 1-n levels of dimension intersections define flexible comparators for different column key levelsColumn slicing to retrieve the right offerings (i.e. intersections)No need to use Order Preserving PartitionerCategorization and structure is completely handled outside of the data storeCassandra only used to capture attribute values
  • Solr for additional indexing querying capabilitiesMainly at attribute value levelPattern matchingNon-standard comparisons and range checks HDFS/Hadoop for “extreme” bulk/batch operationsLarge File/content streaming and parallel processingCorresponding response aggregationHadoop “append”
  • Cassandra atwalmartlabsmeetup201203

    1. 1. Cassandra @walmartlabs
    2. 2. Cassandra @walmartlabs• Cassandra adoption at Walmart – Using the DataStax distribution http://www.datastax.com/• Introduction to the talks• Hiring @labsWalmart eCommerce 2
    3. 3. Cassandra @walmartlabs• Introduction to the talks – Walmartlabs • @labs – Using Cassandra for real-time stream processing • @services – Using Cassandra for product and items – DataStax • Data modeling with CassandraWalmart eCommerce 3
    4. 4. Cassandra @walmartlabs• Hiring @labs – Cassandra admins – Java engineers – http://www.walmartlabs.com/open-positions/Walmart eCommerce 4
    5. 5. Cassandra for Real-timeStream ProcessingKarl Mueller, @WalmartLabsWang Lam, @WalmartLabs
    6. 6. Data-stream computation• “Big” data: MapReduce (Hadoop) – Map and Reduce steps – Batch process large input (e.g., from HDFS) – Hadoop distributes computation• Fast data: MapUpdate (Muppet) – Map and Update steps – Continuously process streaming input – Muppet maintains computation – Muppet manages memory/storage2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    7. 7. The MapReduce framework (Hadoop)• Event – A <key, value> pair of data• Map – A function that performs (stateless) computation on incoming events• Reduce – A function that combines all input for a particular key• Application – Map -> Reduce2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    8. 8. The MapUpdate framework (Muppet)• Event – A <key, value> pair of data• Map – A function that performs (stateless) computation on incoming events• Update – A function that updates a slate using incoming events• Application – A directed graph of Mappers and Updaters2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    9. 9. A MapUpdate application2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    10. 10. The Map (Foursquare::CheckinMapper)sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = ToysRUs if ($venue_name =~ /toys.*r.*us/i); $retailer = Walmart if ($venue_name =~ /wal.*mart/i); $retailer = SamsClub if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot);2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs
    11. 11. The Update (Foursquare::RetailerUpdater)use Muppet::Updater;package Foursquare::RetailerUpdater;@ISA = qw( Muppet::Updater );use strict;sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1;2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs
    12. 12. Example results2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    13. 13. Muppet Processing• Slates are 1 – 100KB in size• Local cache on Muppet Node – 85% reads from cache – Write-though delayed cache – ~750K slates in cache per node• Remote slates read through Muppet API• Cassandra is the permanent datastore• Slates tend to be updated and read in batches – 10-50 at a time2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    14. 14. Muppet & Cassandra Architecture ~100x Muppet Node Node Processes NodeProcesses Node Processes Processes Node Processes Processes Processes Processes Processes Processes API Processes Processes Slate Cache Slate Cache Slate Cache Delay Slate Cache Delay Slate Cache 16x Cassandra Cassandra Cassandra Cassandra 8x RAID0 SSD 8x RAID0 SSD 8x RAID0 SSD 1.2TB RAW 1.2TB RAW 1.2TB RAW2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    15. 15. Datastore Requirements• Consistent, low response time – 10ms or less for slate reads on average• 1+ billion keys, future expansion maybe 5-10 billion• Value is whole set of data – Slate losses in small amounts OK• Datastore gets entirely “cold” reads – Muppet Cache: 85% for reads – Datastore cannot rely on cache for performance2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    16. 16. Why Cassandra?• Timeframe: Early 2010 – Low latency: a rare feature among NoSQL – Most NoSQL favors throughput over response time – New “Best NoSQL evur!!” every 2 months• Cassandra: – Open-Source, active community, Clustering a core feature• Simple is good – Peer networking, Data file format, key distribution• QUORUM consistency good middle ground – AP focus in CAP aligns well with our needs2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    17. 17. Why Cassandra – the Challenges• Seeks are going to be difficult – Overwrites mean nightly compactions – Compactions blow up seek performance – 90%+ cold reads means lots of seeks – Head and body reads can produce a lot of seeks• Slates as an atomic unit means no bulk column slice reads• Likely to have unfavorable read:write ratio – Early estimates: 1:3, or even worse• Oh yeah, spinning disks hate seeks. Uh oh!2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    18. 18. Frequent Row Overwrites in Cassandra TAIL Few Seeks Full Compaction BODY Some Seeks HEAD Many Seeks Growth During Day Data Files (SS Tables)2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    19. 19. Solution• Cassandra + SSDs !!• Expensive in terms of space, cheap in terms of IOps• Random seeks “free”• Good performance during nightly compactions2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    20. 20. Compaction Effect on System2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    21. 21. How did Cassandra do?• Average latency below 10ms, often 5-8ms• read-write ratio: 1:2 – Today, 1:1• Compacting 500GB every night in <4 hours• Individual C* nodes handled over 1500 rps/wps• SSD cost: well worth it2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    22. 22. Helping Cassandra out• Muppet absorbs writes in local cache – Write on # of updates or staleness – Reduces write counts in Cassandra – More efficient• Compress all slates on Muppet nodes – Easier to scale than C* nodes doing compression – Less disk IO, less network – CPU on Muppet nodes cheap• Expire data via TTL – Muppet apps decide data-keep length• Java GC tuning flattened out CPU and GC stops2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    23. 23. Recent and Future• Cassandra 0.8.x – Faster compaction – Stability – Performance• Cassandra 1.0.x – Close to deployment @WML – LevelDB is very, very interesting – Cache memory changes make large caches feasible! – Row[Column] latest-only: very nice – SSDs no longer needed? Possibly! • Depends on cold seek requirements2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    24. 24. Lessons• Simple is usually faster and cheaper – Add complexity only where needed• Best solution can usually be made to work• Proactive monitoring very important – Trend graph everything relevant!• Failing fast is better than succeeding late• No substitute for understanding your platform• Spend money when it will save you time and complexity2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    25. 25. Q&A2012 Cassandra for Real-Time Stream Processing @WalmartLabs
    26. 26. Using Cassandra for Products & ItemsRajkumar Venkatrvenkat@walmartlabs.com
    27. 27. First ChallengeBuild a truly Global Product Catalog
    28. 28. Dimensions, Products & Product Offerings - Example
    29. 29. Second ChallengeCatalog (& Categorize) Any Sellable Item
    30. 30. Flexible Categorization & Attribution • The right kind of categorization and attribution is crucial to making sense of the enormity of product data • Ultimate shopping experience • Fine-grained analytics & planning • Standards exist, but severely limiting • Product landscape changes dramatically every day
    31. 31. Other excerpts from the “shopping list”• Lookup and potentially match products and offerings by any combination of attributes and other dimensional criteria• Item-Item Relationships & Collections • Hierarchical • Graph• Low Latency, High Throughput, Highly Available• A scalable but unified system of record for all product and offering data
    32. 32. Translating to Cassandra• Modeling options 1. Product as a “wide row” encompassing all offerings 2. Product assembled from several offering “fragment” rows• Multiple Column Families • Product fragments • Custom consistency enabler • Custom row caching at column family level• Single keyspace to hold all core data fragments • Tighter control of replication factor, strategy • Additional keyspaces only for supporting data
    33. 33. Translating to Cassandra (contd.)• Flexible, selective denormalization• Secondary indexes for faster attribute-level queries• Dynamic composites • define flexible comparators for different column key levels • capture 1-n levels of dimension intersections• Column slicing to retrieve the right offerings
    34. 34. The “Supporting Cast”• Solr for additional indexing querying capabilities • Mainly for attribute values • Pattern matching • Non-standard type comparisons • Range checks
    35. 35. Queries?

    ×