• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cassandra atwalmartlabsmeetup201203
 

Cassandra atwalmartlabsmeetup201203

on

  • 3,319 views

Slides from Cassandra Meetup 2012/03/15

Slides from Cassandra Meetup 2012/03/15

Statistics

Views

Total Views
3,319
Views on SlideShare
3,317
Embed Views
2

Actions

Likes
12
Downloads
0
Comments
0

2 Embeds 2

https://si0.twimg.com 1
https://content-preview.socialcast.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Products are inherently multi-dimensional and mostly multi-variantDimensions includeBusiness Unit (Walmart, Sam’s Club, ASDA etc.)Geography (US, Canada, UK etc.)Language (en_US, fr_CA, en_UK etc.) Supply Chain (Owned Inventory, Direct Ship, Marketplace etc.) Channel (Website, Retail/Store, Mobile, Facebook etc.)Variants includeSize (S, M, L, XL etc.)Color (Red, Green, Blue etc.)Capacity (8 GB, 16GB etc.)A true Global Product content is typically agnostic of any specific dimensions or variants Items as we know and see them are actually Product OfferingsRepresenting content and behavior changes captured at every dimension and variant intersectionWhat you shop for is different from what you order is different from what you actually get!
  • Notice the need for the concept of dimensions and variants to capture and maintain data at each levelIngest external catalogs, even if we do not plan to sell it rightawayOn a scale of 100’s of millions of unique SKU’sBase – Variant and pre-configured bundles create order of magnitude increases in these estimates
  • How do we give our customers access to the largest assortment in the world?As the digital arm of the worlds largest retailer, we need to not only give existing customers access to an endless shelf, but we also need to have a broad assortment to expand into the consideration set of retailer non-walmart shoppers… this means millions and millions of items. And, we do so in a manner that is scalable and gives the consumer the right product information to make an informed decision about whether or not the product will meet their needs.
  • Ultimate shopping experience Customer finds everything that he/she needs intuitively and in the right place, whether browsing or searchingFine-grained analytics & planning Fine-grained analytics helps us put the right kind of products on our shelves (physical or virtual) at the right level of availability (inventory) and pricingStandards exist, but severely limiting e.g. GPC hierarchical classification and attribution structureProducts landscape changes dramatically every day e.g. Tablets, a radically new form factor, unleashes itself on the market, we want to be able to adopt it and sell it ASAP and not wait for a cumbersome change control process due to inflexible categorization and attribution
  • Ability to lookup and potentially match products and offerings/items by any combination of attributes and other dimensional criteriaItem-Item Relationships & CollectionsHierarchicalBase-variants (e.g. iPhone 4S 16/32/64 GB)GraphBundlesHard, Fixed, Inflexible or ConfigurableComponents & IngredientsAccessories & ReplacementsCase Packs & Vendor PacksLow Latency, High Throughput, Highly AvailableSellers typically update 40-50% of their offerings at some level each dayBased on global projections, this may be comparable to the scale of social media feedsAccept, process, search, retrieve and analyze large volumes of data 24x7
  • Multiple Column FamiliesProduct fragmentsCustom consistency enablerSeparate the “data” from the “index” or “event log”Use to separate “Work In Progress” from golden copyImplicit versioning and potential archiving/purging requirementsTunable consistency levels per API call (Read/Write)Custom row caching at column family levelOptimize for read-intensivevs. write-intensive column familiesSingle keyspace to hold all data fragmentsTighter control of replication factor (DC + 3 or 5), strategy (NetworkTopologyStrategy (formerly known as Datacenter-ShardStrategy))Additional keyspaces only for supporting dataLower priority, loosely coupled or completely decoupledE.g. Purgeable audit & history logs
  • Flexible, selective denormalizationBi-directional relationshipsCapture more than just foreign keysIndicesMerge records to create product offering in the application/DaaS layerRight balance of optimization of the retrieval algorithm vs, spaceSecondary indexes for faster attribute-level queries, but simple queries onlyHowever, complex queries may need to be supplemented with other tools as we will see later Dynamic composites capture 1-n levels of dimension intersections define flexible comparators for different column key levelsColumn slicing to retrieve the right offerings (i.e. intersections)No need to use Order Preserving PartitionerCategorization and structure is completely handled outside of the data storeCassandra only used to capture attribute values
  • Solr for additional indexing querying capabilitiesMainly at attribute value levelPattern matchingNon-standard comparisons and range checks HDFS/Hadoop for “extreme” bulk/batch operationsLarge File/content streaming and parallel processingCorresponding response aggregationHadoop “append”

Cassandra atwalmartlabsmeetup201203 Cassandra atwalmartlabsmeetup201203 Presentation Transcript

  • Cassandra @walmartlabs
  • Cassandra @walmartlabs• Cassandra adoption at Walmart – Using the DataStax distribution http://www.datastax.com/• Introduction to the talks• Hiring @labsWalmart eCommerce 2
  • Cassandra @walmartlabs• Introduction to the talks – Walmartlabs • @labs – Using Cassandra for real-time stream processing • @services – Using Cassandra for product and items – DataStax • Data modeling with CassandraWalmart eCommerce 3
  • Cassandra @walmartlabs• Hiring @labs – Cassandra admins – Java engineers – http://www.walmartlabs.com/open-positions/Walmart eCommerce 4
  • Cassandra for Real-timeStream ProcessingKarl Mueller, @WalmartLabsWang Lam, @WalmartLabs
  • Data-stream computation• “Big” data: MapReduce (Hadoop) – Map and Reduce steps – Batch process large input (e.g., from HDFS) – Hadoop distributes computation• Fast data: MapUpdate (Muppet) – Map and Update steps – Continuously process streaming input – Muppet maintains computation – Muppet manages memory/storage2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • The MapReduce framework (Hadoop)• Event – A <key, value> pair of data• Map – A function that performs (stateless) computation on incoming events• Reduce – A function that combines all input for a particular key• Application – Map -> Reduce2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • The MapUpdate framework (Muppet)• Event – A <key, value> pair of data• Map – A function that performs (stateless) computation on incoming events• Update – A function that updates a slate using incoming events• Application – A directed graph of Mappers and Updaters2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • A MapUpdate application2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • The Map (Foursquare::CheckinMapper)sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = ToysRUs if ($venue_name =~ /toys.*r.*us/i); $retailer = Walmart if ($venue_name =~ /wal.*mart/i); $retailer = SamsClub if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot);2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs
  • The Update (Foursquare::RetailerUpdater)use Muppet::Updater;package Foursquare::RetailerUpdater;@ISA = qw( Muppet::Updater );use strict;sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1;2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs
  • Example results2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Muppet Processing• Slates are 1 – 100KB in size• Local cache on Muppet Node – 85% reads from cache – Write-though delayed cache – ~750K slates in cache per node• Remote slates read through Muppet API• Cassandra is the permanent datastore• Slates tend to be updated and read in batches – 10-50 at a time2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Muppet & Cassandra Architecture ~100x Muppet Node Node Processes NodeProcesses Node Processes Processes Node Processes Processes Processes Processes Processes Processes API Processes Processes Slate Cache Slate Cache Slate Cache Delay Slate Cache Delay Slate Cache 16x Cassandra Cassandra Cassandra Cassandra 8x RAID0 SSD 8x RAID0 SSD 8x RAID0 SSD 1.2TB RAW 1.2TB RAW 1.2TB RAW2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Datastore Requirements• Consistent, low response time – 10ms or less for slate reads on average• 1+ billion keys, future expansion maybe 5-10 billion• Value is whole set of data – Slate losses in small amounts OK• Datastore gets entirely “cold” reads – Muppet Cache: 85% for reads – Datastore cannot rely on cache for performance2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Why Cassandra?• Timeframe: Early 2010 – Low latency: a rare feature among NoSQL – Most NoSQL favors throughput over response time – New “Best NoSQL evur!!” every 2 months• Cassandra: – Open-Source, active community, Clustering a core feature• Simple is good – Peer networking, Data file format, key distribution• QUORUM consistency good middle ground – AP focus in CAP aligns well with our needs2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Why Cassandra – the Challenges• Seeks are going to be difficult – Overwrites mean nightly compactions – Compactions blow up seek performance – 90%+ cold reads means lots of seeks – Head and body reads can produce a lot of seeks• Slates as an atomic unit means no bulk column slice reads• Likely to have unfavorable read:write ratio – Early estimates: 1:3, or even worse• Oh yeah, spinning disks hate seeks. Uh oh!2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Frequent Row Overwrites in Cassandra TAIL Few Seeks Full Compaction BODY Some Seeks HEAD Many Seeks Growth During Day Data Files (SS Tables)2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Solution• Cassandra + SSDs !!• Expensive in terms of space, cheap in terms of IOps• Random seeks “free”• Good performance during nightly compactions2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Compaction Effect on System2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • How did Cassandra do?• Average latency below 10ms, often 5-8ms• read-write ratio: 1:2 – Today, 1:1• Compacting 500GB every night in <4 hours• Individual C* nodes handled over 1500 rps/wps• SSD cost: well worth it2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Helping Cassandra out• Muppet absorbs writes in local cache – Write on # of updates or staleness – Reduces write counts in Cassandra – More efficient• Compress all slates on Muppet nodes – Easier to scale than C* nodes doing compression – Less disk IO, less network – CPU on Muppet nodes cheap• Expire data via TTL – Muppet apps decide data-keep length• Java GC tuning flattened out CPU and GC stops2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Recent and Future• Cassandra 0.8.x – Faster compaction – Stability – Performance• Cassandra 1.0.x – Close to deployment @WML – LevelDB is very, very interesting – Cache memory changes make large caches feasible! – Row[Column] latest-only: very nice – SSDs no longer needed? Possibly! • Depends on cold seek requirements2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Lessons• Simple is usually faster and cheaper – Add complexity only where needed• Best solution can usually be made to work• Proactive monitoring very important – Trend graph everything relevant!• Failing fast is better than succeeding late• No substitute for understanding your platform• Spend money when it will save you time and complexity2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Q&A2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • Using Cassandra for Products & ItemsRajkumar Venkatrvenkat@walmartlabs.com
  • First ChallengeBuild a truly Global Product Catalog
  • Dimensions, Products & Product Offerings - Example
  • Second ChallengeCatalog (& Categorize) Any Sellable Item
  • Flexible Categorization & Attribution • The right kind of categorization and attribution is crucial to making sense of the enormity of product data • Ultimate shopping experience • Fine-grained analytics & planning • Standards exist, but severely limiting • Product landscape changes dramatically every day
  • Other excerpts from the “shopping list”• Lookup and potentially match products and offerings by any combination of attributes and other dimensional criteria• Item-Item Relationships & Collections • Hierarchical • Graph• Low Latency, High Throughput, Highly Available• A scalable but unified system of record for all product and offering data
  • Translating to Cassandra• Modeling options 1. Product as a “wide row” encompassing all offerings 2. Product assembled from several offering “fragment” rows• Multiple Column Families • Product fragments • Custom consistency enabler • Custom row caching at column family level• Single keyspace to hold all core data fragments • Tighter control of replication factor, strategy • Additional keyspaces only for supporting data
  • Translating to Cassandra (contd.)• Flexible, selective denormalization• Secondary indexes for faster attribute-level queries• Dynamic composites • define flexible comparators for different column key levels • capture 1-n levels of dimension intersections• Column slicing to retrieve the right offerings
  • The “Supporting Cast”• Solr for additional indexing querying capabilities • Mainly for attribute values • Pattern matching • Non-standard type comparisons • Range checks
  • Queries?