Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data cache-surge-2012


Published on

These days it is not uncommon to have 100s of gigabytes of data that must be sliced and diced, then delivered fast and rendered quickly. Typically solutions involve lots of caching and expensive hardware with lots of memory. And, while those solutions certainly can work, they aren’t always cost effective, or feasible in certain environments (like in the cloud). This talk seeks to cover some strategies for caching large data sets without tons of expensive hardware, but through software and data design.

It’s common wisdom that serving your data from memory dramatically improves application performance and is a key to scaling. However, caching large datasets brings its own challenges: distribution, consistency, dealing with memory limits, and optimizing data loading just to name a few. This talk will go through some of the challenges, and solutions, to achieve fast data queries in the cloud. The audience will come away armed with a number of practical techniques for organizing and building caches for non-transactional datasets that can be applied to scale existing systems, or design new systems.

Published in: Technology, Business

Big data cache-surge-2012

  1. Big DataWithout a Big Database Kate Matsudaira @katemats
  2. Two kinds of data “reference”, “non- nicknames “user”, “transactional” transactional” •  product/offer catalogs •  user accounts •  service catalogs •  shopping cart/ordersexamples: •  static geolocation data •  user messages •  dictionaries •  … •  …created/modified users business (you)by:sensitivity to high lowstaleness:plan for growth: hard easyaccess read/write mostly readoptimization:
  3. user datareference data
  4. user datareference data
  5. user datareference data
  6. Reference Data Needs Speed credit::
  7. Performance Remindermain memory read 0.0001 ms (100 ns)network round trip 0.5 ms (500,000 ns) disk seek 10 ms (10,000,000 ns) source:
  8. availability performance problemsThe Beginning problems service webapp load balancer load balancer service BIG
 DATABASE webapp service data loader scalability problems
  9. operational overheadReplication REPLICA service webapp load balancer load balancer service BIG
 DATABASE webapp service data loader performance problems scalability problems
  10. operationalLocal Caching scalability overhead problems REPLICA cache service webapp load balancer load balancer cache service BIG
 DATABASE webapp cache service data loader performance long tail performance problems problems consistency problems
  11. 80% of requests query 10% of entries (head)The Long 20% of requests Tail query remaining 90% of entries (tail)Problem
  12. operational performanceBig Cache long tail overhead problems performance problems replica service webappload balancer load balancer database BIG service CACHE webapp data loader service preload scalability consistency problems problems
  13. Do I look like I memcached(b) need a cache? ElastiCache (AWS) Oracle Coherence
  14. Dynamically Targeted assign keys to generic data/ the “nodes” use cases. Scales Dynamically horizontally rebalances data No assumptions Poor about loading/ performance updating data on cold starts
  15. Big Cache Technologies•  Additional hardware•  Additional configuration•  Additional monitoring } operational overhead•  Extra network hop•  Slow scanning•  Additional deserialization } performance
  16. NoSQL to some operational overhead The Rescue? NoSQL Replica service load balancer webapp load balancer service NoSQL Database webapp service data loader some performance some scalability problems problems
  17. Remote Store Retrieval Latency network remote store network client Lookup/write TCP request: response: 0.5 ms 0.5 ms TCP response: 0.5 ms read/parse response: 0.25 ms Total time to retrieve single value: 1.75 ms
  18. Total time to retrieveA single value from remote store: 1.75 ms
 from memory: 0.001 ms 
 (10 main memory reads)Sequential access of1 million random keys from remote store: 30 minutes
 from memory: 1 second
  19. The Truth About Databases
  20. “What Im going to call as the hot data cliff: As the size of your hot data set (data frequently read at sustained rates above disk I/O capacity) approaches available memory, write operation bursts that exceeds disk write I/O capacity can create a trashing death spiral where hot disk pages that MongoDB desperately needs are evicted from disk cache by the OS as it consumes more buffer space to hold the writes in memory.” MongoDB Source:
  21. “Redis is an in-memory but persistent on diskdatabase, so it represents a different trade off where very high write and read speed is achieved with the limitation of data sets that cant be larger than memory.”
 Redis source:
  22. They are fast if everything fits into memory.
  23. Can you keep it inmemory yourself? operational relief full data service cache loader load balancer load balancer webapp full data BIG
 service cache loader DATABASE webapp full data service cache loader consistency performance scales problems gain infinitely
  24. Fixing 1.  Deployment “Cells”Consistency 2.  Sticky user sessions deployment cell full data webapp service cache loader load balancer full data webapp service cache loader full data webapp service cache loader
  25. credit:
  26. "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and theseattempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” Donald Knuth
  27. How do you fit allthat data in memory?
  28. The Answer1 2 3 4 5
  29. “Domain Layer (or Model Layer): Responsible for representing concepts of the business, information about the business situation, and business rules. State thatDomain reflects the business situation is controlled and used here, even though the technical Model details of storing it are delegated to the infrastructure. This layer is the heart of business software.”Design Eric Evans, Domain-Driven Design, 20031 2 3 4 5
  30. Domain Model Design Guidelines #3 Optimize#1 Keep it Data immutable #2 Use independent Help! I am hierarchies in the trunk!
  31. intern()your immutables K1 V1 K1 V1 A A B D B C E C D K2 V2 K2 V2 E F E’ B’ D’ F C’ E’
  32. private  final  Map<Class<?>,  Map<Object,  WeakReference<Object>>>  cache  =            new  ConcurrentHashMap<Class<?>,  Map<Object,  WeakReference<Object>>>();    public  <T>  T  intern(T  o)  {    if  (o  ==  null)      return  null;    Class<?>  c  =  o.getClass();    Map<Object,  WeakReference<Object>>  m  =  cache.get(c);    if  (m  ==  null)        cache.put(c,  m  =  synchronizedMap(new  WeakHashMap<Object,  WeakReference<Object>>()));    WeakReference<Object>  r  =  m.get(o);    @SuppressWarnings("unchecked")    T  v  =  (r  ==  null)  ?  null  :  (T)  r.get();    if  (v  ==  null)  {      v  =  o;      m.put(v,  new  WeakReference<Object>(v));    }        return  v;  }  
  33. Use Independent Hierarchies Product Summary productId = … Productid = … Offerstitle= … productId = … Offers Specifications productId = … Specifications Product Reviews Info Description productId = … Reviews Model History productId = … Rumors Description Model History productId = … Rumors productId = …
  34. Collection Optimization1 2 3 4 5
  35. Leverage Primitive Keys/Values collection with 10,000 elements [0 .. 9,999] size in memoryjava.util.ArrayList<Integer>   200K  java.util.HashSet<Integer>   546K  gnu.trove.list.array.TIntArrayList   40K  gnu.trove.set.hash.TIntHashSet   102K   Trove (“High Performance Collections for Java”)
  36. Optimizesmall immutable collectionsCollections with small number of entries (up to ~20): class  ImmutableMap<K,  V>  implements  Map<K,V>,  Serializable   {  ...  }     class  MapN<K,  V>  extends  ImmutableMap<K,  V>  {    final  K  k1,  k2,  ...,  kN;    final  V  v1,  v2,  ...,  vN;    @Override  public  boolean  containsKey(Object  key)  {        if  (eq(key,  k1))  return  true;        if  (eq(key,  k2))  return  true;        ...        return  false;    }    ...  
  37. java.util.HashMap: Space 128 bytes + 32 bytes per entry Savings compact immutable map: 24 bytes + 8 bytes per entry
  38. Numeric Data Optimization1 2 3 4 5
  39. Price History Example
  40. Problem: •  Store daily prices for 1MExample: products, 2 offers per product Price •  Average price history length per product ~2 years History Total price points: (1M + 2M) * 730 = ~2 billion
  41. Price History TreeMap<Date,  Double>   First 88 bytes per entry * 2 billion = ~180 GBattempt
  42. Typical Shopping Price History price$100 0 20 60 70 90 100 120 121 days
  43. Run Length Encodinga a a a a a b b b c c c c c c 6 a 3 b 6 a
  44. Price History •  positive: price (adjusted to scale) Optimization •  negative: run length (precedes price) •  zero: unavailable-20 100 -40 150 -10 140 -20 100 -10 0 -20 100 90 -9 80 •  Drop pennies •  Store prices in primitive short (use scale factor to represent prices greater than Short.MAX_VALUE) Memory: 15 * 2 + 16 (array) + 24 (start date) + 4 (scale factor) = 74 bytes
  45. Reduction compared to TreeMap<Date,  Double>: 155Savings times Estimated memory for 2 billion price points: 1.2 GB << 180 GB
  46. public  class  PriceHistory  {   Price History Model    private  final  Date  startDate;  //  or  use  org.joda.time.LocalDate    private  final  short[]  encoded;    private  final  int  scaleFactor;      public  PriceHistory(SortedMap<Date,  Double>  prices)  {  …  }  //  encode    public  SortedMap<Date,  Double>  getPricesByDate()  {  …  }  //  decode    public  Date  getStartDate()  {  return  startDate;  }      //  Below  computations  implemented  directly  against  encoded  data    public  Date  getEndDate()  {  …  }    public  Double  getMinPrice()  {  …  }    public  int  getNumChanges(double  minChangeAmt,  double  minChangePct,  boolean  abs)  {  …  }    public  PriceHistory  trim(Date  startDate,  Date  endDate)  {  …  }    public  PriceHistory  interpolate()  {  …  }  
  47. Know Your Data
  48. Compress text1 2 3 4 5
  49. String Compression: byte arrays •  Use the minimum character set encodingstatic  Charset  UTF8  =  Charset.forName("UTF-­‐8");  String  s  =  "The  quick  brown  fox  jumps  over  the  lazy  dog”;  //  42  chars,  136  bytes  byte[]  b  a=  "The  quick  brown  fox  jumps  over  the  lazy  dog”.getBytes(UTF8);  //  64  bytes  String  s1  =  “Hello”;  //  5  chars,  64  bytes  byte[]  b1  =  “Hello”.getBytes(UTF8);  //  24  bytes    byte[]  toBytes(String  s)  {  return  s  ==  null  ?  null  :  s.getBytes(UTF8);  }  String  toString(byte[]  b)  {  return  b  ==  null  ?  null  :  new  String(b,  UTF8);  }    
  50. String Compression: shared prefix•  Great for URLspublic  class  PrefixedString  {    private  PrefixedString  prefix;    private  byte[]  suffix;      .  .  .        @Override  public  int  hashCode()  {  …  }    @Override  public  boolean  equals(Object  o)  {  …  }  }  
  51. String Compression:short alphanumeric case-insensitive stringspublic abstract class AlphaNumericString {
 public static AlphaNumericString make(String s) {
 try { return new Numeric(Long.parseLong(s, Character.MAX_RADIX)); } catch (NumberFormatException e) { return new Alpha(s.getBytes(UTF8)); } } protected abstract String value(); @Override public String toString() {return value(); } private static class Numeric extends AlphaNumericString { long value; Numeric(long value) { this.value = value; } @Override protected String value() { return Long.toString(value, Character.MAX_RADIX); } @Override public int hashCode() { … } @Override public boolean equals(Object o) { … } } private static class Alpha extends AlphaNumericString { byte[] value; Alpha(byte[] value) {this.value = value; } @Override protected String value() { return new String(value, UTF8); } @Override public int hashCode() { … } @Override public boolean equals(Object o) { … } }}
  52. Just convert bzip2 to byte[] first, then compress Gzip Become the master of your strings! String Compression: Large Strings Image source:
  53. JVM Tuning1 2 3 4 5
  54. make sure to use compressed pointers use low pause (-XX:+UseCompressedOops) GC (Concurrent Mark Sweep, G1)JVM Tuning Overprovision Adjust heap by ~30% generation sizes/ratios This s#!% is heavy! Image srouce:
  55. Print garbage collection JVMTuning If GC pauses still prohibitive thenconsider partitioning
  56. Cache Loading full data webapp service cache loaderload balancer full data reliable file webapp service store (S3) cache loader full data webapp service cache loader “cooked” datasets
  57. Keep the format simple Poll for (CSV, JSON) updates Final datasets should be compressed and stored (i.e. S3) Poll frequency == data inconsistency threshold Help! I amin the trunk!
  58. Cache LoadingTime Sensitivity
  59. Cache Loading:low time Sensitivity Data/tax-rates /date=2012-05-01 tax-rates.2012-05-01.csv.gz /date=2012-06-01 tax-rates.2012-06-01.csv.gz /date=2012-07-01 tax-rates.2012-07-01.csv.gz
  60. Cache Loading:/prices medium/high time /full Sensitivity /date=2012-07-01 price-obs.2012-07-01.csv.gz /date=2012-07-02 /inc /date=2012-07-01 2012-07-01T00-10-00.csv.gz 2012-07-01T00-20-00.csv.gz
  61. meow Cache is immutable, so no locking is required CacheLoadingStrategy Swap And for datasets that Works well for need to be refreshed infrequently each update updated data sets Image src:
  62. Deletions can be tricky Avoid fullsynchronization Use one container per partition YARRRRRR!
  63. Concurrent Locking with Trove Mappublic class LongCache<V> { private TLongObjectMap<V> map = new TLongObjectHashMap<V>(); private ReentrantReadWriteLock lock = new ReentrantReadWriteLock(); private Lock r = lock.readLock(), w = lock.writeLock(); public V get(long k) { r.lock(); try { return map.get(k); } finally { r.unlock(); } } public V update(long k, V v) { w.lock(); try { return map.put(k, v); } finally { w.unlock(); } } public V remove(long k) { w.lock(); try { return map.remove(k); } finally { w.unlock(); } }}
  64. I am “cooking” Periodically the data sets. generate Ha! serialized data/ state Validatewith CRC or hash Keep local copies
  65. Dependent Caches service instance cache Aload balancer service status dependencies aggregator cache B cache C health check (servlet) cache D
  66. Deployment Cell Status deployment cell webapp status aggregator load balancer health HTTP or JMX service 1 check cell status status aggregator aggregator service 2 status aggregator
  67. Hierarchical Status Aggregation
  69. Data doesn’t fit Keep a smaller into the heap heap
  70. Partitioning Decision Tree yes does my no data fit in a single VM? yes can I no partition statically? don’t use fixed use dynamicpartition partitioning partitioning difficulty
  71. FixedPartitions
  72. Fixed Partition Assignment deployment cell p 1 p 2 webapp p 3 p 4 load balancer p 1 p 2 webapp p 3 p 4
  73. Dynamic Partitioning
  74. Dynamic Partition Assignment•  Each partition lives on at least 2 nodes: primary/secondary•  Partition (set) determined via a simple computation
  75. Member nodes operate with“active state” from the New Eleader node Leader F Leader E A The leader is re-elected D C B
  76. Each member node learns their targetstate from the leader E F Leader E A D C B All nodes load partitions to achieve their target state. Then will transition into an active state.
  77. Dynamic primary secondaryPartitioning p 1 p 2 p 3 p 4 p 5 p 6 webapp load balancer load balancer p 4 p 5 p 6 p 7 p 8 p 9 webapp p 7 p 8 p 9 extra network hop is back p 1 p 2 p 3
  78. Ad-hoc CacheQuerying Image source:
  79. •  Groovy •  JRuby Ad-hoc •  MVEL 
 Querying ( •  JoSQL 
 (Java Object SQL,
  80. Break up the query: •  Extractor expressions (like SQL “SELECT” clause)Organizing •  Filter (evaluates to boolean, like SQL “WHERE” clause) Queries •  Sorter (like SQL “ORDER BY” clause) •  Limit (like SQL “LIMIT” clause)
  81. prune partitions partition 1 partition 2 partition N … filter filter filter sort sort sort limit limit limit intermediate intermediate intermediate results results results sort Parallel limit (fan-out) final resultsQuery Execution extract
  82. Parallel (fan-out) Query: Multi-level Reductionp 1 p 2 p 3 p M p M+1 p N r 1 r 2 r 3 r 4
  83. Optimizing for Network Topology Using Multi-level Reductionslower/lower capacity linksfaster/higher capacity links p 1 p 2 p 7 p 8 p 13 p 14 p 3 p 4 p 9 p 10 p 15 p 16 p 5 p 6 p 11 p 12 p 17 p 18 location 1 location 2 location 3
  84. Example (from JoSQL site): SELECT  *   FROM   WHERE  name  $LIKE  "%.html"  JoSQL AND      toDate  (lastModified)          BETWEEN  toDate  (01-­‐12-­‐2004)   AND  toDate  (31-­‐12-­‐2004)  
  85. find average number of significant price changes and average pricehistory by product category for a specific seller and time period (firsthalf of 2012)select @priceChangeRate, product, @avgPriceHistory JoSQL Examplefrom products where offers.lowestPrice >= 100 and offer(offers, 12).priceHistory.startDate <= date(2012-01-01) and offer(offers, 12).priceHistory.endDate >= date(2012-06-01) and decideRank.rank < 200group by category(product.primaryCategoryId).namegroup by order 1execute on group_by_results sum(getNumberOfDailyPriceChanges( trimPriceHistory(offer(offers, 12).priceHistory, 2012-01-01, 2012-06-01), offer(offers, 12).priceHistory.startDate, 15, 5)) / count(1) as priceChangeRate, avgPriceHistory(trimPriceHistory(offer(offers, 12).priceHistory, 2012-01-01, 2012-06-01)) as avgPriceHistory
  86. Total time to execute where clause on all objects: 6010.0 ms where clause average over 3,393,731 objects: 0.0011 ms Total time to execute 2 expressions on GROUP_BY_RESULTS objects: Group operation 3.0 ms took: 1.0 ms Group column collection and sort: 46.0 ms
  87. Monitoring cache readiness •  rolled up status aggregate metrics •  staleness •  total counts •  counts by partition •  counts by attribute •  average cardinality of one-to-many relationships
  88. Monitoring cache readiness •  rolled up status aggregate metrics •  staleness •  total counts •  counts by partition •  counts by attribute •  average cardinality of one- to-many relationships
  89. The End And much of the credit for this talk goes to Decide’s Chief Architect, Leon Stein for developing thetechnology and creating the content. Thank you, Leon.