From cache to in-memory data grid. Introduction to Hazelcast.

20,262 views

Published on

This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution

Published in: Engineering, Technology
6 Comments
76 Likes
Statistics
Notes
No Downloads
Views
Total views
20,262
On SlideShare
0
From Embeds
0
Number of Embeds
558
Actions
Shares
0
Downloads
0
Comments
6
Likes
76
Embeds 0
No embeds

No notes for slide

From cache to in-memory data grid. Introduction to Hazelcast.

  1. 1. From cache to in-memory data grid. Introduction to Hazelcast. By Taras Matyashovsky
  2. 2. Introduction
  3. 3. About me • Software engineer/TL • Worked for outsource companies, product companies and tried myself in startups/ freelancing • 7+ years production Java experience • Fan of Agile methodologies, CSM
  4. 4. What? • This presentation: • covers basics of caching and popular cache types • explains evolution from simple cache to distributed, and from distributed to IMDG • not describes usage of NoSQL solutions for caching • is not intended for products comparison or for promotion of Hazelcast as the best solution
  5. 5. Why? • to expand horizons regarding modern distributed architectures and solutions • to share experience from my current project where Infinispan was replaced with Hazelcast as in-memory distributed cache solution
  6. 6. Agenda 1st part: • Why software caches? • Common cache attributes • Cache access patterns • Cache types • Distributed cache vs. IMDG
  7. 7. Agenda 2nd part: • Hazelcast in a nutshell • Hazelcast configuration • Live demo sessions • in-memory distributed cache • write-through cache with Postgres as storage • search in distributed cache • parallel processing using executor service and entry processor • Infinispan vs. Hazelcast • Best practices and personal recommendations
  8. 8. Caching Basics
  9. 9. Why Software Caching? • application performance: • many concurrent users • time and costs overhead to access application’s data stored in RDBMS or file system • database-access bottlenecks caused by too many simultaneous requests
  10. 10. So Software Caches • improve response times by reducing data access latency • offload persistent storages by reducing number of trips to data sources • avoid the cost of repeatedly creating objects • share objects between threads • only work for IO-bound applications
  11. 11. So Software Caches are essential for modern high-loaded applications
  12. 12. But • memory size • is limited • can become unacceptably huge • synchronization complexity • consistency between the cached data state and data source’s original data • durability • correct cache invalidation • scalability
  13. 13. Common Cache Attributes • maximum size, e.g. quantity of entries • cache algorithm used for invalidation/eviction, e.g.: • least recently used (LRU) • least frequently used (LFU) • FIFO • eviction percentage • expiration, e.g.: • time-to-live (TTL) • absolute/relative time-based expiration
  14. 14. Cache Access Patterns • cache aside • read-through • refresh-ahead • write-through • write-behind
  15. 15. Cache Aside Pattern • application is responsible for reading and writing from the storage and the cache doesn't interact with the storage at all • the cache is “kept aside” as a faster and more scalable in-memory data store Client Cache Storage
  16. 16. Read-Through/Write-Through • the application treats cache as the main data store and reads/writes data from/to it • the cache is responsible for reading and writing this data to the database Client Cache Storage
  17. 17. Write-Behind Pattern • modified cache entries are asynchronously written to the storage after a configurable delay Client Cache Storage
  18. 18. Refresh-Ahead Pattern • automatically and asynchronously reload (refresh) any recently accessed cache entry from the cache loader prior to its expiration Client Cache Storage
  19. 19. Cache Strategy Selection RT/WT vs. cache-aside: • RT/WT simplifies application code • cache-aside may have blocking behavior • cache-aside may be preferable when there are multiple cache updates triggered to the same storage from different cache servers
  20. 20. Cache Strategy Selection Write-through vs. write-behind: • write-behind caching may deliver considerably higher throughput and reduced latency compared to write-through caching • implication of write-behind caching is that database updates occur outside of the cache transaction • write-behind transaction can conflict with an external update
  21. 21. Cache Types
  22. 22. Cache Types • local cache • replicated cache • distributed cache • remote cache • near cache
  23. 23. Local Cache a cache that is local to (completely contained within) a particular cluster node
  24. 24. Local Cache Pros: • simplicity • performance • no serialization/deserialization overhead Cons: • not a fault-tolerant • scalability
  25. 25. Local Cache Solutions: • EhCache • Google Guava • Infinispan local cache mode
  26. 26. Replicated Cache a cache that replicates its data to all cluster nodes
  27. 27. Get in Replicated Cache Each cluster node (JVM) accesses the data from its own memory, i.e. local read:
  28. 28. Put in Replicated Cache Pushing the new version of the data to all other cluster nodes:
  29. 29. Replicated Cache Pros: • best read performance • fault–tolerant • linear performance scalability for reads Cons: • poor write performance • additional network load • poor and limited scalability for writes • memory consumption
  30. 30. Replicated Cache Solutions: • open-source: • Infinispan • commercial: • Oracle Coherence • EhCache + Terracota
  31. 31. Distributed Cache a cache that partitions its data among all cluster nodes
  32. 32. Get in Distributed Cache Access often must go over the network to another cluster node:
  33. 33. Put in Distributed Cache Resolving known limitation of replicated cache:
  34. 34. Put in Distributed Cache • the data is being sent to a primary cluster node and a backup cluster node if backup count is 1 • modifications to the cache are not considered complete until all backups have acknowledged receipt of the modification, i.e. slight performance penalty • such overhead guarantees that data consistency is maintained and no data is lost
  35. 35. Failover in Distributed Cache Failover involves promoting backup data to be primary storage:
  36. 36. Local Storage in Distributed Cache Certain cluster nodes can be configured to store data, and others to be configured to not store data:
  37. 37. Distributed Cache Pros: • linear performance scalability for reads and writes • fault-tolerant Cons: • increased latency of reads (due to network round-trip and serialization/deserialization expenses)
  38. 38. Distributed Cache Summary Distributed in-memory key/value stores supports a simple set of “put” and “get” operations and optionally read-through and write-through behavior for writing and reading values to and from underlying disk-based storage such as an RDBMS
  39. 39. Distributed Cache Summary Depending on the product additional features like: • ACID transactions • eviction policies • replication vs. partitioning • active backups also became available as the products matured
  40. 40. Distributed Cache Solutions: • open-source: • Infinispan • Hazelcast • NoSQL storages, e.g. Redis, Cassandra, MongoDB, etc. • commercial: • Oracle Coherence • Terracota
  41. 41. Remote Cache a cache that is located remotely and should be accessed by a client(s)
  42. 42. Remote Cache Majority of existing distributed/replicated caches solutions support 2 modes: • embedded mode • when cache instance is started within the same JVM as your application • client-server mode • when remote cache instance is started and clients connect to it using a variety of different protocols
  43. 43. Remote Cache Solutions: • Infinispan remote cache mode • Hazelcast client-server mode • Memcached
  44. 44. Near Cache a hybrid cache; it typically fronts a distributed cache or a remote cache with a local cache
  45. 45. Get in Near Cache When an object is fetched from remote node, it is put to local cache, so subsequent requests are handled by local node retrieving from local cache:
  46. 46. Near Cache Pros: • it is best used for read only data Cons: • increases memory usage since the near cache items need to be stored in the memory of the member • reduces consistency
  47. 47. In-memory Data Grid
  48. 48. In-memory Data Grid (IMDG)
  49. 49. In-memory Data Grid In-memory distributed cache plus: • ability to support co-location of computations with data in a distributed context and move computation to data • distributed MPP processing based on standard SQL and/or Map/Reduce, that allows to effectively compute over data stored in-memory across the cluster
  50. 50. IMDC vs. IMDG • in-memory distributed caches were developed in response to a growing need for data high-availability • in-memory data grids were developed to respond to the growing complexities of data processing
  51. 51. IMDG in a nutshell Adding distributed SQL and/or MapReduce type processing required a complete re-thinking of distributed caches, as focus has shifted from pure data management to hybrid data and compute management
  52. 52. In-memory Data Grid Solutions
  53. 53. Hazelcast
  54. 54. Hazelcast The leading open source in-memory data grid free alternative to proprietary solutions, such as Oracle Coherence, VMWare Pivotal Gemfire and Software AG Terracotta
  55. 55. Hazelcast Use-Cases • scale your application • share data across cluster • partition your data • balance the load • send/receive messages • process in parallel on many JVMs, i.e. MPP
  56. 56. Hazelcast Features • dynamic clustering, backup, discovery, fail-over • distributed map, queue, set, list, lock, semaphore, topic, executor service, etc. • transaction support • map/reduce API • Java client for accessing the cluster remotely
  57. 57. Hazelcast Configuration • programmatic configuration • XML configuration • Spring configuration Nuance: It is very important that the configuration on all members in the cluster is exactly the same, it doesn’t matter if you use the XML based configuration or the programmatic configuration.
  58. 58. Sample Application
  59. 59. Live Demo “Configuration”
  60. 60. Sample Application Technologies: • Spring Boot 1.0.1 • Hazelcast 3.2 • Postgres 9.3 Application: • RESTful web service to get/put data from/to cache • RESTful web service to execute tasks in the cluster • one Instance of Hazelcast per application * Some samples are not optimal and created just to demonstrate usage of existing Hazelcast API
  61. 61. Global Hazelcast Configuration Defined global Hazelcast configuration in separate config in common module. It contains skeleton for future Hazelcast instance as well as global configuration settings: • instance configuration skeleton • common properties • group name and password • TCP based network configuration • join config • multicast and TCP/IP config • default distributed map configuration skeleton
  62. 62. Hazelcast Instance Each module that uses Hazelcast for distributed cache should have its own separate Hazelcast instance. The “Hazelcast Instance” is a factory for creating individual cache objects. Each cache has a name and potentially distinct configuration settings (expiration, eviction, replication, and more). Multiple instances can live within the same JVM.
  63. 63. Hazelcast Cluster Group Groups are used in order to have multiple isolated clusters on the same network instead of a single cluster. JVM can host multiple Hazelcast instances (nodes). Each node can only participate in one group and it only joins to its own group, does not mess with others. In order to achieve this group name and group password configuration properties are used.
  64. 64. Hazelcast Network Config In our environment multicast mechanism for joining the cluster is not supported, so only TCP/IP-cluster approach will be used. In this case there should be a one or more well known members to connect to.
  65. 65. Live Demo “Map Store”
  66. 66. Hazelcast Map Store • useful for reading and writing map entries from and to an external data source • one instance per map per node will be created • word of caution: the map store should NOT call distributed map operations, otherwise you might run into deadlocks
  67. 67. Hazelcast Map Store • map pre-population via loadAllKeys method that returns the set of all “hot” keys that need to be loaded for the partitions owned by the member • write through vs. write behind using “write-delay- seconds” configuration (0 or bigger) • MapLoaderLifecycleSupport to be notified of lifecycle events, i.e. init and destroy
  68. 68. Live Demo “Executor Service”
  69. 69. Hazelcast Executor Service • extends the java.util.concurrent.ExecutorService, but is designed to be used in a distributed environment • scaling up via threads pool size • scaling out is automatic via addition of new Hazelcast instances
  70. 70. Hazelcast Executor Service • provides different ways to route tasks: • any member • specific member • the member hosting a specific key • all or subset of members • supports execution callback
  71. 71. Hazelcast Executor Service Drawbacks: • work-queue has no high availability: • each member will create local ThreadPoolExecutors with ordinary work-queues that do the real work but not backed up by Hazelcast • work-queue is not partitioned: • it could be that one member has a lot of unprocessed work, and another is idle • no customizable load balancing
  72. 72. Hazelcast Features More useful features: • entry listener • transactions support, e.g. local, distributed • map reduce API out-of-the-box • custom serialization/deserialization mechanism • distributed topic • clients
  73. 73. Hazelcast Missing Features Missing useful features: • update configuration in running cluster • load balancing for executor service
  74. 74. Infinispan vs. Hazelcast
  75. 75. Infinispan vs. Hazelcast Infinispan Hazelcast Pros • backed by relatively large company for use in largely distributed environments (JBoss) • been in active use for several years • well-written documentation • a lot of examples of different configurations as well as solutions to common problems • easy setup • more performant than Infinispan • simple node/cluster discovery mechanism • relies on only 1 jar to be included on classpath • brief documentation completed with simple code samples
  76. 76. Infinispan vs. Hazelcast Infinispan Hazelcast Cons • relies on JGroups that proven to be buggy especially under high load • configuration can be overly complex • ~9 jars are needed in order to get Infinispan up and running • code appears very complex and hard to debug/trace • backed by a startup based in Palo Alto and Turkey, just received Series A 2.5 M funding from Bain Capital Ventures • customization points are fairly limited • some exceptions can be difficult to diagnose due to poorly written exception messages • still quite buggy
  77. 77. Hazelcast Summary
  78. 78. Best Practices • each specific Hazelcast instance should have its unique instance name • each specific Hazelcast instance should have its unique group name and password • each specific Hazelcast instance should start on separate port according to predefined ranges
  79. 79. Personal Recommendations • use XML configuration in production, but don’t use spring:hz schema. Our Spring based “lego bricks” approach for building resulting Hazelcast instance is quite decent. • don’t use Hazelcast for local caches as it was never designed with that purpose and always performs serialization/deserialization • don’t use library specific classes, use common collections, e.g. ConcurrentMap, and you will be able to replace underlying cache solution easily
  80. 80. Hazelcast Drawbacks • still quite buggy • poor documentation for more complex cases • enterprise edition costs money, but includes: • elastic memory • JAAS security • .NET and C++ clients
  81. 81. Q/A?
  82. 82. Thank you! by Taras Matyashovsky
  83. 83. References • http://docs.oracle.com/cd/E18686_01/coh.37/e18677/cache_intro.htm • http://coherence.oracle.com/display/COH31UG/Read-Through,+Write- Through,+Refresh-Ahead+and+Write-Behind+Caching • http://blog.tekmindsolutions.com/oracle-coherence-diffrence-between-replicated- cache-vs-partitioneddistributed-cache/ • http://www.slideshare.net/MaxAlexejev/from-distributed-caches-to-inmemory-data- grids • http://www.slideshare.net/jaxlondon2012/clustering-your-application-with-hazelcast • http://www.gridgain.com/blog/fyi/cache-data-grid-database/ • http://gridgaintech.wordpress.com/2013/10/19/distributed-caching-is-dead-long- live/ • http://www.hazelcast.com/resources/the-book-of-hazelcast/ • https://labs.consol.de/java-caches/part-3-3-peer-to-peer-with-hazelcast/ • http://hazelcast.com/resources/thinking-distributed-the-hazelcast-way/ • https://github.com/tmatyashovsky/hazelcast-samples/

×