Apache geode
Performance is key. Consistency is a must.
Yogesh..BG
11th Mar 2016
• Introduction
• Applications
• Architecture
• Features
• Performance
• Comparison with other products
10/25/2016 Confidential 2
Agenda
Introduction
10/25/2016 Confidential 3
• GemStone was the first project for smalltalk.
• First deployed as data engine in financial sector of wall street trading platform.
• Low latency, high concurrency data management system.
• In memory data management platform.
10/25/2016 Confidential 4
Applications
• GIRE[Rapipago] a leading financial company in Argentina
• 19 million transactions per month
• Southwest Airlines
• Southwest.com is the world’s largest airline website by number of visitors.
• SBI, China Citic bank, Philips, BMW, Union bank, AllState,
Architecture
10/25/2016 Confidential 5
Elements are:
• Cache
• Region – Local, replicated & partitioned
• Locators
• Functions
• Listeners
Cache1
region1
R:region3
R:region2
region4
Cache2
R:region1
region3
region2
R:region4
Geode Cluster
Features
10/25/2016 Confidential 6
• Distributed cloud architecture
• Pools memory, CPU, network resources, and optionally local disk
• Uses dynamic replication and data partitioning techniques
• Reliable asynchronous event notifications
• Thousands of concurrent distributed transaction(JTA complaint)
• Shared nothing persistence architecture
Features
10/25/2016 Confidential 7
• Asynchronous and synchronous cache update propagation. Delta propagation.
• Horizontally scalable
• Querying and Indexing
• Super fast write-ahead-logging (WAL) persistence
• Compression, eviction and expiration of data
• User functions
Features
10/25/2016 Confidential 8
• HDFS Store – analytics job
• Rebalancing
• Integrated security : DATA_READ, DATA_WRITE, MONITOR, ADMIN [HTTP/HTTPS Authentication for REST ]
• JVSD – for analyzing the performance issues
• Off heap memory
• REST APIs
Internals of Geode
10/25/2016 Confidential 9
• Optimized caching layer, minimum thread and process switches.
• highly concurrent data structures to minimize contention points.
• Servers manage object graphs in serialized form, so less GC.
• Batch operation to the database.
• Uses TCP/IP, UDP UniCast and UDP MultiCast for member communication
• Serialization
How to use?
10/25/2016 Confidential 10
Bucket 1
Bucket 2
Bucket 3
Bucket 2
Bucket 1
Bucket x
Client
Insert Person(UID, name, age)
Replicate
Query and index
10/25/2016 Confidential 11
Performance
10/25/2016 Confidential 12
• 10 times the read-and-write throughput of traditional disk-based databases.
• 4-40 times better performance of any application.
• 10million concurrent users
• Proven 10-100ms of latency in china railway system
Horizontal Scaling: Consistent Latency and CPU
10/25/2016 Confidential 13
Geode and Redis
10/25/2016 Confidential 14
• GemFireRedisServer understand the redis protocol
• Keys represents region and namespace is with in OQL boundary.
• Redis is a single-threaded server. It is not designed to benefit from multiple CPU cores.
• Redis cluster you can scale up the number of data structures, not the data structures them selves (Partitioned
regions)
• Replication : slaves loses the data when they startup and sync with master. In Geode, you can have up to 3
redundant copies (for partitioned regions). Rep is async in redis.
• Persistence : AOF with keys and values in same file, on restart need to parse entire file.
• Redis uses Sentinel for managing HA.
• Network Partition
10/25/2016 Confidential 15
Condition No pipelining and 1KB payloads Pipelining 16 requests at a time
Operation Redis GemFireRedis Redis GemFireRedis
SET 100894.94 87627.06 109277.91 109109.55
GET 103504.02 102988.52 113583.70 113523.87
INCR 99662.14 92251.61 1061300.75 575023.25
SADD 99559.35 92254.50 989119.69 644678.81
Geode and cassandra
10/25/2016 Confidential 16
Geode and HazelCast
10/25/2016 Confidential 17
https://hazelcast.com/resources/benchmark-pivotal-gemfire-vs-hazelcast/
Thank You
Yogesh..BG
10/25/2016 Confidential 18

Apache geode

  • 1.
    Apache geode Performance iskey. Consistency is a must. Yogesh..BG 11th Mar 2016
  • 2.
    • Introduction • Applications •Architecture • Features • Performance • Comparison with other products 10/25/2016 Confidential 2 Agenda
  • 3.
    Introduction 10/25/2016 Confidential 3 •GemStone was the first project for smalltalk. • First deployed as data engine in financial sector of wall street trading platform. • Low latency, high concurrency data management system. • In memory data management platform.
  • 4.
    10/25/2016 Confidential 4 Applications •GIRE[Rapipago] a leading financial company in Argentina • 19 million transactions per month • Southwest Airlines • Southwest.com is the world’s largest airline website by number of visitors. • SBI, China Citic bank, Philips, BMW, Union bank, AllState,
  • 5.
    Architecture 10/25/2016 Confidential 5 Elementsare: • Cache • Region – Local, replicated & partitioned • Locators • Functions • Listeners Cache1 region1 R:region3 R:region2 region4 Cache2 R:region1 region3 region2 R:region4 Geode Cluster
  • 6.
    Features 10/25/2016 Confidential 6 •Distributed cloud architecture • Pools memory, CPU, network resources, and optionally local disk • Uses dynamic replication and data partitioning techniques • Reliable asynchronous event notifications • Thousands of concurrent distributed transaction(JTA complaint) • Shared nothing persistence architecture
  • 7.
    Features 10/25/2016 Confidential 7 •Asynchronous and synchronous cache update propagation. Delta propagation. • Horizontally scalable • Querying and Indexing • Super fast write-ahead-logging (WAL) persistence • Compression, eviction and expiration of data • User functions
  • 8.
    Features 10/25/2016 Confidential 8 •HDFS Store – analytics job • Rebalancing • Integrated security : DATA_READ, DATA_WRITE, MONITOR, ADMIN [HTTP/HTTPS Authentication for REST ] • JVSD – for analyzing the performance issues • Off heap memory • REST APIs
  • 9.
    Internals of Geode 10/25/2016Confidential 9 • Optimized caching layer, minimum thread and process switches. • highly concurrent data structures to minimize contention points. • Servers manage object graphs in serialized form, so less GC. • Batch operation to the database. • Uses TCP/IP, UDP UniCast and UDP MultiCast for member communication • Serialization
  • 10.
    How to use? 10/25/2016Confidential 10 Bucket 1 Bucket 2 Bucket 3 Bucket 2 Bucket 1 Bucket x Client Insert Person(UID, name, age) Replicate
  • 11.
  • 12.
    Performance 10/25/2016 Confidential 12 •10 times the read-and-write throughput of traditional disk-based databases. • 4-40 times better performance of any application. • 10million concurrent users • Proven 10-100ms of latency in china railway system
  • 13.
    Horizontal Scaling: ConsistentLatency and CPU 10/25/2016 Confidential 13
  • 14.
    Geode and Redis 10/25/2016Confidential 14 • GemFireRedisServer understand the redis protocol • Keys represents region and namespace is with in OQL boundary. • Redis is a single-threaded server. It is not designed to benefit from multiple CPU cores. • Redis cluster you can scale up the number of data structures, not the data structures them selves (Partitioned regions) • Replication : slaves loses the data when they startup and sync with master. In Geode, you can have up to 3 redundant copies (for partitioned regions). Rep is async in redis. • Persistence : AOF with keys and values in same file, on restart need to parse entire file. • Redis uses Sentinel for managing HA. • Network Partition
  • 15.
    10/25/2016 Confidential 15 ConditionNo pipelining and 1KB payloads Pipelining 16 requests at a time Operation Redis GemFireRedis Redis GemFireRedis SET 100894.94 87627.06 109277.91 109109.55 GET 103504.02 102988.52 113583.70 113523.87 INCR 99662.14 92251.61 1061300.75 575023.25 SADD 99559.35 92254.50 989119.69 644678.81
  • 16.
  • 17.
    Geode and HazelCast 10/25/2016Confidential 17 https://hazelcast.com/resources/benchmark-pivotal-gemfire-vs-hazelcast/
  • 18.

Editor's Notes

  • #4  When to use Application dev who need extremely fast processing and consistent data usage Thousands of concurrent transaction Access to the hundred tera bytes of operational data in memory Real time analytics Parallel compute grid 10 million user transactions a day.  VMware Gemfire                                                 (Java) Oracle Coherence                                             (Java) Huawei Cache SDK over redis (Java) Alachisoft NCache                                              (.Net) Gigaspaces XAP Elastic Caching Edition            (Java) Hazelcast                                                           (Java) Scaleout StateServer                                          (.Net) Jboss (Redhat) Infinispan
  • #5  deployment includes China National Railways that uses Geode to run railway ticketing for the entire country of China with a 10 node cluster that manages 2 terabytes of "hot data" in memory, and 10 backup nodes for high availability and elastic scale, latencies of 10-100 milliseconds. booked 2.5 million tickets per day on average.  2 terabytes or one month of ticket data in memory. http://blog.gopivotal.com/case-studies-2/china-railway-corp-for-chinese-new-year-chunyun Indian Railways system: 500,000 tickets daily, 40,000 concurrent users logged on to purchase tickets during tatkal hours. 200,000 concurrent users without impacting performance, Stable Performance to Book Approximately 150,000 Tickets Per Hour http://pivotal.io/big-data/case-study/distributed-in-memory-data-management-solution GIRE : billing, collection, payment and transaction processing  via the web, call center, retail service centers http://pivotal.io/big-data/case-study/enabling-real-time-transactions-and-analysis-gire South West Airlines : http://pivotal.io/agile/case-study/transforming-it-and-development-for-the-worlds-largest-airline-website-southwest-airlines
  • #6 Main Concepts and Components Caches are an abstraction that describe a node in a Geode distributed system. Application architects can arrange these nodes in peer-to-peer or client/server topologies. Within each cache, you define data regions. Data regions are analogous to tables in a relational database and manage data in a distributed fashion as name/value pairs. A replicated region stores identical copies of the data on each cache member of a distributed system. A partitioned region spreads the data among cache members. After the system is configured, client applications can access the distributed data in regions without knowledge of the underlying system architecture. You can define listeners to create notifications about when data has changed, and you can define expiration criteria to delete obsolete data in a region. For large production systems, Geode provides locators. Locators provide both discovery and load balancing services. You configure clients with a list of locator services and the locators maintain a dynamic list of member servers. By default, Geode clients and servers use port 40404 and multicast to discover each other. Functions can be Map Reduce, stored procedures, data parallel – member oriented Listeners – CacheListener/CacheWriter, AsyncEventListener
  • #7 Cluster: Failure detection, dynamically scalable and network partition detection algorithms Indexing: RangeIndex - Uses a ConcurrentNavigableMap to store a key to store a RegionEntryToValuesMap. A RegionEntryToValuesMap is a map that uses the entry as the key and a struct as the value An example of the struct (notice the index iter naming associated with the struct and how the struct is a combination of portfolio, position): struct(index_iter1:Portfolio [ID=8 status=active type=type2 pkid=8 XYZ:Position secId=XYZ out=100.0 type=a id=7 mktValue=8.0, AOL:Position secId=AOL out=5000.0 type=a id=5 mktValue=6.0, APPL:Position secId=APPL out=6000.0 type=a id=6 mktValue=7.0, P1:Position secId=MSFT out=4000.0 type=a id=4 mktValue=5.0, P2:null ],index_iter2:Position secId=APPL out=6000.0 type=a id=6 mktValue=7.0) CompactRangeIndex - A memory efficient but slightly restricted version of RangeIndex. Will be preferred by the engine over range index if possible. Uses a ConcurrentNavigableMap to store a key and value pair, where the value can either be a RegionEntry, an IndexElemArray that contains RegionEntries or a IndexConcurrentHashSet that contains RegionEntries. The ConcurrentNavigableMap also is passed a Comparator that allows Indexes to match across different Numeric types. MapRangeIndex - This index contains a map where the key is the map key and the value is a range indexes. So for example an portfolio.positions'key' = 'IBM' The map range index would have a map with a key of 'key' and the value would be a range index. The range index would have another map where the key is 'IBM' and the value would be RegionEntryToValuesMap. The RegionEntryToValuesMap would be a map where the key is the entry itself and the value is 'IBM' CompactMapRangeIndex - Similar to MapRangeIndex but a map of CompactRangeIndexes instead. Similar restrictions to those between CompactRangeIndex and RangeIndex. HashIndex - Is a memory savings index that does not store key values and instead extracts the key from the object and uses the hash of the key to slot the RegionEntry into an array PrimaryKeyIndex - The primary key index is a very lightweight index that hints to the query engine that it should do a a region.get(key) PartitionedIndex - The partition index is a collection of indexes which are the buckets of the region.
  • #8 Cluster: Failure detection, dynamically scalable and network partition detection algorithms Indexing: RangeIndex - Uses a ConcurrentNavigableMap to store a key to store a RegionEntryToValuesMap. A RegionEntryToValuesMap is a map that uses the entry as the key and a struct as the value An example of the struct (notice the index iter naming associated with the struct and how the struct is a combination of portfolio, position): struct(index_iter1:Portfolio [ID=8 status=active type=type2 pkid=8 XYZ:Position secId=XYZ out=100.0 type=a id=7 mktValue=8.0, AOL:Position secId=AOL out=5000.0 type=a id=5 mktValue=6.0, APPL:Position secId=APPL out=6000.0 type=a id=6 mktValue=7.0, P1:Position secId=MSFT out=4000.0 type=a id=4 mktValue=5.0, P2:null ],index_iter2:Position secId=APPL out=6000.0 type=a id=6 mktValue=7.0) CompactRangeIndex - A memory efficient but slightly restricted version of RangeIndex. Will be preferred by the engine over range index if possible. Uses a ConcurrentNavigableMap to store a key and value pair, where the value can either be a RegionEntry, an IndexElemArray that contains RegionEntries or a IndexConcurrentHashSet that contains RegionEntries. The ConcurrentNavigableMap also is passed a Comparator that allows Indexes to match across different Numeric types. MapRangeIndex - This index contains a map where the key is the map key and the value is a range indexes. So for example an portfolio.positions'key' = 'IBM' The map range index would have a map with a key of 'key' and the value would be a range index. The range index would have another map where the key is 'IBM' and the value would be RegionEntryToValuesMap. The RegionEntryToValuesMap would be a map where the key is the entry itself and the value is 'IBM' CompactMapRangeIndex - Similar to MapRangeIndex but a map of CompactRangeIndexes instead. Similar restrictions to those between CompactRangeIndex and RangeIndex. HashIndex - Is a memory savings index that does not store key values and instead extracts the key from the object and uses the hash of the key to slot the RegionEntry into an array PrimaryKeyIndex - The primary key index is a very lightweight index that hints to the query engine that it should do a a region.get(key) PartitionedIndex - The partition index is a collection of indexes which are the buckets of the region.
  • #9 HDFS: Shared nothing architecture impacts full data scan for analytics Region data persisted on HDFS could be accessed directly from HDFS without impacting cluster performance. Supports high performance data reads Supports HDFS data loader Eviction logic is based on LRU Secondary indexing is done on the data stored on HDFS store Replicated HDFS regions Each write operation will be cached in-memory and HDFS buffers simultaneously. Offline access : provides tool to parse data in regions which can be done even when geode is offline Rebalancing: Currently manual Decision to rebalance is based on the data distribution and max memory conf of node As Geode monitors the data size, it can also automatically trigger rebalancing. Auto-balancing will redistribute data-load periodically and prevent conditions leading to failures. Will be able to configure the threshold to consider it as off balanced Avoid impact of auto rebalancing by scheduling Turn off rebalancing New node addition can be flagged for rebalancing
  • #10 Geode partitions subscription management (interest registration and continuous queries) across server data stores, ensuring that a subscription is processed only once for all interested clients. The resulting improvements in CPU use and bandwidth utilization improve throughput and reduce latency for client subscriptions.
  • #14 http://www.slideshare.net/ApacheGeode/open-sourcing-gemfire-apache-geode
  • #15 cause failures with Geode if attempting to create a key using non printable characters such as UTF-8 0x01, 0x02, etc People are supposed to launch several Redis instances to scale out on several cores if needed. It is not really fair to compare one single Redis instance to a multi-threaded data store. highly concurrent nature of Geode to make GemFireRedisServer concurrent. Each server instance will start 4 * (number of processor cores) threads for processing client requests,  but this can be configured by system property where either one thread per connection can be created or a specific number of client handler threads can be requested. With the sentinel approach, there is no real protection from network partition. The documentation mentions that write quorum should be used to guard against writing to a primary on the loosing side, however, since the replication is asynchronous, there will still be some amount of data loss. (This will be fixed with redis-cluster, no more need of sentinels for partition detection) Geode has network partition detection built in. The loosing side servers will shutdown/fence themselves, so that clients cannot connect to them. https://cwiki.apache.org/confluence/display/GEODE/Geode+Redis+Adapter