Introduction to Apache Geode
(incubating)
Cork, September 2015
Anthony Baker (@metatype) William Markito (@william_markito)
• Introduction to Geode
• Geode concepts and usage
• The Geode open source project
• StockPrediction demo
Agenda
Introduction
4
"…an in-memory, distributed database with strong
consistency built to support low latency
transactional applications at extreme scale.”
History
2004 2008 2014
•  Massive increase in data
volumes
•  Falling margins per
transaction
•  Increasing cost of IT
maintenance
•  Need for elasticity in
systems
•  Financial Services
Providers (every major
Wall Street bank)
•  Department of Defense
•  Real Time response needs
•  Time to market constraints
•  Need for flexible data
models across enterprise
•  Distributed development
•  Persistence + In-memory
•  Global data visibility needs
•  Fast Ingest needs for data
•  Need to allow devices to
hook into enterprise data
•  Always on
•  Largest travel Portal
•  Airlines
•  Trade clearing
•  Online gambling
•  Largest Telcos
•  Large mfrers
•  Largest Payroll processor
•  Auto insurance giants
•  Largest rail systems on
earth
• 1000+ customers in production
• Cutting edge use cases
Interesting use cases
China Railway

Corporation
5,700 train stations
4.5 million tickets per day
20 million daily users
1.4 billion page views per day
40,000 visits per second
*http://pivotal.io/big-data/pivotal-gemfire
Indian Railways
7,000 stations
72,000 miles of track
23 million passengers daily
120,000 concurrent users
10,000 transactions per minute
Interesting use cases
Indian RailwaysChina Railway Corporation
World: ~7,349,000,000
~36% of the world population
Population: 1,251,695,6161,401,586,609
8
Application Patterns
• Caching for speed and scale
➡ Read-Through, Write-Through, Write-Behind
• As the OLTP system of record
➡ Data in-memory for low-latency, on disk for durability
• Parallel compute engine
• Real-time analytics
Concepts
• Cache
• Region
• Member
• Client Cache
• Functions
• Listeners
Geode Concepts and Usage
• Cache
• In-memory storage and management for your data
• Configurable through XML, Spring, Java API or CLI
• Collection of Region
Region
Region
Region
Cache
JVM
Concepts
Concepts
• Region
• Distributed java.util.Map on steroids (Key/Value)
• Consistent API regardless of where or how data is stored
• Observable (reactive)
• Highly available, redundant on cache Member (s).
• Querying
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Concepts
• Region
• Local, Replicated or Partitioned
• In-memory or persistent
• Redundant
• LRU
• Overflow
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
LOCAL	
LOCAL_HEAP_LRU	
LOCAL_OVERFLOW	
LOCAL_PERSISTENT	
LOCAL_PERSISTENT_OVERFLOW	
PARTITION	
PARTITION_HEAP_LRU	
PARTITION_OVERFLOW	
PARTITION_PERSISTENT	
PARTITION_PERSISTENT_OVERFLOW	
PARTITION_PROXY	
PARTITION_PROXY_REDUNDANT	
PARTITION_REDUNDANT	
PARTITION_REDUNDANT_HEAP_LRU	
PARTITION_REDUNDANT_OVERFLOW	
PARTITION_REDUNDANT_PERSISTENT	
PARTITION_REDUNDANT_PERSISTENT_OVERFLOW	
REPLICATE	
REPLICATE_HEAP_LRU	
REPLICATE_OVERFLOW	
REPLICATE_PERSISTENT	
REPLICATE_PERSISTENT_OVERFLOW	
REPLICATE_PROXY
• Persistent Regions
• Durability
• WAL for efficient writing
• Consistent recovery
• Compaction
Concepts
Modify
k1->v5
Create
k6->v6
Create
k2->v2
Create
k4->v4
Oplog2.crf
Member
1
Modify
k4->v7Oplog3.crf
Put k4->v7
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Server 1 Server N
• Member
• A process that has a connection to the system
• A process that has created a cache
• Embeddable within your application
Concepts
Client
Locator
Server
Concepts
• Client cache
• A process connected to the Geode server(s)
• Can have a local copy of the data
• Can be notified about events on the servers
Application
GemFire Server
Region
Region
RegionClient Cache
Concepts
• Functions
• Used for distributed concurrent processing 

(Map/Reduce, stored procedure)
• Highly available
• Data oriented
• Member oriented
Submit (f1)
f1 , f2 , … fn
Execute

Functions
Concepts
• Functions
Server Server
FunctionService.onRegion.withFilter.execute
ResultCollector.getResult
Server Distributed System
execute
Server
Server
6
1
result
execute
execute
result
result
2
5
3
4
3 4
Server
Partitioned Region
Data Store - X
Partitioned Region
Data Store - Y
Partitioned Region
Data Store - Z
Partitioned Region
Data Accessor
Partitioned Region
Data Accessor
filter = Keys X, Y
Client Region
Concepts
• Listeners
• CacheWriter / CacheListener
• AsyncEventListener (queue / batch)
• Parallel or Serial
• Conflation
Core Beliefs
Performance
Consistency
Resiliency
Why Apache Geode?
© Copyright 2014 Pivotal. All rights reserved.
Pivotal GemFire High Availability and Fault Tolerance in 6 acts
Failing data copies are
replaced transparently
Data is replicated to other
clusters and sites (WAN)
Network segmentations are
identified and fixed automatically
Client and cluster disconnections
are handled gracefully
Data is persisted on local
disk for ultimate durability
“split brain”
Failed function executions
are restarted automatically
restart
• Minimize copying
• Minimize contention points
• Flexible consistency model
• Partitioning and parallelism
• Avoid disk seeks
What makes it fast?
Benchmarks and Testing
0
2
4
6
8
10
12
14
16
18
0
1
2
3
4
5
6
2 4 6 8 10
Speedup
Server(Hosts
speedup
latency4(ms)
CPU4%
The Geode Project
Why Open Source? Why ASF?
• Open source is fundamentally changing software buying patterns
• Customers get transparency and co-development of features
• It’s the community that matters
• ASF provides a framework for open source
Geode Will Be a Significant Apache Project
• 1M+ LOC, over a 1000 person years invested into cutting edge R&D
• Thousands of production customers in very demanding verticals
• Cutting edge use cases that have shaped product thinking
• A core technology team that has stayed together since founding
• Performance differentiators that are baked into every aspect of the product
Geode versus GemFire
• Geode is a project supported by the OSS community
• GemFire is product from Pivotal, based on Geode source
• We donated everything but the kitchen sink*
• Development process follows “The Apache Way”
* Multi-site WAN replication, continuous queries, and native (C/C++/.NET) client
"Talk is cheap, show me the code"
• Clone & Build
git	clone	https://github.com/apache/incubator-geode	
cd	incubator-geode

./gradlew	build	-Dskip.tests=true
• Start a server
cd	gemfire-assembly/build/install/apache-geode		
./bin/gfsh		
gfsh>	start	locator	--name=locator		
gfsh>	start	server	--name=server		
gfsh>	create	region	--name=myRegion	--type=REPLICATE
29
&
Stock Predictions 

with 

Apache Geode, Spark, and SpringXD
• RDD
• Dataframe
• Driver
• Worker
Quick intro to Apache Spark
"An RDD in Spark is simply an immutable distributed collection of objects.
Each RDD is split into multiple partitions, which may be computed on different nodes
of the cluster. RDDs can contain any type of Python, Java, or Scala objects,
including user-defined classes."
• RDD
• Dataframe
• Driver
• Worker
Quick intro to Apache Spark
“A dataframe is a distributed collection of rows organized into named columns. An
abstraction for selecting, filtering and plotting structured data (pandas), previously
known as SchemaRDD."
• RDD
• Dataframe
• Driver
• Worker
Quick intro to Apache Spark
Live Data
Apache Geode / GemFire
1- Live data is ingested into the grid
2 - Trained ML model compares
new data to historical patterns
3 - Results are pushed
immediately to deployed
applications
Machine Learning
model
4 - Re-training is triggered, updating
the model with the latest historical data
Spring XD
Spring XD
Data
Temperature
Hot
Warm
Machine Learning Concepts
medium
avg (x+1)
relative
strength (x)
medium avg (x)
price(x)
Machine Learning Model
(e.g. Linear Regression)
medium
avg (x+1)
relative
strength (x)
medium avg (x)
price(x)
Machine Learning Model
(e.g. Linear Regression)
Features Label
Transform Sink
SpringXD
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable
Cloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions
• Off-heap memory storage
• HDFS persistence
• Lucene indexes
• Spark connector
• Cloud Foundry service
• Distributed transactions
…and other ideas from the Geode community!
Roadmap
How to Get Involved
• http://geode.incubator.apache.org
• Join the mailing lists; ask a question, answer a question, learn
dev@geode.incubator.apache.org
user@geode.incubator.apache.org
• File a bug in JIRA
• Update the wiki, website, or documentation
• Create example applications
• Use it in your project! We need you!
Questions?
Thank you!

Apache Geode Meetup, Cork, Ireland at CIT

  • 1.
    Introduction to ApacheGeode (incubating) Cork, September 2015 Anthony Baker (@metatype) William Markito (@william_markito)
  • 2.
    • Introduction toGeode • Geode concepts and usage • The Geode open source project • StockPrediction demo Agenda
  • 3.
  • 4.
    4 "…an in-memory, distributeddatabase with strong consistency built to support low latency transactional applications at extreme scale.”
  • 5.
    History 2004 2008 2014 • Massive increase in data volumes •  Falling margins per transaction •  Increasing cost of IT maintenance •  Need for elasticity in systems •  Financial Services Providers (every major Wall Street bank) •  Department of Defense •  Real Time response needs •  Time to market constraints •  Need for flexible data models across enterprise •  Distributed development •  Persistence + In-memory •  Global data visibility needs •  Fast Ingest needs for data •  Need to allow devices to hook into enterprise data •  Always on •  Largest travel Portal •  Airlines •  Trade clearing •  Online gambling •  Largest Telcos •  Large mfrers •  Largest Payroll processor •  Auto insurance giants •  Largest rail systems on earth • 1000+ customers in production • Cutting edge use cases
  • 6.
    Interesting use cases ChinaRailway
 Corporation 5,700 train stations 4.5 million tickets per day 20 million daily users 1.4 billion page views per day 40,000 visits per second *http://pivotal.io/big-data/pivotal-gemfire Indian Railways 7,000 stations 72,000 miles of track 23 million passengers daily 120,000 concurrent users 10,000 transactions per minute
  • 7.
    Interesting use cases IndianRailwaysChina Railway Corporation World: ~7,349,000,000 ~36% of the world population Population: 1,251,695,6161,401,586,609
  • 8.
    8 Application Patterns • Cachingfor speed and scale ➡ Read-Through, Write-Through, Write-Behind • As the OLTP system of record ➡ Data in-memory for low-latency, on disk for durability • Parallel compute engine • Real-time analytics
  • 9.
  • 10.
    • Cache • Region •Member • Client Cache • Functions • Listeners Geode Concepts and Usage
  • 11.
    • Cache • In-memorystorage and management for your data • Configurable through XML, Spring, Java API or CLI • Collection of Region Region Region Region Cache JVM Concepts
  • 12.
    Concepts • Region • Distributedjava.util.Map on steroids (Key/Value) • Consistent API regardless of where or how data is stored • Observable (reactive) • Highly available, redundant on cache Member (s). • Querying Region Cache java.util.Map JVM Key Value K01 May K02 Tim
  • 13.
    Concepts • Region • Local,Replicated or Partitioned • In-memory or persistent • Redundant • LRU • Overflow Region Cache java.util.Map JVM Key Value K01 May K02 Tim Region Cache java.util.Map JVM Key Value K01 May K02 Tim LOCAL LOCAL_HEAP_LRU LOCAL_OVERFLOW LOCAL_PERSISTENT LOCAL_PERSISTENT_OVERFLOW PARTITION PARTITION_HEAP_LRU PARTITION_OVERFLOW PARTITION_PERSISTENT PARTITION_PERSISTENT_OVERFLOW PARTITION_PROXY PARTITION_PROXY_REDUNDANT PARTITION_REDUNDANT PARTITION_REDUNDANT_HEAP_LRU PARTITION_REDUNDANT_OVERFLOW PARTITION_REDUNDANT_PERSISTENT PARTITION_REDUNDANT_PERSISTENT_OVERFLOW REPLICATE REPLICATE_HEAP_LRU REPLICATE_OVERFLOW REPLICATE_PERSISTENT REPLICATE_PERSISTENT_OVERFLOW REPLICATE_PROXY
  • 14.
    • Persistent Regions •Durability • WAL for efficient writing • Consistent recovery • Compaction Concepts Modify k1->v5 Create k6->v6 Create k2->v2 Create k4->v4 Oplog2.crf Member 1 Modify k4->v7Oplog3.crf Put k4->v7 Region Cache java.util.Map JVM Key Value K01 May K02 Tim Region Cache java.util.Map JVM Key Value K01 May K02 Tim Server 1 Server N
  • 15.
    • Member • Aprocess that has a connection to the system • A process that has created a cache • Embeddable within your application Concepts Client Locator Server
  • 16.
    Concepts • Client cache •A process connected to the Geode server(s) • Can have a local copy of the data • Can be notified about events on the servers Application GemFire Server Region Region RegionClient Cache
  • 17.
    Concepts • Functions • Usedfor distributed concurrent processing 
 (Map/Reduce, stored procedure) • Highly available • Data oriented • Member oriented Submit (f1) f1 , f2 , … fn Execute
 Functions
  • 18.
    Concepts • Functions Server Server FunctionService.onRegion.withFilter.execute ResultCollector.getResult ServerDistributed System execute Server Server 6 1 result execute execute result result 2 5 3 4 3 4 Server Partitioned Region Data Store - X Partitioned Region Data Store - Y Partitioned Region Data Store - Z Partitioned Region Data Accessor Partitioned Region Data Accessor filter = Keys X, Y Client Region
  • 19.
    Concepts • Listeners • CacheWriter/ CacheListener • AsyncEventListener (queue / batch) • Parallel or Serial • Conflation
  • 20.
  • 21.
    Why Apache Geode? ©Copyright 2014 Pivotal. All rights reserved. Pivotal GemFire High Availability and Fault Tolerance in 6 acts Failing data copies are replaced transparently Data is replicated to other clusters and sites (WAN) Network segmentations are identified and fixed automatically Client and cluster disconnections are handled gracefully Data is persisted on local disk for ultimate durability “split brain” Failed function executions are restarted automatically restart
  • 22.
    • Minimize copying •Minimize contention points • Flexible consistency model • Partitioning and parallelism • Avoid disk seeks What makes it fast?
  • 23.
    Benchmarks and Testing 0 2 4 6 8 10 12 14 16 18 0 1 2 3 4 5 6 24 6 8 10 Speedup Server(Hosts speedup latency4(ms) CPU4%
  • 24.
  • 25.
    Why Open Source?Why ASF? • Open source is fundamentally changing software buying patterns • Customers get transparency and co-development of features • It’s the community that matters • ASF provides a framework for open source
  • 26.
    Geode Will Bea Significant Apache Project • 1M+ LOC, over a 1000 person years invested into cutting edge R&D • Thousands of production customers in very demanding verticals • Cutting edge use cases that have shaped product thinking • A core technology team that has stayed together since founding • Performance differentiators that are baked into every aspect of the product
  • 27.
    Geode versus GemFire •Geode is a project supported by the OSS community • GemFire is product from Pivotal, based on Geode source • We donated everything but the kitchen sink* • Development process follows “The Apache Way” * Multi-site WAN replication, continuous queries, and native (C/C++/.NET) client
  • 28.
    "Talk is cheap,show me the code" • Clone & Build git clone https://github.com/apache/incubator-geode cd incubator-geode
 ./gradlew build -Dskip.tests=true • Start a server cd gemfire-assembly/build/install/apache-geode ./bin/gfsh gfsh> start locator --name=locator gfsh> start server --name=server gfsh> create region --name=myRegion --type=REPLICATE
  • 29.
  • 30.
    Stock Predictions 
 with
 Apache Geode, Spark, and SpringXD
  • 31.
    • RDD • Dataframe •Driver • Worker Quick intro to Apache Spark "An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."
  • 32.
    • RDD • Dataframe •Driver • Worker Quick intro to Apache Spark “A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."
  • 33.
    • RDD • Dataframe •Driver • Worker Quick intro to Apache Spark
  • 34.
    Live Data Apache Geode/ GemFire 1- Live data is ingested into the grid 2 - Trained ML model compares new data to historical patterns 3 - Results are pushed immediately to deployed applications Machine Learning model 4 - Re-training is triggered, updating the model with the latest historical data Spring XD Spring XD Data Temperature Hot Warm
  • 35.
    Machine Learning Concepts medium avg(x+1) relative strength (x) medium avg (x) price(x) Machine Learning Model (e.g. Linear Regression)
  • 36.
    medium avg (x+1) relative strength (x) mediumavg (x) price(x) Machine Learning Model (e.g. Linear Regression) Features Label
  • 37.
    Transform Sink SpringXD Extensible Open-Source Fault-Tolerant Horizontally Scalable Cloud-Native MachineLearning Enrich Filter Split Dashboard Indicators 1 2 Predict 3 Real data Simulator /Stocks /TechIndicators /Predictions
  • 38.
    • Off-heap memorystorage • HDFS persistence • Lucene indexes • Spark connector • Cloud Foundry service • Distributed transactions …and other ideas from the Geode community! Roadmap
  • 39.
    How to GetInvolved • http://geode.incubator.apache.org • Join the mailing lists; ask a question, answer a question, learn dev@geode.incubator.apache.org user@geode.incubator.apache.org • File a bug in JIRA • Update the wiki, website, or documentation • Create example applications • Use it in your project! We need you!
  • 40.