Cloud Computing Applications
Hazelcast, Spark and Ignite
Joseph S. Kuo a.k.a. CyberJos
About Me
.大學唸數學系時玩了一堆語言和架構
.22年程式資歷,17年Java資歷
.擔任過資訊講師,曾任職於遊戲雲端平台
公司、全球電子商務公司、知名資安公司以
及社群趨勢分析公司
.希望能一輩子寫程式玩技術到老
Agenda
.Briefing of Hazelcast
.More about Hazelcast
.Spark Introduction
.Hazelcast and Spark
.About Apache Ignite
.Things between Ignite and Hazelcast
Briefing of Hazelcast
What is Hazelcast?
Hazelcast is an in-memory data grid which
distributed data evenly among the nodes of
a computing cluster, and shares available
processing power and storage space to
provide services. It also has the ability for
failure tolerance and node loss.
Features
.Distributed Caching: Queue, Set, List, Map,
MultiMap, Lock, Topic, AtomicReference,
AtomicLong, IdGenerator, Ringbuffer,
Semaphores
.Distributed Compute: Entry Processor,
Executor Service, User Defined Services
.Distributed Query: Query, Aggregators,
Listener with Predicate, MapReduce
Features (Cont.)
.Integrated Clustering: Hibernate 2nd Level
Cache, Grails 3, JCS Resource Adapter
.Standards: JCache, Apache jclouds
.Cloud and Virtualization Support: Docker,
AWS, Azure, Discovery Service Provider
Interface, Kubernetes, Zookeper Discovery
.Client-Server Protocols: Memcache, Open
Binary Client Protocol, REST
Use Cases
.In-Memory Data Grid
.Caching
.In-Memory NoSQL
.Messaging
.Application Scaling
.Clustering
In-Memory Data Grid
.Scale-out Computing: shared CPU power
.Resilience: failure & data loss/performance
.Programming Model: easily code clusters
.Fast, Big Data: handle large sets in RAM
.Dynamic Scalability: join/leave a cluster
.Elastic Main Memory: memory pool
Caching
.Elastic Memcached: Hazelcast has been
used as an alternative to Memcached.
.Hibernate 2nd Level Cache: It organizes
caching into 1st and 2nd level caches.
.Spring Cache: It supports Spring Cache
which allows it to plug in to any Spring
application.
In-Memory NoSQL
.Scalability: size of RAM vs DISK
By joining nodes in a cluster, we can gather
RAM to store maps, and the CPU and RAM
resources become available to the network.
.Volatility: volatility of RAM vs Disk
It uses P2P data distribution to provide no
single node of failure. By default, it has data
stored in two locations in the cluster.
In-Memory NoSQL (Cont.)
.Rebalancing
It automatically rebalances data if a node
crashes. Shuffling data has a negative effect
as it consumes network, CPU and RAM.
.Going Native
The High-Density Memory Store can avoid
GC pauses. It uses NIO DirectByteBuffers
and does not require any defragmentation.
Messaging
Hazelcast provides Topic for distribution
mechanism for publishing messages that
are delivered to multiple subscribers.
Publish and subscriptions are cluster-wide.
Messages are ordered, that is, listeners will
process the messages in the order they are
actually published.
Application Scaling
.Elastic Scalability: new servers join a
cluster automatically
.Super Speeds: memory transaction speed
.High Availability: can deploy in backup
pairs or even WAN replicated
.Fault Tolerance: no single point of failure
.Cloud Readiness: deploy right into EC2
Clustering
Hazelcast is easily able to handle Session
Clustering with in-memory performance,
linear scalability as you add new nodes and
reliability. This is a great way to ensure that
session information is maintained when
you are clustering web servers. You can also
use a similar pattern for managing user
identities.
Dependency
.Maven
<dependency>
<groupId>com.hazelcast</groupId>
<artifactId>hazelcast</artifactId>
<version>3.7.2</version>
</dependency>
.Gradle
dependencies {
compile 'com.hazelcast:hazelcast:3.7.2'
}
More about Hazelcast
What’s New in Hazelcast 3.4
.High-Density Memory Store
.Hazelcast Configuration Import
.Back Pressure
What’s New in Hazelcast 3.5
.Async Back Pressure
.Client Configuration Import
.Cluster Quorum
.Hazelcast Client Protocol
.Listener for Lost Partitions
.Increased Visibility of Slow Operations
.Sub-Listener Interfaces for Map Listener
What’s New in Hazelcast 3.6
.High-Density Memory Store for Map
.Discovery SPI
.Client Protocol & Version Compatibility
.Support for cloud providers by jclouds®
.Hot Restart Persistence
.Lite Members
.Lots of Features for Hazelcast JCache
.Hazelcast Docker image
What’s New in Hazelcast 3.7
.Custom Eviction Policies
.Discovery SPI for Azure
.Hazelcast CLI with Scripting
.OpenShift and CloudFoundry Plugin
.Apache Spark Connector
.Alignment of WAN Replication Clusters
.Fault Tolerant Executor Service
Sample Code
public class GetStartedMain {
public static void main(final String[] args) {
Config cfg = new Config();
HazelcastInstance instance =
      Hazelcast.newHazelcastInstance(cfg);
Map<Long, String> map = instance.getMap("test");
map.put(1L, "Demo");
System.our.println(map.get(1L));
}
}
Sharding – 4 nodes
How Data is Partitioned?
Data entries are distributed into partitions
by using a hashing algorithm (key/name):
.the key or name is serialized (converted
into a byte array),
.this byte array is hashed, and
.the result of the hash is mod by the
number of partitions.
Partition ID
The result of this modulo - MOD (hash
result, partition count) - is the partition in
which the data will be stored, that is the
partition ID. For ALL members you have in
your cluster, the partition ID for a given key
will always be the same.
Partition Table
When we start a member, a partition table
is created within it. This table stores the
partition IDs and the cluster members to
which they belong. The purpose of this
table is to make all members (including lite
members) in the cluster aware of this
information, ensuring that each member
knows where the data is.
Partition Table (Cont.)
The oldest member in the cluster (the one
that started first) periodically sends the
partition table to all members. In this way
each member in the cluster is informed
about any changes to partition ownership.
The ownerships may be changed when a
new member joins the cluster, or when a
member leaves the cluster.
Repartitioning
Repartitioning is the process of redistribution
of partition ownerships:
.When a member joins to the cluster.
.When a member leaves the cluster.
In these cases, the partition table in the
oldest member is updated with the new
partition ownerships.
Topology - Embedded
Topology - Client/Server
Spark Introduction
What is Spark?
.Spark is a fast and general-purpose
cluster computing system. It provides high-
level APIs and an optimized engine that
supports general execution graphs. It also
supports a rich set of higher-level tools.
.It provides an interface for programming
entire clusters with implicit data parallelism
and fault-tolerance.
Advantages
.Speed
Run programs up to 100x faster than
Hadoop MapReduce in memory, or 10x
faster on disk.
.Ease of Use
Write application quickly. Spark offers over
80 high-level operators to build parallel
applications.
Advantages (Cont.)
.Generality
Combine SQL, streaming and complex
analytics libraries seamlessly in the same
application.
.Run Everywhere
Support multiple cluster management and
distributed storage system.
Features
.Resilient distributed dataset (RDD)
.Fault Tolerant
.Map-reduce cluster computing
.Build-in libraries
.Languages: Java, Scala, Python and R
.Interactive shell (Python, Scala, R) and
web-based UI
RDD
Resilient distributed dataset is a read-only
distributed data set of elements partitioned
across the nodes of the cluster that can be
operated on in parallel. It can stay in
memory and fall back to disk gracefully. An
RDD in memory (cached) can be reused
efficiently across parallel operations.
Finally, RDD automatically recovers from
node failures.
RDD Operations
Two types of things that can be done on an
RDD:
.transformations like map, filter than
results in another RDD
.actions like count that result in an output
RDD Operations (Cont.)
RDD Fault Recovery
Directed Acyclic Graph
Cluster Topology
Dependency
.Maven
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
</dependency>
.Gradle
dependencies {
compile 'org.apache.spark:spark-core_2.11:2.0.0'
}
Spark Node with Docker
.Pull image (Docker 2.0)
docker pull maguowei/spark
.Launch a Spark node
docker run -it -p 4040:4040 maguowei/spark pyspark
docker run -it -p 4040:4040 maguowei/spark spark-
shell
.Monitoring
http://localhost:4040/
Spark Cluster with Docker
.Launch master image (driver program)
docker run -it -h sandbox1 -p 7077:7077 -p 8080:8080
maguowei/spark bash
.Append text to /etc/hosts
172.17.0.2 sandbox1
172.17.0.3 sandbox2
.Launch the master node
/opt/spark-2.0.0-bin-hadoop2.7/sbin/start-master.sh
.Monitoring
http://localhost:8080/
Spark Cluster with Docker (Cont.)
.Launch work images
docker run -it -h sandbox2 maguowei/spark bash
.Append text to /etc/hosts
172.17.0.2 sandbox1
172.17.0.3 sandbox2
.Launch a work node
/opt/spark-2.0.0-bin-hadoop2.7/sbin/start-slave.sh
spark://sandbox1:7077
.Run tasks
docker exec <CONTAINER_ID> run-example <class> <arg>
same version for all places
same version for all places
same version for all places
Very important so say 3 times
Hazelcast and Spark
What is this Connector?
A plug-in which allows maps and caches to
be used as shared RDD caches by Spark
using the Spark RDD API.
What is this Connector?
Clients Clients
 /
Hazelcast (MapReduce) Spark (MapReduce)
 /
Hazelcast Spark Connector
Features
.Read/Write support for Hazelcast Maps
.Read/Write support for Hazelcast Caches
Requirements
.Hazelcast 3.7.x
.Apache Spark 1.6.1
Dependency
.Maven
<dependency>
<groupId>com.hazelcast</groupId>
<artifactId>hazelcast-spark</artifactId>
<version>0.1</version>
</dependency>
.Gradle
dependencies {
compile 'com.hazelcast:hazelcast-spark:0.1'
}
Properties
The options for the SparkConf object
.hazelcast.server.addresses: 127.0.0.1:5701 (Comma
separated list)
.hazelcast.server.groupName: dev
.hazelcast.server.groupPass: dev-pass
.hazelcast.spark.valueBatchingEnabled: true
.hazelcast.spark.readBatchSize: 1000
.hazelcast.spark.writeBatchSize: 1000
.hazelcast.spark.clientXmlPath
Creating the SparkContext
SparkConf conf = new SparkConf()
.set("hazelcast.server.addresses", "127.0.0.1:5701")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.valueBatchingEnabled", "true")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000");
JavaSparkContext jsc =
new JavaSparkContext("spark://127.0.0.1:7077",
"appname", conf);
// provide Hazelcast functions to the Spark Context.
HazelcastSparkContext hsc =
new HazelcastSparkContext(jsc);
Read Data to Hazelcast
// read
HazelcastJavaRDD rddFromMap =
hsc.fromHazelcastMap("map-name-to-be-loaded");
HazelcastJavaRDD rddFromCache =
hsc.fromHazelcastCache("cache-name-to-be-loaded");
Write Data to Hazelcast
import static com.hazelcast.spark.connector.
HazelcastJavaPairRDDFunctions.javaPairRddFunctions;
JavaPairRDD<Object, Long> rdd =
hsc.parallelize(new ArrayList<Object>() {
add(1);
add(2);
add(3);
}).zipWithIndex();
// write
javaPairRddFunctions(rdd).saveToHazelcastMap(name);
javaPairRddFunctions(rdd).saveToHazelcastCache(name);
About Apache Ignite
What is Ignite?
Apache Ignite In-Memory Data Fabric is a
high-performance, integrated and
distributed in-memory platform for
computing and transacting on large-scale
data sets in real-time, orders of magnitude
faster than possible with traditional disk-
based or flash technologies.
Features
.Data Grid
.Compute Grid
.Streaming and CEP
.Data Structures
.Messaging and Events
.Service Grid
Data Grid
Data Grid
.Distributed Caching: Key-Value Store,
Partitioning & Replication, Client-Side Cache
.Cluster Resiliency: Self-Healing Cluster
.Memory Formats: On-heap, Off-heap,
Tiered Storage
.Marshalling: Binary Protocol
.Distributed Transactions and Locks: ACID,
Deadlock-free, Cross-partition, Locks
Data Grid (Cont.)
.Distributed Query: SQL Queries, Joins,
Continuous Queries, Indexing, Consistency,
Fault-Tolerance
.Persistence: Write-Through, Read-Through,
Write-Behind Caching, Automatic Persistence
.Standards: JCache, SQL, JDBC, OSGi
.Integrations: DB, Hibernate L2 Cache,
Session Clustering, Spring Caching
Computing Grid
Computing Grid
.Distributed Closure Execution
.Clustered Executor Service
.MapReduce and ForkJoin
.Load Balancing
.Fault-Tolerance
.Job Scheduling
.Checkpointing
Streaming and CEP
Ignite streaming allows to process
continuous never-ending streams of data in
scalable and fault-tolerant fashion. The
rates at which data can be injected into
Ignite can be very high and easily exceed
millions of events per second on a
moderately sized cluster.
Streaming and CEP
Data Structures
.Queue and Set
.Atomic Types
.CountDownLatch
.IdGenerator
.Semaphore
Messaging and Events
.Topic Based Messaging
.Point-to-Point Messaging
.Event Notifications
.Automatic Batching
Service Grid
Dependency
.Maven
<dependency>
<groupId>org.apache.ignite</groupId>
<artifactId>ignite-core</artifactId>
<version>1.7.0</version>
</dependency>
.Gradle
dependencies {
compile 'org.apache.ignite:ignite-core:1.7.0'
}
Things between Ignite & Hazelcast
Benchmark Fight
.GridGain posted: GridGain vs Hazelcast
Benchmarks
.It was also posted to Hazelcast Forum
.Hazelcast CEO removed that post
.Hazelcast fought back and claimed that
GridGain cheated
.GridGain re-tested and clarified
Difference
Ignite Hazelcast
Off-heap Memory Configurable Enterprise
Off-heap Indexing Yes No
Continuous Query Yes Enterprise
SSL Encryption Yes Enterprise
SQL Query Full ANSI 99 Limited
Join Query Yes No
Data Consistency Yes Partial
Difference (Cont.)
Ignite Hazelcast
Deadlock-free Yes No
Computing Grid
MapReduce, ForkJoin
LoadBalance, ...
MapReduce
Streaming/ Yes No
Service Grid Yes No
Language .Net/C#/C++/Node.js .Net/C#/C++
Data Structures Less More
Plug-in Less More
It doesn’t matter which you select
How you use it does matter
References
.Hazelcast: http://hazelcast.org/
.Hazelcast Doc: http://hazelcast.org/documentation/
.Spark: http://spark.apache.org/
.Hazelcast Spark Connector:
https://github.com/hazelcast/hazelcast-spark
.Apache Ignite: https://ignite.apache.org/
.Sample Code: https://github.com/CyberJos/jcconf2016-
hazelcast-spark
Thank You!!

JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite