This session aims to establish applications running against distributed and scalable system, or as we know cloud computing system. We will introduce you not only briefing of Hazelcast but also deeper kernel of it, and how it works with Spark, the most famous Map-reduce library. Furthermore, we will introduce another in-memory cache called Apache Ignite and compare it with Hazelcast to see what's the difference between them. In the end, we will give a demonstration showing how Hazelcast and Spark work together well to form a cloud-base service which is distributed, flexible, reliable, available, scalable and stable. You can find demo code here: https://github.com/CyberJos/jcconf2016-hazelcast-spark
https://cyberjos.blog/java/seminar/jcconf-2016-cloud-computing-applications-hazelcast-spark-and-ignite/
3. Agenda
.Briefing of Hazelcast
.More about Hazelcast
.Spark Introduction
.Hazelcast and Spark
.About Apache Ignite
.Things between Ignite and Hazelcast
6. What is Hazelcast?
Hazelcast is an in-memory data grid which
distributed data evenly among the nodes of
a computing cluster, and shares available
processing power and storage space to
provide services. It also has the ability for
failure tolerance and node loss.
7. Features
.Distributed Caching: Queue, Set, List, Map,
MultiMap, Lock, Topic, AtomicReference,
AtomicLong, IdGenerator, Ringbuffer,
Semaphores
.Distributed Compute: Entry Processor,
Executor Service, User Defined Services
.Distributed Query: Query, Aggregators,
Listener with Predicate, MapReduce
8. Features (Cont.)
.Integrated Clustering: Hibernate 2nd Level
Cache, Grails 3, JCS Resource Adapter
.Standards: JCache, Apache jclouds
.Cloud and Virtualization Support: Docker,
AWS, Azure, Discovery Service Provider
Interface, Kubernetes, Zookeper Discovery
.Client-Server Protocols: Memcache, Open
Binary Client Protocol, REST
10. In-Memory Data Grid
.Scale-out Computing: shared CPU power
.Resilience: failure & data loss/performance
.Programming Model: easily code clusters
.Fast, Big Data: handle large sets in RAM
.Dynamic Scalability: join/leave a cluster
.Elastic Main Memory: memory pool
11. Caching
.Elastic Memcached: Hazelcast has been
used as an alternative to Memcached.
.Hibernate 2nd Level Cache: It organizes
caching into 1st and 2nd level caches.
.Spring Cache: It supports Spring Cache
which allows it to plug in to any Spring
application.
12. In-Memory NoSQL
.Scalability: size of RAM vs DISK
By joining nodes in a cluster, we can gather
RAM to store maps, and the CPU and RAM
resources become available to the network.
.Volatility: volatility of RAM vs Disk
It uses P2P data distribution to provide no
single node of failure. By default, it has data
stored in two locations in the cluster.
13. In-Memory NoSQL (Cont.)
.Rebalancing
It automatically rebalances data if a node
crashes. Shuffling data has a negative effect
as it consumes network, CPU and RAM.
.Going Native
The High-Density Memory Store can avoid
GC pauses. It uses NIO DirectByteBuffers
and does not require any defragmentation.
14. Messaging
Hazelcast provides Topic for distribution
mechanism for publishing messages that
are delivered to multiple subscribers.
Publish and subscriptions are cluster-wide.
Messages are ordered, that is, listeners will
process the messages in the order they are
actually published.
15. Application Scaling
.Elastic Scalability: new servers join a
cluster automatically
.Super Speeds: memory transaction speed
.High Availability: can deploy in backup
pairs or even WAN replicated
.Fault Tolerance: no single point of failure
.Cloud Readiness: deploy right into EC2
16. Clustering
Hazelcast is easily able to handle Session
Clustering with in-memory performance,
linear scalability as you add new nodes and
reliability. This is a great way to ensure that
session information is maintained when
you are clustering web servers. You can also
use a similar pattern for managing user
identities.
19. What’s New in Hazelcast 3.4
.High-Density Memory Store
.Hazelcast Configuration Import
.Back Pressure
20. What’s New in Hazelcast 3.5
.Async Back Pressure
.Client Configuration Import
.Cluster Quorum
.Hazelcast Client Protocol
.Listener for Lost Partitions
.Increased Visibility of Slow Operations
.Sub-Listener Interfaces for Map Listener
21. What’s New in Hazelcast 3.6
.High-Density Memory Store for Map
.Discovery SPI
.Client Protocol & Version Compatibility
.Support for cloud providers by jclouds®
.Hot Restart Persistence
.Lite Members
.Lots of Features for Hazelcast JCache
.Hazelcast Docker image
22. What’s New in Hazelcast 3.7
.Custom Eviction Policies
.Discovery SPI for Azure
.Hazelcast CLI with Scripting
.OpenShift and CloudFoundry Plugin
.Apache Spark Connector
.Alignment of WAN Replication Clusters
.Fault Tolerant Executor Service
23. Sample Code
public class GetStartedMain {
public static void main(final String[] args) {
Config cfg = new Config();
HazelcastInstance instance =
Hazelcast.newHazelcastInstance(cfg);
Map<Long, String> map = instance.getMap("test");
map.put(1L, "Demo");
System.our.println(map.get(1L));
}
}
25. How Data is Partitioned?
Data entries are distributed into partitions
by using a hashing algorithm (key/name):
.the key or name is serialized (converted
into a byte array),
.this byte array is hashed, and
.the result of the hash is mod by the
number of partitions.
26. Partition ID
The result of this modulo - MOD (hash
result, partition count) - is the partition in
which the data will be stored, that is the
partition ID. For ALL members you have in
your cluster, the partition ID for a given key
will always be the same.
27. Partition Table
When we start a member, a partition table
is created within it. This table stores the
partition IDs and the cluster members to
which they belong. The purpose of this
table is to make all members (including lite
members) in the cluster aware of this
information, ensuring that each member
knows where the data is.
28. Partition Table (Cont.)
The oldest member in the cluster (the one
that started first) periodically sends the
partition table to all members. In this way
each member in the cluster is informed
about any changes to partition ownership.
The ownerships may be changed when a
new member joins the cluster, or when a
member leaves the cluster.
29. Repartitioning
Repartitioning is the process of redistribution
of partition ownerships:
.When a member joins to the cluster.
.When a member leaves the cluster.
In these cases, the partition table in the
oldest member is updated with the new
partition ownerships.
35. What is Spark?
.Spark is a fast and general-purpose
cluster computing system. It provides high-
level APIs and an optimized engine that
supports general execution graphs. It also
supports a rich set of higher-level tools.
.It provides an interface for programming
entire clusters with implicit data parallelism
and fault-tolerance.
36. Advantages
.Speed
Run programs up to 100x faster than
Hadoop MapReduce in memory, or 10x
faster on disk.
.Ease of Use
Write application quickly. Spark offers over
80 high-level operators to build parallel
applications.
37. Advantages (Cont.)
.Generality
Combine SQL, streaming and complex
analytics libraries seamlessly in the same
application.
.Run Everywhere
Support multiple cluster management and
distributed storage system.
38. Features
.Resilient distributed dataset (RDD)
.Fault Tolerant
.Map-reduce cluster computing
.Build-in libraries
.Languages: Java, Scala, Python and R
.Interactive shell (Python, Scala, R) and
web-based UI
39. RDD
Resilient distributed dataset is a read-only
distributed data set of elements partitioned
across the nodes of the cluster that can be
operated on in parallel. It can stay in
memory and fall back to disk gracefully. An
RDD in memory (cached) can be reused
efficiently across parallel operations.
Finally, RDD automatically recovers from
node failures.
40. RDD Operations
Two types of things that can be done on an
RDD:
.transformations like map, filter than
results in another RDD
.actions like count that result in an output
67. What is Ignite?
Apache Ignite In-Memory Data Fabric is a
high-performance, integrated and
distributed in-memory platform for
computing and transacting on large-scale
data sets in real-time, orders of magnitude
faster than possible with traditional disk-
based or flash technologies.
73. Computing Grid
.Distributed Closure Execution
.Clustered Executor Service
.MapReduce and ForkJoin
.Load Balancing
.Fault-Tolerance
.Job Scheduling
.Checkpointing
74. Streaming and CEP
Ignite streaming allows to process
continuous never-ending streams of data in
scalable and fault-tolerant fashion. The
rates at which data can be injected into
Ignite can be very high and easily exceed
millions of events per second on a
moderately sized cluster.
81. Benchmark Fight
.GridGain posted: GridGain vs Hazelcast
Benchmarks
.It was also posted to Hazelcast Forum
.Hazelcast CEO removed that post
.Hazelcast fought back and claimed that
GridGain cheated
.GridGain re-tested and clarified
82.
83. Difference
Ignite Hazelcast
Off-heap Memory Configurable Enterprise
Off-heap Indexing Yes No
Continuous Query Yes Enterprise
SSL Encryption Yes Enterprise
SQL Query Full ANSI 99 Limited
Join Query Yes No
Data Consistency Yes Partial
84. Difference (Cont.)
Ignite Hazelcast
Deadlock-free Yes No
Computing Grid
MapReduce, ForkJoin
LoadBalance, ...
MapReduce
Streaming/ Yes No
Service Grid Yes No
Language .Net/C#/C++/Node.js .Net/C#/C++
Data Structures Less More
Plug-in Less More