Scabi is a simple, light-weight Cluster Computing and Storage framework for BigData processing written purely in Java. Scabi provides high performance computing and storage with ease of use. Users can get started on using Scabi within a few minutes. Scabi is free of cost to use. https://www.github.com/dilshadmustafa/scabi
2. TABLE OF CONTENTS
1. Scabi Overview
2. Scabi Data Driven Framework
3. Scabi Framework Constructs
4. Scabi Data Ring
5. Processing Huge Data Sets in Scabi
6. Single Hardware Vs Scabi Cluster
Performance
7. User Files in Scabifs
8. Map/Reduce In Scabi
9. BigData Processing In Cloud
10. Peta Scale In Cloud
11. How to quickly run Scabi
12. Example 1 – MapReduce, Median computing
examples
13. Example 2 – Scabifs examples
PART 1
3. TABLE OF CONTENTS
1. Scabi Compute Driven Framework Overview
2. Scabi Compute Driven Framework
3. Scabi Cluster
4. Scabi - Distributed Storage & Retrieval
5. Scabi Namespace
6. Submitting User Jobs, Programs in
Scabi
7. Single Hardware Vs Scabi Cluster
Performance
8. User Files in Scabi
9. Local Filesystem Vs Scabi Cluster
Performance
10. User Tables, Data in Scabi
11. Map/Reduce In Scabi
12. BigData Processing In Cloud
13. Peta Scale In Cloud
14. Scabi Namespace Operations
15. How to quickly run Scabi
16. Scabi Performance Tuning
17. How to quickly build Scabi from GitHub
18. InfiniBand Support
19. APIs / Libraries used by Scabi
20. Scabi Test Environment
21. Example 1 - Complex and time consuming
computing examples
22. Example 2 - Distributed Store & Retrieval
examples
23. Example 3 - CRUD examples
24. Example 4 – Map/Reduce example
25. Example 5 – Scabi Namespace examples
PART 2
4. Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi is a simple, light-weight Cluster Computing & Storage micro framework for BigData
processing written purely in Java.
Scabi provides two frameworks for processing (a) Data-driven framework and (b) Compute-
driven framework. Both the frameworks basically share the same underlying core. Part 1 of
this presentation covers Data-driven framework and Part 2 covers Compute-driven framework.
(a) Data-driven framework
In the data-driven framework, Scabi processes partitions of huge datasets parallely by
loading these partitions into memory and executing User-defined operations on those
partitions (partition data and its operations are together referred to as a Data Unit) in the Scabi
Cluster.
The framework is highly fault tolerant and manages executing the Data Units when any
number of systems which are part of the Scabi Cluster can fail at any time. Data Unit
makes use of in-memory, off-heap and unbounded storage data structure and enables fast
processing of huge data sets. This enables us to perform algorithms like complex
MapReduce operations, ensemble machine learning algorithms and iterative algorithms.
This gives us the capability to process Petabytes to Exabytes+ of multiple datasets within
minutes.
The Scabi micro framework with Scabi Cluster enables high performance computing by
spreading the Data Units and executing in the Scabi Cluster. The Scabi Compute Services
and Meta Services weave together to form a highly scalable cluster of hundreds of
thousands of Scabi Compute Services by networking commodity hardware.
Scabi Overview
5. (b) Compute-driven framework
In the compute-driven framework, Scabi processes User-defined computations or
Algorithms or jobs parallely by splitting them into Compute Units and executing them in
the Scabi Cluster.
The framework is highly fault tolerant and manages executing the Compute Units when
any number of systems which are part of the Scabi Cluster can fail at any time. The
framework takes care of the distributed computing and load balancing in the Scabi
Cluster. This gives us the capability to perform complex and time-consuming
computations by aggregating and combining the processing power of many individual
systems.
The Scabi micro framework with Scabi Cluster enables high performance computing by
spreading the Scabi Users's jobs and programs and executing in the Scabi Cluster. The
Scabi Compute Servers and Meta Servers weave together to form a highly scalable
cluster of hundreds of thousands of Scabi Compute Servers by networking commodity
hardware. This means Users do not need specialized computing hardware with
thousands of CPUs or CPU cores or special network hardware.
The Scabi framework provides simple API to easily distribute storage and retrieval of
User files and data by using Scabi Namespaces. The micro framework with the cluster
provides high availability of User files and data by keeping versions of User files.
Scabi Overview
Copyright (c) Dilshad Mustafa. All Rights Reserved.
7. In the data-driven framework, Scabi processes partitions of huge datasets parallely by
loading these partitions into memory and executing User-defined operations on those
partitions (partition data and its operations are together referred to as a Data Unit) in the Scabi
Cluster.
The framework is highly fault tolerant and manages executing the Data Units when any
number of systems which are part of the Scabi Cluster can fail at any time. Data Unit
makes use of in-memory, off-heap and unbounded storage data structure and enables fast
processing of huge data sets. This enables us to perform algorithms like complex
MapReduce operations, ensemble machine learning algorithms and iterative algorithms.
This gives us the capability to process Petabytes to Exabytes+ of multiple datasets within
minutes.
The Scabi micro framework with Scabi Cluster enables high performance computing by
spreading the Data Units and executing in the Scabi Cluster. The Scabi Compute Services
and Meta Services weave together to form a highly scalable cluster of hundreds of
thousands of Scabi Compute Services by networking commodity hardware.
Scabi Data Driven Framework
Copyright (c) Dilshad Mustafa. All Rights Reserved.
8. Copyright (c) Dilshad Mustafa. All Rights Reserved.
#m Compute Hardware
#1 Compute Service
#2 Compute Service
#n Compute Service
#1
DU
#2
DU
#p
DU
#1ComputeHardware
#1ComputeService
#2ComputeService
#nComputeService
#1
DU
#2
DU
#p
DU
#2ComputeHardware
#1ComputeService
#2ComputeService
#nComputeService
#1
DU
#2
DU
#p
DU
#3 Compute Hardware
Data (#1) Data (#2)
Data (#M )
#1
DP
#2
DP
#N
DP
N is total number of Data Units
across all Compute Services across
all Compute Hardware for a dataset
M is total number of datasets
Driver Code
Data Ring
Distributed Storage System
Scabi Data Driven Framework (continued)
m, n, p are any variable number
DU
DP
Data Unit
Data Partition
Figure 1.1
9. Data Partition DP
#1 Data Page DPE
#2 Data Page DPE
#k Data Page DPE
#p Data Unit DU
#1 Data Partition DP
#2 Data Partition DP
#M Data Partition DP
Memory
In-memory, Off-heap
Local Cache
Memory-mapped, Local file
#1
DPE
#2
DPE
#k
DPE
Unbounded storage
Data Ring Distributed
Storage System
Most Recently Used (MRU) Of Data Pages (DPE)
Data Page size = 64 MB (Configurable)
Time To Live (TTL) = 1000 ms (Configurable)
Data Ring
Distributed Storage System
Data Ring
Distributed Storage System
Scabi Data Driven Framework (continued)
p, k are any variable number
M is total number of datasets
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Figure 1.2
10. Scabi Framework Constructs
Scabi’s data-driven framework comprises four core constructs: Data, Data Unit, Data Partition,
Data Ring
Data
Data is the construct that orchestrates a data cluster of Data Units across Compute Services
in the Scabi Cluster. It handles all the User’s multiple datasets. User can give a string identifier
to each dataset referred to as Data Id.
Data Unit
Data Unit is the construct that represents a data partition from all the User’s datasets along
with their User defined set of operations. Data Units are executed parallely in the Compute
Services.
Data Partition
Data Partition is an in-memory, off-heap, unbounded storage data structure that uses memory,
local cache, distributed storage system (Data Ring) for storing a portion of a data set. Data
Partition has unbounded storage as its storage is not limited to any particular system’s hard disk
or storage. The storage is provided by the Data Ring. Data Partition maintains a Most Recently
Used (MRU) of Data Pages locally (basically 64 MB page files with Time To Live TTL of 1000
ms) which are memory-mapped files, in-memory and off-heap, enabling faster processing in
memory with less memory foot print.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
11. Scabi Framework Constructs
Data Unit class
Data Unit class is used to load data into each Data Unit and tell the framework how many Data
Units are to be created.
Data Ring
Data Ring is a distributed storage system that holds all the partition data of all the User’s
datasets.
This can be a
(a) Network or distributed file system (POSIX or non-POSIX, need not be fully POSIX compliant).
Examples are NFS, IBM GPFS, Oracle LustreFS
(b) FUSE mounted distributed file system. Examples are RedHat GlusterFS, Scality, Google
GFS2, Apache HDFS, MapR FS, Ceph FS, IBM Cleversafe
(c) Non file system. Examples are Seaweed file system
(d) S3 or Object Storage system. Examples are Minio, Cloudian, Riak
(e) Any other storage system with HTTP or REST or S3 or any custom interface. Support for
any storage system can be implemented by implementing the interface IStorageHandler.java
Copyright (c) Dilshad Mustafa. All Rights Reserved.
13. The Scabi micro framework with Scabi Cluster enables high performance computing by
spreading the Scabi Users's jobs and programs and executing in the Scabi Cluster. The
Scabi Compute Servers and Meta Servers weave together to form a highly scalable
cluster of hundreds of thousands of Scabi Compute Servers by networking commodity
hardware. This means Users do not need specialized computing hardware with
thousands of CPUs or CPU cores or special network hardware.
The Scabi framework provides simple API to easily distribute storage and retrieval of
User files and data by using Scabi Namespaces. The micro framework with the cluster
provides high availability of User files and data by keeping versions of User files.
Scabi Compute Driven Framework Overview
Copyright (c) Dilshad Mustafa. All Rights Reserved.
14. Scabi provides single, unified and uniform namespace for various types of User data:
files, tables, unstructured document data (Collections), properties and Java files (.class,
.jar, .bsh).
User files of various sizes can be stored and retrieved using the Scabi Namespaces just
like in a shared or cluster file system.
Massively parallel Map/Reduce, Aggregations and Geospatial queries can be performed
on the User’s tables in various databases without actually moving the data in the
network.
Programs running in the User’s Client system as well as those running in the Scabi
Cluster can access the User’s files and tables in the distributed databases through the
Scabi Namespace URL: scabi:<namespace>:<resource name>. Also using Scabi URLs
eliminate the need to pass huge volumes of data around in the network, without
saturating the network bandwidth.
Scabi Compute Driven Framework Overview (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
15. In Scabi, each User job or program is sliced into multiple split jobs. Each split job is known as
Compute Unit (CU). The total number of Compute Units is specified by the Scabi User. Each
Compute Unit will be executed separately in any of the Compute Servers available in the
Scabi Cluster. Also each Compute Server can execute multiple CU concurrently.
There can be multiple Compute Servers running in the same as well as different hardware. All
Compute Servers are connected to a Meta Server. All Meta Servers are connected to each
other forming a Scabi Cluster. The Scabi Cluster can be easily scaled-out horizontally by
adding more Compute Hardware, starting more Compute Servers, run Compute Servers with
more number of threads per Compute Server and adding Meta Servers with its own Cluster of
Compute Servers. Meta Servers are added by starting a new Meta Server and pointing it to
an existing Meta Server, forming a mega cluster.
Scabi Compute Driven Framework
Copyright (c) Dilshad Mustafa. All Rights Reserved.
16. Meta
Server
#m Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
#2 Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
#1 Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
Scabi Cluster
Figure shows a Scabi Cluster with m-Compute Hardware running n-Compute Servers each
running p-Compute Units, connected to one Meta Server.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
17. Meta
Server
#1 Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
Scabi Cluster (cont’d)
Meta
Server
Meta
Server
Figure shows a Scabi Cluster with multiple Compute Hardware running multiple Compute
Servers each running multiple Compute Units, scales out horizontally by adding more Compute
Hardware, starting more Compute Servers and Meta Servers.
#m CH
#1
CS
#2
CS
#n
CS
#1 CH
#1
CS
#2
CS
#n
CS
#k CH
#1
CS
#2
CS
#n
CS
Copyright (c) Dilshad Mustafa. All Rights Reserved.
CS
CH Compute Hardware
Compute Server
18. Scabi provides storage and retrieval for various types of User data: files, tables, unstructured
document data (Collections), properties and Java files (.class, .jar, .bsh).
Scabi maintains two versions of each User file at any time. The current version and the
immediate previous version of each file will be always available in the system. After each
completed file upload operation, the specific uploaded file will be marked as latest and the
last version (based on server timestamp) that already existed in the system prior to upload
will be marked as immediate previous version. All other versions will be removed from the
system.
Scabi relies on MongoDB's Replica Sets to provide high availability for User’s data through
MongoDB's replicating / secondary servers. The Replication process provided by MongoDB
is transparent to Scabi users and is utilized by directly configuring MongoDB.
For providing load balancing for various User’s file and database operations and to scale-out
horizontally, Scabi relies on MongoDB's Sharding process to provide high performance
access to User’s data. Scabi users can directly configure the MongoDB database to use
Sharding.
Scabi - Distributed Storage & Retrieval
Copyright (c) Dilshad Mustafa. All Rights Reserved.
19. Scabi provides single, unified and uniform namespace for various types of User data: files,
tables, unstructured document data (Collections), properties and Java files (.class, .jar, .bsh).
Each of Scabi's Namespace for User files, App-specific tables, unstructured document data,
properties and Java files, corresponds to a MongoDB database as configured by Scabi user
while registering the namespace in Meta server. Scabi Namespaces can be registered to
use same or different MongoDB databases which are distributed and located anywhere and
connected to the network accessible by the Scabi Cluster and User’s Client system.
Programs running in the User’s Client system as well as those running in the Scabi Cluster
can access the User’s resources stored in the distributed databases through the Scabi
namespace URL: scabi:<namespace>:<resource name>
Users as well as programs running in the Scabi Cluster can perform various operations viz.
registering new namespace, read / write operations for various types of User data: User files,
tables, unstructured document data, properties and Java files (.class, .jar, .bsh).
Scabi Namespace
Copyright (c) Dilshad Mustafa. All Rights Reserved.
20. Scabi
Client
DB-n
Meta
Server
DB-1
DB-2
#1 Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
#m CH
#1
CS
#2
CS
#n
CS
Figure shows one scenario with a Scabi Client writing User files and table data to DB-2 and
DB-n and submitting split jobs / Compute Units to various Compute Servers for execution. The
CUs will then process the User files and table data by accessing DB-2 and DB-n and writing
results back to DB or returning results back to Scabi Client. The Client either directly receives
the results from the Compute Units or read results from User files and table data from DB-2
and DB-n.
Scabi Client and Compute Servers resolve the
namespace URL scabi:<namespace>:<resource
name> into specific DB by contacting the Meta Server
Scabi Namespace (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
21. Submitting User Jobs, Programs in Scabi
User programs use Scabi Client API to split jobs or programs into Compute Units. Users extend the
DComputeUnit class and implement the compute() method. Users can then use DCompute class to
submit the Compute Units to the Scabi Cluster for execution in the Compute Servers.
1. Example (a) for Splitting a User’s Job
Let’s take a Prime number example. To check if a given number N is Prime, we need to divide with all
the previous Prime numbers or with 2 and all previous odd numbers till square root of N to check if N
is divisible. If N contains millions of digits, this process will become a time consuming computing for a
single PC or a computer hardware with thousands of cores or CPUs.
To give an idea for comparison, Java’s long has a maximum of 19 digits and double has a maximum
of 308 digits.
NJob : Check if N is divisible by 2 and all odd numbers <=
2, 3, 5, …, N
To split this job into Compute Units, extend the DComputeUnit class and implement the compute()
method. User specifies the number of times the job has to be split for e.g. 100,000, then 100,000
DComputeUnit objects will be executed in the Scabi Cluster in various Compute Servers.
Submitting User jobs or programs for execution in the Scabi Cluster involve the following steps:-
1. Splitting the User’s job or program
2. Extend the DComputeUnit class and implement the compute() method
3. Use DCompute class to submit the DComputeUnit class for execution in the Scabi
Cluster
4. Retrieve the execution results
Copyright (c) Dilshad Mustafa. All Rights Reserved.
22. Submitting User Jobs, Programs in Scabi (continued)
1. Example (a) for Splitting a User’s Job (continued)
DComputeUnit object, getCU() / CU number #1
Each DComputeUnit object below checks for division of N only for the following set of numbers
shown below:-
2 3 5 P1
Start End
DComputeUnit object, getCU() / CU number #2 P1+2 P1+4 P2
P99999+2DComputeUnit object, getCU() / CU number
#100,000
P100000P99999+4
N
getTU() Total Units
X iWhere Pi =
Copyright (c) Dilshad Mustafa. All Rights Reserved.
23. Submitting User Jobs, Programs in Scabi (continued)
1. Example (b) for Splitting a User’s Job
A meteorological department's data contains Geographical temperature variations from 1980 to 2015
automatically recorded by instrumentation devices each hour. The department needs to obtain mean-
average of temperature variations per month basis for their research purposes. The calculation
becomes complex as they want to apply complex statistical formula. If the data contains millions of
records to be processed, this process will become a time consuming computing for a single PC or a
computer hardware with thousands of cores or CPUs.
Job : Calculate mean-average of temperature variations per month basis from 1980 to 2015
and apply the department’s statistical formula
To split this job into Compute Units, extend the DComputeUnit class and implement the compute()
method. User specifies the number of times the job has to be split for e.g. 36 to split by Year, then 36
DComputeUnit objects will be executed in the Scabi Cluster in various Compute Servers.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
24. Submitting User Jobs, Programs in Scabi (continued)
1. Example (b) for Splitting a User’s Job (continued)
DComputeUnit object, getCU() / CU number #1
Each DComputeUnit object below checks for 12 months of an Year as shown below:-
1 2 3 12
Start End
DComputeUnit object, getCU() / CU number #2
DComputeUnit object, getCU() / CU number
#36
1980
Year
1980+(2-1)1 2 3 12
1980+(36-1)1 2 3 12
Copyright (c) Dilshad Mustafa. All Rights Reserved.
25. Submitting User Jobs, Programs in Scabi (continued)
2. Extend the DComputeUnit Class and implement the compute() method
Example (a) MyFirstUnit Class
public class MyFirstUnit extends DComputeUnit {
public String compute(DComputeContext context) {
int totalUnits = context.getTU();
int thisUnit = context.getCU();
String result = “Hello from this unit CU #” + thisUnit);
return result;
}
}
The code above creates a class MyFirstUnit by extending DComputeUnit and implements
the compute() method. context will be passed to each Compute Unit object running in the
Compute Servers by the Scabi framework.
getTU() will give the Total number of Compute Units or the split jobs as specified by the User,
getCU() is the Compute Unit number of this particular Compute Unit object running in the
Compute Server.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
26. Submitting User Jobs, Programs in Scabi (continued)
2. Extend the DComputeUnit Class and implement the compute() method (continued)
Example (b) MyPrimeCheckUnit Class
public class MyPrimeCheckUnit extends DComputeUnit {
public String compute(DComputeContext context) {
int totalUnits = context.getTU();
int thisUnit = context.getCU();
BigInteger number = new BigInteger(context.getInput().getString(“NumberToCheck”));
// check if number is even number.
// If number >2 and divisible by 2, then number is not Prime, return false immediately
…
// Obtain square root of number
BigInteger sqrtof = sqrt(number);
// calculate chunkSize = sqrt(number) / totalUnits
BigInteger chunkSize = …;
// calculate starting number for division, start = (thisUnit – 1) * chunkSize + 1
// make start as odd number > 1 if not already
BigInteger start = …;
// calculate ending number for division, end = thisUnit * chunkSize
// make end as odd number if not already
BigInteger end = …;
Copyright (c) Dilshad Mustafa. All Rights Reserved.
27. Submitting User Jobs, Programs in Scabi (continued)
2. Extend the DComputeUnit Class and implement the compute() method (continued)
Example (b) MyPrimeCheckUnit Class (continued)
// check if number is divisible by numbers from start to end
for (…) {
…
}
String result = …;
return result;
}
}
The above code is abbreviated to focus on explaining the concept and saving space. The
abbreviated code is provided with comments and is self-explanatory.
We first calculate the chunk size, which is square root (N) / getTU() Total Units.
We then calculate the start and end numbers to be used for division check.
start = (thisUnit – 1) * chunk size + 1
end = thisUnit * chunk size
For more details, please refer the Java code.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
28. After DComputeUnit class or object is created, Users can then use DCompute class to
submit the Compute Units to the Scabi Cluster for execution in the Compute Servers.
DCompute class uses asynchronous non-blocking network I/O to submit the Compute
Units to the Compute Servers for execution. It determines the optimal number of threads
the User’s Client system can handle based on User’s Client system’s memory and number
of CPUs as well as the number of threads sufficient enough to submit all the Compute
Units. This class can be used to submit very large number of Compute Units / split jobs to
the Scabi Cluster for execution.
Users can also explicitly specify the number of threads to be created by using the
maxThreads() method. The performance of execution of the Compute Units in the Scabi
Cluster is limited mostly by the number of Compute Hardware and Compute Servers
available in the Scabi Cluster.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class
3. Submitting Compute Units for execution in the Cluster
Copyright (c) Dilshad Mustafa. All Rights Reserved.
29. To give a theoretical example, the following code submits 1 billion Compute Units or split
jobs to check if the input number is Prime number. The number of compute hardware and
compute servers running in Scabi Cluster is the limiting factor in the performance of
execution of the Compute Units.
The code below shows four different ways to do it:-
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
MyPrimeCheckUnit class extends DComputeUnit class and is explained in prior slide in
section “Extend the DComputeUnit Class and implement the compute() method”,
Example (b).
json is a Dson object containing the input number to check for Prime. It can contain
potentially millions of digits. To give an idea for comparison, Java’s long has a maximum
of 19 digits and double has a maximum of 308 digits.
Dson json = new Dson();
json.add(“NumberToCheck”, “712430483480234234234241232143223447”);
DCompute c = DCompute(meta);
c.executeClass(MyPrimeCheckUnit.class).split(1000000000).input(json).output(m
ap).perform();
c.finish();
Method 1 – Submitting Compute Units with Class
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
30. Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
Dson json = new Dson();
json.add(“NumberToCheck”, “712430483480234234234241232143223447”);
DCompute c = DCompute(meta);
c.addJar(“MyPrimeCheckUnit.jar”); // Add all Java libraries / jar files like this
c.executeCode(“import MyPrimeCheckUnit;” +
“cu = new MyPrimeCheckUnit();”
“return cu.compute(context);”);
c.split(1000000000).input(json).output(map).perform();
c.finish();
Method 3 – Submitting Compute Units with Java Source Code
Dson json = new Dson();
json.add(“NumberToCheck”, “712430483480234234234241232143223447”);
MyPrimeCheckUnit cu = new MyPrimeCheckUnit();
DCompute c = DCompute(meta);
c.executeObject(cu).split(1000000000).input(json).output(map).perform();
c.finish();
Method 2 – Submitting Compute Units with object reference
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
31. The Compute Units submitted through executeClass(), executeObject(), executeCode()
and executeJar() methods are all executed in the Compute Servers available in the
Cluster.
The DCompute class provides various methods to do the following:
(1) Specify the number of times the job or program has to be split into Compute Units
(2) Specify the range, which set of Compute Units has to be submitted to Scabi Cluster
(3) Input data to the Compute Units
(4) Where to store output results for each Compute Unit after execution in the Scabi
Cluster
(5) Explicitly specify and override the maximum number of threads to be created to submit
the Compute Units to the Scabi Cluster
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
Dson json = new Dson();
json.add(“NumberToCheck”, “712430483480234234234241232143223447”);
DCompute c = DCompute(meta);
c.executeJar(“MyPrimeCheckUnit.jar”, “MyPrimeCheckUnit”);
c.split(1000000000).input(json).output(map).perform();
c.finish();
Method 4 – Submitting Compute Units with Jar file
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
32. As the Compute Units submitted through executeClass(), executeObject(), executeCode()
and executeJar() methods are all executed in the Compute Servers available in the
Cluster, the Compute Servers will not have the User’s jar files for the classes used by the
DComputeUnit object, cu in the above example.
Use addJar() method to add all the supporting jar files.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
DComputeUnit cu = new DComputeUnit() {
public String compute(DComputeContext context) {
MyPrimeCheckUnit pcu = new
MyPrimeCheckUnit();
return pcu.compute(context);
}
};
DCompute c = DCompute(meta);
// Add all Java libraries, jar files like this
c.addJar("MyPrimeCheckUnit.jar");
c.executeObject(cu).input(json).split(1).output(out);
c.perform();
c.finish();
Adding jar files, Java libraries to the Compute Units
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
33. The above example shows executeObject() method to submit a Compute Unit. The
Compute Unit will internally submit its own Compute Units / split jobs for execution in the
Cluster.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
DComputeUnit cu = new DComputeUnit() {
public String compute(DComputeContext context) {
DCompute c2 = DCompute(meta2);
c2.executeClass(MyPrimeCheckUnit.class);
c2.input(context.getInput()).split(1).output(out2);
c2.perform();
c2.finish();
… … …
}
};
DCompute c = DCompute(meta);
// Add all Java libraries, jar files like this
c.addJar(“MyPrimeCheckUnit.jar");
c.executeObject(cu).input(json).split(1).output(out);
c.perform();
c.finish();
Submitting Compute Units from within a Compute Unit
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
34. The CUs submitted through this CU are run in any of the Compute Servers. The jar file
paths in the User’s Client system will not be valid inside a Compute Server. Users can use
addComputeUnitJars() method to add all the jars added to this CU (cu in above code) by
the User earlier when submitting this CU.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
DComputeUnit cu = new DComputeUnit() {
public String compute(DComputeContext context) {
DCompute c2 = DCompute(meta2);
c2.addComputeUnitJars();
c2.executeCode(“import MyPrimeCheckUnit;” +
“cu = new MyPrimeCheckUnit();”
“return cu.compute(context);”);
c2.input(context.getInput()).split(1).output(out2);
c2.perform();
c2.finish();
}
};
DCompute c = DCompute(meta);
c.addJar("MyPrimeCheckUnit.jar"); // Add all Java libraries / jar files like this
c.executeObject(cu).input(json).split(1).output(out);
c.perform();
c.finish();
Providing User’s jar files to Compute Units submitted from within a Compute Unit
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
35. Submitting User Jobs, Programs in Scabi (continued)
Compute Units are just like any other Java program. We need to include the jar files in the
program to start using the API of specific storage systems (like Amazon S3, Google Cloud
Storage), other file systems, databases (JDBC, Oracle, DB2, Cassandra, CouchDB, Redis,
etc.), other third-party Java libraries, Java Machine Learning libraries (J48/C4.5/C5,
JavaBayes, Weka, etc.), other Java external libraries.
To access these systems or other Java libraries from inside a compute Unit, add the jar files
using the .addJar() method before submitting the Compute Units for execution using the
.perform() method in DCompute class.
In each DCompute object, a maximum of Java Long.MAX_VALUE (264-1 or
9223372036854775807) number of splits can be created. By creating additional DCompute
objects and by passing in additional parameter values in json input (passed in .input() method
of DCompute) like “TotalDComputeObjects” and “ThisDComputeObject”, theoretically infinite
number of splits can be created and handled in the System, limited only by the hardware and
memory used in the System.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
DCompute Class (continued)
3. Submitting Compute Units for execution in the Cluster (continued)
36. After the finish() method is called on the Dcompute object, results will be available in the
HashMap object supplied in the .output() method.
The map will contain the split number / Compute Unit number as the key with value as the
result returned by each Compute Unit after execution in the Compute Server in the Cluster.
Submitting User Jobs, Programs in Scabi (continued)
4. Retrieving Results After Execution
In DCompute class, Compute Units / split jobs will be submitted for execution in parallel in
the available Compute Servers in the Scabi Cluster.
The framework will try to execute the Compute Units in as many Compute Servers as the
total number of splits specified by the User. This is limited only by the number of Compute
Servers running in the Scabi Cluster.
Users can use the maxRetry() method to specify the maximum retries attempted in case
of network communication error with a Compute Server. In those cases, retries of
submitting the Compute Units will be attempted with other working Compute Servers in
the Cluster.
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
37. Copyright (c) Dilshad Mustafa. All Rights Reserved.
Single Hardware Vs Scabi Cluster Performance
Time
No. of Threads / Processes
In a single hardware scenario, User has a system with N number of CPUs/cores. Users
utilise the maximum performance by running their program in maximum number of threads
or running in multiple processes. If the User's system has 16 GB memory and assuming
each thread consumes 1 MB of Stack memory, the maximum number of threads that can be
created in the User's client system is approximately 16,384 threads. If the user's system has
4 CPUs, then the system will spend most of the time in thread context switching in this case.
Single Hardware
The above graph shows the maximum performance (minimum time in graph) is achieved
only at a certain number of threads. As the number of threads increases, the CPUs will be
spending most of the time in thread context switching and high number of threads actually
becomes counter-productive to performance of the system. The graph shows for very high
number of threads the performance actually drops.
38. Copyright (c) Dilshad Mustafa. All Rights Reserved.
Single Hardware Vs Scabi Cluster Performance
Time
m-Compute Hardware X n-Compute Servers
Scabi Cluster Performance
In the previous single hardware scenario, the User's system has 16 GB memory and
assuming each thread consumes 1 MB of Stack memory, the maximum number of threads
that can be created in the User's client system is approximately 16,384 threads.
Using Scabi Cluster, User can run his programs in multiple such systems. For example if the
User has m such systems, his programs can effectively run on m * 16,384 threads allowing
the User to scale out horizontally by adding more compute hardware, starting more compute
servers, or run compute servers with more number of threads per compute server or adding
more Meta Servers with its own Cluster of Compute Servers.
The Compute Units submitted through DCompute Class run concurrently in the Compute
Servers in the Scabi Cluster and as well as run concurrently in multiple different threads
within a Compute Server also.
39. User file operations in Scabi are carried out using the DFile Class. The DFile class provides
methods to set Namespace to access, put() method to store files in Scabi and get() method
to retrieve files from Scabi.
Scabi maintains two versions of each User file at any time. The current version and the
immediate previous version of each file will be always available in the system. After each
completed file upload operation, the specific uploaded file will be marked as latest and the
last version (based on server timestamp) that already existed in the system prior to upload
will be marked as immediate previous version. All other versions will be removed from the
system.
Scabi provides single, unified and uniform namespace for various types of User data: files,
tables, unstructured document data (Collections), properties and Java files (.class, .jar,
.bsh).
Each of Scabi's Namespace for User files, User App-specific tables, unstructured document
data, properties and Java files, corresponds to a MongoDB database as configured by
Scabi user while registering the namespace in Meta server. Scabi Namespaces can be
registered to use same or different MongoDB databases which are distributed and located
anywhere and connected to the network accessible by the Scabi Cluster and User’s Client
system.
User Files In Scabi (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
40. User Files In Scabi
Scabi connects to MongoDB databases for various User file and User database
operations. Each Scabi Namespace may correspond to separate MongoDB database.
Also Scabi’s Meta Server connects to MongoDB to read/write meta data about Compute
Servers as well as Scabi Namespaces.
Scabi relies on MongoDB's Replica Sets to provide high availability for User’s data
through MongoDB's replication / secondary servers. For providing load balancing for
various User’s file and database operations and to scale-out horizontally, Scabi relies on
MongoDB's Sharding process.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
41. The following Scabi Namespace URLs refer to different files:-
scabi:MyOrg.MyFiles:myfile.txt
scabi:MyOrg.MyFiles:/myfile.txt
scabi:MyOrg.MyFiles:./myfile.txt
scabi:MyOrg.MyFiles:/home/dilshad/myfile.txt
scabi:MyOrg.MyFiles:myorg.mydept.myfile.txt
scabi.MyOrg.MyFiles:myorg-mydept-myfile.txt
scabi:MyOrg.MyFiles:/usr/myfile.txt
scabi:MyOrg.Bangalore:myfile.txt
scabi:MyOrg.California:myfile.txt
After the files are stored in Scabi using the put() method, the Scabi Namespace URL can be
passed around between Scabi Client system and Compute Units to access input files for
processing by the Compute Units. Compute Units can also write results to files in Scabi and
return back a Scabi Namespace URL of the file in the result if the result contains huge
volume of data.
Programs running in the User’s Client system as well as those running in the Scabi Cluster
can access the User’s files stored in the distributed databases through the Scabi
Namespace URL: scabi:<namespace>:<file name>
Users as well as programs running in the Scabi Cluster can perform various operations viz.
registering new namespace, read / write operations using the files.
User Files In Scabi (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
42. Scabi
Framework
Scabi
Client
Meta
Server
DB-1
Stores Latest version
And Previous version
of File
Resolve
Namespace
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Figure shows a Client using Scabi micro framework to do file operations in Scabi.
Namespaces are resolved by the framework by contacting the Meta Server.
User Files In Scabi (continued)
43. Scabi
Framework
Scabi
Client
Meta
Server
DB-1
Stores Latest version
And Previous version
of File
Resolve
Namespace
1) Write
scabi:MyOrg.MyFiles:input.txt
4) Read
scabi:MyOrg.MyFiles:results.txt
2) Read
scabi:MyOrg.MyFiles:input.txt
3) Write
scabi:MyOrg,MyFiles:results.txt
#m Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
Both Scabi Client and Compute Units running in Compute Servers can do read / write
operations of file in Scabi. After a file is stored in Scabi, only Scabi Namespace URL of the
file need to be conveyed instead of transferring the actual contents of the file between
Scabi Client and Compute Units.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
User Files In Scabi (continued)
44. DFile Class is used to do all the file operations in Scabi. The Scabi micro framework
resolves the namespaces in the Scabi Namespace URLs by contacting the Meta Server.
The DFile class internally uses streaming mechanism to read and write files into Scabi.
This means the entire file contents are not loaded into memory. For e.g. if the User’s
Client system has 2 GB memory and a 6 GB file is written using the DFile class, the
contents of the file is transferred through a 64 MB internal buffer. The entire 6 GB of file
data is not loaded into memory. The DFile class is lightweight and does not pose any
overhead in transferring data. To give an example, on a low-end system with 2 GB RAM,
a 6 GB file is transferred within 4 minutes.
The Scabi framework maintains two versions of each file: the latest version and the
previous version. This ensures that when transferring files of very large size, if there is a
network error, the system will still contain the previous versions and the corrupted file will
be discarded.
High availability of data is achieved by enabling MongoDB Replication of primary and
secondary servers of the underlying MongoDB server instances that correspond to each
Scabi Namespace.
High loads can be handled by using MongoDB Sharding of the underlying MongoDB
server instances that correspond to each Scabi Namespace.
User Files In Scabi (continued)
DFile Class
Copyright (c) Dilshad Mustafa. All Rights Reserved.
45. User Files In Scabi (continued)
DFile Class (continued)
DFile f = new DFile(meta);
f.put("scabi:MyOrg.MyFiles:myfile1.txt", “myfile1.txt");
FileInputStream fis = new FileInputStream(“myfile3.txt”);
f.put("scabi:MyOrg.MyFiles:myfile3.txt", fis);
fis.close();
To put a file from local file system into Scabi:
To put a file from an input stream into Scabi:
The following code examples demonstrate various file operations using DFile class:
f.copy("scabi:MyOrg.MyFiles:myfile4.txt", "scabi:MyOrg.MyFiles:myfile1.txt");
To copy a file already in Scabi into another Scabi Namespace or to another file:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
46. User Files In Scabi (continued)
DFile Class (continued)
f.get("scabi:MyOrg.MyFiles:myfile1.txt", “fileout1.txt”);
FileOutputStream fos = new FileOutputStream(“fileout3.txt”);
f.get("scabi:MyOrg.MyFiles:myfile1.txt", fos);
fos.close();
To get a file already in Scabi to local file system:
To get a file already in Scabi to an output stream:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
47. Copyright (c) Dilshad Mustafa. All Rights Reserved.
Time
No. of bytes of read / write
Time
No. of bytes of read / write using
m-Scabi Namespaces X n-MongoDB instances
Local Filesystem Vs Scabi Cluster Performance
Local File System
Scabi Cluster
48. User Tables In Scabi
User table operations in Scabi are carried out using the Dao, DTable, DObject, DResultSet
classes.
Dao class - provides methods to set Namespace to access, getTable() method to get DTable.
DTable class - provides methods to select, insert, update and delete operations, can directly
embed MongoDB queries and can be used to query or filter data, get underlying
Mongo Collection, do Map/Reduce, use Aggregation Framework (Aggregation
Pipeline), use Geospatial queries
DResultSet class - contains the results returned by DTable methods
The following Scabi Namespace URLs refer to different tables:-
scabi:MyOrg.MyTables:mytable
scabi:MyOrg.MyTables:/mytable
scabi:MyOrg.MyTables:./mytable
scabi:MyOrg.MyTables:/home/dilshad/mytable
scabi:MyOrg.MyTables:myorg.mydept.mytable
scabi.MyOrg.MyTables:myorg-mydept-mytable
scabi:MyOrg.MyTables:/usr/mytable
scabi:MyOrg.Bangalore.Tables:mytable
scabi:MyOrg.California.Tables:mytable
Copyright (c) Dilshad Mustafa. All Rights Reserved.
49. After the data are stored in Scabi using the DTable method, the Scabi Namespace URL can
be passed around between Scabi Client system and Compute Units to access data for
processing by the Compute Units. Compute Units can also write results to tables in Scabi
and return back a Scabi Namespace URL of the table in the result if the result contains huge
volume of data.
Programs running in the User’s Client system as well as those running in the Scabi Cluster
can access the User’s tables stored in the distributed databases through the Scabi
Namespace URL: scabi:<namespace>:<table name>
Users as well as programs running in the Scabi Cluster can perform various operations viz.
registering new namespace, read / write operations using the tables.
User Tables In Scabi (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
51. DB-1
1) Write
scabi:MyOrg.MyTables:emp_table
4) Read
scabi:MyOrg.MyTables:emp_results
Both Scabi Client and Compute Units running in Compute Servers can do select, insert,
update, delete operations of table in Scabi. After table data is stored in Scabi, only Scabi
Namespace URL of the table need to be conveyed instead of transferring the actual
contents of the table between Scabi Client and Compute Units.
Scabi
Framework
Scabi
Client
Meta
Server
Table
data
Resolve
Namespace
2) Read
scabi:MyOrg.MyTables:emp_table
3) Write
scabi:MyOrg,MyTables:emp_results
#m Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
Copyright (c) Dilshad Mustafa. All Rights Reserved.
User Tables In Scabi (continued)
52. User Tables In Scabi (continued)
DTable table = dao.createTable("scabi:MyOrg.MyTables:Table1");
DDocument d = new DDocument();
d.append("EmployeeName", "Karthik").append("EmployeeNumber", "3000");
d.append("Age", 40);
table.insert(d);
DDocument d2 = new DDocument();
d2.put("Age", 45);
DDocument updateObj = new DDocument();
updateObj.put("$set", d2);
table.update(eq("EmployeeName", "Balaji"), updateObj);
To create a table in Scabi:
To insert data into a table in Scabi:
To update records in a table in Scabi:
The following code examples demonstrate various database operations using Dao class:
Dao Class (continued)
DTable table = dao.getTable(“scabi:MyOrg.MyTables:Table1");
To get existing table in Scabi:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
53. User Tables In Scabi (continued)
DResultSet result = table.find(or(eq("EmployeeNumber", "3003"), lt("Age", 40)));
while (result.hasNext()) {
DDocument d3 = result.next();
… … …
}
To query data in a table in Scabi:
Dao Class (continued)
MongoCollection<Document> c = table.getCollection();
To access underlying MongoCollection from DTable:
String map = "function() { for (var key in this) { emit(key, null); } }";
String reduce = "function(key, s) { if ("Age" == key) return true; else return false; }";
MapReduceIterable<Document> out = c.mapReduce(map, reduce);
for (Document o : out) {
System.out.println("Key name is : " + o.get("_id").toString());
System.out.println(o.toString());
}
Map/Reduce example directly on the MongoCollection:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
54. Map/Reduce In Scabi
MongoCollection<Document> c = table.getCollection();
To access underlying MongoCollection from DTable:
String map = "function() { for (var key in this) { emit(key, null); } }";
String reduce = "function(key, s) { if ("Age" == key) return true; else return false; }";
MapReduceIterable<Document> out = c.mapReduce(map, reduce);
for (Document o : out) {
System.out.println("Key name is : " + o.get("_id").toString());
System.out.println(o.toString());
}
Map/Reduce example directly on the MongoCollection:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Map/Reduce functions can be directly executed on the Mongo Collection natively.
55. Map/Reduce In Scabi (continued)
MongoDB’s Map/Reduce functionality can be directly used/invoked from within each Compute
Unit. The following optimizations can be performed on Map/Reduce with MongoDB:
1) Map/Reduce optimized with sort on indexed fields
2) Incremental Map/Reduce (Map/Reduce with query filter to read only newer records)
3) Concurrent Map/Reduce (each Map/Reduce over specific ranges)
4) Map/Reduce over sharded collection
MongoDB’s Aggregation Framework functionality (Aggregation Pipeline) can be directly
used/invoked from within each Compute Unit. The following optimizations can be performed
on aggregate() with MongoDB:
1) Concurrently doing aggregate() (each aggregate() over specific split keys)
2) aggregate() over sharded collection
Copyright (c) Dilshad Mustafa. All Rights Reserved.
56. Map/Reduce In Scabi (continued)
Alternatively, we can also programmatically read/write data from the Mongo Collection or
from any other data source from each Compute Unit.
Compute Units are just like any other Java program. We need to include the jar files in the
program to start using the API of specific storage systems (like Amazon S3, Google Cloud
Storage), other file systems, databases (JDBC, Oracle, DB2, Cassandra, CouchDB,
Redis, etc.), other third-party Java libraries, Java Machine Learning libraries (J48/C4.5/C5,
JavaBayes, Weka, etc.), other Java external libraries.
To access these systems from inside a compute Unit, add the jar files using the .addJar()
method before submitting the Compute Units for execution using the .perform() method in
DCompute class.
In each Compute Unit, load data directly from any data source: shared file system, SAN,
NAS, Alluxio in-memory distributed file system (formerly Tachyon), JDBC, Amazon S3, etc.
and perform the Map/Reduce and Aggregations and write back the results to the data
source.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
57. Map/Reduce In Scabi (continued)
For unstructured data, the DFile class can be used to store large data files and retrieve
only specific portions/partitions (for e.g. 64 MB partition) of the file from each Compute Unit
to process the data concurrently.
Alternatively, very large files can be split into several small files and stored using DFile and
processed from each Compute Unit concurrently.
For structured or semi-structured data, the Dao class can be used to do Map/Reduce,
Aggregations and other data computations. By spreading the data into multiple databases,
database instances, collections or using sharded databases and collections, massively
parallel data computations can be implemented (discussed further in section “Peta Scale
With Cloud”).
From each Compute Unit, read data over specific ranges or specific split keys from
database for concurrently doing data computations from multiple Compute Units.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
58. BigData solutions rely on data locality to handle data processing on large data sets. The
data locality is achieved through local disk I/O (large number of Direct Attached Storage
devices attached to each node) and in-memory (large RAM in each node).
Both Direct Attached Storage (DAS) architecture and large RAM sizes tend to raise
costs quickly as data reaches Peta scale after factoring in maintenance costs, backup of
previous months Peta-scale data sets, hardware upgrade cycle, etc. For example,
compared with Amazon AWS, EBS with PIOPS and S3 are relatively cost effective over
huge number of DAS storage devices physically attached to each node over tens of
thousands of nodes for storing Peta scale data. EBS with PIOPS can provide near data
locality.
If Map/Reduce is run incrementally and as a batch job, S3 provides a cost effective
solution to store Peta Bytes of data (e.g. in the case of Netflix Hadoop implementation,
Pinterest, etc.).
In-memory distributed file systems like Alluxio (formerly Tachyon) can provide faster
access and act as a distributed memory cache layer. From each Compute Unit we can
directly read/write to the in-memory file system.
Peta scale data can be divided and stored in multiple MongoDB instances or databases.
From each Compute Unit we can directly read/write to a corresponding MongoDB
instance or database.
BigData Processing In Cloud
Copyright (c) Dilshad Mustafa. All Rights Reserved.
59. Peta Scale With Cloud
Scabi micro framework can be used to implement solutions to many different kinds of
problems including Map/Reduce, Aggregation problems, parallel algorithms and run
Machine Learning algorithms parallely over huge data sets stored in heterogeneous data
sources.
The below is one example to handle Peta Scale. In this example, divide the PetaBytes of
data into multiple MongoDB databases just to speedup Map/Reduce data computations
and assign to different Scabi Namespaces (DataSet1, ..., DataSetN) using
meta.namespaceRegister() method. Then submit multiple Compute Units to Scabi
Cluster.
In each Compute Unit, get a table from an assigned Scabi Namespace and access the
Mongo Collection using MongoCollection c = table.getCollection().
Then directly do Map/Reduce on the Mongo Collection natively using c.mapReduce(map,
reduce) (refer Example5). This way actual data will not be moved around in the network.
Data can be spread based on any of the below arrangements:
1. Multiple different databases in different MongoDB instances
2. Multiple different collection in same or different databases, in same or different
MongoDB instances
3. Sharded database and collection
Copyright (c) Dilshad Mustafa. All Rights Reserved.
60. Peta Scale With Cloud (continued)
The figure shows one example to implement massively parallel Map/Reduce and
Aggregations. A Grid of Compute Units is logically aligned to a corresponding grid of
MongoDB instances / databases or collections.
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
Copyright (c) Dilshad Mustafa. All Rights Reserved.
61. Scabi Namespace Operations
Dson dson = new Dson();
dson.add("Namespace", "MyCompany-Tables");
dson.add("Type", DNamespace.APPTABLE);
dson.add("Host", "localhost");
dson.add("Port", "27017");
dson.add("UserID", "myuser");
dson.add("Pwd", "hello");
dson.add("SystemSpecificName", "MyCompanyDB");
dson.add("SystemType", "MongoDB");
if (false == meta.namespaceExists("MyCompany-Tables")) {
System.out.println("Register new namespace");
String uuid = meta.namespaceRegister(dson);
}
To create a new namespace:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
62. How to quickly run Scabi
1. Install Oracle Java 8 Java SE 1.8.0_66
2. Install MongoDB v3.2.1 with default settings, without enabling Login password and
security certificate. Run sudo mongod --dbpath /home/<username>/db/data
3. Download scabi.tar.gz from Download folder in Scabi’s GitHub project
4. Unzip scabi.tar.gz to a folder /home/<username>/scabi
5. Start Meta Server,
./start_meta.sh &
6. Start Compute Servers,
./start_compute.sh 5001 localhost 5000 1000 &
./start_compute.sh 5002 localhost 5000 1000 &
To start Compute Servers in other machines and ports, enter command as below,
./start_compute.sh <ComputeServer_Port> <MetaServer_HostName>
<MetaServer_Port> [<NoOfThreads> [debug]] &
7. Run example code inside the examples folder in /home/<username>/scabi,
cd examples
java -cp "../dependency-jars/*":"../*":. Example1
java -cp "../dependency-jars/*":"../*":. Example1_2
java -cp "../dependency-jars/*":"../*":. Example1_3
java -cp "../dependency-jars/*":"../*":. Example1_4
java -cp "../dependency-jars/*":"../*":. Example2
java -cp "../dependency-jars/*":"../*":. Example3
java -cp "../dependency-jars/*":"../*":. Example4
java -cp "../dependency-jars/*":"../*":. Example5
Copyright (c) Dilshad Mustafa. All Rights Reserved.
63. How to quickly run Scabi (continued)
1. Scabi Meta Server command line options
./start_meta.sh <No arguments> to use default settings
./start_meta.sh <MetaServer_Port> [debug]
./start_meta.sh <MetaServer_Port> <Database_HostName> <Database_Port> [debug]
2. Scabi Compute Server command line options
./start_compute.sh <ComputeServer_Port> <MetaServer_HostName>
<MetaServer_Port> [<NoOfThreads> [debug]]
Command line options
Copyright (c) Dilshad Mustafa. All Rights Reserved.
64. Scabi Performance Tuning
The following guidelines can help with performance tuning.
1. Scabi Cluster can be scaled out horizontally by adding more compute hardware,
starting more compute servers, or running compute servers with more number of
threads per compute server or adding more Meta Servers with its own Cluster of
Compute Servers.
2. If the User has limited number of compute hardware connected to the Cluster, the
User can start fewer number of Compute Servers, with more number of threads per
Compute Server, based on the memory size. Using JVM configuration, minimum and
maximum size of Thread Stack Size and Heap memory size can be configured while
starting the Compute Servers.
3. If the User has high number of compute hardware connected to Cluster with large
memory size, then high number of Computer Servers, with fewer number of threads
per Compute Server, can be started in each of the compute hardware.
4. Additional Meta Servers can be started and added to the Cluster, with each Meta
Server having its own cluster of Compute Servers.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
65. How to quickly build Scabi from GitHub
1. Install Oracle Java 8 Java SE 1.8.0_66
2. Install Git
3. Install Maven
4. Create folder /home/<username>/scabi
5. cd to scabi folder
6. Run command
git clone https://www.github.com/dilshadmustafa/scabi.git
1. cd to DilshadDCS_Core folder in /home/<username>/scabi
2. Run command
mvn package
3. The file scabi_core.jar will be created
Initial Setup
Build Scabi Core scabi_core.jar
Copyright (c) Dilshad Mustafa. All Rights Reserved.
66. How to quickly build Scabi from GitHub (continued)
1. Copy scabi_core.jar file created in above step to folder DilshadDCS_MS
2. cd to DilshadDCS_MS folder in /home/<username>/scabi
3. Include scabi_core.jar in Maven java classpath before compiling with Maven
4. Run command
mvn package
5. The file scabi_meta.jar will be created
6. cd to target folder
7. Include scabi_core.jar in java classpath before running Meta Server
8. Run below command to run Meta Server with default settings (MongoDB should
be installed and started already)
java –jar scabi_meta.jar
Build Scabi Meta Server scabi_meta.jar
1. Copy scabi_core.jar file created in above step to folder DilshadDCS_CS
2. cd to DilshadDCS_CS folder in /home/<username>/scabi
3. Include scabi_core.jar in Maven java classpath before compiling with Maven
4. Run command
mvn package
5. The file scabi_compute.jar will be created
6. cd to target folder
7. Include scabi_core.jar in java classpath before running Compute Server
8. Run below command to run Compute Server, (Meta Server should be started
already)
java –jar scabi_compute.jar 5001 localhost 5000 1000
Build Scabi Compute Server scabi_compute.jar
Copyright (c) Dilshad Mustafa. All Rights Reserved.
67. InfiniBand Support
Scabi micro framework and its Cluster are written in pure Java. The micro framework, Meta
Servers, Compute Servers can be configured to use Sockets Direct Protocol (SDP) using
Java JVM configuration to enable it to run on InfiniBand or other RDMA networks.
The Java JVM settings can be configured to use IBM Java Sockets Over RDMA (JSOR),
RSockets to enable it to run on InfiniBand or other RDMA networks.
MongoDB does not natively support RDMA. It can be configured to run on IP Over
InfiniBand (IPoIB).
Copyright (c) Dilshad Mustafa. All Rights Reserved.
68. Scabi uses the following APIs/Libraries :-
1. MongoDB Driver API 3.2.1
2. MongoDB GridFS Driver API 3.2.1
3. RedHat JBoss RESTEasy framework 3.0.14.Final
4. RedHat JBoss JavaAssist API 3.20.0-GA
5. Jetty Web Server API 9.3.2.v20150730
6. Apache Http Client API 4.5.2
7. Apache Http Async Client API 4.1.1
8. Jackson Json API 2.7.4
9. BeanShell API 2.0b4
10. SLF4J Simple Logging API 1.7.16
Copyright (c) Dilshad Mustafa. All Rights Reserved.
APIs / Libraries used by Scabi
69. Scabi is currently at version v0.2 which is the initial version. It is developed and
tested in the following environment
1. Ubuntu 15.10 64-bit
2. Oracle Java 8 Java SE 1.8.0_66 64-bit
3. MongoDB 3.2.1 64-bit
4. Maven
It uses the following versions of APIs/Libraries:-
1. MongoDB Driver API 3.2.1
2. MongoDB GridFS Driver API 3.2.1
3. RedHat JBoss RESTEasy framework 3.0.14.Final
4. RedHat JBoss JavaAssist API 3.20.0-GA
5. Jetty Web Server API 9.3.2.v20150730
6. Apache Http Client API 4.5.2
7. Apache Http Async Client API 4.1.1
8. Jackson Json API 2.7.4
9. BeanShell API 2.0b4
10. SLF4J Simple Logging API 1.7.16
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi Test Environment
70. Please refer the Java example code in Download folder in GitHub project
https://www.github.com/dilshadmustafa/scabi
Example 1
The examples use DCompute class which internally uses asynchronous non-blocking
network I/O and can submit very large number of split jobs / Compute Units
Complex and time consuming computing examples. Program to check specific
number is Prime number. The program uses Scabi to automatically scale-out horizontally to
thousands of Compute Servers running in hundreds of networked commodity hardware.
Demonstrates executeClass(), executeObject(), executeJar(), executeCode() functionalities.
Example 1_2
(a) Example shows how to add additional Java libraries, jar files
(b) Example shows executeObject() method to submit a Compute Unit. The Compute Unit
will internally submit its own Compute Units / split jobs for execution in the Cluster
(c) Example shows how to add jar files, java libraries to Compute Units submitted from
within Compute Unit. CUs are run inside Compute Servers. jar file paths provided by User
are not available inside Compute Servers. Use addComputeUnitJars() method to add all the
jar files provided to this Compute Unit by User.
Scabi Examples
Copyright (c) Dilshad Mustafa. All Rights Reserved.
71. Copyright (c) Dilshad Mustafa. All Rights Reserved.
Example 1_3
The examples use DComputeSync class which internally uses synchronous blocking
network I/O and can submit large number of split jobs / Compute Units
Complex and time consuming computing examples. Program to check specific
number is Prime number. The program uses Scabi to automatically scale-out horizontally to
thousands of Compute Servers running in hundreds of networked commodity hardware.
Demonstrates executeClass(), executeObject(), executeJar(), executeCode() functionalities.
Example 1_4
(a) Example shows how to add additional Java libraries, jar files
(b) Example shows executeObject() method to submit a Compute Unit. The Compute Unit
will internally submit its own Compute Units / split jobs for execution in the Cluster
(c) Example shows how to add jar files, java libraries to Compute Units submitted from
within Compute Unit. CUs are run inside Compute Servers. jar file paths provided by User
are not available inside Compute Servers. Use addComputeUnitJars() method to add all the
jar files provided to this Compute Unit by User.
Scabi Examples (continued)
72. Example 2 - Distributed Storage & Retrieval examples.
(a) put() operations for storing files into Scabi from local file system and from input streams
(b) copy() operations to copy files into another Scabi Namespace or to another file
(c) get() operations for retrieving files from Scabi to local file system or to output streams
Example 3 - Scabi Distributed Tables examples
Demonstrate CRUD operations.
(a) Create Table
(b) Check table exists
(c) Get existing table
(d) Insert data into table
(e) Update records in table
(f) Query data in table
(g) Directly embed MongoDB filters into queries
Example 4 - Scabi Distributed Tables examples (continued)
(a) Access underlying MongoCollection
(b) Map/Reduce example on the MongoCollection
Example 5 - Scabi Namespace Operations
(a) Check Namespace exists
(b) Create new Namespace Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi Examples (continued)
73. For any questions, clarifications or if you want to partner or participate or fund Scabi
Project development, please feel free to contact Dilshad Mustafa at
mdilshad2016@yahoo.com with a copy to mdilshad2016@rediffmail.com
Copyright (c) Dilshad Mustafa. All Rights Reserved.
74. Dilshad Mustafa is the creator and
programmer of Scabi micro framework and
Cluster. He is also Author of Book titled “Tech
Job 9 to 9”. He is a Senior Software Architect
with 16+ years experience in Information
Technology industry. He has experience
across various domains, Banking, Retail,
Materials & Supply Chain.
He completed his B.E. in Computer Science &
Engineering from Annamalai University, India
and completed his M.Sc. In Communication &
Network Systems from Nanyang Technological
University, Singapore.
Dilshad Mustafa can be reached at
mdilshad2016@yahoo.com with a copy to
mdilshad2016@rediffmail.com
Copyright (c) Dilshad Mustafa. All Rights Reserved.