SlideShare a Scribd company logo
1 of 74
Download to read offline
Scabi
A simple, lightweight framework for Cluster
Computing and Storage for BigData
processing in pure Java, created and
programmed by Dilshad Mustafa
Copyright © Dilshad Mustafa 2016. All Rights Reserved.
https://www.github.com/dilshadmustafa/scabi
TABLE OF CONTENTS
1. Scabi Overview
2. Scabi Data Driven Framework
3. Scabi Framework Constructs
4. Scabi Data Ring
5. Processing Huge Data Sets in Scabi
6. Single Hardware Vs Scabi Cluster
Performance
7. User Files in Scabifs
8. Map/Reduce In Scabi
9. BigData Processing In Cloud
10. Peta Scale In Cloud
11. How to quickly run Scabi
12. Example 1 – MapReduce, Median computing
examples
13. Example 2 – Scabifs examples
PART 1
TABLE OF CONTENTS
1. Scabi Compute Driven Framework Overview
2. Scabi Compute Driven Framework
3. Scabi Cluster
4. Scabi - Distributed Storage & Retrieval
5. Scabi Namespace
6. Submitting User Jobs, Programs in
Scabi
7. Single Hardware Vs Scabi Cluster
Performance
8. User Files in Scabi
9. Local Filesystem Vs Scabi Cluster
Performance
10. User Tables, Data in Scabi
11. Map/Reduce In Scabi
12. BigData Processing In Cloud
13. Peta Scale In Cloud
14. Scabi Namespace Operations
15. How to quickly run Scabi
16. Scabi Performance Tuning
17. How to quickly build Scabi from GitHub
18. InfiniBand Support
19. APIs / Libraries used by Scabi
20. Scabi Test Environment
21. Example 1 - Complex and time consuming
computing examples
22. Example 2 - Distributed Store & Retrieval
examples
23. Example 3 - CRUD examples
24. Example 4 – Map/Reduce example
25. Example 5 – Scabi Namespace examples
PART 2
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi is a simple, light-weight Cluster Computing & Storage micro framework for BigData
processing written purely in Java.
Scabi provides two frameworks for processing (a) Data-driven framework and (b) Compute-
driven framework. Both the frameworks basically share the same underlying core. Part 1 of
this presentation covers Data-driven framework and Part 2 covers Compute-driven framework.
(a) Data-driven framework
In the data-driven framework, Scabi processes partitions of huge datasets parallely by
loading these partitions into memory and executing User-defined operations on those
partitions (partition data and its operations are together referred to as a Data Unit) in the Scabi
Cluster.
The framework is highly fault tolerant and manages executing the Data Units when any
number of systems which are part of the Scabi Cluster can fail at any time. Data Unit
makes use of in-memory, off-heap and unbounded storage data structure and enables fast
processing of huge data sets. This enables us to perform algorithms like complex
MapReduce operations, ensemble machine learning algorithms and iterative algorithms.
This gives us the capability to process Petabytes to Exabytes+ of multiple datasets within
minutes.
The Scabi micro framework with Scabi Cluster enables high performance computing by
spreading the Data Units and executing in the Scabi Cluster. The Scabi Compute Services
and Meta Services weave together to form a highly scalable cluster of hundreds of
thousands of Scabi Compute Services by networking commodity hardware.
Scabi Overview
(b) Compute-driven framework
In the compute-driven framework, Scabi processes User-defined computations or
Algorithms or jobs parallely by splitting them into Compute Units and executing them in
the Scabi Cluster.
The framework is highly fault tolerant and manages executing the Compute Units when
any number of systems which are part of the Scabi Cluster can fail at any time. The
framework takes care of the distributed computing and load balancing in the Scabi
Cluster. This gives us the capability to perform complex and time-consuming
computations by aggregating and combining the processing power of many individual
systems.
The Scabi micro framework with Scabi Cluster enables high performance computing by
spreading the Scabi Users's jobs and programs and executing in the Scabi Cluster. The
Scabi Compute Servers and Meta Servers weave together to form a highly scalable
cluster of hundreds of thousands of Scabi Compute Servers by networking commodity
hardware. This means Users do not need specialized computing hardware with
thousands of CPUs or CPU cores or special network hardware.
The Scabi framework provides simple API to easily distribute storage and retrieval of
User files and data by using Scabi Namespaces. The micro framework with the cluster
provides high availability of User files and data by keeping versions of User files.
Scabi Overview
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
In the data-driven framework, Scabi processes partitions of huge datasets parallely by
loading these partitions into memory and executing User-defined operations on those
partitions (partition data and its operations are together referred to as a Data Unit) in the Scabi
Cluster.
The framework is highly fault tolerant and manages executing the Data Units when any
number of systems which are part of the Scabi Cluster can fail at any time. Data Unit
makes use of in-memory, off-heap and unbounded storage data structure and enables fast
processing of huge data sets. This enables us to perform algorithms like complex
MapReduce operations, ensemble machine learning algorithms and iterative algorithms.
This gives us the capability to process Petabytes to Exabytes+ of multiple datasets within
minutes.
The Scabi micro framework with Scabi Cluster enables high performance computing by
spreading the Data Units and executing in the Scabi Cluster. The Scabi Compute Services
and Meta Services weave together to form a highly scalable cluster of hundreds of
thousands of Scabi Compute Services by networking commodity hardware.
Scabi Data Driven Framework
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
#m Compute Hardware
#1 Compute Service
#2 Compute Service
#n Compute Service
#1
DU
#2
DU
#p
DU
#1ComputeHardware
#1ComputeService
#2ComputeService
#nComputeService
#1
DU
#2
DU
#p
DU
#2ComputeHardware
#1ComputeService
#2ComputeService
#nComputeService
#1
DU
#2
DU
#p
DU
#3 Compute Hardware
Data (#1) Data (#2)
Data (#M )
#1
DP
#2
DP
#N
DP
N is total number of Data Units
across all Compute Services across
all Compute Hardware for a dataset
M is total number of datasets
Driver Code
Data Ring
Distributed Storage System
Scabi Data Driven Framework (continued)
m, n, p are any variable number
DU
DP
Data Unit
Data Partition
Figure 1.1
Data Partition DP
#1 Data Page DPE
#2 Data Page DPE
#k Data Page DPE
#p Data Unit DU
#1 Data Partition DP
#2 Data Partition DP
#M Data Partition DP
Memory
In-memory, Off-heap
Local Cache
Memory-mapped, Local file
#1
DPE
#2
DPE
#k
DPE
Unbounded storage
Data Ring Distributed
Storage System
Most Recently Used (MRU) Of Data Pages (DPE)
Data Page size = 64 MB (Configurable)
Time To Live (TTL) = 1000 ms (Configurable)
Data Ring
Distributed Storage System
Data Ring
Distributed Storage System
Scabi Data Driven Framework (continued)
p, k are any variable number
M is total number of datasets
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Figure 1.2
Scabi Framework Constructs
Scabi’s data-driven framework comprises four core constructs: Data, Data Unit, Data Partition,
Data Ring
Data
Data is the construct that orchestrates a data cluster of Data Units across Compute Services
in the Scabi Cluster. It handles all the User’s multiple datasets. User can give a string identifier
to each dataset referred to as Data Id.
Data Unit
Data Unit is the construct that represents a data partition from all the User’s datasets along
with their User defined set of operations. Data Units are executed parallely in the Compute
Services.
Data Partition
Data Partition is an in-memory, off-heap, unbounded storage data structure that uses memory,
local cache, distributed storage system (Data Ring) for storing a portion of a data set. Data
Partition has unbounded storage as its storage is not limited to any particular system’s hard disk
or storage. The storage is provided by the Data Ring. Data Partition maintains a Most Recently
Used (MRU) of Data Pages locally (basically 64 MB page files with Time To Live TTL of 1000
ms) which are memory-mapped files, in-memory and off-heap, enabling faster processing in
memory with less memory foot print.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi Framework Constructs
Data Unit class
Data Unit class is used to load data into each Data Unit and tell the framework how many Data
Units are to be created.
Data Ring
Data Ring is a distributed storage system that holds all the partition data of all the User’s
datasets.
This can be a
(a) Network or distributed file system (POSIX or non-POSIX, need not be fully POSIX compliant).
Examples are NFS, IBM GPFS, Oracle LustreFS
(b) FUSE mounted distributed file system. Examples are RedHat GlusterFS, Scality, Google
GFS2, Apache HDFS, MapR FS, Ceph FS, IBM Cleversafe
(c) Non file system. Examples are Seaweed file system
(d) S3 or Object Storage system. Examples are Minio, Cloudian, Riak
(e) Any other storage system with HTTP or REST or S3 or any custom interface. Support for
any storage system can be implemented by implementing the interface IStorageHandler.java
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
The Scabi micro framework with Scabi Cluster enables high performance computing by
spreading the Scabi Users's jobs and programs and executing in the Scabi Cluster. The
Scabi Compute Servers and Meta Servers weave together to form a highly scalable
cluster of hundreds of thousands of Scabi Compute Servers by networking commodity
hardware. This means Users do not need specialized computing hardware with
thousands of CPUs or CPU cores or special network hardware.
The Scabi framework provides simple API to easily distribute storage and retrieval of
User files and data by using Scabi Namespaces. The micro framework with the cluster
provides high availability of User files and data by keeping versions of User files.
Scabi Compute Driven Framework Overview
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi provides single, unified and uniform namespace for various types of User data:
files, tables, unstructured document data (Collections), properties and Java files (.class,
.jar, .bsh).
User files of various sizes can be stored and retrieved using the Scabi Namespaces just
like in a shared or cluster file system.
Massively parallel Map/Reduce, Aggregations and Geospatial queries can be performed
on the User’s tables in various databases without actually moving the data in the
network.
Programs running in the User’s Client system as well as those running in the Scabi
Cluster can access the User’s files and tables in the distributed databases through the
Scabi Namespace URL: scabi:<namespace>:<resource name>. Also using Scabi URLs
eliminate the need to pass huge volumes of data around in the network, without
saturating the network bandwidth.
Scabi Compute Driven Framework Overview (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
In Scabi, each User job or program is sliced into multiple split jobs. Each split job is known as
Compute Unit (CU). The total number of Compute Units is specified by the Scabi User. Each
Compute Unit will be executed separately in any of the Compute Servers available in the
Scabi Cluster. Also each Compute Server can execute multiple CU concurrently.
There can be multiple Compute Servers running in the same as well as different hardware. All
Compute Servers are connected to a Meta Server. All Meta Servers are connected to each
other forming a Scabi Cluster. The Scabi Cluster can be easily scaled-out horizontally by
adding more Compute Hardware, starting more Compute Servers, run Compute Servers with
more number of threads per Compute Server and adding Meta Servers with its own Cluster of
Compute Servers. Meta Servers are added by starting a new Meta Server and pointing it to
an existing Meta Server, forming a mega cluster.
Scabi Compute Driven Framework
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Meta
Server
#m Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
#2 Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
#1 Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
Scabi Cluster
Figure shows a Scabi Cluster with m-Compute Hardware running n-Compute Servers each
running p-Compute Units, connected to one Meta Server.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Meta
Server
#1 Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
Scabi Cluster (cont’d)
Meta
Server
Meta
Server
Figure shows a Scabi Cluster with multiple Compute Hardware running multiple Compute
Servers each running multiple Compute Units, scales out horizontally by adding more Compute
Hardware, starting more Compute Servers and Meta Servers.
#m CH
#1
CS
#2
CS
#n
CS
#1 CH
#1
CS
#2
CS
#n
CS
#k CH
#1
CS
#2
CS
#n
CS
Copyright (c) Dilshad Mustafa. All Rights Reserved.
CS
CH Compute Hardware
Compute Server
Scabi provides storage and retrieval for various types of User data: files, tables, unstructured
document data (Collections), properties and Java files (.class, .jar, .bsh).
Scabi maintains two versions of each User file at any time. The current version and the
immediate previous version of each file will be always available in the system. After each
completed file upload operation, the specific uploaded file will be marked as latest and the
last version (based on server timestamp) that already existed in the system prior to upload
will be marked as immediate previous version. All other versions will be removed from the
system.
Scabi relies on MongoDB's Replica Sets to provide high availability for User’s data through
MongoDB's replicating / secondary servers. The Replication process provided by MongoDB
is transparent to Scabi users and is utilized by directly configuring MongoDB.
For providing load balancing for various User’s file and database operations and to scale-out
horizontally, Scabi relies on MongoDB's Sharding process to provide high performance
access to User’s data. Scabi users can directly configure the MongoDB database to use
Sharding.
Scabi - Distributed Storage & Retrieval
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi provides single, unified and uniform namespace for various types of User data: files,
tables, unstructured document data (Collections), properties and Java files (.class, .jar, .bsh).
Each of Scabi's Namespace for User files, App-specific tables, unstructured document data,
properties and Java files, corresponds to a MongoDB database as configured by Scabi user
while registering the namespace in Meta server. Scabi Namespaces can be registered to
use same or different MongoDB databases which are distributed and located anywhere and
connected to the network accessible by the Scabi Cluster and User’s Client system.
Programs running in the User’s Client system as well as those running in the Scabi Cluster
can access the User’s resources stored in the distributed databases through the Scabi
namespace URL: scabi:<namespace>:<resource name>
Users as well as programs running in the Scabi Cluster can perform various operations viz.
registering new namespace, read / write operations for various types of User data: User files,
tables, unstructured document data, properties and Java files (.class, .jar, .bsh).
Scabi Namespace
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi
Client
DB-n
Meta
Server
DB-1
DB-2
#1 Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
#m CH
#1
CS
#2
CS
#n
CS
Figure shows one scenario with a Scabi Client writing User files and table data to DB-2 and
DB-n and submitting split jobs / Compute Units to various Compute Servers for execution. The
CUs will then process the User files and table data by accessing DB-2 and DB-n and writing
results back to DB or returning results back to Scabi Client. The Client either directly receives
the results from the Compute Units or read results from User files and table data from DB-2
and DB-n.
Scabi Client and Compute Servers resolve the
namespace URL scabi:<namespace>:<resource
name> into specific DB by contacting the Meta Server
Scabi Namespace (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi
User programs use Scabi Client API to split jobs or programs into Compute Units. Users extend the
DComputeUnit class and implement the compute() method. Users can then use DCompute class to
submit the Compute Units to the Scabi Cluster for execution in the Compute Servers.
1. Example (a) for Splitting a User’s Job
Let’s take a Prime number example. To check if a given number N is Prime, we need to divide with all
the previous Prime numbers or with 2 and all previous odd numbers till square root of N to check if N
is divisible. If N contains millions of digits, this process will become a time consuming computing for a
single PC or a computer hardware with thousands of cores or CPUs.
To give an idea for comparison, Java’s long has a maximum of 19 digits and double has a maximum
of 308 digits.
NJob : Check if N is divisible by 2 and all odd numbers <=
2, 3, 5, …, N
To split this job into Compute Units, extend the DComputeUnit class and implement the compute()
method. User specifies the number of times the job has to be split for e.g. 100,000, then 100,000
DComputeUnit objects will be executed in the Scabi Cluster in various Compute Servers.
Submitting User jobs or programs for execution in the Scabi Cluster involve the following steps:-
1. Splitting the User’s job or program
2. Extend the DComputeUnit class and implement the compute() method
3. Use DCompute class to submit the DComputeUnit class for execution in the Scabi
Cluster
4. Retrieve the execution results
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi (continued)
1. Example (a) for Splitting a User’s Job (continued)
DComputeUnit object, getCU() / CU number #1
Each DComputeUnit object below checks for division of N only for the following set of numbers
shown below:-
2 3 5 P1
Start End
DComputeUnit object, getCU() / CU number #2 P1+2 P1+4 P2
P99999+2DComputeUnit object, getCU() / CU number
#100,000
P100000P99999+4
N
getTU() Total Units
X iWhere Pi =
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi (continued)
1. Example (b) for Splitting a User’s Job
A meteorological department's data contains Geographical temperature variations from 1980 to 2015
automatically recorded by instrumentation devices each hour. The department needs to obtain mean-
average of temperature variations per month basis for their research purposes. The calculation
becomes complex as they want to apply complex statistical formula. If the data contains millions of
records to be processed, this process will become a time consuming computing for a single PC or a
computer hardware with thousands of cores or CPUs.
Job : Calculate mean-average of temperature variations per month basis from 1980 to 2015
and apply the department’s statistical formula
To split this job into Compute Units, extend the DComputeUnit class and implement the compute()
method. User specifies the number of times the job has to be split for e.g. 36 to split by Year, then 36
DComputeUnit objects will be executed in the Scabi Cluster in various Compute Servers.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi (continued)
1. Example (b) for Splitting a User’s Job (continued)
DComputeUnit object, getCU() / CU number #1
Each DComputeUnit object below checks for 12 months of an Year as shown below:-
1 2 3 12
Start End
DComputeUnit object, getCU() / CU number #2
DComputeUnit object, getCU() / CU number
#36
1980
Year
1980+(2-1)1 2 3 12
1980+(36-1)1 2 3 12
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi (continued)
2. Extend the DComputeUnit Class and implement the compute() method
Example (a) MyFirstUnit Class
public class MyFirstUnit extends DComputeUnit {
public String compute(DComputeContext context) {
int totalUnits = context.getTU();
int thisUnit = context.getCU();
String result = “Hello from this unit CU #” + thisUnit);
return result;
}
}
The code above creates a class MyFirstUnit by extending DComputeUnit and implements
the compute() method. context will be passed to each Compute Unit object running in the
Compute Servers by the Scabi framework.
getTU() will give the Total number of Compute Units or the split jobs as specified by the User,
getCU() is the Compute Unit number of this particular Compute Unit object running in the
Compute Server.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi (continued)
2. Extend the DComputeUnit Class and implement the compute() method (continued)
Example (b) MyPrimeCheckUnit Class
public class MyPrimeCheckUnit extends DComputeUnit {
public String compute(DComputeContext context) {
int totalUnits = context.getTU();
int thisUnit = context.getCU();
BigInteger number = new BigInteger(context.getInput().getString(“NumberToCheck”));
// check if number is even number.
// If number >2 and divisible by 2, then number is not Prime, return false immediately
…
// Obtain square root of number
BigInteger sqrtof = sqrt(number);
// calculate chunkSize = sqrt(number) / totalUnits
BigInteger chunkSize = …;
// calculate starting number for division, start = (thisUnit – 1) * chunkSize + 1
// make start as odd number > 1 if not already
BigInteger start = …;
// calculate ending number for division, end = thisUnit * chunkSize
// make end as odd number if not already
BigInteger end = …;
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi (continued)
2. Extend the DComputeUnit Class and implement the compute() method (continued)
Example (b) MyPrimeCheckUnit Class (continued)
// check if number is divisible by numbers from start to end
for (…) {
…
}
String result = …;
return result;
}
}
The above code is abbreviated to focus on explaining the concept and saving space. The
abbreviated code is provided with comments and is self-explanatory.
We first calculate the chunk size, which is square root (N) / getTU() Total Units.
We then calculate the start and end numbers to be used for division check.
start = (thisUnit – 1) * chunk size + 1
end = thisUnit * chunk size
For more details, please refer the Java code.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
After DComputeUnit class or object is created, Users can then use DCompute class to
submit the Compute Units to the Scabi Cluster for execution in the Compute Servers.
DCompute class uses asynchronous non-blocking network I/O to submit the Compute
Units to the Compute Servers for execution. It determines the optimal number of threads
the User’s Client system can handle based on User’s Client system’s memory and number
of CPUs as well as the number of threads sufficient enough to submit all the Compute
Units. This class can be used to submit very large number of Compute Units / split jobs to
the Scabi Cluster for execution.
Users can also explicitly specify the number of threads to be created by using the
maxThreads() method. The performance of execution of the Compute Units in the Scabi
Cluster is limited mostly by the number of Compute Hardware and Compute Servers
available in the Scabi Cluster.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class
3. Submitting Compute Units for execution in the Cluster
Copyright (c) Dilshad Mustafa. All Rights Reserved.
To give a theoretical example, the following code submits 1 billion Compute Units or split
jobs to check if the input number is Prime number. The number of compute hardware and
compute servers running in Scabi Cluster is the limiting factor in the performance of
execution of the Compute Units.
The code below shows four different ways to do it:-
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
MyPrimeCheckUnit class extends DComputeUnit class and is explained in prior slide in
section “Extend the DComputeUnit Class and implement the compute() method”,
Example (b).
json is a Dson object containing the input number to check for Prime. It can contain
potentially millions of digits. To give an idea for comparison, Java’s long has a maximum
of 19 digits and double has a maximum of 308 digits.
Dson json = new Dson();
json.add(“NumberToCheck”, “712430483480234234234241232143223447”);
DCompute c = DCompute(meta);
c.executeClass(MyPrimeCheckUnit.class).split(1000000000).input(json).output(m
ap).perform();
c.finish();
Method 1 – Submitting Compute Units with Class
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
Dson json = new Dson();
json.add(“NumberToCheck”, “712430483480234234234241232143223447”);
DCompute c = DCompute(meta);
c.addJar(“MyPrimeCheckUnit.jar”); // Add all Java libraries / jar files like this
c.executeCode(“import MyPrimeCheckUnit;” +
“cu = new MyPrimeCheckUnit();”
“return cu.compute(context);”);
c.split(1000000000).input(json).output(map).perform();
c.finish();
Method 3 – Submitting Compute Units with Java Source Code
Dson json = new Dson();
json.add(“NumberToCheck”, “712430483480234234234241232143223447”);
MyPrimeCheckUnit cu = new MyPrimeCheckUnit();
DCompute c = DCompute(meta);
c.executeObject(cu).split(1000000000).input(json).output(map).perform();
c.finish();
Method 2 – Submitting Compute Units with object reference
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
The Compute Units submitted through executeClass(), executeObject(), executeCode()
and executeJar() methods are all executed in the Compute Servers available in the
Cluster.
The DCompute class provides various methods to do the following:
(1) Specify the number of times the job or program has to be split into Compute Units
(2) Specify the range, which set of Compute Units has to be submitted to Scabi Cluster
(3) Input data to the Compute Units
(4) Where to store output results for each Compute Unit after execution in the Scabi
Cluster
(5) Explicitly specify and override the maximum number of threads to be created to submit
the Compute Units to the Scabi Cluster
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
Dson json = new Dson();
json.add(“NumberToCheck”, “712430483480234234234241232143223447”);
DCompute c = DCompute(meta);
c.executeJar(“MyPrimeCheckUnit.jar”, “MyPrimeCheckUnit”);
c.split(1000000000).input(json).output(map).perform();
c.finish();
Method 4 – Submitting Compute Units with Jar file
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
As the Compute Units submitted through executeClass(), executeObject(), executeCode()
and executeJar() methods are all executed in the Compute Servers available in the
Cluster, the Compute Servers will not have the User’s jar files for the classes used by the
DComputeUnit object, cu in the above example.
Use addJar() method to add all the supporting jar files.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
DComputeUnit cu = new DComputeUnit() {
public String compute(DComputeContext context) {
MyPrimeCheckUnit pcu = new
MyPrimeCheckUnit();
return pcu.compute(context);
}
};
DCompute c = DCompute(meta);
// Add all Java libraries, jar files like this
c.addJar("MyPrimeCheckUnit.jar");
c.executeObject(cu).input(json).split(1).output(out);
c.perform();
c.finish();
Adding jar files, Java libraries to the Compute Units
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
The above example shows executeObject() method to submit a Compute Unit. The
Compute Unit will internally submit its own Compute Units / split jobs for execution in the
Cluster.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
DComputeUnit cu = new DComputeUnit() {
public String compute(DComputeContext context) {
DCompute c2 = DCompute(meta2);
c2.executeClass(MyPrimeCheckUnit.class);
c2.input(context.getInput()).split(1).output(out2);
c2.perform();
c2.finish();
… … …
}
};
DCompute c = DCompute(meta);
// Add all Java libraries, jar files like this
c.addJar(“MyPrimeCheckUnit.jar");
c.executeObject(cu).input(json).split(1).output(out);
c.perform();
c.finish();
Submitting Compute Units from within a Compute Unit
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
The CUs submitted through this CU are run in any of the Compute Servers. The jar file
paths in the User’s Client system will not be valid inside a Compute Server. Users can use
addComputeUnitJars() method to add all the jars added to this CU (cu in above code) by
the User earlier when submitting this CU.
Submitting User Jobs, Programs in Scabi (continued)
DCompute Class (continued)
DComputeUnit cu = new DComputeUnit() {
public String compute(DComputeContext context) {
DCompute c2 = DCompute(meta2);
c2.addComputeUnitJars();
c2.executeCode(“import MyPrimeCheckUnit;” +
“cu = new MyPrimeCheckUnit();”
“return cu.compute(context);”);
c2.input(context.getInput()).split(1).output(out2);
c2.perform();
c2.finish();
}
};
DCompute c = DCompute(meta);
c.addJar("MyPrimeCheckUnit.jar"); // Add all Java libraries / jar files like this
c.executeObject(cu).input(json).split(1).output(out);
c.perform();
c.finish();
Providing User’s jar files to Compute Units submitted from within a Compute Unit
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Submitting User Jobs, Programs in Scabi (continued)
Compute Units are just like any other Java program. We need to include the jar files in the
program to start using the API of specific storage systems (like Amazon S3, Google Cloud
Storage), other file systems, databases (JDBC, Oracle, DB2, Cassandra, CouchDB, Redis,
etc.), other third-party Java libraries, Java Machine Learning libraries (J48/C4.5/C5,
JavaBayes, Weka, etc.), other Java external libraries.
To access these systems or other Java libraries from inside a compute Unit, add the jar files
using the .addJar() method before submitting the Compute Units for execution using the
.perform() method in DCompute class.
In each DCompute object, a maximum of Java Long.MAX_VALUE (264-1 or
9223372036854775807) number of splits can be created. By creating additional DCompute
objects and by passing in additional parameter values in json input (passed in .input() method
of DCompute) like “TotalDComputeObjects” and “ThisDComputeObject”, theoretically infinite
number of splits can be created and handled in the System, limited only by the hardware and
memory used in the System.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
DCompute Class (continued)
3. Submitting Compute Units for execution in the Cluster (continued)
After the finish() method is called on the Dcompute object, results will be available in the
HashMap object supplied in the .output() method.
The map will contain the split number / Compute Unit number as the key with value as the
result returned by each Compute Unit after execution in the Compute Server in the Cluster.
Submitting User Jobs, Programs in Scabi (continued)
4. Retrieving Results After Execution
In DCompute class, Compute Units / split jobs will be submitted for execution in parallel in
the available Compute Servers in the Scabi Cluster.
The framework will try to execute the Compute Units in as many Compute Servers as the
total number of splits specified by the User. This is limited only by the number of Compute
Servers running in the Scabi Cluster.
Users can use the maxRetry() method to specify the maximum retries attempted in case
of network communication error with a Compute Server. In those cases, retries of
submitting the Compute Units will be attempted with other working Compute Servers in
the Cluster.
3. Submitting Compute Units for execution in the Cluster (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Single Hardware Vs Scabi Cluster Performance
Time
No. of Threads / Processes
In a single hardware scenario, User has a system with N number of CPUs/cores. Users
utilise the maximum performance by running their program in maximum number of threads
or running in multiple processes. If the User's system has 16 GB memory and assuming
each thread consumes 1 MB of Stack memory, the maximum number of threads that can be
created in the User's client system is approximately 16,384 threads. If the user's system has
4 CPUs, then the system will spend most of the time in thread context switching in this case.
Single Hardware
The above graph shows the maximum performance (minimum time in graph) is achieved
only at a certain number of threads. As the number of threads increases, the CPUs will be
spending most of the time in thread context switching and high number of threads actually
becomes counter-productive to performance of the system. The graph shows for very high
number of threads the performance actually drops.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Single Hardware Vs Scabi Cluster Performance
Time
m-Compute Hardware X n-Compute Servers
Scabi Cluster Performance
In the previous single hardware scenario, the User's system has 16 GB memory and
assuming each thread consumes 1 MB of Stack memory, the maximum number of threads
that can be created in the User's client system is approximately 16,384 threads.
Using Scabi Cluster, User can run his programs in multiple such systems. For example if the
User has m such systems, his programs can effectively run on m * 16,384 threads allowing
the User to scale out horizontally by adding more compute hardware, starting more compute
servers, or run compute servers with more number of threads per compute server or adding
more Meta Servers with its own Cluster of Compute Servers.
The Compute Units submitted through DCompute Class run concurrently in the Compute
Servers in the Scabi Cluster and as well as run concurrently in multiple different threads
within a Compute Server also.
User file operations in Scabi are carried out using the DFile Class. The DFile class provides
methods to set Namespace to access, put() method to store files in Scabi and get() method
to retrieve files from Scabi.
Scabi maintains two versions of each User file at any time. The current version and the
immediate previous version of each file will be always available in the system. After each
completed file upload operation, the specific uploaded file will be marked as latest and the
last version (based on server timestamp) that already existed in the system prior to upload
will be marked as immediate previous version. All other versions will be removed from the
system.
Scabi provides single, unified and uniform namespace for various types of User data: files,
tables, unstructured document data (Collections), properties and Java files (.class, .jar,
.bsh).
Each of Scabi's Namespace for User files, User App-specific tables, unstructured document
data, properties and Java files, corresponds to a MongoDB database as configured by
Scabi user while registering the namespace in Meta server. Scabi Namespaces can be
registered to use same or different MongoDB databases which are distributed and located
anywhere and connected to the network accessible by the Scabi Cluster and User’s Client
system.
User Files In Scabi (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
User Files In Scabi
Scabi connects to MongoDB databases for various User file and User database
operations. Each Scabi Namespace may correspond to separate MongoDB database.
Also Scabi’s Meta Server connects to MongoDB to read/write meta data about Compute
Servers as well as Scabi Namespaces.
Scabi relies on MongoDB's Replica Sets to provide high availability for User’s data
through MongoDB's replication / secondary servers. For providing load balancing for
various User’s file and database operations and to scale-out horizontally, Scabi relies on
MongoDB's Sharding process.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
The following Scabi Namespace URLs refer to different files:-
scabi:MyOrg.MyFiles:myfile.txt
scabi:MyOrg.MyFiles:/myfile.txt
scabi:MyOrg.MyFiles:./myfile.txt
scabi:MyOrg.MyFiles:/home/dilshad/myfile.txt
scabi:MyOrg.MyFiles:myorg.mydept.myfile.txt
scabi.MyOrg.MyFiles:myorg-mydept-myfile.txt
scabi:MyOrg.MyFiles:/usr/myfile.txt
scabi:MyOrg.Bangalore:myfile.txt
scabi:MyOrg.California:myfile.txt
After the files are stored in Scabi using the put() method, the Scabi Namespace URL can be
passed around between Scabi Client system and Compute Units to access input files for
processing by the Compute Units. Compute Units can also write results to files in Scabi and
return back a Scabi Namespace URL of the file in the result if the result contains huge
volume of data.
Programs running in the User’s Client system as well as those running in the Scabi Cluster
can access the User’s files stored in the distributed databases through the Scabi
Namespace URL: scabi:<namespace>:<file name>
Users as well as programs running in the Scabi Cluster can perform various operations viz.
registering new namespace, read / write operations using the files.
User Files In Scabi (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi
Framework
Scabi
Client
Meta
Server
DB-1
Stores Latest version
And Previous version
of File
Resolve
Namespace
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Figure shows a Client using Scabi micro framework to do file operations in Scabi.
Namespaces are resolved by the framework by contacting the Meta Server.
User Files In Scabi (continued)
Scabi
Framework
Scabi
Client
Meta
Server
DB-1
Stores Latest version
And Previous version
of File
Resolve
Namespace
1) Write
scabi:MyOrg.MyFiles:input.txt
4) Read
scabi:MyOrg.MyFiles:results.txt
2) Read
scabi:MyOrg.MyFiles:input.txt
3) Write
scabi:MyOrg,MyFiles:results.txt
#m Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
Both Scabi Client and Compute Units running in Compute Servers can do read / write
operations of file in Scabi. After a file is stored in Scabi, only Scabi Namespace URL of the
file need to be conveyed instead of transferring the actual contents of the file between
Scabi Client and Compute Units.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
User Files In Scabi (continued)
DFile Class is used to do all the file operations in Scabi. The Scabi micro framework
resolves the namespaces in the Scabi Namespace URLs by contacting the Meta Server.
The DFile class internally uses streaming mechanism to read and write files into Scabi.
This means the entire file contents are not loaded into memory. For e.g. if the User’s
Client system has 2 GB memory and a 6 GB file is written using the DFile class, the
contents of the file is transferred through a 64 MB internal buffer. The entire 6 GB of file
data is not loaded into memory. The DFile class is lightweight and does not pose any
overhead in transferring data. To give an example, on a low-end system with 2 GB RAM,
a 6 GB file is transferred within 4 minutes.
The Scabi framework maintains two versions of each file: the latest version and the
previous version. This ensures that when transferring files of very large size, if there is a
network error, the system will still contain the previous versions and the corrupted file will
be discarded.
High availability of data is achieved by enabling MongoDB Replication of primary and
secondary servers of the underlying MongoDB server instances that correspond to each
Scabi Namespace.
High loads can be handled by using MongoDB Sharding of the underlying MongoDB
server instances that correspond to each Scabi Namespace.
User Files In Scabi (continued)
DFile Class
Copyright (c) Dilshad Mustafa. All Rights Reserved.
User Files In Scabi (continued)
DFile Class (continued)
DFile f = new DFile(meta);
f.put("scabi:MyOrg.MyFiles:myfile1.txt", “myfile1.txt");
FileInputStream fis = new FileInputStream(“myfile3.txt”);
f.put("scabi:MyOrg.MyFiles:myfile3.txt", fis);
fis.close();
To put a file from local file system into Scabi:
To put a file from an input stream into Scabi:
The following code examples demonstrate various file operations using DFile class:
f.copy("scabi:MyOrg.MyFiles:myfile4.txt", "scabi:MyOrg.MyFiles:myfile1.txt");
To copy a file already in Scabi into another Scabi Namespace or to another file:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
User Files In Scabi (continued)
DFile Class (continued)
f.get("scabi:MyOrg.MyFiles:myfile1.txt", “fileout1.txt”);
FileOutputStream fos = new FileOutputStream(“fileout3.txt”);
f.get("scabi:MyOrg.MyFiles:myfile1.txt", fos);
fos.close();
To get a file already in Scabi to local file system:
To get a file already in Scabi to an output stream:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Time
No. of bytes of read / write
Time
No. of bytes of read / write using
m-Scabi Namespaces X n-MongoDB instances
Local Filesystem Vs Scabi Cluster Performance
Local File System
Scabi Cluster
User Tables In Scabi
User table operations in Scabi are carried out using the Dao, DTable, DObject, DResultSet
classes.
Dao class - provides methods to set Namespace to access, getTable() method to get DTable.
DTable class - provides methods to select, insert, update and delete operations, can directly
embed MongoDB queries and can be used to query or filter data, get underlying
Mongo Collection, do Map/Reduce, use Aggregation Framework (Aggregation
Pipeline), use Geospatial queries
DResultSet class - contains the results returned by DTable methods
The following Scabi Namespace URLs refer to different tables:-
scabi:MyOrg.MyTables:mytable
scabi:MyOrg.MyTables:/mytable
scabi:MyOrg.MyTables:./mytable
scabi:MyOrg.MyTables:/home/dilshad/mytable
scabi:MyOrg.MyTables:myorg.mydept.mytable
scabi.MyOrg.MyTables:myorg-mydept-mytable
scabi:MyOrg.MyTables:/usr/mytable
scabi:MyOrg.Bangalore.Tables:mytable
scabi:MyOrg.California.Tables:mytable
Copyright (c) Dilshad Mustafa. All Rights Reserved.
After the data are stored in Scabi using the DTable method, the Scabi Namespace URL can
be passed around between Scabi Client system and Compute Units to access data for
processing by the Compute Units. Compute Units can also write results to tables in Scabi
and return back a Scabi Namespace URL of the table in the result if the result contains huge
volume of data.
Programs running in the User’s Client system as well as those running in the Scabi Cluster
can access the User’s tables stored in the distributed databases through the Scabi
Namespace URL: scabi:<namespace>:<table name>
Users as well as programs running in the Scabi Cluster can perform various operations viz.
registering new namespace, read / write operations using the tables.
User Tables In Scabi (continued)
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi
Framework
Scabi
Client
Meta
Server
DB-1
Resolve
Namespace
Table
data
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Figure shows a Client using Scabi micro framework to do database operations in
Scabi. Namespaces are resolved by the framework by contacting the Meta Server.
User Tables In Scabi (continued)
DB-1
1) Write
scabi:MyOrg.MyTables:emp_table
4) Read
scabi:MyOrg.MyTables:emp_results
Both Scabi Client and Compute Units running in Compute Servers can do select, insert,
update, delete operations of table in Scabi. After table data is stored in Scabi, only Scabi
Namespace URL of the table need to be conveyed instead of transferring the actual
contents of the table between Scabi Client and Compute Units.
Scabi
Framework
Scabi
Client
Meta
Server
Table
data
Resolve
Namespace
2) Read
scabi:MyOrg.MyTables:emp_table
3) Write
scabi:MyOrg,MyTables:emp_results
#m Compute Hardware
#1 Compute Server
#1
CU
#2
CU
#m
CU
#2 Compute Server
#1
CU
#2
CU
#m
CU
#n Compute Server
#1
CU
#2
CU
#p
CU
Copyright (c) Dilshad Mustafa. All Rights Reserved.
User Tables In Scabi (continued)
User Tables In Scabi (continued)
DTable table = dao.createTable("scabi:MyOrg.MyTables:Table1");
DDocument d = new DDocument();
d.append("EmployeeName", "Karthik").append("EmployeeNumber", "3000");
d.append("Age", 40);
table.insert(d);
DDocument d2 = new DDocument();
d2.put("Age", 45);
DDocument updateObj = new DDocument();
updateObj.put("$set", d2);
table.update(eq("EmployeeName", "Balaji"), updateObj);
To create a table in Scabi:
To insert data into a table in Scabi:
To update records in a table in Scabi:
The following code examples demonstrate various database operations using Dao class:
Dao Class (continued)
DTable table = dao.getTable(“scabi:MyOrg.MyTables:Table1");
To get existing table in Scabi:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
User Tables In Scabi (continued)
DResultSet result = table.find(or(eq("EmployeeNumber", "3003"), lt("Age", 40)));
while (result.hasNext()) {
DDocument d3 = result.next();
… … …
}
To query data in a table in Scabi:
Dao Class (continued)
MongoCollection<Document> c = table.getCollection();
To access underlying MongoCollection from DTable:
String map = "function() { for (var key in this) { emit(key, null); } }";
String reduce = "function(key, s) { if ("Age" == key) return true; else return false; }";
MapReduceIterable<Document> out = c.mapReduce(map, reduce);
for (Document o : out) {
System.out.println("Key name is : " + o.get("_id").toString());
System.out.println(o.toString());
}
Map/Reduce example directly on the MongoCollection:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Map/Reduce In Scabi
MongoCollection<Document> c = table.getCollection();
To access underlying MongoCollection from DTable:
String map = "function() { for (var key in this) { emit(key, null); } }";
String reduce = "function(key, s) { if ("Age" == key) return true; else return false; }";
MapReduceIterable<Document> out = c.mapReduce(map, reduce);
for (Document o : out) {
System.out.println("Key name is : " + o.get("_id").toString());
System.out.println(o.toString());
}
Map/Reduce example directly on the MongoCollection:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Map/Reduce functions can be directly executed on the Mongo Collection natively.
Map/Reduce In Scabi (continued)
MongoDB’s Map/Reduce functionality can be directly used/invoked from within each Compute
Unit. The following optimizations can be performed on Map/Reduce with MongoDB:
1) Map/Reduce optimized with sort on indexed fields
2) Incremental Map/Reduce (Map/Reduce with query filter to read only newer records)
3) Concurrent Map/Reduce (each Map/Reduce over specific ranges)
4) Map/Reduce over sharded collection
MongoDB’s Aggregation Framework functionality (Aggregation Pipeline) can be directly
used/invoked from within each Compute Unit. The following optimizations can be performed
on aggregate() with MongoDB:
1) Concurrently doing aggregate() (each aggregate() over specific split keys)
2) aggregate() over sharded collection
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Map/Reduce In Scabi (continued)
Alternatively, we can also programmatically read/write data from the Mongo Collection or
from any other data source from each Compute Unit.
Compute Units are just like any other Java program. We need to include the jar files in the
program to start using the API of specific storage systems (like Amazon S3, Google Cloud
Storage), other file systems, databases (JDBC, Oracle, DB2, Cassandra, CouchDB,
Redis, etc.), other third-party Java libraries, Java Machine Learning libraries (J48/C4.5/C5,
JavaBayes, Weka, etc.), other Java external libraries.
To access these systems from inside a compute Unit, add the jar files using the .addJar()
method before submitting the Compute Units for execution using the .perform() method in
DCompute class.
In each Compute Unit, load data directly from any data source: shared file system, SAN,
NAS, Alluxio in-memory distributed file system (formerly Tachyon), JDBC, Amazon S3, etc.
and perform the Map/Reduce and Aggregations and write back the results to the data
source.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Map/Reduce In Scabi (continued)
For unstructured data, the DFile class can be used to store large data files and retrieve
only specific portions/partitions (for e.g. 64 MB partition) of the file from each Compute Unit
to process the data concurrently.
Alternatively, very large files can be split into several small files and stored using DFile and
processed from each Compute Unit concurrently.
For structured or semi-structured data, the Dao class can be used to do Map/Reduce,
Aggregations and other data computations. By spreading the data into multiple databases,
database instances, collections or using sharded databases and collections, massively
parallel data computations can be implemented (discussed further in section “Peta Scale
With Cloud”).
From each Compute Unit, read data over specific ranges or specific split keys from
database for concurrently doing data computations from multiple Compute Units.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
BigData solutions rely on data locality to handle data processing on large data sets. The
data locality is achieved through local disk I/O (large number of Direct Attached Storage
devices attached to each node) and in-memory (large RAM in each node).
Both Direct Attached Storage (DAS) architecture and large RAM sizes tend to raise
costs quickly as data reaches Peta scale after factoring in maintenance costs, backup of
previous months Peta-scale data sets, hardware upgrade cycle, etc. For example,
compared with Amazon AWS, EBS with PIOPS and S3 are relatively cost effective over
huge number of DAS storage devices physically attached to each node over tens of
thousands of nodes for storing Peta scale data. EBS with PIOPS can provide near data
locality.
If Map/Reduce is run incrementally and as a batch job, S3 provides a cost effective
solution to store Peta Bytes of data (e.g. in the case of Netflix Hadoop implementation,
Pinterest, etc.).
In-memory distributed file systems like Alluxio (formerly Tachyon) can provide faster
access and act as a distributed memory cache layer. From each Compute Unit we can
directly read/write to the in-memory file system.
Peta scale data can be divided and stored in multiple MongoDB instances or databases.
From each Compute Unit we can directly read/write to a corresponding MongoDB
instance or database.
BigData Processing In Cloud
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Peta Scale With Cloud
Scabi micro framework can be used to implement solutions to many different kinds of
problems including Map/Reduce, Aggregation problems, parallel algorithms and run
Machine Learning algorithms parallely over huge data sets stored in heterogeneous data
sources.
The below is one example to handle Peta Scale. In this example, divide the PetaBytes of
data into multiple MongoDB databases just to speedup Map/Reduce data computations
and assign to different Scabi Namespaces (DataSet1, ..., DataSetN) using
meta.namespaceRegister() method. Then submit multiple Compute Units to Scabi
Cluster.
In each Compute Unit, get a table from an assigned Scabi Namespace and access the
Mongo Collection using MongoCollection c = table.getCollection().
Then directly do Map/Reduce on the Mongo Collection natively using c.mapReduce(map,
reduce) (refer Example5). This way actual data will not be moved around in the network.
Data can be spread based on any of the below arrangements:
1. Multiple different databases in different MongoDB instances
2. Multiple different collection in same or different databases, in same or different
MongoDB instances
3. Sharded database and collection
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Peta Scale With Cloud (continued)
The figure shows one example to implement massively parallel Map/Reduce and
Aggregations. A Grid of Compute Units is logically aligned to a corresponding grid of
MongoDB instances / databases or collections.
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
CU
DB
c.mapReduce()
c.aggregate()
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi Namespace Operations
Dson dson = new Dson();
dson.add("Namespace", "MyCompany-Tables");
dson.add("Type", DNamespace.APPTABLE);
dson.add("Host", "localhost");
dson.add("Port", "27017");
dson.add("UserID", "myuser");
dson.add("Pwd", "hello");
dson.add("SystemSpecificName", "MyCompanyDB");
dson.add("SystemType", "MongoDB");
if (false == meta.namespaceExists("MyCompany-Tables")) {
System.out.println("Register new namespace");
String uuid = meta.namespaceRegister(dson);
}
To create a new namespace:
Copyright (c) Dilshad Mustafa. All Rights Reserved.
How to quickly run Scabi
1. Install Oracle Java 8 Java SE 1.8.0_66
2. Install MongoDB v3.2.1 with default settings, without enabling Login password and
security certificate. Run sudo mongod --dbpath /home/<username>/db/data
3. Download scabi.tar.gz from Download folder in Scabi’s GitHub project
4. Unzip scabi.tar.gz to a folder /home/<username>/scabi
5. Start Meta Server,
./start_meta.sh &
6. Start Compute Servers,
./start_compute.sh 5001 localhost 5000 1000 &
./start_compute.sh 5002 localhost 5000 1000 &
To start Compute Servers in other machines and ports, enter command as below,
./start_compute.sh <ComputeServer_Port> <MetaServer_HostName>
<MetaServer_Port> [<NoOfThreads> [debug]] &
7. Run example code inside the examples folder in /home/<username>/scabi,
cd examples
java -cp "../dependency-jars/*":"../*":. Example1
java -cp "../dependency-jars/*":"../*":. Example1_2
java -cp "../dependency-jars/*":"../*":. Example1_3
java -cp "../dependency-jars/*":"../*":. Example1_4
java -cp "../dependency-jars/*":"../*":. Example2
java -cp "../dependency-jars/*":"../*":. Example3
java -cp "../dependency-jars/*":"../*":. Example4
java -cp "../dependency-jars/*":"../*":. Example5
Copyright (c) Dilshad Mustafa. All Rights Reserved.
How to quickly run Scabi (continued)
1. Scabi Meta Server command line options
./start_meta.sh <No arguments> to use default settings
./start_meta.sh <MetaServer_Port> [debug]
./start_meta.sh <MetaServer_Port> <Database_HostName> <Database_Port> [debug]
2. Scabi Compute Server command line options
./start_compute.sh <ComputeServer_Port> <MetaServer_HostName>
<MetaServer_Port> [<NoOfThreads> [debug]]
Command line options
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi Performance Tuning
The following guidelines can help with performance tuning.
1. Scabi Cluster can be scaled out horizontally by adding more compute hardware,
starting more compute servers, or running compute servers with more number of
threads per compute server or adding more Meta Servers with its own Cluster of
Compute Servers.
2. If the User has limited number of compute hardware connected to the Cluster, the
User can start fewer number of Compute Servers, with more number of threads per
Compute Server, based on the memory size. Using JVM configuration, minimum and
maximum size of Thread Stack Size and Heap memory size can be configured while
starting the Compute Servers.
3. If the User has high number of compute hardware connected to Cluster with large
memory size, then high number of Computer Servers, with fewer number of threads
per Compute Server, can be started in each of the compute hardware.
4. Additional Meta Servers can be started and added to the Cluster, with each Meta
Server having its own cluster of Compute Servers.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
How to quickly build Scabi from GitHub
1. Install Oracle Java 8 Java SE 1.8.0_66
2. Install Git
3. Install Maven
4. Create folder /home/<username>/scabi
5. cd to scabi folder
6. Run command
git clone https://www.github.com/dilshadmustafa/scabi.git
1. cd to DilshadDCS_Core folder in /home/<username>/scabi
2. Run command
mvn package
3. The file scabi_core.jar will be created
Initial Setup
Build Scabi Core scabi_core.jar
Copyright (c) Dilshad Mustafa. All Rights Reserved.
How to quickly build Scabi from GitHub (continued)
1. Copy scabi_core.jar file created in above step to folder DilshadDCS_MS
2. cd to DilshadDCS_MS folder in /home/<username>/scabi
3. Include scabi_core.jar in Maven java classpath before compiling with Maven
4. Run command
mvn package
5. The file scabi_meta.jar will be created
6. cd to target folder
7. Include scabi_core.jar in java classpath before running Meta Server
8. Run below command to run Meta Server with default settings (MongoDB should
be installed and started already)
java –jar scabi_meta.jar
Build Scabi Meta Server scabi_meta.jar
1. Copy scabi_core.jar file created in above step to folder DilshadDCS_CS
2. cd to DilshadDCS_CS folder in /home/<username>/scabi
3. Include scabi_core.jar in Maven java classpath before compiling with Maven
4. Run command
mvn package
5. The file scabi_compute.jar will be created
6. cd to target folder
7. Include scabi_core.jar in java classpath before running Compute Server
8. Run below command to run Compute Server, (Meta Server should be started
already)
java –jar scabi_compute.jar 5001 localhost 5000 1000
Build Scabi Compute Server scabi_compute.jar
Copyright (c) Dilshad Mustafa. All Rights Reserved.
InfiniBand Support
Scabi micro framework and its Cluster are written in pure Java. The micro framework, Meta
Servers, Compute Servers can be configured to use Sockets Direct Protocol (SDP) using
Java JVM configuration to enable it to run on InfiniBand or other RDMA networks.
The Java JVM settings can be configured to use IBM Java Sockets Over RDMA (JSOR),
RSockets to enable it to run on InfiniBand or other RDMA networks.
MongoDB does not natively support RDMA. It can be configured to run on IP Over
InfiniBand (IPoIB).
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi uses the following APIs/Libraries :-
1. MongoDB Driver API 3.2.1
2. MongoDB GridFS Driver API 3.2.1
3. RedHat JBoss RESTEasy framework 3.0.14.Final
4. RedHat JBoss JavaAssist API 3.20.0-GA
5. Jetty Web Server API 9.3.2.v20150730
6. Apache Http Client API 4.5.2
7. Apache Http Async Client API 4.1.1
8. Jackson Json API 2.7.4
9. BeanShell API 2.0b4
10. SLF4J Simple Logging API 1.7.16
Copyright (c) Dilshad Mustafa. All Rights Reserved.
APIs / Libraries used by Scabi
Scabi is currently at version v0.2 which is the initial version. It is developed and
tested in the following environment
1. Ubuntu 15.10 64-bit
2. Oracle Java 8 Java SE 1.8.0_66 64-bit
3. MongoDB 3.2.1 64-bit
4. Maven
It uses the following versions of APIs/Libraries:-
1. MongoDB Driver API 3.2.1
2. MongoDB GridFS Driver API 3.2.1
3. RedHat JBoss RESTEasy framework 3.0.14.Final
4. RedHat JBoss JavaAssist API 3.20.0-GA
5. Jetty Web Server API 9.3.2.v20150730
6. Apache Http Client API 4.5.2
7. Apache Http Async Client API 4.1.1
8. Jackson Json API 2.7.4
9. BeanShell API 2.0b4
10. SLF4J Simple Logging API 1.7.16
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi Test Environment
Please refer the Java example code in Download folder in GitHub project
https://www.github.com/dilshadmustafa/scabi
Example 1
The examples use DCompute class which internally uses asynchronous non-blocking
network I/O and can submit very large number of split jobs / Compute Units
Complex and time consuming computing examples. Program to check specific
number is Prime number. The program uses Scabi to automatically scale-out horizontally to
thousands of Compute Servers running in hundreds of networked commodity hardware.
Demonstrates executeClass(), executeObject(), executeJar(), executeCode() functionalities.
Example 1_2
(a) Example shows how to add additional Java libraries, jar files
(b) Example shows executeObject() method to submit a Compute Unit. The Compute Unit
will internally submit its own Compute Units / split jobs for execution in the Cluster
(c) Example shows how to add jar files, java libraries to Compute Units submitted from
within Compute Unit. CUs are run inside Compute Servers. jar file paths provided by User
are not available inside Compute Servers. Use addComputeUnitJars() method to add all the
jar files provided to this Compute Unit by User.
Scabi Examples
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Example 1_3
The examples use DComputeSync class which internally uses synchronous blocking
network I/O and can submit large number of split jobs / Compute Units
Complex and time consuming computing examples. Program to check specific
number is Prime number. The program uses Scabi to automatically scale-out horizontally to
thousands of Compute Servers running in hundreds of networked commodity hardware.
Demonstrates executeClass(), executeObject(), executeJar(), executeCode() functionalities.
Example 1_4
(a) Example shows how to add additional Java libraries, jar files
(b) Example shows executeObject() method to submit a Compute Unit. The Compute Unit
will internally submit its own Compute Units / split jobs for execution in the Cluster
(c) Example shows how to add jar files, java libraries to Compute Units submitted from
within Compute Unit. CUs are run inside Compute Servers. jar file paths provided by User
are not available inside Compute Servers. Use addComputeUnitJars() method to add all the
jar files provided to this Compute Unit by User.
Scabi Examples (continued)
Example 2 - Distributed Storage & Retrieval examples.
(a) put() operations for storing files into Scabi from local file system and from input streams
(b) copy() operations to copy files into another Scabi Namespace or to another file
(c) get() operations for retrieving files from Scabi to local file system or to output streams
Example 3 - Scabi Distributed Tables examples
Demonstrate CRUD operations.
(a) Create Table
(b) Check table exists
(c) Get existing table
(d) Insert data into table
(e) Update records in table
(f) Query data in table
(g) Directly embed MongoDB filters into queries
Example 4 - Scabi Distributed Tables examples (continued)
(a) Access underlying MongoCollection
(b) Map/Reduce example on the MongoCollection
Example 5 - Scabi Namespace Operations
(a) Check Namespace exists
(b) Create new Namespace Copyright (c) Dilshad Mustafa. All Rights Reserved.
Scabi Examples (continued)
For any questions, clarifications or if you want to partner or participate or fund Scabi
Project development, please feel free to contact Dilshad Mustafa at
mdilshad2016@yahoo.com with a copy to mdilshad2016@rediffmail.com
Copyright (c) Dilshad Mustafa. All Rights Reserved.
Dilshad Mustafa is the creator and
programmer of Scabi micro framework and
Cluster. He is also Author of Book titled “Tech
Job 9 to 9”. He is a Senior Software Architect
with 16+ years experience in Information
Technology industry. He has experience
across various domains, Banking, Retail,
Materials & Supply Chain.
He completed his B.E. in Computer Science &
Engineering from Annamalai University, India
and completed his M.Sc. In Communication &
Network Systems from Nanyang Technological
University, Singapore.
Dilshad Mustafa can be reached at
mdilshad2016@yahoo.com with a copy to
mdilshad2016@rediffmail.com
Copyright (c) Dilshad Mustafa. All Rights Reserved.

More Related Content

What's hot

A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured DataBigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Dataelliando dias
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem nuriadelasheras
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sqlAnuja Gunale
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperScott Gray
 
Mysql wp cluster_evalguide
Mysql wp cluster_evalguideMysql wp cluster_evalguide
Mysql wp cluster_evalguideKaizenlogcom
 
Triangle MySQL User Group MySQL Fabric Presentation Feb 12th, 2015
Triangle MySQL User Group MySQL Fabric Presentation Feb 12th, 2015Triangle MySQL User Group MySQL Fabric Presentation Feb 12th, 2015
Triangle MySQL User Group MySQL Fabric Presentation Feb 12th, 2015Dave Stokes
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
 
assignment3
assignment3assignment3
assignment3Kirti J
 
Big table
Big tableBig table
Big tablePSIT
 
Advantage & Disadvantage of MySQL
Advantage & Disadvantage of MySQLAdvantage & Disadvantage of MySQL
Advantage & Disadvantage of MySQLKentAnderson43
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAinventy
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 

What's hot (19)

A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured DataBigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data
 
NoSQL
NoSQLNoSQL
NoSQL
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
No sql database
No sql databaseNo sql database
No sql database
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
 
Mysql wp cluster_evalguide
Mysql wp cluster_evalguideMysql wp cluster_evalguide
Mysql wp cluster_evalguide
 
Triangle MySQL User Group MySQL Fabric Presentation Feb 12th, 2015
Triangle MySQL User Group MySQL Fabric Presentation Feb 12th, 2015Triangle MySQL User Group MySQL Fabric Presentation Feb 12th, 2015
Triangle MySQL User Group MySQL Fabric Presentation Feb 12th, 2015
 
Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
 
assignment3
assignment3assignment3
assignment3
 
Big table
Big tableBig table
Big table
 
Advantage & Disadvantage of MySQL
Advantage & Disadvantage of MySQLAdvantage & Disadvantage of MySQL
Advantage & Disadvantage of MySQL
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBA
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 

Similar to Scabiv0.2

JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJoseph Kuo
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfKUMARRISHAV37
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
Azure fb-google Web Services
Azure fb-google Web ServicesAzure fb-google Web Services
Azure fb-google Web ServicesShreya Srivastava
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk Dave Stokes
 
StratusLab at FOSDEM'13
StratusLab at FOSDEM'13StratusLab at FOSDEM'13
StratusLab at FOSDEM'13stratuslab
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
MANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData ServicesMANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData ServicesCisco DevNet
 
spectrum Storage Whitepaper
spectrum Storage Whitepaperspectrum Storage Whitepaper
spectrum Storage WhitepaperCarina Kordan
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIParesh Nayak,OCP®,Prince2®
 

Similar to Scabiv0.2 (20)

Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Azure fb-google Web Services
Azure fb-google Web ServicesAzure fb-google Web Services
Azure fb-google Web Services
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Hazelcast
HazelcastHazelcast
Hazelcast
 
Introducing Mache
Introducing MacheIntroducing Mache
Introducing Mache
 
Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk
 
StratusLab at FOSDEM'13
StratusLab at FOSDEM'13StratusLab at FOSDEM'13
StratusLab at FOSDEM'13
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
MANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData ServicesMANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData Services
 
Research Paper
Research PaperResearch Paper
Research Paper
 
spectrum Storage Whitepaper
spectrum Storage Whitepaperspectrum Storage Whitepaper
spectrum Storage Whitepaper
 
kumarResume
kumarResumekumarResume
kumarResume
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part II
 

Recently uploaded

AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyRaymond Okyere-Forson
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxJoão Esperancinha
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...OnePlan Solutions
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfBrain Inventory
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptkinjal48
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsJaydeep Chhasatia
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 

Recently uploaded (20)

AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human Beauty
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptx
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdf
 
Salesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptxSalesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptx
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.ppt
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 

Scabiv0.2

  • 1. Scabi A simple, lightweight framework for Cluster Computing and Storage for BigData processing in pure Java, created and programmed by Dilshad Mustafa Copyright © Dilshad Mustafa 2016. All Rights Reserved. https://www.github.com/dilshadmustafa/scabi
  • 2. TABLE OF CONTENTS 1. Scabi Overview 2. Scabi Data Driven Framework 3. Scabi Framework Constructs 4. Scabi Data Ring 5. Processing Huge Data Sets in Scabi 6. Single Hardware Vs Scabi Cluster Performance 7. User Files in Scabifs 8. Map/Reduce In Scabi 9. BigData Processing In Cloud 10. Peta Scale In Cloud 11. How to quickly run Scabi 12. Example 1 – MapReduce, Median computing examples 13. Example 2 – Scabifs examples PART 1
  • 3. TABLE OF CONTENTS 1. Scabi Compute Driven Framework Overview 2. Scabi Compute Driven Framework 3. Scabi Cluster 4. Scabi - Distributed Storage & Retrieval 5. Scabi Namespace 6. Submitting User Jobs, Programs in Scabi 7. Single Hardware Vs Scabi Cluster Performance 8. User Files in Scabi 9. Local Filesystem Vs Scabi Cluster Performance 10. User Tables, Data in Scabi 11. Map/Reduce In Scabi 12. BigData Processing In Cloud 13. Peta Scale In Cloud 14. Scabi Namespace Operations 15. How to quickly run Scabi 16. Scabi Performance Tuning 17. How to quickly build Scabi from GitHub 18. InfiniBand Support 19. APIs / Libraries used by Scabi 20. Scabi Test Environment 21. Example 1 - Complex and time consuming computing examples 22. Example 2 - Distributed Store & Retrieval examples 23. Example 3 - CRUD examples 24. Example 4 – Map/Reduce example 25. Example 5 – Scabi Namespace examples PART 2
  • 4. Copyright (c) Dilshad Mustafa. All Rights Reserved. Scabi is a simple, light-weight Cluster Computing & Storage micro framework for BigData processing written purely in Java. Scabi provides two frameworks for processing (a) Data-driven framework and (b) Compute- driven framework. Both the frameworks basically share the same underlying core. Part 1 of this presentation covers Data-driven framework and Part 2 covers Compute-driven framework. (a) Data-driven framework In the data-driven framework, Scabi processes partitions of huge datasets parallely by loading these partitions into memory and executing User-defined operations on those partitions (partition data and its operations are together referred to as a Data Unit) in the Scabi Cluster. The framework is highly fault tolerant and manages executing the Data Units when any number of systems which are part of the Scabi Cluster can fail at any time. Data Unit makes use of in-memory, off-heap and unbounded storage data structure and enables fast processing of huge data sets. This enables us to perform algorithms like complex MapReduce operations, ensemble machine learning algorithms and iterative algorithms. This gives us the capability to process Petabytes to Exabytes+ of multiple datasets within minutes. The Scabi micro framework with Scabi Cluster enables high performance computing by spreading the Data Units and executing in the Scabi Cluster. The Scabi Compute Services and Meta Services weave together to form a highly scalable cluster of hundreds of thousands of Scabi Compute Services by networking commodity hardware. Scabi Overview
  • 5. (b) Compute-driven framework In the compute-driven framework, Scabi processes User-defined computations or Algorithms or jobs parallely by splitting them into Compute Units and executing them in the Scabi Cluster. The framework is highly fault tolerant and manages executing the Compute Units when any number of systems which are part of the Scabi Cluster can fail at any time. The framework takes care of the distributed computing and load balancing in the Scabi Cluster. This gives us the capability to perform complex and time-consuming computations by aggregating and combining the processing power of many individual systems. The Scabi micro framework with Scabi Cluster enables high performance computing by spreading the Scabi Users's jobs and programs and executing in the Scabi Cluster. The Scabi Compute Servers and Meta Servers weave together to form a highly scalable cluster of hundreds of thousands of Scabi Compute Servers by networking commodity hardware. This means Users do not need specialized computing hardware with thousands of CPUs or CPU cores or special network hardware. The Scabi framework provides simple API to easily distribute storage and retrieval of User files and data by using Scabi Namespaces. The micro framework with the cluster provides high availability of User files and data by keeping versions of User files. Scabi Overview Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 6. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 7. In the data-driven framework, Scabi processes partitions of huge datasets parallely by loading these partitions into memory and executing User-defined operations on those partitions (partition data and its operations are together referred to as a Data Unit) in the Scabi Cluster. The framework is highly fault tolerant and manages executing the Data Units when any number of systems which are part of the Scabi Cluster can fail at any time. Data Unit makes use of in-memory, off-heap and unbounded storage data structure and enables fast processing of huge data sets. This enables us to perform algorithms like complex MapReduce operations, ensemble machine learning algorithms and iterative algorithms. This gives us the capability to process Petabytes to Exabytes+ of multiple datasets within minutes. The Scabi micro framework with Scabi Cluster enables high performance computing by spreading the Data Units and executing in the Scabi Cluster. The Scabi Compute Services and Meta Services weave together to form a highly scalable cluster of hundreds of thousands of Scabi Compute Services by networking commodity hardware. Scabi Data Driven Framework Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 8. Copyright (c) Dilshad Mustafa. All Rights Reserved. #m Compute Hardware #1 Compute Service #2 Compute Service #n Compute Service #1 DU #2 DU #p DU #1ComputeHardware #1ComputeService #2ComputeService #nComputeService #1 DU #2 DU #p DU #2ComputeHardware #1ComputeService #2ComputeService #nComputeService #1 DU #2 DU #p DU #3 Compute Hardware Data (#1) Data (#2) Data (#M ) #1 DP #2 DP #N DP N is total number of Data Units across all Compute Services across all Compute Hardware for a dataset M is total number of datasets Driver Code Data Ring Distributed Storage System Scabi Data Driven Framework (continued) m, n, p are any variable number DU DP Data Unit Data Partition Figure 1.1
  • 9. Data Partition DP #1 Data Page DPE #2 Data Page DPE #k Data Page DPE #p Data Unit DU #1 Data Partition DP #2 Data Partition DP #M Data Partition DP Memory In-memory, Off-heap Local Cache Memory-mapped, Local file #1 DPE #2 DPE #k DPE Unbounded storage Data Ring Distributed Storage System Most Recently Used (MRU) Of Data Pages (DPE) Data Page size = 64 MB (Configurable) Time To Live (TTL) = 1000 ms (Configurable) Data Ring Distributed Storage System Data Ring Distributed Storage System Scabi Data Driven Framework (continued) p, k are any variable number M is total number of datasets Copyright (c) Dilshad Mustafa. All Rights Reserved. Figure 1.2
  • 10. Scabi Framework Constructs Scabi’s data-driven framework comprises four core constructs: Data, Data Unit, Data Partition, Data Ring Data Data is the construct that orchestrates a data cluster of Data Units across Compute Services in the Scabi Cluster. It handles all the User’s multiple datasets. User can give a string identifier to each dataset referred to as Data Id. Data Unit Data Unit is the construct that represents a data partition from all the User’s datasets along with their User defined set of operations. Data Units are executed parallely in the Compute Services. Data Partition Data Partition is an in-memory, off-heap, unbounded storage data structure that uses memory, local cache, distributed storage system (Data Ring) for storing a portion of a data set. Data Partition has unbounded storage as its storage is not limited to any particular system’s hard disk or storage. The storage is provided by the Data Ring. Data Partition maintains a Most Recently Used (MRU) of Data Pages locally (basically 64 MB page files with Time To Live TTL of 1000 ms) which are memory-mapped files, in-memory and off-heap, enabling faster processing in memory with less memory foot print. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 11. Scabi Framework Constructs Data Unit class Data Unit class is used to load data into each Data Unit and tell the framework how many Data Units are to be created. Data Ring Data Ring is a distributed storage system that holds all the partition data of all the User’s datasets. This can be a (a) Network or distributed file system (POSIX or non-POSIX, need not be fully POSIX compliant). Examples are NFS, IBM GPFS, Oracle LustreFS (b) FUSE mounted distributed file system. Examples are RedHat GlusterFS, Scality, Google GFS2, Apache HDFS, MapR FS, Ceph FS, IBM Cleversafe (c) Non file system. Examples are Seaweed file system (d) S3 or Object Storage system. Examples are Minio, Cloudian, Riak (e) Any other storage system with HTTP or REST or S3 or any custom interface. Support for any storage system can be implemented by implementing the interface IStorageHandler.java Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 12. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 13. The Scabi micro framework with Scabi Cluster enables high performance computing by spreading the Scabi Users's jobs and programs and executing in the Scabi Cluster. The Scabi Compute Servers and Meta Servers weave together to form a highly scalable cluster of hundreds of thousands of Scabi Compute Servers by networking commodity hardware. This means Users do not need specialized computing hardware with thousands of CPUs or CPU cores or special network hardware. The Scabi framework provides simple API to easily distribute storage and retrieval of User files and data by using Scabi Namespaces. The micro framework with the cluster provides high availability of User files and data by keeping versions of User files. Scabi Compute Driven Framework Overview Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 14. Scabi provides single, unified and uniform namespace for various types of User data: files, tables, unstructured document data (Collections), properties and Java files (.class, .jar, .bsh). User files of various sizes can be stored and retrieved using the Scabi Namespaces just like in a shared or cluster file system. Massively parallel Map/Reduce, Aggregations and Geospatial queries can be performed on the User’s tables in various databases without actually moving the data in the network. Programs running in the User’s Client system as well as those running in the Scabi Cluster can access the User’s files and tables in the distributed databases through the Scabi Namespace URL: scabi:<namespace>:<resource name>. Also using Scabi URLs eliminate the need to pass huge volumes of data around in the network, without saturating the network bandwidth. Scabi Compute Driven Framework Overview (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 15. In Scabi, each User job or program is sliced into multiple split jobs. Each split job is known as Compute Unit (CU). The total number of Compute Units is specified by the Scabi User. Each Compute Unit will be executed separately in any of the Compute Servers available in the Scabi Cluster. Also each Compute Server can execute multiple CU concurrently. There can be multiple Compute Servers running in the same as well as different hardware. All Compute Servers are connected to a Meta Server. All Meta Servers are connected to each other forming a Scabi Cluster. The Scabi Cluster can be easily scaled-out horizontally by adding more Compute Hardware, starting more Compute Servers, run Compute Servers with more number of threads per Compute Server and adding Meta Servers with its own Cluster of Compute Servers. Meta Servers are added by starting a new Meta Server and pointing it to an existing Meta Server, forming a mega cluster. Scabi Compute Driven Framework Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 16. Meta Server #m Compute Hardware #1 Compute Server #1 CU #2 CU #m CU #2 Compute Server #1 CU #2 CU #m CU #n Compute Server #1 CU #2 CU #p CU #2 Compute Hardware #1 Compute Server #1 CU #2 CU #m CU #2 Compute Server #1 CU #2 CU #m CU #n Compute Server #1 CU #2 CU #p CU #1 Compute Hardware #1 Compute Server #1 CU #2 CU #m CU #2 Compute Server #1 CU #2 CU #m CU #n Compute Server #1 CU #2 CU #p CU Scabi Cluster Figure shows a Scabi Cluster with m-Compute Hardware running n-Compute Servers each running p-Compute Units, connected to one Meta Server. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 17. Meta Server #1 Compute Hardware #1 Compute Server #1 CU #2 CU #m CU #2 Compute Server #1 CU #2 CU #m CU #n Compute Server #1 CU #2 CU #p CU Scabi Cluster (cont’d) Meta Server Meta Server Figure shows a Scabi Cluster with multiple Compute Hardware running multiple Compute Servers each running multiple Compute Units, scales out horizontally by adding more Compute Hardware, starting more Compute Servers and Meta Servers. #m CH #1 CS #2 CS #n CS #1 CH #1 CS #2 CS #n CS #k CH #1 CS #2 CS #n CS Copyright (c) Dilshad Mustafa. All Rights Reserved. CS CH Compute Hardware Compute Server
  • 18. Scabi provides storage and retrieval for various types of User data: files, tables, unstructured document data (Collections), properties and Java files (.class, .jar, .bsh). Scabi maintains two versions of each User file at any time. The current version and the immediate previous version of each file will be always available in the system. After each completed file upload operation, the specific uploaded file will be marked as latest and the last version (based on server timestamp) that already existed in the system prior to upload will be marked as immediate previous version. All other versions will be removed from the system. Scabi relies on MongoDB's Replica Sets to provide high availability for User’s data through MongoDB's replicating / secondary servers. The Replication process provided by MongoDB is transparent to Scabi users and is utilized by directly configuring MongoDB. For providing load balancing for various User’s file and database operations and to scale-out horizontally, Scabi relies on MongoDB's Sharding process to provide high performance access to User’s data. Scabi users can directly configure the MongoDB database to use Sharding. Scabi - Distributed Storage & Retrieval Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 19. Scabi provides single, unified and uniform namespace for various types of User data: files, tables, unstructured document data (Collections), properties and Java files (.class, .jar, .bsh). Each of Scabi's Namespace for User files, App-specific tables, unstructured document data, properties and Java files, corresponds to a MongoDB database as configured by Scabi user while registering the namespace in Meta server. Scabi Namespaces can be registered to use same or different MongoDB databases which are distributed and located anywhere and connected to the network accessible by the Scabi Cluster and User’s Client system. Programs running in the User’s Client system as well as those running in the Scabi Cluster can access the User’s resources stored in the distributed databases through the Scabi namespace URL: scabi:<namespace>:<resource name> Users as well as programs running in the Scabi Cluster can perform various operations viz. registering new namespace, read / write operations for various types of User data: User files, tables, unstructured document data, properties and Java files (.class, .jar, .bsh). Scabi Namespace Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 20. Scabi Client DB-n Meta Server DB-1 DB-2 #1 Compute Hardware #1 Compute Server #1 CU #2 CU #m CU #2 Compute Server #1 CU #2 CU #m CU #n Compute Server #1 CU #2 CU #p CU #m CH #1 CS #2 CS #n CS Figure shows one scenario with a Scabi Client writing User files and table data to DB-2 and DB-n and submitting split jobs / Compute Units to various Compute Servers for execution. The CUs will then process the User files and table data by accessing DB-2 and DB-n and writing results back to DB or returning results back to Scabi Client. The Client either directly receives the results from the Compute Units or read results from User files and table data from DB-2 and DB-n. Scabi Client and Compute Servers resolve the namespace URL scabi:<namespace>:<resource name> into specific DB by contacting the Meta Server Scabi Namespace (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 21. Submitting User Jobs, Programs in Scabi User programs use Scabi Client API to split jobs or programs into Compute Units. Users extend the DComputeUnit class and implement the compute() method. Users can then use DCompute class to submit the Compute Units to the Scabi Cluster for execution in the Compute Servers. 1. Example (a) for Splitting a User’s Job Let’s take a Prime number example. To check if a given number N is Prime, we need to divide with all the previous Prime numbers or with 2 and all previous odd numbers till square root of N to check if N is divisible. If N contains millions of digits, this process will become a time consuming computing for a single PC or a computer hardware with thousands of cores or CPUs. To give an idea for comparison, Java’s long has a maximum of 19 digits and double has a maximum of 308 digits. NJob : Check if N is divisible by 2 and all odd numbers <= 2, 3, 5, …, N To split this job into Compute Units, extend the DComputeUnit class and implement the compute() method. User specifies the number of times the job has to be split for e.g. 100,000, then 100,000 DComputeUnit objects will be executed in the Scabi Cluster in various Compute Servers. Submitting User jobs or programs for execution in the Scabi Cluster involve the following steps:- 1. Splitting the User’s job or program 2. Extend the DComputeUnit class and implement the compute() method 3. Use DCompute class to submit the DComputeUnit class for execution in the Scabi Cluster 4. Retrieve the execution results Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 22. Submitting User Jobs, Programs in Scabi (continued) 1. Example (a) for Splitting a User’s Job (continued) DComputeUnit object, getCU() / CU number #1 Each DComputeUnit object below checks for division of N only for the following set of numbers shown below:- 2 3 5 P1 Start End DComputeUnit object, getCU() / CU number #2 P1+2 P1+4 P2 P99999+2DComputeUnit object, getCU() / CU number #100,000 P100000P99999+4 N getTU() Total Units X iWhere Pi = Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 23. Submitting User Jobs, Programs in Scabi (continued) 1. Example (b) for Splitting a User’s Job A meteorological department's data contains Geographical temperature variations from 1980 to 2015 automatically recorded by instrumentation devices each hour. The department needs to obtain mean- average of temperature variations per month basis for their research purposes. The calculation becomes complex as they want to apply complex statistical formula. If the data contains millions of records to be processed, this process will become a time consuming computing for a single PC or a computer hardware with thousands of cores or CPUs. Job : Calculate mean-average of temperature variations per month basis from 1980 to 2015 and apply the department’s statistical formula To split this job into Compute Units, extend the DComputeUnit class and implement the compute() method. User specifies the number of times the job has to be split for e.g. 36 to split by Year, then 36 DComputeUnit objects will be executed in the Scabi Cluster in various Compute Servers. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 24. Submitting User Jobs, Programs in Scabi (continued) 1. Example (b) for Splitting a User’s Job (continued) DComputeUnit object, getCU() / CU number #1 Each DComputeUnit object below checks for 12 months of an Year as shown below:- 1 2 3 12 Start End DComputeUnit object, getCU() / CU number #2 DComputeUnit object, getCU() / CU number #36 1980 Year 1980+(2-1)1 2 3 12 1980+(36-1)1 2 3 12 Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 25. Submitting User Jobs, Programs in Scabi (continued) 2. Extend the DComputeUnit Class and implement the compute() method Example (a) MyFirstUnit Class public class MyFirstUnit extends DComputeUnit { public String compute(DComputeContext context) { int totalUnits = context.getTU(); int thisUnit = context.getCU(); String result = “Hello from this unit CU #” + thisUnit); return result; } } The code above creates a class MyFirstUnit by extending DComputeUnit and implements the compute() method. context will be passed to each Compute Unit object running in the Compute Servers by the Scabi framework. getTU() will give the Total number of Compute Units or the split jobs as specified by the User, getCU() is the Compute Unit number of this particular Compute Unit object running in the Compute Server. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 26. Submitting User Jobs, Programs in Scabi (continued) 2. Extend the DComputeUnit Class and implement the compute() method (continued) Example (b) MyPrimeCheckUnit Class public class MyPrimeCheckUnit extends DComputeUnit { public String compute(DComputeContext context) { int totalUnits = context.getTU(); int thisUnit = context.getCU(); BigInteger number = new BigInteger(context.getInput().getString(“NumberToCheck”)); // check if number is even number. // If number >2 and divisible by 2, then number is not Prime, return false immediately … // Obtain square root of number BigInteger sqrtof = sqrt(number); // calculate chunkSize = sqrt(number) / totalUnits BigInteger chunkSize = …; // calculate starting number for division, start = (thisUnit – 1) * chunkSize + 1 // make start as odd number > 1 if not already BigInteger start = …; // calculate ending number for division, end = thisUnit * chunkSize // make end as odd number if not already BigInteger end = …; Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 27. Submitting User Jobs, Programs in Scabi (continued) 2. Extend the DComputeUnit Class and implement the compute() method (continued) Example (b) MyPrimeCheckUnit Class (continued) // check if number is divisible by numbers from start to end for (…) { … } String result = …; return result; } } The above code is abbreviated to focus on explaining the concept and saving space. The abbreviated code is provided with comments and is self-explanatory. We first calculate the chunk size, which is square root (N) / getTU() Total Units. We then calculate the start and end numbers to be used for division check. start = (thisUnit – 1) * chunk size + 1 end = thisUnit * chunk size For more details, please refer the Java code. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 28. After DComputeUnit class or object is created, Users can then use DCompute class to submit the Compute Units to the Scabi Cluster for execution in the Compute Servers. DCompute class uses asynchronous non-blocking network I/O to submit the Compute Units to the Compute Servers for execution. It determines the optimal number of threads the User’s Client system can handle based on User’s Client system’s memory and number of CPUs as well as the number of threads sufficient enough to submit all the Compute Units. This class can be used to submit very large number of Compute Units / split jobs to the Scabi Cluster for execution. Users can also explicitly specify the number of threads to be created by using the maxThreads() method. The performance of execution of the Compute Units in the Scabi Cluster is limited mostly by the number of Compute Hardware and Compute Servers available in the Scabi Cluster. Submitting User Jobs, Programs in Scabi (continued) DCompute Class 3. Submitting Compute Units for execution in the Cluster Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 29. To give a theoretical example, the following code submits 1 billion Compute Units or split jobs to check if the input number is Prime number. The number of compute hardware and compute servers running in Scabi Cluster is the limiting factor in the performance of execution of the Compute Units. The code below shows four different ways to do it:- Submitting User Jobs, Programs in Scabi (continued) DCompute Class (continued) MyPrimeCheckUnit class extends DComputeUnit class and is explained in prior slide in section “Extend the DComputeUnit Class and implement the compute() method”, Example (b). json is a Dson object containing the input number to check for Prime. It can contain potentially millions of digits. To give an idea for comparison, Java’s long has a maximum of 19 digits and double has a maximum of 308 digits. Dson json = new Dson(); json.add(“NumberToCheck”, “712430483480234234234241232143223447”); DCompute c = DCompute(meta); c.executeClass(MyPrimeCheckUnit.class).split(1000000000).input(json).output(m ap).perform(); c.finish(); Method 1 – Submitting Compute Units with Class 3. Submitting Compute Units for execution in the Cluster (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 30. Submitting User Jobs, Programs in Scabi (continued) DCompute Class (continued) Dson json = new Dson(); json.add(“NumberToCheck”, “712430483480234234234241232143223447”); DCompute c = DCompute(meta); c.addJar(“MyPrimeCheckUnit.jar”); // Add all Java libraries / jar files like this c.executeCode(“import MyPrimeCheckUnit;” + “cu = new MyPrimeCheckUnit();” “return cu.compute(context);”); c.split(1000000000).input(json).output(map).perform(); c.finish(); Method 3 – Submitting Compute Units with Java Source Code Dson json = new Dson(); json.add(“NumberToCheck”, “712430483480234234234241232143223447”); MyPrimeCheckUnit cu = new MyPrimeCheckUnit(); DCompute c = DCompute(meta); c.executeObject(cu).split(1000000000).input(json).output(map).perform(); c.finish(); Method 2 – Submitting Compute Units with object reference 3. Submitting Compute Units for execution in the Cluster (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 31. The Compute Units submitted through executeClass(), executeObject(), executeCode() and executeJar() methods are all executed in the Compute Servers available in the Cluster. The DCompute class provides various methods to do the following: (1) Specify the number of times the job or program has to be split into Compute Units (2) Specify the range, which set of Compute Units has to be submitted to Scabi Cluster (3) Input data to the Compute Units (4) Where to store output results for each Compute Unit after execution in the Scabi Cluster (5) Explicitly specify and override the maximum number of threads to be created to submit the Compute Units to the Scabi Cluster Submitting User Jobs, Programs in Scabi (continued) DCompute Class (continued) Dson json = new Dson(); json.add(“NumberToCheck”, “712430483480234234234241232143223447”); DCompute c = DCompute(meta); c.executeJar(“MyPrimeCheckUnit.jar”, “MyPrimeCheckUnit”); c.split(1000000000).input(json).output(map).perform(); c.finish(); Method 4 – Submitting Compute Units with Jar file 3. Submitting Compute Units for execution in the Cluster (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 32. As the Compute Units submitted through executeClass(), executeObject(), executeCode() and executeJar() methods are all executed in the Compute Servers available in the Cluster, the Compute Servers will not have the User’s jar files for the classes used by the DComputeUnit object, cu in the above example. Use addJar() method to add all the supporting jar files. Submitting User Jobs, Programs in Scabi (continued) DCompute Class (continued) DComputeUnit cu = new DComputeUnit() { public String compute(DComputeContext context) { MyPrimeCheckUnit pcu = new MyPrimeCheckUnit(); return pcu.compute(context); } }; DCompute c = DCompute(meta); // Add all Java libraries, jar files like this c.addJar("MyPrimeCheckUnit.jar"); c.executeObject(cu).input(json).split(1).output(out); c.perform(); c.finish(); Adding jar files, Java libraries to the Compute Units 3. Submitting Compute Units for execution in the Cluster (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 33. The above example shows executeObject() method to submit a Compute Unit. The Compute Unit will internally submit its own Compute Units / split jobs for execution in the Cluster. Submitting User Jobs, Programs in Scabi (continued) DCompute Class (continued) DComputeUnit cu = new DComputeUnit() { public String compute(DComputeContext context) { DCompute c2 = DCompute(meta2); c2.executeClass(MyPrimeCheckUnit.class); c2.input(context.getInput()).split(1).output(out2); c2.perform(); c2.finish(); … … … } }; DCompute c = DCompute(meta); // Add all Java libraries, jar files like this c.addJar(“MyPrimeCheckUnit.jar"); c.executeObject(cu).input(json).split(1).output(out); c.perform(); c.finish(); Submitting Compute Units from within a Compute Unit 3. Submitting Compute Units for execution in the Cluster (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 34. The CUs submitted through this CU are run in any of the Compute Servers. The jar file paths in the User’s Client system will not be valid inside a Compute Server. Users can use addComputeUnitJars() method to add all the jars added to this CU (cu in above code) by the User earlier when submitting this CU. Submitting User Jobs, Programs in Scabi (continued) DCompute Class (continued) DComputeUnit cu = new DComputeUnit() { public String compute(DComputeContext context) { DCompute c2 = DCompute(meta2); c2.addComputeUnitJars(); c2.executeCode(“import MyPrimeCheckUnit;” + “cu = new MyPrimeCheckUnit();” “return cu.compute(context);”); c2.input(context.getInput()).split(1).output(out2); c2.perform(); c2.finish(); } }; DCompute c = DCompute(meta); c.addJar("MyPrimeCheckUnit.jar"); // Add all Java libraries / jar files like this c.executeObject(cu).input(json).split(1).output(out); c.perform(); c.finish(); Providing User’s jar files to Compute Units submitted from within a Compute Unit 3. Submitting Compute Units for execution in the Cluster (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 35. Submitting User Jobs, Programs in Scabi (continued) Compute Units are just like any other Java program. We need to include the jar files in the program to start using the API of specific storage systems (like Amazon S3, Google Cloud Storage), other file systems, databases (JDBC, Oracle, DB2, Cassandra, CouchDB, Redis, etc.), other third-party Java libraries, Java Machine Learning libraries (J48/C4.5/C5, JavaBayes, Weka, etc.), other Java external libraries. To access these systems or other Java libraries from inside a compute Unit, add the jar files using the .addJar() method before submitting the Compute Units for execution using the .perform() method in DCompute class. In each DCompute object, a maximum of Java Long.MAX_VALUE (264-1 or 9223372036854775807) number of splits can be created. By creating additional DCompute objects and by passing in additional parameter values in json input (passed in .input() method of DCompute) like “TotalDComputeObjects” and “ThisDComputeObject”, theoretically infinite number of splits can be created and handled in the System, limited only by the hardware and memory used in the System. Copyright (c) Dilshad Mustafa. All Rights Reserved. DCompute Class (continued) 3. Submitting Compute Units for execution in the Cluster (continued)
  • 36. After the finish() method is called on the Dcompute object, results will be available in the HashMap object supplied in the .output() method. The map will contain the split number / Compute Unit number as the key with value as the result returned by each Compute Unit after execution in the Compute Server in the Cluster. Submitting User Jobs, Programs in Scabi (continued) 4. Retrieving Results After Execution In DCompute class, Compute Units / split jobs will be submitted for execution in parallel in the available Compute Servers in the Scabi Cluster. The framework will try to execute the Compute Units in as many Compute Servers as the total number of splits specified by the User. This is limited only by the number of Compute Servers running in the Scabi Cluster. Users can use the maxRetry() method to specify the maximum retries attempted in case of network communication error with a Compute Server. In those cases, retries of submitting the Compute Units will be attempted with other working Compute Servers in the Cluster. 3. Submitting Compute Units for execution in the Cluster (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 37. Copyright (c) Dilshad Mustafa. All Rights Reserved. Single Hardware Vs Scabi Cluster Performance Time No. of Threads / Processes In a single hardware scenario, User has a system with N number of CPUs/cores. Users utilise the maximum performance by running their program in maximum number of threads or running in multiple processes. If the User's system has 16 GB memory and assuming each thread consumes 1 MB of Stack memory, the maximum number of threads that can be created in the User's client system is approximately 16,384 threads. If the user's system has 4 CPUs, then the system will spend most of the time in thread context switching in this case. Single Hardware The above graph shows the maximum performance (minimum time in graph) is achieved only at a certain number of threads. As the number of threads increases, the CPUs will be spending most of the time in thread context switching and high number of threads actually becomes counter-productive to performance of the system. The graph shows for very high number of threads the performance actually drops.
  • 38. Copyright (c) Dilshad Mustafa. All Rights Reserved. Single Hardware Vs Scabi Cluster Performance Time m-Compute Hardware X n-Compute Servers Scabi Cluster Performance In the previous single hardware scenario, the User's system has 16 GB memory and assuming each thread consumes 1 MB of Stack memory, the maximum number of threads that can be created in the User's client system is approximately 16,384 threads. Using Scabi Cluster, User can run his programs in multiple such systems. For example if the User has m such systems, his programs can effectively run on m * 16,384 threads allowing the User to scale out horizontally by adding more compute hardware, starting more compute servers, or run compute servers with more number of threads per compute server or adding more Meta Servers with its own Cluster of Compute Servers. The Compute Units submitted through DCompute Class run concurrently in the Compute Servers in the Scabi Cluster and as well as run concurrently in multiple different threads within a Compute Server also.
  • 39. User file operations in Scabi are carried out using the DFile Class. The DFile class provides methods to set Namespace to access, put() method to store files in Scabi and get() method to retrieve files from Scabi. Scabi maintains two versions of each User file at any time. The current version and the immediate previous version of each file will be always available in the system. After each completed file upload operation, the specific uploaded file will be marked as latest and the last version (based on server timestamp) that already existed in the system prior to upload will be marked as immediate previous version. All other versions will be removed from the system. Scabi provides single, unified and uniform namespace for various types of User data: files, tables, unstructured document data (Collections), properties and Java files (.class, .jar, .bsh). Each of Scabi's Namespace for User files, User App-specific tables, unstructured document data, properties and Java files, corresponds to a MongoDB database as configured by Scabi user while registering the namespace in Meta server. Scabi Namespaces can be registered to use same or different MongoDB databases which are distributed and located anywhere and connected to the network accessible by the Scabi Cluster and User’s Client system. User Files In Scabi (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 40. User Files In Scabi Scabi connects to MongoDB databases for various User file and User database operations. Each Scabi Namespace may correspond to separate MongoDB database. Also Scabi’s Meta Server connects to MongoDB to read/write meta data about Compute Servers as well as Scabi Namespaces. Scabi relies on MongoDB's Replica Sets to provide high availability for User’s data through MongoDB's replication / secondary servers. For providing load balancing for various User’s file and database operations and to scale-out horizontally, Scabi relies on MongoDB's Sharding process. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 41. The following Scabi Namespace URLs refer to different files:- scabi:MyOrg.MyFiles:myfile.txt scabi:MyOrg.MyFiles:/myfile.txt scabi:MyOrg.MyFiles:./myfile.txt scabi:MyOrg.MyFiles:/home/dilshad/myfile.txt scabi:MyOrg.MyFiles:myorg.mydept.myfile.txt scabi.MyOrg.MyFiles:myorg-mydept-myfile.txt scabi:MyOrg.MyFiles:/usr/myfile.txt scabi:MyOrg.Bangalore:myfile.txt scabi:MyOrg.California:myfile.txt After the files are stored in Scabi using the put() method, the Scabi Namespace URL can be passed around between Scabi Client system and Compute Units to access input files for processing by the Compute Units. Compute Units can also write results to files in Scabi and return back a Scabi Namespace URL of the file in the result if the result contains huge volume of data. Programs running in the User’s Client system as well as those running in the Scabi Cluster can access the User’s files stored in the distributed databases through the Scabi Namespace URL: scabi:<namespace>:<file name> Users as well as programs running in the Scabi Cluster can perform various operations viz. registering new namespace, read / write operations using the files. User Files In Scabi (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 42. Scabi Framework Scabi Client Meta Server DB-1 Stores Latest version And Previous version of File Resolve Namespace Copyright (c) Dilshad Mustafa. All Rights Reserved. Figure shows a Client using Scabi micro framework to do file operations in Scabi. Namespaces are resolved by the framework by contacting the Meta Server. User Files In Scabi (continued)
  • 43. Scabi Framework Scabi Client Meta Server DB-1 Stores Latest version And Previous version of File Resolve Namespace 1) Write scabi:MyOrg.MyFiles:input.txt 4) Read scabi:MyOrg.MyFiles:results.txt 2) Read scabi:MyOrg.MyFiles:input.txt 3) Write scabi:MyOrg,MyFiles:results.txt #m Compute Hardware #1 Compute Server #1 CU #2 CU #m CU #2 Compute Server #1 CU #2 CU #m CU #n Compute Server #1 CU #2 CU #p CU Both Scabi Client and Compute Units running in Compute Servers can do read / write operations of file in Scabi. After a file is stored in Scabi, only Scabi Namespace URL of the file need to be conveyed instead of transferring the actual contents of the file between Scabi Client and Compute Units. Copyright (c) Dilshad Mustafa. All Rights Reserved. User Files In Scabi (continued)
  • 44. DFile Class is used to do all the file operations in Scabi. The Scabi micro framework resolves the namespaces in the Scabi Namespace URLs by contacting the Meta Server. The DFile class internally uses streaming mechanism to read and write files into Scabi. This means the entire file contents are not loaded into memory. For e.g. if the User’s Client system has 2 GB memory and a 6 GB file is written using the DFile class, the contents of the file is transferred through a 64 MB internal buffer. The entire 6 GB of file data is not loaded into memory. The DFile class is lightweight and does not pose any overhead in transferring data. To give an example, on a low-end system with 2 GB RAM, a 6 GB file is transferred within 4 minutes. The Scabi framework maintains two versions of each file: the latest version and the previous version. This ensures that when transferring files of very large size, if there is a network error, the system will still contain the previous versions and the corrupted file will be discarded. High availability of data is achieved by enabling MongoDB Replication of primary and secondary servers of the underlying MongoDB server instances that correspond to each Scabi Namespace. High loads can be handled by using MongoDB Sharding of the underlying MongoDB server instances that correspond to each Scabi Namespace. User Files In Scabi (continued) DFile Class Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 45. User Files In Scabi (continued) DFile Class (continued) DFile f = new DFile(meta); f.put("scabi:MyOrg.MyFiles:myfile1.txt", “myfile1.txt"); FileInputStream fis = new FileInputStream(“myfile3.txt”); f.put("scabi:MyOrg.MyFiles:myfile3.txt", fis); fis.close(); To put a file from local file system into Scabi: To put a file from an input stream into Scabi: The following code examples demonstrate various file operations using DFile class: f.copy("scabi:MyOrg.MyFiles:myfile4.txt", "scabi:MyOrg.MyFiles:myfile1.txt"); To copy a file already in Scabi into another Scabi Namespace or to another file: Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 46. User Files In Scabi (continued) DFile Class (continued) f.get("scabi:MyOrg.MyFiles:myfile1.txt", “fileout1.txt”); FileOutputStream fos = new FileOutputStream(“fileout3.txt”); f.get("scabi:MyOrg.MyFiles:myfile1.txt", fos); fos.close(); To get a file already in Scabi to local file system: To get a file already in Scabi to an output stream: Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 47. Copyright (c) Dilshad Mustafa. All Rights Reserved. Time No. of bytes of read / write Time No. of bytes of read / write using m-Scabi Namespaces X n-MongoDB instances Local Filesystem Vs Scabi Cluster Performance Local File System Scabi Cluster
  • 48. User Tables In Scabi User table operations in Scabi are carried out using the Dao, DTable, DObject, DResultSet classes. Dao class - provides methods to set Namespace to access, getTable() method to get DTable. DTable class - provides methods to select, insert, update and delete operations, can directly embed MongoDB queries and can be used to query or filter data, get underlying Mongo Collection, do Map/Reduce, use Aggregation Framework (Aggregation Pipeline), use Geospatial queries DResultSet class - contains the results returned by DTable methods The following Scabi Namespace URLs refer to different tables:- scabi:MyOrg.MyTables:mytable scabi:MyOrg.MyTables:/mytable scabi:MyOrg.MyTables:./mytable scabi:MyOrg.MyTables:/home/dilshad/mytable scabi:MyOrg.MyTables:myorg.mydept.mytable scabi.MyOrg.MyTables:myorg-mydept-mytable scabi:MyOrg.MyTables:/usr/mytable scabi:MyOrg.Bangalore.Tables:mytable scabi:MyOrg.California.Tables:mytable Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 49. After the data are stored in Scabi using the DTable method, the Scabi Namespace URL can be passed around between Scabi Client system and Compute Units to access data for processing by the Compute Units. Compute Units can also write results to tables in Scabi and return back a Scabi Namespace URL of the table in the result if the result contains huge volume of data. Programs running in the User’s Client system as well as those running in the Scabi Cluster can access the User’s tables stored in the distributed databases through the Scabi Namespace URL: scabi:<namespace>:<table name> Users as well as programs running in the Scabi Cluster can perform various operations viz. registering new namespace, read / write operations using the tables. User Tables In Scabi (continued) Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 50. Scabi Framework Scabi Client Meta Server DB-1 Resolve Namespace Table data Copyright (c) Dilshad Mustafa. All Rights Reserved. Figure shows a Client using Scabi micro framework to do database operations in Scabi. Namespaces are resolved by the framework by contacting the Meta Server. User Tables In Scabi (continued)
  • 51. DB-1 1) Write scabi:MyOrg.MyTables:emp_table 4) Read scabi:MyOrg.MyTables:emp_results Both Scabi Client and Compute Units running in Compute Servers can do select, insert, update, delete operations of table in Scabi. After table data is stored in Scabi, only Scabi Namespace URL of the table need to be conveyed instead of transferring the actual contents of the table between Scabi Client and Compute Units. Scabi Framework Scabi Client Meta Server Table data Resolve Namespace 2) Read scabi:MyOrg.MyTables:emp_table 3) Write scabi:MyOrg,MyTables:emp_results #m Compute Hardware #1 Compute Server #1 CU #2 CU #m CU #2 Compute Server #1 CU #2 CU #m CU #n Compute Server #1 CU #2 CU #p CU Copyright (c) Dilshad Mustafa. All Rights Reserved. User Tables In Scabi (continued)
  • 52. User Tables In Scabi (continued) DTable table = dao.createTable("scabi:MyOrg.MyTables:Table1"); DDocument d = new DDocument(); d.append("EmployeeName", "Karthik").append("EmployeeNumber", "3000"); d.append("Age", 40); table.insert(d); DDocument d2 = new DDocument(); d2.put("Age", 45); DDocument updateObj = new DDocument(); updateObj.put("$set", d2); table.update(eq("EmployeeName", "Balaji"), updateObj); To create a table in Scabi: To insert data into a table in Scabi: To update records in a table in Scabi: The following code examples demonstrate various database operations using Dao class: Dao Class (continued) DTable table = dao.getTable(“scabi:MyOrg.MyTables:Table1"); To get existing table in Scabi: Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 53. User Tables In Scabi (continued) DResultSet result = table.find(or(eq("EmployeeNumber", "3003"), lt("Age", 40))); while (result.hasNext()) { DDocument d3 = result.next(); … … … } To query data in a table in Scabi: Dao Class (continued) MongoCollection<Document> c = table.getCollection(); To access underlying MongoCollection from DTable: String map = "function() { for (var key in this) { emit(key, null); } }"; String reduce = "function(key, s) { if ("Age" == key) return true; else return false; }"; MapReduceIterable<Document> out = c.mapReduce(map, reduce); for (Document o : out) { System.out.println("Key name is : " + o.get("_id").toString()); System.out.println(o.toString()); } Map/Reduce example directly on the MongoCollection: Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 54. Map/Reduce In Scabi MongoCollection<Document> c = table.getCollection(); To access underlying MongoCollection from DTable: String map = "function() { for (var key in this) { emit(key, null); } }"; String reduce = "function(key, s) { if ("Age" == key) return true; else return false; }"; MapReduceIterable<Document> out = c.mapReduce(map, reduce); for (Document o : out) { System.out.println("Key name is : " + o.get("_id").toString()); System.out.println(o.toString()); } Map/Reduce example directly on the MongoCollection: Copyright (c) Dilshad Mustafa. All Rights Reserved. Map/Reduce functions can be directly executed on the Mongo Collection natively.
  • 55. Map/Reduce In Scabi (continued) MongoDB’s Map/Reduce functionality can be directly used/invoked from within each Compute Unit. The following optimizations can be performed on Map/Reduce with MongoDB: 1) Map/Reduce optimized with sort on indexed fields 2) Incremental Map/Reduce (Map/Reduce with query filter to read only newer records) 3) Concurrent Map/Reduce (each Map/Reduce over specific ranges) 4) Map/Reduce over sharded collection MongoDB’s Aggregation Framework functionality (Aggregation Pipeline) can be directly used/invoked from within each Compute Unit. The following optimizations can be performed on aggregate() with MongoDB: 1) Concurrently doing aggregate() (each aggregate() over specific split keys) 2) aggregate() over sharded collection Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 56. Map/Reduce In Scabi (continued) Alternatively, we can also programmatically read/write data from the Mongo Collection or from any other data source from each Compute Unit. Compute Units are just like any other Java program. We need to include the jar files in the program to start using the API of specific storage systems (like Amazon S3, Google Cloud Storage), other file systems, databases (JDBC, Oracle, DB2, Cassandra, CouchDB, Redis, etc.), other third-party Java libraries, Java Machine Learning libraries (J48/C4.5/C5, JavaBayes, Weka, etc.), other Java external libraries. To access these systems from inside a compute Unit, add the jar files using the .addJar() method before submitting the Compute Units for execution using the .perform() method in DCompute class. In each Compute Unit, load data directly from any data source: shared file system, SAN, NAS, Alluxio in-memory distributed file system (formerly Tachyon), JDBC, Amazon S3, etc. and perform the Map/Reduce and Aggregations and write back the results to the data source. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 57. Map/Reduce In Scabi (continued) For unstructured data, the DFile class can be used to store large data files and retrieve only specific portions/partitions (for e.g. 64 MB partition) of the file from each Compute Unit to process the data concurrently. Alternatively, very large files can be split into several small files and stored using DFile and processed from each Compute Unit concurrently. For structured or semi-structured data, the Dao class can be used to do Map/Reduce, Aggregations and other data computations. By spreading the data into multiple databases, database instances, collections or using sharded databases and collections, massively parallel data computations can be implemented (discussed further in section “Peta Scale With Cloud”). From each Compute Unit, read data over specific ranges or specific split keys from database for concurrently doing data computations from multiple Compute Units. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 58. BigData solutions rely on data locality to handle data processing on large data sets. The data locality is achieved through local disk I/O (large number of Direct Attached Storage devices attached to each node) and in-memory (large RAM in each node). Both Direct Attached Storage (DAS) architecture and large RAM sizes tend to raise costs quickly as data reaches Peta scale after factoring in maintenance costs, backup of previous months Peta-scale data sets, hardware upgrade cycle, etc. For example, compared with Amazon AWS, EBS with PIOPS and S3 are relatively cost effective over huge number of DAS storage devices physically attached to each node over tens of thousands of nodes for storing Peta scale data. EBS with PIOPS can provide near data locality. If Map/Reduce is run incrementally and as a batch job, S3 provides a cost effective solution to store Peta Bytes of data (e.g. in the case of Netflix Hadoop implementation, Pinterest, etc.). In-memory distributed file systems like Alluxio (formerly Tachyon) can provide faster access and act as a distributed memory cache layer. From each Compute Unit we can directly read/write to the in-memory file system. Peta scale data can be divided and stored in multiple MongoDB instances or databases. From each Compute Unit we can directly read/write to a corresponding MongoDB instance or database. BigData Processing In Cloud Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 59. Peta Scale With Cloud Scabi micro framework can be used to implement solutions to many different kinds of problems including Map/Reduce, Aggregation problems, parallel algorithms and run Machine Learning algorithms parallely over huge data sets stored in heterogeneous data sources. The below is one example to handle Peta Scale. In this example, divide the PetaBytes of data into multiple MongoDB databases just to speedup Map/Reduce data computations and assign to different Scabi Namespaces (DataSet1, ..., DataSetN) using meta.namespaceRegister() method. Then submit multiple Compute Units to Scabi Cluster. In each Compute Unit, get a table from an assigned Scabi Namespace and access the Mongo Collection using MongoCollection c = table.getCollection(). Then directly do Map/Reduce on the Mongo Collection natively using c.mapReduce(map, reduce) (refer Example5). This way actual data will not be moved around in the network. Data can be spread based on any of the below arrangements: 1. Multiple different databases in different MongoDB instances 2. Multiple different collection in same or different databases, in same or different MongoDB instances 3. Sharded database and collection Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 60. Peta Scale With Cloud (continued) The figure shows one example to implement massively parallel Map/Reduce and Aggregations. A Grid of Compute Units is logically aligned to a corresponding grid of MongoDB instances / databases or collections. CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() CU DB c.mapReduce() c.aggregate() Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 61. Scabi Namespace Operations Dson dson = new Dson(); dson.add("Namespace", "MyCompany-Tables"); dson.add("Type", DNamespace.APPTABLE); dson.add("Host", "localhost"); dson.add("Port", "27017"); dson.add("UserID", "myuser"); dson.add("Pwd", "hello"); dson.add("SystemSpecificName", "MyCompanyDB"); dson.add("SystemType", "MongoDB"); if (false == meta.namespaceExists("MyCompany-Tables")) { System.out.println("Register new namespace"); String uuid = meta.namespaceRegister(dson); } To create a new namespace: Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 62. How to quickly run Scabi 1. Install Oracle Java 8 Java SE 1.8.0_66 2. Install MongoDB v3.2.1 with default settings, without enabling Login password and security certificate. Run sudo mongod --dbpath /home/<username>/db/data 3. Download scabi.tar.gz from Download folder in Scabi’s GitHub project 4. Unzip scabi.tar.gz to a folder /home/<username>/scabi 5. Start Meta Server, ./start_meta.sh & 6. Start Compute Servers, ./start_compute.sh 5001 localhost 5000 1000 & ./start_compute.sh 5002 localhost 5000 1000 & To start Compute Servers in other machines and ports, enter command as below, ./start_compute.sh <ComputeServer_Port> <MetaServer_HostName> <MetaServer_Port> [<NoOfThreads> [debug]] & 7. Run example code inside the examples folder in /home/<username>/scabi, cd examples java -cp "../dependency-jars/*":"../*":. Example1 java -cp "../dependency-jars/*":"../*":. Example1_2 java -cp "../dependency-jars/*":"../*":. Example1_3 java -cp "../dependency-jars/*":"../*":. Example1_4 java -cp "../dependency-jars/*":"../*":. Example2 java -cp "../dependency-jars/*":"../*":. Example3 java -cp "../dependency-jars/*":"../*":. Example4 java -cp "../dependency-jars/*":"../*":. Example5 Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 63. How to quickly run Scabi (continued) 1. Scabi Meta Server command line options ./start_meta.sh <No arguments> to use default settings ./start_meta.sh <MetaServer_Port> [debug] ./start_meta.sh <MetaServer_Port> <Database_HostName> <Database_Port> [debug] 2. Scabi Compute Server command line options ./start_compute.sh <ComputeServer_Port> <MetaServer_HostName> <MetaServer_Port> [<NoOfThreads> [debug]] Command line options Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 64. Scabi Performance Tuning The following guidelines can help with performance tuning. 1. Scabi Cluster can be scaled out horizontally by adding more compute hardware, starting more compute servers, or running compute servers with more number of threads per compute server or adding more Meta Servers with its own Cluster of Compute Servers. 2. If the User has limited number of compute hardware connected to the Cluster, the User can start fewer number of Compute Servers, with more number of threads per Compute Server, based on the memory size. Using JVM configuration, minimum and maximum size of Thread Stack Size and Heap memory size can be configured while starting the Compute Servers. 3. If the User has high number of compute hardware connected to Cluster with large memory size, then high number of Computer Servers, with fewer number of threads per Compute Server, can be started in each of the compute hardware. 4. Additional Meta Servers can be started and added to the Cluster, with each Meta Server having its own cluster of Compute Servers. Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 65. How to quickly build Scabi from GitHub 1. Install Oracle Java 8 Java SE 1.8.0_66 2. Install Git 3. Install Maven 4. Create folder /home/<username>/scabi 5. cd to scabi folder 6. Run command git clone https://www.github.com/dilshadmustafa/scabi.git 1. cd to DilshadDCS_Core folder in /home/<username>/scabi 2. Run command mvn package 3. The file scabi_core.jar will be created Initial Setup Build Scabi Core scabi_core.jar Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 66. How to quickly build Scabi from GitHub (continued) 1. Copy scabi_core.jar file created in above step to folder DilshadDCS_MS 2. cd to DilshadDCS_MS folder in /home/<username>/scabi 3. Include scabi_core.jar in Maven java classpath before compiling with Maven 4. Run command mvn package 5. The file scabi_meta.jar will be created 6. cd to target folder 7. Include scabi_core.jar in java classpath before running Meta Server 8. Run below command to run Meta Server with default settings (MongoDB should be installed and started already) java –jar scabi_meta.jar Build Scabi Meta Server scabi_meta.jar 1. Copy scabi_core.jar file created in above step to folder DilshadDCS_CS 2. cd to DilshadDCS_CS folder in /home/<username>/scabi 3. Include scabi_core.jar in Maven java classpath before compiling with Maven 4. Run command mvn package 5. The file scabi_compute.jar will be created 6. cd to target folder 7. Include scabi_core.jar in java classpath before running Compute Server 8. Run below command to run Compute Server, (Meta Server should be started already) java –jar scabi_compute.jar 5001 localhost 5000 1000 Build Scabi Compute Server scabi_compute.jar Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 67. InfiniBand Support Scabi micro framework and its Cluster are written in pure Java. The micro framework, Meta Servers, Compute Servers can be configured to use Sockets Direct Protocol (SDP) using Java JVM configuration to enable it to run on InfiniBand or other RDMA networks. The Java JVM settings can be configured to use IBM Java Sockets Over RDMA (JSOR), RSockets to enable it to run on InfiniBand or other RDMA networks. MongoDB does not natively support RDMA. It can be configured to run on IP Over InfiniBand (IPoIB). Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 68. Scabi uses the following APIs/Libraries :- 1. MongoDB Driver API 3.2.1 2. MongoDB GridFS Driver API 3.2.1 3. RedHat JBoss RESTEasy framework 3.0.14.Final 4. RedHat JBoss JavaAssist API 3.20.0-GA 5. Jetty Web Server API 9.3.2.v20150730 6. Apache Http Client API 4.5.2 7. Apache Http Async Client API 4.1.1 8. Jackson Json API 2.7.4 9. BeanShell API 2.0b4 10. SLF4J Simple Logging API 1.7.16 Copyright (c) Dilshad Mustafa. All Rights Reserved. APIs / Libraries used by Scabi
  • 69. Scabi is currently at version v0.2 which is the initial version. It is developed and tested in the following environment 1. Ubuntu 15.10 64-bit 2. Oracle Java 8 Java SE 1.8.0_66 64-bit 3. MongoDB 3.2.1 64-bit 4. Maven It uses the following versions of APIs/Libraries:- 1. MongoDB Driver API 3.2.1 2. MongoDB GridFS Driver API 3.2.1 3. RedHat JBoss RESTEasy framework 3.0.14.Final 4. RedHat JBoss JavaAssist API 3.20.0-GA 5. Jetty Web Server API 9.3.2.v20150730 6. Apache Http Client API 4.5.2 7. Apache Http Async Client API 4.1.1 8. Jackson Json API 2.7.4 9. BeanShell API 2.0b4 10. SLF4J Simple Logging API 1.7.16 Copyright (c) Dilshad Mustafa. All Rights Reserved. Scabi Test Environment
  • 70. Please refer the Java example code in Download folder in GitHub project https://www.github.com/dilshadmustafa/scabi Example 1 The examples use DCompute class which internally uses asynchronous non-blocking network I/O and can submit very large number of split jobs / Compute Units Complex and time consuming computing examples. Program to check specific number is Prime number. The program uses Scabi to automatically scale-out horizontally to thousands of Compute Servers running in hundreds of networked commodity hardware. Demonstrates executeClass(), executeObject(), executeJar(), executeCode() functionalities. Example 1_2 (a) Example shows how to add additional Java libraries, jar files (b) Example shows executeObject() method to submit a Compute Unit. The Compute Unit will internally submit its own Compute Units / split jobs for execution in the Cluster (c) Example shows how to add jar files, java libraries to Compute Units submitted from within Compute Unit. CUs are run inside Compute Servers. jar file paths provided by User are not available inside Compute Servers. Use addComputeUnitJars() method to add all the jar files provided to this Compute Unit by User. Scabi Examples Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 71. Copyright (c) Dilshad Mustafa. All Rights Reserved. Example 1_3 The examples use DComputeSync class which internally uses synchronous blocking network I/O and can submit large number of split jobs / Compute Units Complex and time consuming computing examples. Program to check specific number is Prime number. The program uses Scabi to automatically scale-out horizontally to thousands of Compute Servers running in hundreds of networked commodity hardware. Demonstrates executeClass(), executeObject(), executeJar(), executeCode() functionalities. Example 1_4 (a) Example shows how to add additional Java libraries, jar files (b) Example shows executeObject() method to submit a Compute Unit. The Compute Unit will internally submit its own Compute Units / split jobs for execution in the Cluster (c) Example shows how to add jar files, java libraries to Compute Units submitted from within Compute Unit. CUs are run inside Compute Servers. jar file paths provided by User are not available inside Compute Servers. Use addComputeUnitJars() method to add all the jar files provided to this Compute Unit by User. Scabi Examples (continued)
  • 72. Example 2 - Distributed Storage & Retrieval examples. (a) put() operations for storing files into Scabi from local file system and from input streams (b) copy() operations to copy files into another Scabi Namespace or to another file (c) get() operations for retrieving files from Scabi to local file system or to output streams Example 3 - Scabi Distributed Tables examples Demonstrate CRUD operations. (a) Create Table (b) Check table exists (c) Get existing table (d) Insert data into table (e) Update records in table (f) Query data in table (g) Directly embed MongoDB filters into queries Example 4 - Scabi Distributed Tables examples (continued) (a) Access underlying MongoCollection (b) Map/Reduce example on the MongoCollection Example 5 - Scabi Namespace Operations (a) Check Namespace exists (b) Create new Namespace Copyright (c) Dilshad Mustafa. All Rights Reserved. Scabi Examples (continued)
  • 73. For any questions, clarifications or if you want to partner or participate or fund Scabi Project development, please feel free to contact Dilshad Mustafa at mdilshad2016@yahoo.com with a copy to mdilshad2016@rediffmail.com Copyright (c) Dilshad Mustafa. All Rights Reserved.
  • 74. Dilshad Mustafa is the creator and programmer of Scabi micro framework and Cluster. He is also Author of Book titled “Tech Job 9 to 9”. He is a Senior Software Architect with 16+ years experience in Information Technology industry. He has experience across various domains, Banking, Retail, Materials & Supply Chain. He completed his B.E. in Computer Science & Engineering from Annamalai University, India and completed his M.Sc. In Communication & Network Systems from Nanyang Technological University, Singapore. Dilshad Mustafa can be reached at mdilshad2016@yahoo.com with a copy to mdilshad2016@rediffmail.com Copyright (c) Dilshad Mustafa. All Rights Reserved.