The rise of “Big Data” on cloud computing

The rise of “Big Data” on
cloud computing: Review and
open research issues
• Ibrahim Abaker Targio Hashem
• Ibrar Yaqoob
• Nor Badrul Anuar
• Salimah Mokhtar
• Abdullah Gani
• Samee Ullah Khan

Presented By
Kazi Mojammel Hossen
ID: B130305001
2
Minhazul Arefin
ID: B130305003

Outlines
◎Introduction
◎Definition & Characteristics of Big Data
◎Cloud Computing
◎Relationship between cloud computing & big data
◎Case studies
◎Big data storage system
◎Hadoop background
◎Research challenges
◎Open research issues
◎Conclusion
3

Introduction
◎The continuous increase in the volume and detail of
data captured by organizations has produced an
overwhelming flow of data in either structured or
unstructured format.
◎Virtualization is a process of resource sharing and
isolation of underlying hardware to increase computer
resource utilization, efficiency, and scalability.
◎The goal of this study is to implement a
comprehensive investigation of the status of big data
in cloud computing environments
4

What is Big Data?
Big data is a term utilized
to refer to the increase in
the volume of data that are
difficult to store, process,
and analyze through
traditional database
technologies.
5

Characteristics of big data
◎Big data are characterized by three aspects:
i. data are numerous
ii. data cannot be categorized into regular
relational databases
iii. data are generated, captured, and
processed rapidly.
6

Volume
◎Processing Performance
◎Class Imbalance
◎Feature Engineering
◎Non-Linearity
8

Velocity
◎Data Availability
◎Real-Time
Process/Streaming
◎Independent and
Identically
◎Distributed Random
Variables
9

Variety
◎Data Locality
◎Data Heterogeneity
◎Dirty and Noisy Data
10

Varacity
◎Data Provenance
◎Data Uncertainty
◎Dirty and Noisy Data
11

Classification of Big Data
◎Web & Social Media
◎Machine
◎Sensing
◎Transaction
◎IoT
12
1. Data sources

◎Structured
○ SQL Server
○ Oracle
○ Access, Excel
◎Semi-structured
○ Text Analytics
○ Blogs
○ Social Authority
○ Video
○ Audio
◎Unstructured
○ Weather data
○ Currency Conversion
○ Demographic
○ E-Commerce
13
2. Content Format

◎Document-oriented
◎Column-oriented
◎Graph database
◎Key-value
14
3. Data Stores

◎Cleaning
◎Transform
◎Normalization
15
4. Data Staging

◎Batch
○ Used MapReduce based system
◎Real Time
○ Scalable streaming system
16
4. Data Preprocessing

What is Cloud Computing?
Cloud computing is a fast-
growing technology that
has established itself in
the next generation of IT
industry and business.
17

Cloud Service Model
◎Cloud service model typically consists of PaaS, SaaS
and IaaS
18

Relationship between Colud Computing & Big Data
19

Organization case Studies from vendors
◎A language technology aids
touchscreen typing by
providing personalized
predictions and corrections
◎Collects & analyzes terabytes
of data to create language
model
◎Used Apache Hadoop
running on Amazon Simple
Storage Service
20
1. Swiftkey

◎Maker of Halo, a science
fiction media franchise
◎The developers analyzed
data to obtain insights into
player preferences and
online tournament
◎Used Windows Azure
HDInsight Service, which is
based on Apache Hadoop big
data framework
21
2. 343 Industries

◎Online travel agency
◎Unifying tens of
thousands of bus
schedules into a single
booking operations
◎Implemented
GoogleQuery to analyze
large dataset in Google
data processing
infrastructure
22
3. redBus

◎A mobile communication
company
◎Gathers and analyze large
amount of data from
mobile phones
◎Used Hadoop Distributed
File System (HDFS)
23
4. Nokia

◎An online retailer
◎Experiencing revenue
leakage for unreliable
real time notifications of
service problems
◎Used big data
algorithms to create a
cloud monitoring
system that deliver
notifications
24
5. Alacer

Case Studies from Scholarly/Academic Source
Situation/ context Objective Approach Result
Massively parallel
DNA sequencing
generates
staggering amounts
of data
Provide accurate &
reproductive genomic
result
Develop a Mercury analysis
pipeline and deploy it in the
Amazon web service cloud
via DNAnexus platform
Established a powerful
combination of a robust and fully
validated software pipeline and a
scalable computational resource
Conducting
analyses on large
social networks
such as Twitter
To use cloud services as
a possible solution for
the analysis of large
amounts of data
Use PageRank algorithm on
the Twitter user base to
obtain user ranking
Implemented a relatively cheap
solution for data acquisition and
analysis by using Amazon cloud
infrastuture
To study the
complex molecular
interactions that
regulate biological
systems
To develop a Hadoop
Based cloud computing
application that process
sequence of microscopic
images
Use Hadoop cloud
computing framework
Allows users to submit data
processing jobs in the cloud
Applications
running on cloud
computing likely
may fail
Design a failure scenario
Create a series of failure
scenarios on a Amazon cloud
computing platform
Help to identify vulnerabilities in
Hadoop applications running in
cloud
25

“
There were 5 exabytes of information created
between the dawn of civilization through 2003,
but that much information is now created in
every 2 days
26
- Eric Schmidt,
Executive Chairman, Google

Big data storage system
◎Traditional storage systems store data through
structured RDBMS
◎A storage architecture need to achieve availability &
reliability
◎Need to store and manage large dataset
◎The organizational systems of data storage can be
divided into three parts:
○ Disc array
○ Connection and network subsystems
○ Storage management software
27

Comparision of Storage Media
Type Specific use Advantages Disadvantages
Hard
drives
Store data up to four
terabytes
• Density
• Cost per bit storage
• Speedy start up
• Require Special cooling
• High read latency time
• Produce more heat
Solid
state
memory
Store data up to two
terabytes
• Fast access to data
• Fast movement of huge data
• Fast start-up time
• More expensive than hard drives
Object
storage
Store data as variable
size object rather than
fixed sized blocks
• Easy to find information
• Unique identifier to find data objects
• Ensure security
• Complexity in tracking indices
Optical
storage
Store data at different
angles throughout the
storage medium
• Least expensive
• Removable storage medium
• Complex
• Ability to produce multiple
optical disks in a single unit is yet
to be proven
Cloud
Storage
Serve as a provisioning
& storage model
• Usefull for small organization that do
not have sufficient storage capacity
• Can store large amount of data
• Less Security
28

Hadoop
◎A free, Java-based
programming
framework that
supports the processing
of large data sets in a
distributed computing
environment
◎Has Google’s powerful
computation
MapReduce Technology
29

HDFS (Hadoop Distributed File System)
◎A scalable distributed file system for applications
dealing with large data sets
○ Distributed: runs in a cluster
○ Scalable: 10Κ nodes, 100Κ files 10PB storage
◎ Storage space is seamless for the whole cluster
◎ Files broken into blocks
◎ Typical block size: 128 MB.
◎ Replication: Each block copied to multiple data
nodes.
30

What is MapReduce?
◎A programming model
◎A programming framework
◎Used to develop solutions that will
○ Process large amounts of data in a parallelized fashion
○ In clusters of computing nodes
◎Originally a closed-source implementation at Google
◎Hadoop: Open source implementation of the
algorithms described in the scientific papers
31

MapReduce
◎The model is broken down in 2 phases:
○ Map: Non overlapping sets of data input (<key, value> records) are
assigned to different processes (mappers) that produce a set of
intermediate <key, value> results
○ Reduce: Data of Map phase are fed to a typically smaller number of
processes(reducers) that aggregate the input results to a smaller
number of <key, value> records.
32

Research Challenges
◎Ability to handle increasing
amounts of data in an
appropriate manner
◎NoSQL database store and
retrieve large volumes of
distributed data.
◎Wang et al proposed a new
scalable data cube analysis
technique called HaCube in
big data clusters to
overcome the challenges of
large-scale data.
33
1. Scalability

Research Challenges
◎Refers to the resources of the
system accessible on
demand by an authorized
individual
◎Mobile user needs data
within a short amount of
time
◎Services must remain
operational even in the case
of a security breach
34
2. Availability

Research Challenges
◎Preventing improper or
unauthorized change or
access
◎Must ensure the
correctness of user data
◎Should provide a
mechanism for the user
to check whether the
data is maintained
35
3. Data Integrity

Research Challenges
◎Transforming data into a
form suitable for analysis
is an obstacle in the
adoption of big data
◎Owing to the variety of
data formats, big data can
be transformed into an
analysis workflow in two
ways
36
4. Transformation

Transforming big data for analysis.
◎Structured data is pre-processed before they are stored
in relational databases to meet the constraints of
schema-on-write, then it can be retrieved for analysis
◎Unstructured data must first be stored in distributed
databases, such as HBase, before they are processed
for analysis
37

Research Challenges
◎Defined as “any
difficulty encountered
along one or more
quality dimensions that
render data completely
or largely unfit for use”
◎High-quality data in the
cloud is characterized
by data consistency
38
5. Data quality

Research Challenges
◎Variety, one of the major aspects
of big data characterization
◎In a cloud environment, users
can store data
◎Structured data formats are
appropriate for database
systems
◎Semi-structured data formats
are appropriate only to some
extent
◎Unstructured data are
inappropriate
39
6. Heterogeneity

Research Challenges
◎Concerns to hamper
users who outsource
their private data into
the cloud storage
◎Encryption is utilized by
most researchers to
ensure data privacy in
the cloud
40
7. Privacy

Research Challenges
◎Specific laws &
regulation must be
established to preserve
sensitive information
◎Monitoring of company
staff communications is
not legal
◎Electrical monitoring is
permitted under special
circumstances
41
8. Legal/regulatory issues

Research Challenges
◎Design and operation of a
management system to
assure that data delivers
value and is not a cost
◎Who can do what to the
organization's data and how.
◎ Ensuring standards are set
and met
◎ A strategic & high level view
across the organization
42
9. Governance

Open research issues
◎Heterogeneous nature of
data
◎Data gathered from
different sources in
unstructured format
◎Hadoop and MapReduce
simplify the distributed
processing of unstructured
data formats
43
1. Data Staging

◎Provide capacity to
address massive
amount of data
◎Optimization of existing
file systems
◎Stored data in a manner
that they can be
retrieved and migrated
easily
44
2. Distributed storage systems

◎Should obtain
information from large
amount of data in
limited time
◎Need better algorithm
◎Data sources may
contain different
formats which makes
interrogation for
analysis a complex task
45
3. Data Analysis

◎Need policies that cover
all user privacy
◎Utilizing strong
cryptography to
encapsulate sensitive
data
◎Need algorithm to
secure key
management and
exchange
46
4. Data Security

Future of Cloud Computing & Big Data
◎Stream computing
◎Dramatically improved forecasting and predictive
analysis across all scientific disciplines
◎The rise of the Social Graph
– Battle lines are drawn
◎ Individually tailored and personalized solutions,
services and experiences
– Medical diagnosis and treatment
– Lifestyle management
– Targeted marketing and advertising
47

Limitation of Cloud Computing & Big Data
◎Querying encrypted data is time consuming
◎Difficult to handle such variety of data
◎Normally there is only one destination from which to
secure data
◎Less concerns with the safety and privacy of
important data stored remotely
◎Unable to access data without internet
48

Conclusion
◎The size of data at present is huge and continues to
increase every day
◎Present a review on the rise of big data in cloud
computing
◎Reviewed some of the challenges in big data
processing
◎The key issues in big data in clouds were highlighted
◎Researchers should collaborate to ensure the long-
term success of data management in a cloud
computing environment
49

The rise of “Big Data” on cloud computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The rise of “Big Data” on cloud computing

Similar to The rise of “Big Data” on cloud computing (20)

More from Minhazul Arefin

More from Minhazul Arefin (7)

Recently uploaded

Recently uploaded (20)

The rise of “Big Data” on cloud computing