The rise of “Big Data” on
cloud computing: Review and
open research issues
• Ibrahim Abaker Targio Hashem
• Ibrar Yaqoob
• Nor Badrul Anuar
• Salimah Mokhtar
• Abdullah Gani
• Samee Ullah Khan
Presented By
Kazi Mojammel Hossen
ID: B130305001
2
Minhazul Arefin
ID: B130305003
Outlines
◎Introduction
◎Definition & Characteristics of Big Data
◎Cloud Computing
◎Relationship between cloud computing & big data
◎Case studies
◎Big data storage system
◎Hadoop background
◎Research challenges
◎Open research issues
◎Conclusion
3
Introduction
◎The continuous increase in the volume and detail of
data captured by organizations has produced an
overwhelming flow of data in either structured or
unstructured format.
◎Virtualization is a process of resource sharing and
isolation of underlying hardware to increase computer
resource utilization, efficiency, and scalability.
◎The goal of this study is to implement a
comprehensive investigation of the status of big data
in cloud computing environments
4
What is Big Data?
Big data is a term utilized
to refer to the increase in
the volume of data that are
difficult to store, process,
and analyze through
traditional database
technologies.
5
Characteristics of big data
◎Big data are characterized by three aspects:
i. data are numerous
ii. data cannot be categorized into regular
relational databases
iii. data are generated, captured, and
processed rapidly.
6
Characteristics of big data
7
Volume
◎Processing Performance
◎Class Imbalance
◎Feature Engineering
◎Non-Linearity
8
Velocity
◎Data Availability
◎Real-Time
Process/Streaming
◎Independent and
Identically
◎Distributed Random
Variables
9
Variety
◎Data Locality
◎Data Heterogeneity
◎Dirty and Noisy Data
10
Varacity
◎Data Provenance
◎Data Uncertainty
◎Dirty and Noisy Data
11
Classification of Big Data
◎Web & Social Media
◎Machine
◎Sensing
◎Transaction
◎IoT
12
1. Data sources
Classification of Big Data
◎Structured
○ SQL Server
○ Oracle
○ Access, Excel
◎Semi-structured
○ Text Analytics
○ Blogs
○ Social Authority
○ Video
○ Audio
◎Unstructured
○ Weather data
○ Currency Conversion
○ Demographic
○ E-Commerce
13
2. Content Format
Classification of Big Data
◎Document-oriented
◎Column-oriented
◎Graph database
◎Key-value
14
3. Data Stores
Classification of Big Data
◎Cleaning
◎Transform
◎Normalization
15
4. Data Staging
Classification of Big Data
◎Batch
○ Used MapReduce based system
◎Real Time
○ Scalable streaming system
16
4. Data Preprocessing
What is Cloud Computing?
Cloud computing is a fast-
growing technology that
has established itself in
the next generation of IT
industry and business.
17
Cloud Service Model
◎Cloud service model typically consists of PaaS, SaaS
and IaaS
18
Relationship between Colud Computing & Big Data
19
Organization case Studies from vendors
◎A language technology aids
touchscreen typing by
providing personalized
predictions and corrections
◎Collects & analyzes terabytes
of data to create language
model
◎Used Apache Hadoop
running on Amazon Simple
Storage Service
20
1. Swiftkey
Organization case Studies from vendors
◎Maker of Halo, a science
fiction media franchise
◎The developers analyzed
data to obtain insights into
player preferences and
online tournament
◎Used Windows Azure
HDInsight Service, which is
based on Apache Hadoop big
data framework
21
2. 343 Industries
Organization case Studies from vendors
◎Online travel agency
◎Unifying tens of
thousands of bus
schedules into a single
booking operations
◎Implemented
GoogleQuery to analyze
large dataset in Google
data processing
infrastructure
22
3. redBus
Organization case Studies from vendors
◎A mobile communication
company
◎Gathers and analyze large
amount of data from
mobile phones
◎Used Hadoop Distributed
File System (HDFS)
23
4. Nokia
Organization case Studies from vendors
◎An online retailer
◎Experiencing revenue
leakage for unreliable
real time notifications of
service problems
◎Used big data
algorithms to create a
cloud monitoring
system that deliver
notifications
24
5. Alacer
Case Studies from Scholarly/Academic Source
Situation/ context Objective Approach Result
Massively parallel
DNA sequencing
generates
staggering amounts
of data
Provide accurate &
reproductive genomic
result
Develop a Mercury analysis
pipeline and deploy it in the
Amazon web service cloud
via DNAnexus platform
Established a powerful
combination of a robust and fully
validated software pipeline and a
scalable computational resource
Conducting
analyses on large
social networks
such as Twitter
To use cloud services as
a possible solution for
the analysis of large
amounts of data
Use PageRank algorithm on
the Twitter user base to
obtain user ranking
Implemented a relatively cheap
solution for data acquisition and
analysis by using Amazon cloud
infrastuture
To study the
complex molecular
interactions that
regulate biological
systems
To develop a Hadoop
Based cloud computing
application that process
sequence of microscopic
images
Use Hadoop cloud
computing framework
Allows users to submit data
processing jobs in the cloud
Applications
running on cloud
computing likely
may fail
Design a failure scenario
Create a series of failure
scenarios on a Amazon cloud
computing platform
Help to identify vulnerabilities in
Hadoop applications running in
cloud
25
“
There were 5 exabytes of information created
between the dawn of civilization through 2003,
but that much information is now created in
every 2 days
26
- Eric Schmidt,
Executive Chairman, Google
Big data storage system
◎Traditional storage systems store data through
structured RDBMS
◎A storage architecture need to achieve availability &
reliability
◎Need to store and manage large dataset
◎The organizational systems of data storage can be
divided into three parts:
○ Disc array
○ Connection and network subsystems
○ Storage management software
27
Comparision of Storage Media
Type Specific use Advantages Disadvantages
Hard
drives
Store data up to four
terabytes
• Density
• Cost per bit storage
• Speedy start up
• Require Special cooling
• High read latency time
• Produce more heat
Solid
state
memory
Store data up to two
terabytes
• Fast access to data
• Fast movement of huge data
• Fast start-up time
• More expensive than hard drives
Object
storage
Store data as variable
size object rather than
fixed sized blocks
• Easy to find information
• Unique identifier to find data objects
• Ensure security
• Complexity in tracking indices
Optical
storage
Store data at different
angles throughout the
storage medium
• Least expensive
• Removable storage medium
• Complex
• Ability to produce multiple
optical disks in a single unit is yet
to be proven
Cloud
Storage
Serve as a provisioning
& storage model
• Usefull for small organization that do
not have sufficient storage capacity
• Can store large amount of data
• Less Security
28
Hadoop
◎A free, Java-based
programming
framework that
supports the processing
of large data sets in a
distributed computing
environment
◎Has Google’s powerful
computation
MapReduce Technology
29
HDFS (Hadoop Distributed File System)
◎A scalable distributed file system for applications
dealing with large data sets
○ Distributed: runs in a cluster
○ Scalable: 10Κ nodes, 100Κ files 10PB storage
◎ Storage space is seamless for the whole cluster
◎ Files broken into blocks
◎ Typical block size: 128 MB.
◎ Replication: Each block copied to multiple data
nodes.
30
What is MapReduce?
◎A programming model
◎A programming framework
◎Used to develop solutions that will
○ Process large amounts of data in a parallelized fashion
○ In clusters of computing nodes
◎Originally a closed-source implementation at Google
◎Hadoop: Open source implementation of the
algorithms described in the scientific papers
31
MapReduce
◎The model is broken down in 2 phases:
○ Map: Non overlapping sets of data input (<key, value> records) are
assigned to different processes (mappers) that produce a set of
intermediate <key, value> results
○ Reduce: Data of Map phase are fed to a typically smaller number of
processes(reducers) that aggregate the input results to a smaller
number of <key, value> records.
32
Research Challenges
◎Ability to handle increasing
amounts of data in an
appropriate manner
◎NoSQL database store and
retrieve large volumes of
distributed data.
◎Wang et al proposed a new
scalable data cube analysis
technique called HaCube in
big data clusters to
overcome the challenges of
large-scale data.
33
1. Scalability
Research Challenges
◎Refers to the resources of the
system accessible on
demand by an authorized
individual
◎Mobile user needs data
within a short amount of
time
◎Services must remain
operational even in the case
of a security breach
34
2. Availability
Research Challenges
◎Preventing improper or
unauthorized change or
access
◎Must ensure the
correctness of user data
◎Should provide a
mechanism for the user
to check whether the
data is maintained
35
3. Data Integrity
Research Challenges
◎Transforming data into a
form suitable for analysis
is an obstacle in the
adoption of big data
◎Owing to the variety of
data formats, big data can
be transformed into an
analysis workflow in two
ways
36
4. Transformation
Transforming big data for analysis.
◎Structured data is pre-processed before they are stored
in relational databases to meet the constraints of
schema-on-write, then it can be retrieved for analysis
◎Unstructured data must first be stored in distributed
databases, such as HBase, before they are processed
for analysis
37
Research Challenges
◎Defined as “any
difficulty encountered
along one or more
quality dimensions that
render data completely
or largely unfit for use”
◎High-quality data in the
cloud is characterized
by data consistency
38
5. Data quality
Research Challenges
◎Variety, one of the major aspects
of big data characterization
◎In a cloud environment, users
can store data
◎Structured data formats are
appropriate for database
systems
◎Semi-structured data formats
are appropriate only to some
extent
◎Unstructured data are
inappropriate
39
6. Heterogeneity
Research Challenges
◎Concerns to hamper
users who outsource
their private data into
the cloud storage
◎Encryption is utilized by
most researchers to
ensure data privacy in
the cloud
40
7. Privacy
Research Challenges
◎Specific laws &
regulation must be
established to preserve
sensitive information
◎Monitoring of company
staff communications is
not legal
◎Electrical monitoring is
permitted under special
circumstances
41
8. Legal/regulatory issues
Research Challenges
◎Design and operation of a
management system to
assure that data delivers
value and is not a cost
◎Who can do what to the
organization's data and how.
◎ Ensuring standards are set
and met
◎ A strategic & high level view
across the organization
42
9. Governance
Open research issues
◎Heterogeneous nature of
data
◎Data gathered from
different sources in
unstructured format
◎Hadoop and MapReduce
simplify the distributed
processing of unstructured
data formats
43
1. Data Staging
Open research issues
◎Provide capacity to
address massive
amount of data
◎Optimization of existing
file systems
◎Stored data in a manner
that they can be
retrieved and migrated
easily
44
2. Distributed storage systems
Open research issues
◎Should obtain
information from large
amount of data in
limited time
◎Need better algorithm
◎Data sources may
contain different
formats which makes
interrogation for
analysis a complex task
45
3. Data Analysis
Open research issues
◎Need policies that cover
all user privacy
◎Utilizing strong
cryptography to
encapsulate sensitive
data
◎Need algorithm to
secure key
management and
exchange
46
4. Data Security
Future of Cloud Computing & Big Data
◎Stream computing
◎Dramatically improved forecasting and predictive
analysis across all scientific disciplines
◎The rise of the Social Graph
– Battle lines are drawn
◎ Individually tailored and personalized solutions,
services and experiences
– Medical diagnosis and treatment
– Lifestyle management
– Targeted marketing and advertising
47
Limitation of Cloud Computing & Big Data
◎Querying encrypted data is time consuming
◎Difficult to handle such variety of data
◎Normally there is only one destination from which to
secure data
◎Less concerns with the safety and privacy of
important data stored remotely
◎Unable to access data without internet
48
Conclusion
◎The size of data at present is huge and continues to
increase every day
◎Present a review on the rise of big data in cloud
computing
◎Reviewed some of the challenges in big data
processing
◎The key issues in big data in clouds were highlighted
◎Researchers should collaborate to ensure the long-
term success of data management in a cloud
computing environment
49
Thanks!
Any questions?
50

The rise of “Big Data” on cloud computing

  • 1.
    The rise of“Big Data” on cloud computing: Review and open research issues • Ibrahim Abaker Targio Hashem • Ibrar Yaqoob • Nor Badrul Anuar • Salimah Mokhtar • Abdullah Gani • Samee Ullah Khan
  • 2.
    Presented By Kazi MojammelHossen ID: B130305001 2 Minhazul Arefin ID: B130305003
  • 3.
    Outlines ◎Introduction ◎Definition & Characteristicsof Big Data ◎Cloud Computing ◎Relationship between cloud computing & big data ◎Case studies ◎Big data storage system ◎Hadoop background ◎Research challenges ◎Open research issues ◎Conclusion 3
  • 4.
    Introduction ◎The continuous increasein the volume and detail of data captured by organizations has produced an overwhelming flow of data in either structured or unstructured format. ◎Virtualization is a process of resource sharing and isolation of underlying hardware to increase computer resource utilization, efficiency, and scalability. ◎The goal of this study is to implement a comprehensive investigation of the status of big data in cloud computing environments 4
  • 5.
    What is BigData? Big data is a term utilized to refer to the increase in the volume of data that are difficult to store, process, and analyze through traditional database technologies. 5
  • 6.
    Characteristics of bigdata ◎Big data are characterized by three aspects: i. data are numerous ii. data cannot be categorized into regular relational databases iii. data are generated, captured, and processed rapidly. 6
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    Classification of BigData ◎Web & Social Media ◎Machine ◎Sensing ◎Transaction ◎IoT 12 1. Data sources
  • 13.
    Classification of BigData ◎Structured ○ SQL Server ○ Oracle ○ Access, Excel ◎Semi-structured ○ Text Analytics ○ Blogs ○ Social Authority ○ Video ○ Audio ◎Unstructured ○ Weather data ○ Currency Conversion ○ Demographic ○ E-Commerce 13 2. Content Format
  • 14.
    Classification of BigData ◎Document-oriented ◎Column-oriented ◎Graph database ◎Key-value 14 3. Data Stores
  • 15.
    Classification of BigData ◎Cleaning ◎Transform ◎Normalization 15 4. Data Staging
  • 16.
    Classification of BigData ◎Batch ○ Used MapReduce based system ◎Real Time ○ Scalable streaming system 16 4. Data Preprocessing
  • 17.
    What is CloudComputing? Cloud computing is a fast- growing technology that has established itself in the next generation of IT industry and business. 17
  • 18.
    Cloud Service Model ◎Cloudservice model typically consists of PaaS, SaaS and IaaS 18
  • 19.
    Relationship between ColudComputing & Big Data 19
  • 20.
    Organization case Studiesfrom vendors ◎A language technology aids touchscreen typing by providing personalized predictions and corrections ◎Collects & analyzes terabytes of data to create language model ◎Used Apache Hadoop running on Amazon Simple Storage Service 20 1. Swiftkey
  • 21.
    Organization case Studiesfrom vendors ◎Maker of Halo, a science fiction media franchise ◎The developers analyzed data to obtain insights into player preferences and online tournament ◎Used Windows Azure HDInsight Service, which is based on Apache Hadoop big data framework 21 2. 343 Industries
  • 22.
    Organization case Studiesfrom vendors ◎Online travel agency ◎Unifying tens of thousands of bus schedules into a single booking operations ◎Implemented GoogleQuery to analyze large dataset in Google data processing infrastructure 22 3. redBus
  • 23.
    Organization case Studiesfrom vendors ◎A mobile communication company ◎Gathers and analyze large amount of data from mobile phones ◎Used Hadoop Distributed File System (HDFS) 23 4. Nokia
  • 24.
    Organization case Studiesfrom vendors ◎An online retailer ◎Experiencing revenue leakage for unreliable real time notifications of service problems ◎Used big data algorithms to create a cloud monitoring system that deliver notifications 24 5. Alacer
  • 25.
    Case Studies fromScholarly/Academic Source Situation/ context Objective Approach Result Massively parallel DNA sequencing generates staggering amounts of data Provide accurate & reproductive genomic result Develop a Mercury analysis pipeline and deploy it in the Amazon web service cloud via DNAnexus platform Established a powerful combination of a robust and fully validated software pipeline and a scalable computational resource Conducting analyses on large social networks such as Twitter To use cloud services as a possible solution for the analysis of large amounts of data Use PageRank algorithm on the Twitter user base to obtain user ranking Implemented a relatively cheap solution for data acquisition and analysis by using Amazon cloud infrastuture To study the complex molecular interactions that regulate biological systems To develop a Hadoop Based cloud computing application that process sequence of microscopic images Use Hadoop cloud computing framework Allows users to submit data processing jobs in the cloud Applications running on cloud computing likely may fail Design a failure scenario Create a series of failure scenarios on a Amazon cloud computing platform Help to identify vulnerabilities in Hadoop applications running in cloud 25
  • 26.
    “ There were 5exabytes of information created between the dawn of civilization through 2003, but that much information is now created in every 2 days 26 - Eric Schmidt, Executive Chairman, Google
  • 27.
    Big data storagesystem ◎Traditional storage systems store data through structured RDBMS ◎A storage architecture need to achieve availability & reliability ◎Need to store and manage large dataset ◎The organizational systems of data storage can be divided into three parts: ○ Disc array ○ Connection and network subsystems ○ Storage management software 27
  • 28.
    Comparision of StorageMedia Type Specific use Advantages Disadvantages Hard drives Store data up to four terabytes • Density • Cost per bit storage • Speedy start up • Require Special cooling • High read latency time • Produce more heat Solid state memory Store data up to two terabytes • Fast access to data • Fast movement of huge data • Fast start-up time • More expensive than hard drives Object storage Store data as variable size object rather than fixed sized blocks • Easy to find information • Unique identifier to find data objects • Ensure security • Complexity in tracking indices Optical storage Store data at different angles throughout the storage medium • Least expensive • Removable storage medium • Complex • Ability to produce multiple optical disks in a single unit is yet to be proven Cloud Storage Serve as a provisioning & storage model • Usefull for small organization that do not have sufficient storage capacity • Can store large amount of data • Less Security 28
  • 29.
    Hadoop ◎A free, Java-based programming frameworkthat supports the processing of large data sets in a distributed computing environment ◎Has Google’s powerful computation MapReduce Technology 29
  • 30.
    HDFS (Hadoop DistributedFile System) ◎A scalable distributed file system for applications dealing with large data sets ○ Distributed: runs in a cluster ○ Scalable: 10Κ nodes, 100Κ files 10PB storage ◎ Storage space is seamless for the whole cluster ◎ Files broken into blocks ◎ Typical block size: 128 MB. ◎ Replication: Each block copied to multiple data nodes. 30
  • 31.
    What is MapReduce? ◎Aprogramming model ◎A programming framework ◎Used to develop solutions that will ○ Process large amounts of data in a parallelized fashion ○ In clusters of computing nodes ◎Originally a closed-source implementation at Google ◎Hadoop: Open source implementation of the algorithms described in the scientific papers 31
  • 32.
    MapReduce ◎The model isbroken down in 2 phases: ○ Map: Non overlapping sets of data input (<key, value> records) are assigned to different processes (mappers) that produce a set of intermediate <key, value> results ○ Reduce: Data of Map phase are fed to a typically smaller number of processes(reducers) that aggregate the input results to a smaller number of <key, value> records. 32
  • 33.
    Research Challenges ◎Ability tohandle increasing amounts of data in an appropriate manner ◎NoSQL database store and retrieve large volumes of distributed data. ◎Wang et al proposed a new scalable data cube analysis technique called HaCube in big data clusters to overcome the challenges of large-scale data. 33 1. Scalability
  • 34.
    Research Challenges ◎Refers tothe resources of the system accessible on demand by an authorized individual ◎Mobile user needs data within a short amount of time ◎Services must remain operational even in the case of a security breach 34 2. Availability
  • 35.
    Research Challenges ◎Preventing improperor unauthorized change or access ◎Must ensure the correctness of user data ◎Should provide a mechanism for the user to check whether the data is maintained 35 3. Data Integrity
  • 36.
    Research Challenges ◎Transforming datainto a form suitable for analysis is an obstacle in the adoption of big data ◎Owing to the variety of data formats, big data can be transformed into an analysis workflow in two ways 36 4. Transformation
  • 37.
    Transforming big datafor analysis. ◎Structured data is pre-processed before they are stored in relational databases to meet the constraints of schema-on-write, then it can be retrieved for analysis ◎Unstructured data must first be stored in distributed databases, such as HBase, before they are processed for analysis 37
  • 38.
    Research Challenges ◎Defined as“any difficulty encountered along one or more quality dimensions that render data completely or largely unfit for use” ◎High-quality data in the cloud is characterized by data consistency 38 5. Data quality
  • 39.
    Research Challenges ◎Variety, oneof the major aspects of big data characterization ◎In a cloud environment, users can store data ◎Structured data formats are appropriate for database systems ◎Semi-structured data formats are appropriate only to some extent ◎Unstructured data are inappropriate 39 6. Heterogeneity
  • 40.
    Research Challenges ◎Concerns tohamper users who outsource their private data into the cloud storage ◎Encryption is utilized by most researchers to ensure data privacy in the cloud 40 7. Privacy
  • 41.
    Research Challenges ◎Specific laws& regulation must be established to preserve sensitive information ◎Monitoring of company staff communications is not legal ◎Electrical monitoring is permitted under special circumstances 41 8. Legal/regulatory issues
  • 42.
    Research Challenges ◎Design andoperation of a management system to assure that data delivers value and is not a cost ◎Who can do what to the organization's data and how. ◎ Ensuring standards are set and met ◎ A strategic & high level view across the organization 42 9. Governance
  • 43.
    Open research issues ◎Heterogeneousnature of data ◎Data gathered from different sources in unstructured format ◎Hadoop and MapReduce simplify the distributed processing of unstructured data formats 43 1. Data Staging
  • 44.
    Open research issues ◎Providecapacity to address massive amount of data ◎Optimization of existing file systems ◎Stored data in a manner that they can be retrieved and migrated easily 44 2. Distributed storage systems
  • 45.
    Open research issues ◎Shouldobtain information from large amount of data in limited time ◎Need better algorithm ◎Data sources may contain different formats which makes interrogation for analysis a complex task 45 3. Data Analysis
  • 46.
    Open research issues ◎Needpolicies that cover all user privacy ◎Utilizing strong cryptography to encapsulate sensitive data ◎Need algorithm to secure key management and exchange 46 4. Data Security
  • 47.
    Future of CloudComputing & Big Data ◎Stream computing ◎Dramatically improved forecasting and predictive analysis across all scientific disciplines ◎The rise of the Social Graph – Battle lines are drawn ◎ Individually tailored and personalized solutions, services and experiences – Medical diagnosis and treatment – Lifestyle management – Targeted marketing and advertising 47
  • 48.
    Limitation of CloudComputing & Big Data ◎Querying encrypted data is time consuming ◎Difficult to handle such variety of data ◎Normally there is only one destination from which to secure data ◎Less concerns with the safety and privacy of important data stored remotely ◎Unable to access data without internet 48
  • 49.
    Conclusion ◎The size ofdata at present is huge and continues to increase every day ◎Present a review on the rise of big data in cloud computing ◎Reviewed some of the challenges in big data processing ◎The key issues in big data in clouds were highlighted ◎Researchers should collaborate to ensure the long- term success of data management in a cloud computing environment 49
  • 50.