The rise of “Big Data” on cloud computing: Review and open research issues
Paper Link: https://www.researchgate.net/publication/264624667_The_rise_of_Big_Data_on_cloud_computing_Review_and_open_research_issues
Big data is a term that describes the large volume of data may be both structured and unstructured.
That inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Big data is a term that describes the large volume of data may be both structured and unstructured.
That inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
A 3-day interactive workshop for startups involve in Big Data & Analytics in Asia. Introduction to Big Data & Analytics concepts, and case studies in R Programming, Excel, Web APIs, and many more.
DOI: 10.13140/RG.2.2.10638.36162
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Edureka!
( ** Hadoop Training: https://www.edureka.co/hadoop ** )
This Edureka tutorial on "Big Data Applications" will explain various how Big Data analytics can be used in various domains. Following are the topics included in this tutorial:
1. Why do we need Big Data Analytics?
2. Big Data Applications in Health Care.
3. Big Data in Real World Clinical Analytics.
4. Big Data Analytics in Education Sector.
5. IBM Case Study in Education Section.
6. Big data applications and use cases in E-Commerce.
7. How Government uses Big Data analytics?
8. How Big data is helpful in E-Government Portal?
9. Big Data in IOT.
10. Smart city concept.
11. Big Data analytics in Media and Entertainment
12. Netflix example in Big data
13. Future Scope of Big data.
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Explore the challenges of migrating existing applications to a cloud infrastructure, then present proven strategies for mitigating the risks You'll learn how to:
- Prioritize application migration
- Plan for “big-blocks”
- Assess your existing applications' ‘fit’ for cloud
- Leverage a centralized development testing platform to align your business goals with coding decisions
- Create a cloud migration policy
- Use process to mitigate the risks associated with cloud migration
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Denodo
Watch full webinar here: https://bit.ly/3dudL6u
It's not if you move to the cloud, but when. Most organisations are well underway with migrating applications and data to the cloud. In fact, most organisations - whether they realise it or not - have a multi-cloud strategy. Single, hybrid, or multi-cloud…the potential benefits are huge - flexibility, agility, cost savings, scaling on-demand, etc. However, the challenges can be just as large and daunting. A poorly managed migration to the cloud can leave users frustrated at their inability to get to the data that they need and IT scrambling to cobble together a solution.
In this session, we will look at the challenges facing data management teams as they migrate to cloud and multi-cloud architectures. We will show how the Denodo Platform can:
- Reduce the risk and minimise the disruption of migrating to the cloud.
- Make it easier and quicker for users to find the data that they need - wherever it is located.
- Provide a uniform security layer that spans hybrid and multi-cloud environments.
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
A 3-day interactive workshop for startups involve in Big Data & Analytics in Asia. Introduction to Big Data & Analytics concepts, and case studies in R Programming, Excel, Web APIs, and many more.
DOI: 10.13140/RG.2.2.10638.36162
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Edureka!
( ** Hadoop Training: https://www.edureka.co/hadoop ** )
This Edureka tutorial on "Big Data Applications" will explain various how Big Data analytics can be used in various domains. Following are the topics included in this tutorial:
1. Why do we need Big Data Analytics?
2. Big Data Applications in Health Care.
3. Big Data in Real World Clinical Analytics.
4. Big Data Analytics in Education Sector.
5. IBM Case Study in Education Section.
6. Big data applications and use cases in E-Commerce.
7. How Government uses Big Data analytics?
8. How Big data is helpful in E-Government Portal?
9. Big Data in IOT.
10. Smart city concept.
11. Big Data analytics in Media and Entertainment
12. Netflix example in Big data
13. Future Scope of Big data.
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Explore the challenges of migrating existing applications to a cloud infrastructure, then present proven strategies for mitigating the risks You'll learn how to:
- Prioritize application migration
- Plan for “big-blocks”
- Assess your existing applications' ‘fit’ for cloud
- Leverage a centralized development testing platform to align your business goals with coding decisions
- Create a cloud migration policy
- Use process to mitigate the risks associated with cloud migration
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Denodo
Watch full webinar here: https://bit.ly/3dudL6u
It's not if you move to the cloud, but when. Most organisations are well underway with migrating applications and data to the cloud. In fact, most organisations - whether they realise it or not - have a multi-cloud strategy. Single, hybrid, or multi-cloud…the potential benefits are huge - flexibility, agility, cost savings, scaling on-demand, etc. However, the challenges can be just as large and daunting. A poorly managed migration to the cloud can leave users frustrated at their inability to get to the data that they need and IT scrambling to cobble together a solution.
In this session, we will look at the challenges facing data management teams as they migrate to cloud and multi-cloud architectures. We will show how the Denodo Platform can:
- Reduce the risk and minimise the disruption of migrating to the cloud.
- Make it easier and quicker for users to find the data that they need - wherever it is located.
- Provide a uniform security layer that spans hybrid and multi-cloud environments.
Cloud computing is a powerful technology to perform massive-scale and complex computing. It eliminates the need to maintain expensive computing hardware, dedicated space, and software.
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloudJeff Hung
Trend Micro has been running big-data in on-premises data center for many years. With Hadoop and its mature ecosystem, we are able to build the centralized Data Lake to serve and fulfill massive data processing loads while manage and encourage new use of data.
In recent years, we are shifting our focus to AWS. Due to the decentralized nature of the cloud, the design and thinking for building Data Lake are different. We must identify what are still important no matter in on-prem or on the cloud, and what could be done differently to embrace the cloud model.
In this talk, we will elaborate Trend Micro considerations and best practices on building Data Lake in on-prem and on cloud. And share our experience on managing peta-byte scale data with many years of evolution.
BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD IJNSA Journal
Many organizations demand efficient solutions to store and analyze huge amount of information. Cloud computing as an enabler provides scalable resources and significant economic benefits in the form of reduced operational costs. This paradigm raises a broad range of security and privacy issues that must be taken into consideration. Multi-tenancy, loss of control, and trust are key challenges in cloud computing environments. This paper reviews the existing technologies and a wide array of both earlier and state-ofthe-art projects on cloud security and privacy. We categorize the existing research according to the cloud reference architecture orchestration, resource control, physical resource, and cloud service management layers, in addition to reviewing the recent developments for enhancing the Apache Hadoop security as one of the most deployed big data infrastructures. We also outline the frontier research on privacy-preserving data-intensive applications in cloud computing such as privacy threat modeling and privacy enhancing solutions.
Big data security and privacy issues in theIJNSA Journal
Many organizations demand efficient solutions to store and analyze huge amount of information. Cloud computing as an enabler provides scalable resources and significant economic benefits in the form of reduced operational costs. This paradigm raises a broad range of security and privacy issues that must be taken into consideration. Multi-tenancy, loss of control, and trust are key challenges in cloud computing environments. This paper reviews the existing technologies and a wide array of both earlier and state-ofthe-art projects on cloud security and privacy. We categorize the existing research according to the cloud reference architecture orchestration, resource control, physical resource, and cloud service management layers, in addition to reviewing the recent developments for enhancing the Apache Hadoop security as one of the most deployed big data infrastructures. We also outline the frontier research on privacy-preserving data-intensive applications in cloud computing such as privacy threat modeling and privacy enhancing solutions.
Geo-distributed Analytics with NetApp StorageGRID and AlluxioAlluxio, Inc.
Alluxio Product School Webinar
March 24, 2022
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Joseph Kandatilparambil, NetApp
Michael Waldrop, Alluxio
This presentation will include information about how Alluxio and NetApp StorageGRID helps enterprises accelerate the adoption of cloud and optimize their resource spend on a modern hybrid big data architecture. The conversation will cover use case and architecture info from a variety of enterprises and some of the high level technical details of how these business solutions are constructed.
It describe cloud infrastructure required for big data. It discusses the object storage and virtualization required for big data. Ceph is discussed as example.
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
Cloud Computing Evolution
Why Cloud Computing needed?
Cloud Computing Models
Cloud Solutions
Cloud Jobs opportunities
Criteria for Big Data
Big Data challenges
Technologies to process Big Data- Hadoop
Hadoop History and Architecture
Hadoop Eco-System
Hadoop Real-time Use cases
Hadoop Job opportunities
Hadoop and SAP HANA integration
Summary
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, technqiues and frameworks. Hadoop is an open source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Mr. Ketan Bagade | Mrs. Anjali Gharat | Mrs. Helina Tandel "A Review Paper on Big Data and Hadoop for Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29816.pdf Paper URL: https://www.ijtsrd.com/computer-science/data-miining/29816/a-review-paper-on-big-data-and-hadoop-for-data-science/mr-ketan-bagade
Distributed Framework for Data Mining As a Service on Private CloudIJERA Editor
Data mining research faces two great challenges: i. Automated mining ii. Mining of distributed data.
Conventional mining techniques are centralized and the data needs to be accumulated at central location. Mining
tool needs to be installed on the computer before performing data mining. Thus, extra time is incurred in
collecting the data. Mining is 4 done by specialized analysts who have access to mining tools. This technique is
not optimal when the data is distributed over the network. To perform data mining in distributed scenario, we
need to design a different framework to improve efficiency. Also, the size of accumulated data grows
exponentially with time and is difficult to mine using a single computer. Personal computers have limitations in
terms of computation capability and storage capacity.
Cloud computing can be exploited for compute-intensive and data intensive applications. Data mining
algorithms are both compute and data intensive, therefore cloud based tools can provide an infrastructure for
distributed data mining. This paper is intended to use cloud computing to support distributed data mining. We
propose a cloud based data mining model which provides the facility of mass data storage along with distributed
data mining facility. This paper provide a solution for distributed data mining on Hadoop framework using an
interface to run the algorithm on specified number of nodes without any user level configuration. Hadoop is
configured over private servers and clients can process their data through common framework from anywhere in
private network. Data to be mined can either be chosen from cloud data server or can be uploaded from private
computers on the network. It is observed that the framework is helpful in processing large size data in less time
as compared to single system.
In this deck from the Swiss HPC Conference, Robert Triendly from DDN presents: Long Live Posix - HPC Storage and the HPC Datacenter.
"The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. Since it was developed over 30 years ago, storage has changed dramatically. To improve the IO performance of applications, many users have called for the relaxation in POSIX IO that could lead to the development of new storage mechanisms to improve not only application performance but management, reliability, portability, and scalability."
Watch the video: https://wp.me/p3RLHQ-kaR
Learn more: http://ddn.com
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Similar to The rise of “Big Data” on cloud computing (20)
Controlling Home Appliances adopting Chatbot using Machine Learning ApproachMinhazul Arefin
In the last decades, home automation becomes popular and rapidly increased artificial intelligence-based controlling systems. So, many researchers have been interested in the Internet of things so that every appliance should be autonomous. Smart home technology is one of them. It involves certain electrical and electronic systems in a building with some degree of computerized or automated control. It can control elements of our home environments (e.g. light, fans, electrical devices, and safety systems). We propose an approach that fully controlled the home appliances by chatbot technology. In our research, the system can extract the device name such as light, fan, etc using synonyms. In the device name extraction part, we use Jaro-Winkler string matching algorithms. We have also used the Naive Bayes algorithm to take command for action. Finally, a Firebase-based system connects the users and controls hardware. Our model can control the home appliances from a long distance because we used the wireless fidelity system.
Object Detection on Dental X-ray Images using R-CNNMinhazul Arefin
In dentistry, Dental X-ray systems help dentists by showing the basic structure of tooth bones to detect various kinds of dental problems. However, depending only on dentists can sometimes impede treatment since identifying things in X-ray pictures requires human effort, experience, and time, which can lead to delays in the process. In image classification, segmentation, object identification, and machine translation, recent improvements in deep learning have been effective. Deep learning may be used in X-ray systems to detect objects. Radiology and pathology have benefited greatly from the use of deep convolutional neural networks, which are a fast-growing new area of a medical study. Deep learning techniques for the identification of objects in dental X-ray systems are the focus of this study. As part of the study, Deep Neural Network algorithms were evaluated for their ability to identify dental cavities and a root canal on periapical radiographs.
Natural Language Query to SQL conversion using Machine Learning ApproachMinhazul Arefin
Natural Language Processing is a computer science and artificial intelligence topic concerned with computer-human language interactions and how computers are designed for processing and exploring a variety of natural language data, in particular. The Structured Query Language for non-expert users is usually a challenging database storage, they may not know the database structure. For database applications to improve the interaction between database and user, a new intelligent interface is therefore necessary. The concept of utilizing a natural language instead of a structured query language has led to the creation of the natural language interface to database systems as a new form of processing procedure. The aim of this research is to build a query generating process using an algorithm for the machine learning to represent information according to user's demands for answering query and obtaining information. For the conversion of Natural Language Query into Structured Query, we utilized a lowercase conversion, removing escaped words, tokenization, PoS tagging, word similarity, Jaro-Winklar matching algorithm, and the method Naive Bayes.
Efficient estimation of word representations in vector space (2013)Minhazul Arefin
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Semantic scaffolds for pseudocode to-code generation (2020)Minhazul Arefin
They propose a method for program generation based on semantic scaffolds, lightweight structures representing the high-level semantic and syntactic composition of a program. By first searching over plausible scaffolds then using these as constraints for a beam search over programs, we achieve better coverage of the search space when compared with existing
techniques. We apply our hierarchical search method to the SPoC dataset for pseudocodeto- code generation, in which we are given line-level natural language pseudocode annotations
and aim to produce a program satisfying execution-based test cases. By using semantic scaffolds during inference, we achieve a 10% absolute improvement in top-100 accuracy
over the previous state-of-the-art. Additionally, we require only 11 candidates to reach the top-3000 performance of the previous best approach when tested against unseen problems, demonstrating a substantial improvement in efficiency.
Recurrent neural networks (rnn) and long short term memory networks (lstm)Minhazul Arefin
Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition, and anomaly detection in network traffic or IDSs (intrusion detection systems).
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
The rise of “Big Data” on cloud computing
1. The rise of “Big Data” on
cloud computing: Review and
open research issues
• Ibrahim Abaker Targio Hashem
• Ibrar Yaqoob
• Nor Badrul Anuar
• Salimah Mokhtar
• Abdullah Gani
• Samee Ullah Khan
3. Outlines
◎Introduction
◎Definition & Characteristics of Big Data
◎Cloud Computing
◎Relationship between cloud computing & big data
◎Case studies
◎Big data storage system
◎Hadoop background
◎Research challenges
◎Open research issues
◎Conclusion
3
4. Introduction
◎The continuous increase in the volume and detail of
data captured by organizations has produced an
overwhelming flow of data in either structured or
unstructured format.
◎Virtualization is a process of resource sharing and
isolation of underlying hardware to increase computer
resource utilization, efficiency, and scalability.
◎The goal of this study is to implement a
comprehensive investigation of the status of big data
in cloud computing environments
4
5. What is Big Data?
Big data is a term utilized
to refer to the increase in
the volume of data that are
difficult to store, process,
and analyze through
traditional database
technologies.
5
6. Characteristics of big data
◎Big data are characterized by three aspects:
i. data are numerous
ii. data cannot be categorized into regular
relational databases
iii. data are generated, captured, and
processed rapidly.
6
12. Classification of Big Data
◎Web & Social Media
◎Machine
◎Sensing
◎Transaction
◎IoT
12
1. Data sources
13. Classification of Big Data
◎Structured
○ SQL Server
○ Oracle
○ Access, Excel
◎Semi-structured
○ Text Analytics
○ Blogs
○ Social Authority
○ Video
○ Audio
◎Unstructured
○ Weather data
○ Currency Conversion
○ Demographic
○ E-Commerce
13
2. Content Format
14. Classification of Big Data
◎Document-oriented
◎Column-oriented
◎Graph database
◎Key-value
14
3. Data Stores
16. Classification of Big Data
◎Batch
○ Used MapReduce based system
◎Real Time
○ Scalable streaming system
16
4. Data Preprocessing
17. What is Cloud Computing?
Cloud computing is a fast-
growing technology that
has established itself in
the next generation of IT
industry and business.
17
20. Organization case Studies from vendors
◎A language technology aids
touchscreen typing by
providing personalized
predictions and corrections
◎Collects & analyzes terabytes
of data to create language
model
◎Used Apache Hadoop
running on Amazon Simple
Storage Service
20
1. Swiftkey
21. Organization case Studies from vendors
◎Maker of Halo, a science
fiction media franchise
◎The developers analyzed
data to obtain insights into
player preferences and
online tournament
◎Used Windows Azure
HDInsight Service, which is
based on Apache Hadoop big
data framework
21
2. 343 Industries
22. Organization case Studies from vendors
◎Online travel agency
◎Unifying tens of
thousands of bus
schedules into a single
booking operations
◎Implemented
GoogleQuery to analyze
large dataset in Google
data processing
infrastructure
22
3. redBus
23. Organization case Studies from vendors
◎A mobile communication
company
◎Gathers and analyze large
amount of data from
mobile phones
◎Used Hadoop Distributed
File System (HDFS)
23
4. Nokia
24. Organization case Studies from vendors
◎An online retailer
◎Experiencing revenue
leakage for unreliable
real time notifications of
service problems
◎Used big data
algorithms to create a
cloud monitoring
system that deliver
notifications
24
5. Alacer
25. Case Studies from Scholarly/Academic Source
Situation/ context Objective Approach Result
Massively parallel
DNA sequencing
generates
staggering amounts
of data
Provide accurate &
reproductive genomic
result
Develop a Mercury analysis
pipeline and deploy it in the
Amazon web service cloud
via DNAnexus platform
Established a powerful
combination of a robust and fully
validated software pipeline and a
scalable computational resource
Conducting
analyses on large
social networks
such as Twitter
To use cloud services as
a possible solution for
the analysis of large
amounts of data
Use PageRank algorithm on
the Twitter user base to
obtain user ranking
Implemented a relatively cheap
solution for data acquisition and
analysis by using Amazon cloud
infrastuture
To study the
complex molecular
interactions that
regulate biological
systems
To develop a Hadoop
Based cloud computing
application that process
sequence of microscopic
images
Use Hadoop cloud
computing framework
Allows users to submit data
processing jobs in the cloud
Applications
running on cloud
computing likely
may fail
Design a failure scenario
Create a series of failure
scenarios on a Amazon cloud
computing platform
Help to identify vulnerabilities in
Hadoop applications running in
cloud
25
26. “
There were 5 exabytes of information created
between the dawn of civilization through 2003,
but that much information is now created in
every 2 days
26
- Eric Schmidt,
Executive Chairman, Google
27. Big data storage system
◎Traditional storage systems store data through
structured RDBMS
◎A storage architecture need to achieve availability &
reliability
◎Need to store and manage large dataset
◎The organizational systems of data storage can be
divided into three parts:
○ Disc array
○ Connection and network subsystems
○ Storage management software
27
28. Comparision of Storage Media
Type Specific use Advantages Disadvantages
Hard
drives
Store data up to four
terabytes
• Density
• Cost per bit storage
• Speedy start up
• Require Special cooling
• High read latency time
• Produce more heat
Solid
state
memory
Store data up to two
terabytes
• Fast access to data
• Fast movement of huge data
• Fast start-up time
• More expensive than hard drives
Object
storage
Store data as variable
size object rather than
fixed sized blocks
• Easy to find information
• Unique identifier to find data objects
• Ensure security
• Complexity in tracking indices
Optical
storage
Store data at different
angles throughout the
storage medium
• Least expensive
• Removable storage medium
• Complex
• Ability to produce multiple
optical disks in a single unit is yet
to be proven
Cloud
Storage
Serve as a provisioning
& storage model
• Usefull for small organization that do
not have sufficient storage capacity
• Can store large amount of data
• Less Security
28
29. Hadoop
◎A free, Java-based
programming
framework that
supports the processing
of large data sets in a
distributed computing
environment
◎Has Google’s powerful
computation
MapReduce Technology
29
30. HDFS (Hadoop Distributed File System)
◎A scalable distributed file system for applications
dealing with large data sets
○ Distributed: runs in a cluster
○ Scalable: 10Κ nodes, 100Κ files 10PB storage
◎ Storage space is seamless for the whole cluster
◎ Files broken into blocks
◎ Typical block size: 128 MB.
◎ Replication: Each block copied to multiple data
nodes.
30
31. What is MapReduce?
◎A programming model
◎A programming framework
◎Used to develop solutions that will
○ Process large amounts of data in a parallelized fashion
○ In clusters of computing nodes
◎Originally a closed-source implementation at Google
◎Hadoop: Open source implementation of the
algorithms described in the scientific papers
31
32. MapReduce
◎The model is broken down in 2 phases:
○ Map: Non overlapping sets of data input (<key, value> records) are
assigned to different processes (mappers) that produce a set of
intermediate <key, value> results
○ Reduce: Data of Map phase are fed to a typically smaller number of
processes(reducers) that aggregate the input results to a smaller
number of <key, value> records.
32
33. Research Challenges
◎Ability to handle increasing
amounts of data in an
appropriate manner
◎NoSQL database store and
retrieve large volumes of
distributed data.
◎Wang et al proposed a new
scalable data cube analysis
technique called HaCube in
big data clusters to
overcome the challenges of
large-scale data.
33
1. Scalability
34. Research Challenges
◎Refers to the resources of the
system accessible on
demand by an authorized
individual
◎Mobile user needs data
within a short amount of
time
◎Services must remain
operational even in the case
of a security breach
34
2. Availability
35. Research Challenges
◎Preventing improper or
unauthorized change or
access
◎Must ensure the
correctness of user data
◎Should provide a
mechanism for the user
to check whether the
data is maintained
35
3. Data Integrity
36. Research Challenges
◎Transforming data into a
form suitable for analysis
is an obstacle in the
adoption of big data
◎Owing to the variety of
data formats, big data can
be transformed into an
analysis workflow in two
ways
36
4. Transformation
37. Transforming big data for analysis.
◎Structured data is pre-processed before they are stored
in relational databases to meet the constraints of
schema-on-write, then it can be retrieved for analysis
◎Unstructured data must first be stored in distributed
databases, such as HBase, before they are processed
for analysis
37
38. Research Challenges
◎Defined as “any
difficulty encountered
along one or more
quality dimensions that
render data completely
or largely unfit for use”
◎High-quality data in the
cloud is characterized
by data consistency
38
5. Data quality
39. Research Challenges
◎Variety, one of the major aspects
of big data characterization
◎In a cloud environment, users
can store data
◎Structured data formats are
appropriate for database
systems
◎Semi-structured data formats
are appropriate only to some
extent
◎Unstructured data are
inappropriate
39
6. Heterogeneity
40. Research Challenges
◎Concerns to hamper
users who outsource
their private data into
the cloud storage
◎Encryption is utilized by
most researchers to
ensure data privacy in
the cloud
40
7. Privacy
41. Research Challenges
◎Specific laws &
regulation must be
established to preserve
sensitive information
◎Monitoring of company
staff communications is
not legal
◎Electrical monitoring is
permitted under special
circumstances
41
8. Legal/regulatory issues
42. Research Challenges
◎Design and operation of a
management system to
assure that data delivers
value and is not a cost
◎Who can do what to the
organization's data and how.
◎ Ensuring standards are set
and met
◎ A strategic & high level view
across the organization
42
9. Governance
43. Open research issues
◎Heterogeneous nature of
data
◎Data gathered from
different sources in
unstructured format
◎Hadoop and MapReduce
simplify the distributed
processing of unstructured
data formats
43
1. Data Staging
44. Open research issues
◎Provide capacity to
address massive
amount of data
◎Optimization of existing
file systems
◎Stored data in a manner
that they can be
retrieved and migrated
easily
44
2. Distributed storage systems
45. Open research issues
◎Should obtain
information from large
amount of data in
limited time
◎Need better algorithm
◎Data sources may
contain different
formats which makes
interrogation for
analysis a complex task
45
3. Data Analysis
46. Open research issues
◎Need policies that cover
all user privacy
◎Utilizing strong
cryptography to
encapsulate sensitive
data
◎Need algorithm to
secure key
management and
exchange
46
4. Data Security
47. Future of Cloud Computing & Big Data
◎Stream computing
◎Dramatically improved forecasting and predictive
analysis across all scientific disciplines
◎The rise of the Social Graph
– Battle lines are drawn
◎ Individually tailored and personalized solutions,
services and experiences
– Medical diagnosis and treatment
– Lifestyle management
– Targeted marketing and advertising
47
48. Limitation of Cloud Computing & Big Data
◎Querying encrypted data is time consuming
◎Difficult to handle such variety of data
◎Normally there is only one destination from which to
secure data
◎Less concerns with the safety and privacy of
important data stored remotely
◎Unable to access data without internet
48
49. Conclusion
◎The size of data at present is huge and continues to
increase every day
◎Present a review on the rise of big data in cloud
computing
◎Reviewed some of the challenges in big data
processing
◎The key issues in big data in clouds were highlighted
◎Researchers should collaborate to ensure the long-
term success of data management in a cloud
computing environment
49