SlideShare a Scribd company logo
Big Data
Submitted in partial fulfillment of the requirements
for the award of the Degree of Information Technology
Submitted By
Mr.Prashant Maruti Navatre
(Regisration No.20130737)
the guidance of
Prof.S.S.Barphe
DEPARMENT OF INFORMATION TECHNOLOGY
DR.BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY,
LONERE, RAIGAD-MAHARASHTRA, INDIA-400103
2015-2016
DR.BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY
LONERE, RAIGAD-MAHARASHTRA, INDIA-400103
Certificate
This is to certify that the project entitled “Big Data” is submitted by Mr.Prashant Maruti
Navatre , Registration No. 20130737 for the partial fulfilment of the requirement for the
award of the degree of Bachelor of Technology in INFORMATION TECHNOLOGY of the Dr.
Babasaheb Ambedkar Technological University, Lonere is a bonafide work carried out during
the academic year 2015-2016.
Prof.S.S.Barphe Dr.S.M.Jadhav
(Seminar Guide) (Head Of Deparment)
Information Technology Information Technology
Examiners:
1).
2).
Date:
Place:Vidyavihar Lonere-402103
Acknowledgment
I am pleased to present this seminar report entitled “Big Data”. It is indeed a great pleasure
and a moment of immense satisfaction for me to express my sense of profound gratitude and
indebtedness towards my guide Prof. S.S.Baprhe whose enthusiasm are the source of inspiration
for me. I am extremely thankful for the guidance and untiring attention, which he bestowed on
me right from the beginning. Her valuable and timely suggestions at crucial stages and above
all his constant encouragement have made it possible for me to achieve this work. I would also
like to give my sincere thanks to S.M. JADHAV Head of INFORMATION TECHNOLOGY for
necessary help and providing me the required facilities for completion of this seminar report.
I would like to thank the entire Teaching staffs who are directly or indirectly involved in the
various data collection and software assistance to bring forward this seminar report. I express
my deep sense of gratitude towards my parents for their sustained cooperation and wishes,
which have been a prime source of inspiration to take this seminar work to its end without
any hurdles.Last but not the least, I would like to thank all my B.Tech. colleagues for their
co-operation and useful suggestion and all those who have directly or indirectly helped me in
completion of this seminar work.
Date Prashant Maruti Navatre
Place 20130737
ABSTRACT
Big data is a term for massive data sets having large, more varied and complex
structure with the difficulties of storing, analyzing and visualizing for further
processes or results. The process of research into massive amounts of data
to reveal hidden patterns and secret correlations named as big data analytics.
These useful informations for companies or organizations with the help of gain-
ing richer and deeper insights and getting an advantage over the competition.
For this reason, big data implementations need to be analyzed and executed
as accurately as possible. This paper presents an overview of big data’s con-
tent, scope, samples, methods, advantages and challenges and discusses privacy
concern on it.
Every day, we create 2.5 quintillion bytes(one quintillion bytes = one billion
gigabytes). of data so much that 90% of the data in the world today has been
created in the last two years alone. This data comes from everywhere: sensors
used to gather climate information, posts to social media sites, digital pictures
and videos, purchase transaction records, and cell phone GPS signals to name
a few. This data is Big Data.
I
Contents
1 Introduction 1
1.1 The General Concept Of Big Data . . . . . . . . . . . . . . . . . 2
1.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Need For Big Data . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Sources Of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Characteristics of Big Data 10
2.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Veracity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Storage,Selection and Processing of Big Data 16
3.1 Storage of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Key Requirements of Big Data . . . . . . . . . . . . . . . 18
3.2 Selection of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Processing of Big Data . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Batch Processing . . . . . . . . . . . . . . . . . . . . . . 23
II
CONTENTS CONTENTS
3.3.2 Stream Processing . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Hadoop Ecosystem . . . . . . . . . . . . . . . . . . . . . 25
3.3.4 Map and Reduce . . . . . . . . . . . . . . . . . . . . . . 25
4 Big Data Analytics 27
4.1 Examples of Big Data Analytics . . . . . . . . . . . . . . . . . . 29
4.2 Benefits of Big Data Analytics . . . . . . . . . . . . . . . . . . . 30
5 Challenges in Big Data 31
5.1 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2 Ineffectiveness . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . 35
5.5 Behavioral change . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Applications and Future of Big Data 38
6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.1 Government . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.2 Cyber-Physical Models . . . . . . . . . . . . . . . . . . . 40
6.1.3 Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1.4 Technology . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 The Future of Big Data . . . . . . . . . . . . . . . . . . . . . . 42
7 Conclusion 45
References 46
III
List of Figures
1.1 Visualization of daily Wikipedia edits created by IBM . . . . . . 2
1.2 Growth and Digitalization of Global and Information Storage
capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Sources of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Architecture of Big Data . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Selection of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 22
IV
Chapter 1
Introduction
In recent years , the term Big Data has emerged to describe a new paradigm
for data applications. New Technologies tend to emerge with a lot of hype ,
but it can take some time to tell what is new and different. While big data has
been defined in a myriad of ways, the heart of the Big Data paradigm is that
too big (volume),arrives too fast(velocity),changes too fast(variability),conatins
too much(vearcity),or is too diverse(variety) to be proceesed within a local com-
puting structure using traditional approaches and techniques. The technologies
being introduced to support this paradigm have a wide variety of interfaces
making it difficult to construct tools and applications that integrate data from
multiple Big Data sources.
Analysis of data sets can find new correlations, to ”spot business trends,
prevent diseases, combat crime and so on. Scientists, business executives, prac-
titioners of media and advertising and governments alike regularly meet difficul-
ties with large data sets in areas including Internet search, finance and business
informatics. Scientists encounter limitations in e-Science work, including me-
teorology, genomics, connectomics, complex physics simulations, and biological
and environmental research.
1
Big Data Dept.of Information Technology
Big Data is high-volume, high-velocity and/or high-variety information assets
that demand cost-effective, innovative forms of information processing that en-
able enhanced insight, decision making, and process automation.Data sets grow
in size in part because they are increasingly being gathered by cheap and nu-
merous information-sensing mobile devices, aerial (remote sensing), software
logs, cameras, microphones, radio-frequency identification (RFID) readers, and
wireless sensor networks.
1.1 The General Concept Of Big Data
The term Big Data is an imprecise description of a rich and complicated
set of characteristics, practices, techniques, ethical issues, and outcomes all
associated with data.Big Data originated in the physical sciences, with physics
and astronomy early to adopt of many of the techniques now called Big Data.
Instruments like the Large Hadron Collider and the Square Kilometer Array
are massive collectors of exabytes of information, and the ability to collect such
massive amounts of data necessitated an increased capacity to manipulate and
analyze these data as well.
Figure 1.1: Visualization of daily Wikipedia edits created by IBM
DR.B.A.T.UNIVERSITY 2
Big Data Dept.of Information Technology
1.2 Definition
Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture,
data curation, search, sharing, storage, transfer, visualization, and information
privacy. The term often refers simply to the use of predictive analytics or
other certain advanced methods to extract value from data, and seldom to a
particular size of data set. Accuracy in big data may lead to more confident
decision making. And better decisions can mean greater operational efficiency,
cost reduction and reduced risk.
1.3 History
Big data burst upon the scene in the first decade of the 21st century, and the
first organizations to embrace it were online and startup firms. Arguably, firms
like Google, eBay, LinkedIn, and Facebook were built around big data from
the beginning. They didnt have to reconcile or integrate big data with more
traditional sources of data and the analytics performed upon them, because
they didnt have those traditional forms. They didnt have to merge big data
technologies with their traditional IT infrastructures because those infrastruc-
tures didnt exist. Big data could stand alone, big data analytics could be the
only focus of analytics, and big data technology architectures could be the only
architecture.
Consider, however, the position of large, well-established businesses. Big data
in those environments shouldnt be separate, but must be integrated with every-
thing else thats going on in the company.Analytics on big data have to coexist
with analytics on other types of data. Hadoop clusters have to do their work
DR.B.A.T.UNIVERSITY 3
Big Data Dept.of Information Technology
alongside IBM mainframes.Data scientists must somehow get along and work
jointly with mere quantitative analysts.In order to understand this coexistence,
we interviewed 20 large organizations in the early months of 2013 about how big
data fit in to their overall data and analytics environments. Overall, we found
the expected co-existence; in not a single one of these large organizations was
big data being managed separately from other types of data and analytics. The
integration was in fact leading to a new management perspective on analytics,
which well call Analytics 3.0. In this paper well describe the overall context for
how organizations think about big data, the organizational structure and skills
required for itetc. Well conclude by describing the Analytics 3.0 era.
In a 2001 research report and related lectures, META Group (now Gartner)
analyst Doug Laney defined data growth challenges and opportunities as be-
ing three-dimensional, i.e. increasing volume (amount of data), velocity (speed
of data in and out), and variety (range of data types and sources).Gartner,
and now much of the industry, continue to use this ”3Vs” model for describ-
ing big data.In 2012, Gartner updated its definition as follows: ”Big data is
high volume, high velocity, and/or high variety information assets that require
new forms of processing to enable enhanced decision making, insight discovery
and process optimization.” Additionally, a new V ”Veracity” is added by some
organizations to describe it.
1.4 The Need For Big Data
Like many new information technologies, big data can bring about dramatic
cost reductions, substantial improvements in the time required to perform a
computing task, or new product and service offerings. Like traditional analytics,
it can also support internal business decisions. The technologies and concepts
DR.B.A.T.UNIVERSITY 4
Big Data Dept.of Information Technology
Figure 1.2: Growth and Digitalization of Global and Information Storage capacity
behind big data allow organizations to achieve a variety of objectives, but most
of the organizations we interviewed were focused on one or two. The chosen
objectives have implications for not only the outcome and financial benefits
from big data, but also the processwho leads the initiative, where it fits within
the organization, and how to manage the project.As the world becomes more
connected via technology, the amount of data flowing into companies is growing
exponentially and identifying value in that data becomes more difficult - as the
data haystack grows larger, the needle becomes more difficult to find. So Big
Data is really about finding the needles gathering, sorting and analyzing the
flood of data to find the valuable information on which sound business decisions
are made. When applied to energy related businesses, Big Data implications
vary by market segment Big Data concerns for utilities are not the same as
those for energy trading organizations; however the necessity for solving those
problems can be as equally pressing.
If Gartners definition (the 3Vs) is still widely used, the growing maturity
of the concept fosters a more sound difference between big data and Business
DR.B.A.T.UNIVERSITY 5
Big Data Dept.of Information Technology
Intelligence, regarding data and their use:
Intelligence uses descriptive statistics with data with high information density
to measure things, detect trends etc.
Big data uses inductive statistics and concepts from nonlinear system iden-
tification to infer laws (regressions, nonlinear relationships, and causal effects)
from large sets of data with low information density to reveal relationships,
dependencies and perform predictions of outcomes and behaviors.
The real issue is not that you are acquiring large amounts of data. It’s what
you do with the data that counts. The hopeful vision is that organizations will
be able to take data from any source, harness relevant data and analyze it to
find answers that enable
1. cost reductions
2. time reductions
3. new product development and optimized offerings
4. smarter business decision making
1.5 Sources Of Big Data
The sources and formats of data continue to grow in variety and complexity.
A partial list of sources includes the public web; social media; mobile applica-
tions; federal, state and local records and databases; commercial databases that
aggregate individual data from a spectrum of commercial transactions and pub-
lic records; geospatial data; surveys; and traditional offline documents scanned
by optical character recognition into electronic form. The advent of the more
Internet-enabled devices and sensors expands the capaci-ty to collect data from
DR.B.A.T.UNIVERSITY 6
Big Data Dept.of Information Technology
physical entities, including sensors and radio-frequency identifica-tion (RFID)
chips. Personal location data can come from GPS chips, cell-tower triangula-
tion of mobile devices, mapping of wireless networks, and in-person payments
There are many different types of Big Data sources e.g.:
1. Social media data
2. Personal data (e.g. data from tracking devices)
3. Sensor data
4. Transactional data
5. Enterprise data
There are different opinions on whether Enterprise data should be consid-
ered to be Big Data or not. Enterprise data are usually large in volume, they
are generated for a different purpose and arise organically through Enterprise
processes. Also the content of Enterprise data is usually not designed by re-
searchers. For these reasons, and because there is a great potential in using
Enterprise data, we will consider it to be in scope for this report. There are
a number of differences between Enterprise data and other types of Big Data
that are worth pointing out.The amount of control a researcher has and the
potential inferential power vary between different types of Big Data sources.
For example, a researcher will likely not have any control of data from different
social media platforms and it could be difficult to decipher a text from social
media. For Enterprise data on the other hand, a statistical agency can form
partnership with owners of the data and influence the design of the data. En-
terprise data is more structured, well defined and more is known about the data
than perhaps other Big Data sources.
DR.B.A.T.UNIVERSITY 7
Big Data Dept.of Information Technology
Figure 1.3: Sources of Big Data
1.6 Architecture
In 2000, Seisint Inc. developed C++ based distributed file sharing framework
for data storage and querying. Structured, semi-structured and/or unstructured
data is stored and distributed across multiple servers. Querying of data is done
by modified C++ called ECL which uses apply scheme on read method to cre-
ate structure of stored data during time of query. In 2004 LexisNexis acquired
Seisint Inc. and 2008 acquired ChoicePoint, Inc. and their high speed paral-
lel processing platform. The two platforms were merged into HPCC Systems
and in 2011 was open sourced under Apache v2.0 License. Currently HPCC
and Quantcast File Systemare the only publicly available platforms capable of
analyzing multiple exabytes of data.
In 2004, Google published a paper on a process called MapReduce that used
such an architecture. The MapReduce framework provides a parallel process-
ing model and associated implementation to process huge amounts of data.
With MapReduce, queries are split and distributed across parallel nodes and
DR.B.A.T.UNIVERSITY 8
Big Data Dept.of Information Technology
Figure 1.4: Architecture of Big Data
processed in parallel (the Map step). The results are then gathered and deliv-
ered (the Reduce step). The framework was very successful, so others wanted
to replicate the algorithm. Therefore, an implementation of the MapReduce
framework was adopted by an Apache open source project named Hadoop.
Recent studies show that the use of a multiple layer architecture is an option
for dealing with big data. The Distributed Parallel architecture distributes data
across multiple processing units and parallel processing units provide data much
faster, by improving processing speeds. This type of architecture inserts data
into a parallel DBMS, which implements the use of MapReduce and Hadoop
frameworks. This type of framework looks to make the processing power trans-
parent to the end user by using a front end application server.
DR.B.A.T.UNIVERSITY 9
Chapter 2
Characteristics of Big Data
2.1 Volume
The quantity of generated data is important in this context. The size of
the data determines the value and potential of the data under consideration,
and whether it can actually be considered big data or not. The name big data
itself contains a term related to size, and hence the characteristic. Many fac-
tors contribute to the increase in data volume. Transaction-based data stored
through the years. Unstructured data streaming in from social media. Increas-
ing amounts of sensor and machine-to-machine data being collected. In the
past, excessive data volume was a storage issue. But with decreasing storage
costs, other issues emerge, including how to determine relevance within large
data volumes and how to use analytics to create value from relevant data. This
refers to the sheer amount of data available for analysis.This volume of data is
driven by the increasing number of data collection instruments (e.g., social me-
dia tools, mobile applications, sensors) as well as the increased ability to store
and transfer those data with recent improvements in data storage and network-
ing.Traditionally, the data volume requirements for analytic and transactional
applications were in sub-terabyte territory.However, over the past decade, more
organizations in diverse industries have identified requirements for analytic data
10
Big Data Dept.of Information Technology
volumes in the terabytes, petabytes,and beyond. Estimates produced by lon-
gitudinal studies started in 2005[8] show that the amount of data in the world
is doubling every two years. Should this trend continue, by 2020, there will
be 50 times the amount of data as there had been in 2011. Other estimates
indicate that 90 % of all data ever created, was created in the past 2 years.The
sheer volume of the data are colossal - the era of a trillion sensors is upon
us. This volume presents the most immediate challenge to conventional infor-
mation technology structures. It has stimulated new ways for scalable storage
across a collection of horizontally coupled resources, and a distributed approach
to querying. Briefly, the traditional relational model has been relaxed for the
persistence of newly prominent data types. These logical non-relational data
models, typically lumped together as NoSQL, can currently be classified as Big
Table, Name-Value, Document and Graphical models.
2.2 Velocity
The term velocity in the context refers to the speed of generation of data
or how fast the data is generated and processed to meet the demands and the
challenges which lie ahead in the path of growth and development.This refers
to both the speed at which these data collection events can occur, and the pres-
sure of managing large streams of real-time data.Across the means of collecting
social information, new information is being added to the database at rates
ranging from as slow as every hour or so, to as fast as thousands of events per
second. In this context, the speed at which the data is generated and processed
to meet the demands and the challenges that lie in the path of growth and de-
velopment.Data is streaming in at unprecedented speed and must be dealt with
in a timely manner.RFID tags,sensors and smart metering are driving the need
to deal with torrents of data in near-real time. Reacting quickly enough to deal
DR.B.A.T.UNIVERSITY 11
Big Data Dept.of Information Technology
with data velocity is a challenge for most organizations The type of content,
and an essential fact that data analysts must know.This helps people who are
associated with and analyze the data to effectively use the data to their advan-
tage and thus uphold its importance.The Velocity is the speed/rate at which
the data are created, stored, analysed and visualized. Traditionally, most en-
terprises separated their transaction processing and analytics. Enterprise data
analytics were concerned with batch data extraction, processing, replication,
delivery, and other applications. But increasingly, organizations everywhere
have begun to emphasize the need for real-time, streaming, continuous data
discovery, extraction, processing, analysis, and access. In the big data era, data
are created in real-time or near real-time. With the availability of Internet
connected devices, wireless or wired, machines and devices can pass-on their
data the moment it is created. Data Flow rates are increasing with enormous
speeds and variability, creating new challenges to enable real or near real-time
data usage. Traditionally this concept has been described as streaming data.
As such there are aspects of this that are not new, as companies such as those
in telecommunication have been sifting through high volume and velocity data
for years. The new horizontal scaling approaches do however add new big data
engineering options for efficiently handling this data.
2.3 Variety
Data today comes in all types of formats. Structured, numeric data in tra-
ditional databases. Information created from line-of-business applications. Un-
structured text documents, email, video, audio, stock ticker data and financial
transactions. Managing, merging and governing different varieties of data is
something many organizations still grapple with. The type of content, and an
essential fact that data analysts must know. This helps people who are asso-
DR.B.A.T.UNIVERSITY 12
Big Data Dept.of Information Technology
ciated with and analyze the data to effectively use the data to their advantage
and thus uphold its importance.Variety refers to the complexity of formats in
which Big Data can exist. Besides structured databases, there are large streams
of unstructured documents, images, email messages, video, links between de-
vices and other forms that create a heterogeneous set of data points. One effect
of this complexity is that structuring and tying data together becomes a ma-
jor effort, and therefore a central concern of Big Data analysis. Traditionally,
enterprise data implementations for analytics and transactions operated on a
single structured, row-based, relational domain of data. However, increasingly,
data applications are creating, consuming, processing, and analysing data in a
wide range of relational and non-relational formats including structured, un-
structured, semistructured, documents and so forth from diverse application
domains.Traditionally, a variety of data was handled through transforms or
pre-analytics to extract features that would allow integration with other data
through a relational model. Given the wider range of data formats, structures,
timescales and semantics that are desirous to use in analytics, the integration of
this data becomes more complex. This challenge arises as data to be integrated
could be text from social networks, image data, or a raw feed directly from a
sensor source. The Internet of Things is the term used to describe the ubiquity
of connected sensors, from RFID tags for location, to smartphones, to home
utility meters. The fusion of all of this streaming data will be a challenge for
developing a total situational awareness. Big Data Engineering has spawned
data storage models that are more efficient for unstructured data types than
a relational model, causing a derivative issue for the mechanisms to integrate
this data. It is possible that the data to be integrated for analytics may be
of such volume that it cannot be moved in order to integrate, or it may be
that some of the data are not under control of the organization creating the
DR.B.A.T.UNIVERSITY 13
Big Data Dept.of Information Technology
data system. In either case, the variety of big data forces a range of new big
data engineering in order to efficiently and automatically integrate data that is
stored across multiple repositories and in multiple formats.
2.4 Variability
The inconsistency the data can show at times-which can hamper the process
of handling and managing the data effectively This is a factor which can be a
problem for those who analyse the data. This refers to the inconsistency which
can be shown by the data at times, thus hampering the process of being able
to handle and manage the data effectively. Variability refers to changes in data
rate, format/structure, semantics, and/or quality that impact the supported
application, analytic, or problem. Specifically, variability is a change in one
or more of the other Big Data characteristics. Impacts can include the need
to refactor architectures, interfaces, processing/algorithms, integration/fusion,
storage, applicability, or use of the data. In addition to the increasing velocities
and varieties of data, data flows can be highly inconsistent with periodic peaks.
Is something trending in social media? Daily, seasonal and event-triggered peak
data loads can be challenging to manage. Even more so with unstructured data
involved.The other characteristics directly affect the scope of the impact for a
change in one dimension. For, example in a system that deals with petabytes or
exabytes of data refactoring the data architecture and performing the necessary
transformation to accommodate a change in structure from the source data may
not even be feasible even with the horizontal scaling typically associated with
big data architectures. In addition, the trend to integrate data from outside the
organization to obtain more refined analytic results combined with the rapid
evolution in technology means that enterprises must be able to adapt rapidly
to data variations.
DR.B.A.T.UNIVERSITY 14
Big Data Dept.of Information Technology
2.5 Veracity
The quality of captured data, which can vary greatly.Accurate analysis de-
pends on the veracity of source data. Veracity refers to the trustworthiness,
applicability, noise, bias, abnormality and other quality properties in the data.
Veracity is a challenge in combination with other Big Data characteristics, but is
essential to the value associated with or developed from the data for a specific
problem/application. Assessment, understanding, exploiting, and controlling
Veracity in Big Data cannot be addressed efficiently and sufficiently through-
out the data lifecycle using current technologies and techniques.
2.6 Complexity
Data management can be very complex, especially when large volumes of data
come from multiple sources. Data must be linked, connected, and correlated
so users can grasp the information the data is supposed to convey. Today’s
data comes from multiple sources. And it is still an undertaking to link, match,
cleanse and transform data across systems. However, it is necessary to connect
and correlate relationships, hierarchies and multiple data linkages or your data
can quickly spiral out of control.
DR.B.A.T.UNIVERSITY 15
Chapter 3
Storage,Selection and Processing of Big
Data
3.1 Storage of Big Data
The explosive growth of data has more strict requirements on storage and
management. In this section, we focus on the storage of big data. Big data stor-
age refers to the storage and management of large-scale datasets while achieving
reliability and availability of data accessing. We will review important issues
including massive storage systems, distributed storage systems, and big data
storage mechanisms. On one hand, the storage infrastructure needs to pro-
vide information storage service with reliable storage space; on the other hand,
it must provide a powerful access interface for query and analysis of a large
amount of data.
Traditionally, as auxiliary equipment of server, data storage device is used
to store, manage, look up, and analyze data with structured RDBMSs. With
the sharp growth of data, data storage device is becoming increasingly more
important, and many Internet companies pursue big capacity of storage to be
competitive. Therefore, there is a compelling need for research on data stor-
age.Storage system for massive data.Various storage systems emerge to meet
16
Big Data Dept.of Information Technology
the demands of massive data. Existing massive storage technologies can be
classified as Direct Attached Storage (DAS) and network storage, while net-
work storage can be further classified into Network Attached Storage (NAS)
and Storage Area Network (SAN). In DAS, various harddisks are directly con-
nected with servers, and data management is server-centric, such that storage
devices are peripheral equipments, each of which takes a certain amount of I/O
resource and is managed by an individual application software. For this reason,
DAS is only suitable to interconnect servers with a small scale. However, due
to its low scalability, DAS will exhibit undesirable efficiency when the storage
capacity is increased, i.e., the upgradeability and expandability are greatly lim-
ited. Thus, DAS is mainly used in personal computers and small-sized servers.
Network storage is to utilize network to provide users with a union interface
for data access and sharing. Network storage equipment includes special data
exchange equipments, disk array, tap library, and other storage media, as well
as special storage software. It is characterized with strong expandability. NAS
is actually an auxillary storage equipment of a network. It is directly connected
to a network through a hub or switch through TCP/IP protocols. In NAS,
data is transmitted in the form of files. Compared to DAS, the I/O burden at a
NAS server is reduced extensively since the server accesses a storage device in-
directly through a network. While NAS is network-oriented, SAN is especially
designed for data storage with a scalable and bandwidth intensive network,
e.g., a high-speed network with optical fiber connections. In SAN, data stor-
age management is relatively independent within a storage local area network,
where multipath based data switching among any internal nodes is utilized to
achieve a maximum degree of data sharing and data management.
DR.B.A.T.UNIVERSITY 17
Big Data Dept.of Information Technology
3.1.1 Key Requirements of Big Data
At root, the key requirements of big data storage are that it can handle
very large amounts of data and keep scaling to keep up with growth, and that
it can provide the input/output operations per second (IOPS) necessary to
deliver data to analytics tools The largest big data practitioners Google, Face-
book, Apple, etc run what are known as hyperscale computing environments.
These comprise vast amounts of commodity servers with direct-attached storage
(DAS). Redundancy is at the level of the entire compute/storage unit, and if a
unit suffers an outage of any component it is replaced wholesale, having already
failed over to its mirror. Such environments run the likes of Hadoop, NoSQL
and Cassandra as analytics engines, and typically have PCIe flash storage alone
in the server or in addition to disk to cut storage latency to a minimum. Theres
no shared storage in this type of configuration. Hyperscale computing environ-
ments have been the preserve of the largest web-based operations to date, but
it is highly probable that such compute/storage architectures will bleed down
into more mainstream enterprises in the coming years. The appetite for build-
ing hyperscale systems will depend on the ability of an enterprise to take on a
lot of in-house hardware building and maintenance and whether they can jus-
tify such systems to handle limited tasks alongside more traditional enterprise
environments that handle large amounts of applications on less specialised sys-
tems. But hyperscale is not the only way. Many enterprises, and even quite
small businesses, can take advantage of big data analytics. They will need
the ability to handle relatively large data sets and handle them quickly, but
may not need quite the same response times as those organisations that use
it push adverts out to users over response times of a few seconds. So the key
type of big data storage system with the attributes required will often be scale-
out or clustered NAS. This is file access shared storage that can scale out to
DR.B.A.T.UNIVERSITY 18
Big Data Dept.of Information Technology
meet capacity or increased compute requirements and uses parallel file systems
that are distributed across many storage nodes that can handle billions of files
without the kind of performance degradation that happens with ordinary file
systems as they grow. For some time, scale-out or clustered NAS was a distinct
product category, with specialised suppliers such as Isilon and BlueArc. But
a measure of the increasing importance of such systems is that both of these
have been bought relatively recently by big storage suppliers EMC and Hitachi
Data Systems, respectively. Meanwhile, clustered NAS has gone mainstream,
and the big change here was with NetApp incorporating true clustering and
petabyte/parallel file system capability into its Data ONTAP OS in its FAS
filers. The other storage format that is built for very large numbers of files is
object storage. This tackles the same challenge as scale-out NAS that tradi-
tional tree-like file systems become unwieldy when they contain large numbers
of files.
Object-based storage gets around this by giving each file a unique identifier
and indexing the data and its location. Its more like the DNS way of doing
things on the internet than the kind of file system were used to.Object storage
systems can scale to very high capacity and large numbers of files in the bil-
lions, so are another option for enterprises that want to take advantage of big
data. Having said that, object storage is a less mature technology than scale-
out NAS. So, to sum up, big data storage needs to be able to handle capacity
and provide low latency for analytics work. You can choose to do it like the big
boys in hyperscale environments or adopt NAS or object storage in more tradi-
tional IT departments to do the job. Flash storage solutions, implemented at
the server level and with all-flash arrays, offer some interesting alternatives for
high-performance, low-latency storage, from a few terabytes to a hundred ter-
DR.B.A.T.UNIVERSITY 19
Big Data Dept.of Information Technology
abytes or more in capacity. Object-based, scale-out architectures with erasure
coding can provide scalable storage systems that eschew traditional RAID and
replication methods to achieve new levels of efficiency and lower per-gigabyte
costs.
3.2 Selection of Big Data
Every organization seeking to make sense of big data must determine which
platforms and tools, in the sea of available options, will help them to meet their
business goals.Answering the following eight questions can help guide IT lead-
ers to make the right data management choices for their organizations future
success. For organizations needing to store and process tens of terabytes of
data, using an open-source distributed file system is a mature choice due to its
predictable scalability over clustered hardware. Plus, its the base platform for
many big data architectures already. However, if looking to run analytics in
online or real-time applications, consider hybrid architectures containing dis-
tributed file systems combined with distributed database management systems
(which have lower latency. Or look at large traditional relational systems to
get real-time access to data that has been through the heavy lifting processes
of a distributed file system. Many NoSQL databases require specific applica-
tion interfaces (APIs) in order to access the data. With this, youll need to
consider the integration of visualization or other tools that will need access
to the data. If the tools being used with the big data platform need a SQL
interface, choose a tool that has maturity in that area. Of note, NoSQL and
big data platforms are evolving quickly and businesses just starting to build
custom applications on top of a big data platform may be able to build around
the sometimes raw data access frameworks. Alternatively, businesses with ex-
isting applications will need a more mature offering. If data requirements are
DR.B.A.T.UNIVERSITY 20
Big Data Dept.of Information Technology
especially unstructured, or include streaming data sources such as social media
or video, businesses should look into data serialization technologies that allow
capture, storage and representation of such high velocity data. Also, how ap-
plications consume data should also be taken into consideration. For instance,
some existing tools allow users to project different structures across the data
store, giving flexibility to store data in one way and access it in another. Yes,
being flexible in how data is presented to consuming applications is a bene-
fit, but the performance may not be good enough for high velocity data. To
overcome this performance challenge, you may need to integrate with a more
structured data store further downstream in your data architecture. If looking
to extend your current data architecture by integrating a big data platform into
an existing data warehouse, data integration tools can help. Many integration
vendors that support big data platforms also have specialized support for in-
tegrating with SQL data warehouses and data marts. Clearly, choosing a big
data solution isnt easy.
As companies of all sizes try to extract more from their existing data stores,
Big Data vendors are rushing in to provide a range of Big Data solutions, which
comprise everything from database technology to visualization tools. With
such a diverse selection of tools to choose from, buyers must carefully define
their goals in order to find the right tools to meet their goals. Before finding
the right tools, however, organization must first ask themselves what business
problems theyre trying to solve – and why. Too many big data projects dont
start with problems to solve, but rather start with exploratory analytics, said
Chris Selland, VP of marketing and business development for HP Vertica. Thats
okay to a point, but eventually these questions need to be asked and answered.
Companies have a lot of Big Data and many questions, but that doesnt result
DR.B.A.T.UNIVERSITY 21
Big Data Dept.of Information Technology
in the CIO or CFO simply handing you a large amount of money to work with.
Figure 3.1: Selection of Big Data
3.3 Processing of Big Data
A variety of platforms have emerged to process big data,including advanced
SQL (sometimes called NewSQL) databases that adapt SQL to handle larger
volumes of structured data with greater speed, and NoSQL platforms that may
range from file systems to document or columnar data stores that typically
dispense with the need for modelling data. Most of the early implementations
of big data, especially with NoSQL platforms such as Hadoop, have focused
more on volume and variety, with results delivered through batch processing.
Behind the scenes, there is a growing range of use cases that also emphasise
speed. Some of them consist of new applications that take advantage not only
of powerful back-end data platforms, but also the growth in bandwidth and
mobility. Examples include mobile applications such as Waze that harness sen-
sory data from smartphones and GPS devices to provide real-time pictures of
traffic conditions. On the horizon there are opportunities for mobile carriers
DR.B.A.T.UNIVERSITY 22
Big Data Dept.of Information Technology
to track caller behaviour in real time to target ads, location-based services, or
otherwise engage their customers, as well as Conversely, existing applications
are being made more accurate, responsive and effective as smart sensors add
more data points, intelligence and adaptive control. These are as diverse as
optimising supply chain inventories, regulating public utility and infrastruc-
ture networks, or providing real-time alerts for homeland security. The list of
potential opportunities for fast processing of big data is limited only by the
imagination.
3.3.1 Batch Processing
Apache Hadoop is a distributed computing framework modeled after Google
MapReduce to process large amounts of data in parallel. Once in a while, the
first thing that comes to my mind when speaking about distributed computing
is EJB. EJB is de facto a component model with remoting capability but short
of the critical features being a distributed computing framework, that include
computational parallelization, work distribution, and tolerance to unreliable
hardware and software. Hadoop on the other hand has these merits built-in.
ZooKeeper modeled on Google Chubby is a centralized service for maintain-
ing configuration information, naming, providing distributed synchronization,
and group services for the Hadoop cluster. Hadoop Distributed File System
(HFDS) modeled on Google GFS is the underlying file system of a Hadoop
cluster. HDFS works more efficiently with a few large data files than numer-
ous small files. A real-world Hadoop job typically takes minutes to hours to
complete, therefore Hadoop is not for real-time analytics, but rather for offline,
batch data processing. Recently, Hadoop has undergone a complete overhaul
for improved maintainability and manageability. Something called YARN (Yet
Another Resource Negotiator) is at the center of this change. One major ob-
DR.B.A.T.UNIVERSITY 23
Big Data Dept.of Information Technology
jective of Hadoop YARN is to decouple Hadoop from MapReduce paradigm to
accommodate other parallel computing models, such as MPI (Message Passing
Interface and Spark.
In general, data flows from components to components in an enterprise ap-
plication. This is the case for application frameworks (EJB and Spring frame-
work), integration engines (Camel and Spring Integration), as well as ESB
(Enterprise Service Bus) products. Nevertheless, for the data-intensive pro-
cesses Hadoop deals with, it makes better sense to load a big data set once
and perform various analysis jobs locally to minimize IO and network cost, the
so-called ”Move-Code-To-Data” philosophy. When you load a big data file to
HDFS, the file is split into chunks (or file blocks) through a centralized Name
Node (master node) and resides on individual Data Nodes (slave nodes) in the
Hadoop cluster for parallel processing.
3.3.2 Stream Processing
Stream data processing is not intended to analyze a full big data set, nor
is it capable of storing that amount of data (The Storm-on-YARN project is
an exception). While you may be asked to build a real-time ad-hoc analytics
system that operates on a complete big data set, you really need some mighty
tools. Twitter Storm is an open source, big-data processing system intended
for distributed, real-time streaming processing. Storm implements a data flow
model in which data (time series facts) flows continuously through a topology
(a network of transformation entities). The slice of data being analyzed at any
moment in an aggregate function is specified by a sliding window, a concept in
CEP/ESP. A sliding window may be like ”last hour”, or ”last 24 hours”, which
is constantly shifting over time. Data can be fed to Storm through distributed
DR.B.A.T.UNIVERSITY 24
Big Data Dept.of Information Technology
messaging queues like Kafka, Kestrel, and even regular JMS. Trident is an ab-
straction API of Storm that makes it easier to use. Like Twitter Storm, Apache
S4 is a product for distributed, scalable, continuous, stream data processing.
Note, the size of a sliding window cannot grow infinitely.
3.3.3 Hadoop Ecosystem
Hadoop API is often considered low level, as it is not easy to program with.
The quickly growing Hadoop ecosystem offers a list of abstraction techniques,
which encapsulate and hide the programming complexity of Hadoop. Pig, Hive,
Cascading, Crunch, Scrunch, Scalding, Scoobi, and Cascalog all aim to provide
low cost entry to Hadoop programming. Pig, Crunch (Scrunch), and Cascading
are data-pipe based techniques. A data pipe is a multi-stepped process, in which
transformation, splitting, merging, and join may be conducted individually at
each step. Thinking about a work flow in a general work flow engine, a data
pipe is similar. Hive on the other hand works like a data warehouse by offering a
SQL compatible interactive shell. Programs or shell scripts developed on top of
these techniques are compiled to native Hadoop Map and Reduce classes behind
the scene to run in the cluster. Given the simplified programming interfaces
in conjunction with libraries of reusable functions, development productivity is
greatly improved.
3.3.4 Map and Reduce
A centralized JobTracker process in the Hadoop cluster moves your code to
data. The code hereby includes a Map and a Reduce class. Put simply, a Map
class does the heavy-lifting job of data filtering, transformation, and splitting.
For better IO and network efficiency, a Mapper instance only processes the data
chunks co-located on the same data node, a concept termed data locality (or
DR.B.A.T.UNIVERSITY 25
Big Data Dept.of Information Technology
data proximity). Mappers can run in parallel on all the available data nodes
in the cluster. The outputs of the Mappers from different nodes are shuffled
through a particular algorithm to the appropriate Reduce nodes. A Reduce class
by nature is an aggregator. The number of Reducer instances is configurable
to developers.
DR.B.A.T.UNIVERSITY 26
Chapter 4
Big Data Analytics
Big data is now a reality: The volume, variety and velocity of data coming
into your organization continue to reach unprecedented levels. This phenom-
enal growth means that not only must you understand big data in order to
decipher the information that truly counts, but you also must understand the
possibilities of big data analytics. Big data analytics is the process of examin-
ing big data to uncover hidden patterns, unknown correlations and other useful
information that can be used to make better decisions. With big data analytics,
data scientists and others can analyze huge volumes of data that conventional
analytics and business intelligence solutions can’t touch. Consider that your or-
ganization could accumulate (if it hasn’t already) billions of rows of data with
hundreds of millions of data combinations in multiple data stores and abundant
formats. High-performance analytics is necessary to process that much data in
order to figure out what’s important and what isn’t. Enter big data analytics.
Why collect and store terabytes of data if you can’t analyze it in full context?
Or if you have to wait hours or days to get results? With new advances in
computing technology, there’s no need to avoid tackling even the most chal-
lenging business problems. For simpler and faster processing of only relevant
data, you can use high-performance analytics. Using high-performance data
mining, predictive analytics, text mining, forecasting and optimization on big
27
Big Data Dept.of Information Technology
data enables you to continuously drive innovation and make the best possible
decisions. In addition, organizations are discovering that the unique properties
of machine learning are ideally suited to addressing their fast-paced big data
needs in new ways.
Big data can be analyzed with the software tools commonly used as part of
advanced analytics disciplines such as predictive analytics, data mining, text
analytics andstatistical analysis. Mainstream BI software and data visualiza-
tion tools can also play a role in the analysis process. But the semi-structured
and unstructured data may not fit well in traditional data warehouses based on
relational databases. Furthermore, data warehouses may not be able to handle
the processing demands posed by sets of big data that need to be updated fre-
quently or even continually – for example, real-time data on the performance of
mobile applications or of oil and gas pipelines. As a result, many organizations
looking to collect, process and analyze big data have turned to a newer class of
technologies that includes Hadoop and related tools such as YARN,MapReduce,
Spark, Hive and Pig as well as NoSQL databases. Those technologies form the
core of an open source software framework that supports the processing of large
and diverse data sets across clustered systems.
In some cases, Hadoop clusters and NoSQL systems are being used as landing
pads and staging areas for data before it gets loaded into a data warehouse
for analysis, often in a summarized form that is more conducive to relational
structures. Increasingly though, big data vendors are pushing the concept of
a Hadoop data lake that serves as the central repository for an organization’s
incoming streams of raw data. In such architectures, subsets of the data can
then be filtered for analysis in data warehouses and analytical databases, or it
can be analyzed directly in Hadoop using batch query tools, stream processing
DR.B.A.T.UNIVERSITY 28
Big Data Dept.of Information Technology
software and SQL on Hadoop technologies that run interactive, ad hoc queries
written in SQL. Potential pitfalls that can trip up organizations on big data
analytics initiatives include a lack of internal analytics skills and the high cost
of hiring experienced analytics professionals. The amount of information that’s
typically involved, and its variety, can also cause data management headaches,
including data quality and consistency issues. In addition, integrating Hadoop
systems and data warehouses can be a challenge, although various vendors now
offer software connectors between Hadoop and relational databases, as well as
other data integration tools with big data capabilities.
4.1 Examples of Big Data Analytics
As the technology that helps an organization to break down data silos and
analyze data improves, business can be transformed in all sorts of ways. Accord-
ing to Datamation, today’s advances in analyzing Big Data allow researchers to
decode human DNA in minutes, predict where terrorists plan to attack, deter-
mine which gene is mostly likely to be responsible for certain diseases and, of
course, which ads you are most likely to respond to on Facebook. The business
cases for leveraging Big Data are compelling. For instance, Netflix mined its
subscriber data to put the essential ingredients together for its recent hit House
of Cards, and subscriber data also prompted the company to bring Arrested
Development back from the dead.
Another example comes from one of the biggest mobile carriers in the world.
France’s Orange launched its Data for Development project by releasing sub-
scriber data for customers in the Ivory Coast. The 2.5 billion records, which
were made anonymous, included details on calls and text messages exchanged
between 5 million users. Researchers accessed the data and sent Orange propos-
DR.B.A.T.UNIVERSITY 29
Big Data Dept.of Information Technology
als for how the data could serve as the foundation for development projects to
improve public health and safety. Proposed projects included one that showed
how to improve public safety by tracking cell phone data to map where people
went after emergencies; another showed how to use cellular data for disease
containment.
4.2 Benefits of Big Data Analytics
Enterprises are increasingly looking to find actionable insights into their data.
Many big data projects originate from the need to answer specific business
questions. With the right big data analytics platforms in place, an enterprise
can boost sales, increase efficiency, and improve operations, customer service
and risk management.
1. Webopedia parent company, QuinStreet, surveyed 540 enterprise decision-
makers involved in big data purchases to learn which business areas com-
panies plan to use Big Data analytics to improve operations.
2. About half of all respondents said they were applying big data analytics
to improve customer retention, help with product development and gain a
competitive advantage.
3. The business area getting the most attention relates to increasing efficien-
cies and optimizing operations.62 percent of respondents said that they
use big data analytics to improve speed and reduce complexity.
DR.B.A.T.UNIVERSITY 30
Chapter 5
Challenges in Big Data
5.1 Security
The biggest challenge for big data from a security point of view is the pro-
tection of users privacy. Big data frequently contains huge amounts of personal
identifiable information and therefore privacy of users is a huge concern. Be-
cause of the big amount of data stored, breaches affecting big data can have
more devastating consequences than the data breaches we normally see in the
press. This is because a big data security breach will potentially affect a much
larger number of people, with consequences not only from a reputational point
of view, but with enormous legal repercussions. When producing information
for big data, organizations have to ensure that they have the right balance
between utility of the data and privacy. Before the data is stored it should
be adequately anonymised, removing any unique identifier for a user. This in
itself can be a security challenge as removing unique identifiers might not be
enough to guarantee that the data will remain anonymous. The anonymized
data could be could be cross-referenced with other available data following de-
anonymization techniques. When storing the data organizations will face the
problem of encryption. Data cannot be sent encrypted by the users if the
cloud needs to perform operations over the data. A solution for this is to use
31
Big Data Dept.of Information Technology
Fully Homomorphic Encryption (FHE), which allows data stored in the cloud
to perform operations over the encrypted data so that new encrypted data
will be created. When the data is decrypted the results will be the same as
if the operations were carried out over plain text data. Therefore, the cloud
will be able to perform operations over encrypted data without knowledge of
the underlying plain text data. While using big data a significant challenge is
how to establish ownership of information. If the data is stored in the cloud
a trust boundary should be establish between the data owners and the data
storage owners. Adequate access control mechanisms will be key in protecting
the data. Access control has traditionally been provided by operating systems
or applications restricting access to the information, which typically exposes all
the information if the system or application is hacked. A better approach is
to protect the information using encryption that only allows decryption if the
entity trying to access the information is authorised by an access control pol-
icy. An additional problem is that software commonly used to store big data,
such as Hadoop, doesnt always come with user authentication by default. This
makes the problem of access control worse, as a default installation would leave
the information open to unauthenticated users. Big data solutions often rely
on traditional firewalls or implementations at the application layer to restrict
access to the information.
The main solution to ensuring that data remains protected is the adequate
use of encryption. For example, Attribute-Based Encryption can help in pro-
viding fine-grained access control of encrypted data.Anonymizing the data is
also important to ensure that privacy concerns are addressed. It should be en-
sured that all sensitive information is removed from the set of records collected.
Real-time security monitoring is also a key security component for a big data
DR.B.A.T.UNIVERSITY 32
Big Data Dept.of Information Technology
project. It is important that organizations monitor access to ensure that there
is no unauthorised access. It is also important that threat intelligence is in place
to ensure that more sophisticated attacks are detected and that the organiza-
tions can react to threats accordingly. If an adequate governance framework is
not applied to big data then the data collected could be misleading and cause
unexpected costs. The main problem from a governance point of view is that
big data is a relatively new concept and therefore no one has created procedures
and policies. The challenge with big data is that the unstructured nature of
the information makes it difficult to categorize, model and map the data when
it is captured and stored. The problem is made worst by the fact that the data
normally comes from external sources, often making it complicated to confirm
its accuracy. Data hackers have become more damaging in the era of big data
due to the availability of large volumes of publically available data, the abil-
ity to store massive amounts of data on portable devices such as USB drives
and laptops, and the accessibility of simple tools to acquire and integrate dis-
parate data sources. According to the Open Security Foundations DataLossDB
project (http://datalossdb.org), hacking accounts for 28% of all data breach
incidents, with theft accounting for an additional 24%, fraud accounting for
12%, and web-related loss accounting for 9% of all data loss incidents. Greater
than half (57%) of all data loss incidents involve external parties, but 10% in-
volve malicious actions on the part of internal parties, and an additional 20%
involve accidental actions by internal parties. Private businesses, hospitals, and
biomedical researchers are also making tremendous investments in the collec-
tion, storage, and analysis of large-scale data and private information.
DR.B.A.T.UNIVERSITY 33
Big Data Dept.of Information Technology
5.2 Data Access
5.2.1 Inefficiency
Securing and controlling access to data is a very time-consuming process.
Currently, for most companies that are even aware that their data access pro-
tocols are an issue, actual practices in securing that are inefficient, whether
you are manually securing unstructured data, or whether it is data that is be-
ing dealt with automatically. To secure it properly, enterprises need to assess
where in the environment your data resides, assess how much data loss there
is from the use of file servers and NAS devices, and even develop an inventory
of whats available in your SharePoint deployments. All very time consuming,
youll no doubt agree.
5.2.2 Ineffectiveness
After answering the question as to who has access to your data, the next
big question is whether they should have a given level of access or not. Does
IT know what level of access should be offered to employees, and should this
decision be left in the hands of the IT department at all .The chances are
that they shouldnt be allowed decide, as this is ultimately a business decision.
But, as in the case of most companies, there is no clear policy on who makes
the decisions. Chances are in this situation theres also going to be a lot of
unstructured and orphaned data lying around with no one to take responsibly
for it.
5.3 Data Cleaning
Data cleaning remains an important part of the process to ensure data qual-
ity. The first is to verify that the quantitative and qualitative (i.e. categorical)
DR.B.A.T.UNIVERSITY 34
Big Data Dept.of Information Technology
variables have been recorded as expected. The second involves removing out-
liers, which in the Big Data paradigm means the use of decision tree algorithms.
But data cleaning itself is a subjective process (e.g. deciding which variables
to consider) and not truly agnostic as would be desired and thus open to philo-
sophical debate (Bollier, 2010).
5.4 Data Representation
Related to the question of data provenance is the issue of understanding the
underlying population whose behavior has been captured. The large data sizes
may make the sampling rate irrelevant, but it doesnt necessarily make it repre-
sentative. Everybody does not use Twitter, Facebook or even Google searches.
For example ITU estimates suggest that Internet usage is still limited to only 40
per cent of the world population. In other words, more than four billion people
globally are not yet using the Internet, and 90 per cent of them are from the
developing world. Of the worlds three billion Internet users, two-thirds are from
the developing countries. At the other end of the spectrum, even though mobile
cellular penetration is close to 100%, this does not mean that every person in
the world is using a mobile phone. This issue of representativeness is an issue
of high relevance when considering how telecommunication data may be used
for monitoring and development. Whilst the promise in leveraging data from
mobile network operators for monitoring and development hinges on its large
coverage, nearing the actual population, it is still not the whole population.
Questions such as the extant of coverage of the poor, or the levels of gender
representation amongst telecom users are all valid questions. Whilst the regis-
tration information might provide answers, the reality is that the demographic
information on telecom subscribers for example is not always accurate. With
pre-paid subscriptions being the norm in the majority of the developing world,
DR.B.A.T.UNIVERSITY 35
Big Data Dept.of Information Technology
demographic information contained in the mobile operator records is practically
useless, even with mandated registration.
The issue of sampling bias is best illustrated by the case of Street Bump,
a mobile app developed by the Boston City Hall. Street Bump uses a phones
accelerometer to detect potholes and notify City Hall, whilst the app users
drive around Boston. The app however introduces a selection bias since it
biased towards the demographics of the app users, who often hail from affluent
areas with greater smartphone ownership (Harford, 2014). Hence the Big in
Big Data does not automatically mean that issues such as measurement bias
and methodology, internal and external data validity, and inter-dependencies
among data can be ignored. These are foundational issues not just for small
data but also for Big Data (Boyd and Crawford, 2012).
5.5 Behavioral change
For that matter, digitized online behavior can be subject to self-censorship
and the creation of multiple personas, further muddying the waters. Thus
studying the data exhaust of people may not always give us insights into real-
world dynamics. This may be less of an issue with TGD, where in essence
the data artifact is itself a byproduct of another activity. Telecom network
Big Data, which mostly falls under this category, may be less susceptible to
self-censorship and persona development. But it doesnt exclude the possibility
either. It is not inconceivable that users may not use their mobiles or even turn
it off in areas, where they do not wish their digital footprint to be left behind.
In a way, Big Data analyses of behavioral data are subject to a form of the
Heisenberg Uncertainty principle: as soon as the basic process of an analysis
is known, there may be concerted efforts to exhibit different behavior and/or
DR.B.A.T.UNIVERSITY 36
Big Data Dept.of Information Technology
actions to change the outcomes (Bollier, 2010).
For example the famous Google page rank algorithm has spawned an entire
industry of organizations that claim to enhance page ranks for websites. Search
Engine Optimization (SEO) is now an established practice when developing
websites.Change in behavior could also partly attribute to the declining verac-
ity of Google Flu Trends. Researchers found that influenza-like-illness rates as
exhibited by Google searches did not necessarily correlate with actual influenza
virus infections (Ortiz et al., 2011). Recent research has shown that after 2009
(when it failed to catch the non-seasonal influenza outbreak of 2009), infre-
quent updates, have not improved the results. In fact Google Flu Trends has
persistently overestimated flu prevalence since 2009 (Lazer, Kennedy, King, and
Vespignani, 2014). Google Flu Trends does not and cannot know what factors
contributed to the strong correlations found in their initial work. The point is
that the underlying real world actions of the population that turned to Google
for its health queries and which contributed to the original correlations discov-
ered by GFT, may have in-fact changed over time, diminishing the robustness
of the original algorithm. For example the hoopla surrounding GFT could have
even created rebound effects, with more and more people turning to Google for
their broader health questions and thereby introducing additional search terms
(due to different cultural norms and/or ground conditions), which can collec-
tively introduce biases that GFT has not been able account for. Such possible
problems could have been caught and resolved had the GFT method been more
transparent.
DR.B.A.T.UNIVERSITY 37
Chapter 6
Applications and Future of Big Data
6.1 Applications
Big data has increased the demand of information management specialists
in that Software AG, Oracle Corporation, IBM, Microsoft, SAP,EMC, HP and
Dell have spent more than $15 billion on software firms specializing in data
management and analytics. In 2010, this industry was worth more than $100
billion and was growing at almost 10 percent a year: about twice as fast as
the software business as a whole. Developed economies increasingly use data-
intensive technologies. There are 4.6 billion mobile-phone subscriptions world-
wide, and between 1 billion and 2 billion people accessing the internet. Between
1990 and 2005, more than 1 billion people worldwide entered the middle class,
which means more people become more literate, which in turn leads to infor-
mation growth. The world’s effective capacity to exchange information through
telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993,
2.2 exabytes in 2000, 65exabytes in 2007 and predictions put the amount of in-
ternet traffic at 667 exabytes annually by 2014. According to one estimate, one
third of the globally stored information is in the form of alphanumeric text and
still image data, which is the format most useful for most big data applications.
This also shows the potential of yet unused data (i.e. in the form of video and
38
Big Data Dept.of Information Technology
audio content).While many vendors offer off-the-shelf solutions for Big Data,
experts recommend the development of in-house solutions custom-tailored to
solve the company’s problem at hand if the company has sufficient technical
capabilities.
6.1.1 Government
The use and adoption of Big Data within governmental processes is beneficial
and allows efficiencies in terms of cost, productivity, and innovation. That
said, this process does not come without its flaws. Data analysis often requires
multiple parts of government (central and local) to work in collaboration and
create new and innovative processes to deliver the desired outcome. Below are
the thought leading examples within the Governmental Big Data space.
United States of America
In 2012, the Obama administration announced the Big Data Research and
Development Initiative, to explore how big data could be used to address im-
portant problems faced by the government. The initiative is composed of 84
different big data programs spread across six departments.Big data analysis
played a large role in Barack Obama’s successful 2012 re-election campaign.
The United States Federal Government owns six of the ten most powerful su-
percomputers in the world. The Utah Data Center is a data center currently
being constructed by the United States National Security Agency. When fin-
ished, the facility will be able to handle a large amount of information collected
by the NSA over the Internet. The exact amount of storage space is unknown,
but more recent sources claim it will be on the order of a fewexabytes.
DR.B.A.T.UNIVERSITY 39
Big Data Dept.of Information Technology
India
Big data analysis helped in parts, responsible for the BJP and its allies to
win Indian General Election 2014. The Indian Government utilises numerous
techniques to ascertain how the Indian electorate is responding to government
action, as well as ideas for policy augmentation
6.1.2 Cyber-Physical Models
Current PHM implementations mostly utilize data during the actual usage
while analytical algorithms can perform more accurately when more informa-
tion throughout the machines lifecycle, such as system configuration, physical
knowledge and working principles, are included. There is a need to systemati-
cally integrate, manage and analyze machinery or process data during different
stages of machine life cycle to handle data/information more efficiently and
further achieve better transparency of machine health condition for manufac-
turing industry.With such motivation a cyber-physical (coupled) model scheme
has been developed. The coupled model is a digital twin of the real machine
that operates in the cloud platform and simulates the health condition with an
integrated knowledge from both data driven analytical algorithms as well as
other available physical knowledge. It can also be described as a 5S systematic
approach consisting of Sensing, Storage, Synchronization, Synthesis and Ser-
vice. The coupled model first constructs a digital image from the early design
stage. System information and physical knowledge are logged during product
design, based on which a simulation model is built as a reference for future
analysis. Initial parameters may be statistically generalized and they can be
tuned using data from testing or the manufacturing process using parameter
estimation. After that step, the simulation model can be considered a mirrored
image of the real machineable to continuously record and track machine condi-
DR.B.A.T.UNIVERSITY 40
Big Data Dept.of Information Technology
tion during the later utilization stage. Finally, with the increased connectivity
offered by cloud computing technology, the coupled model also provides better
accessibility of machine condition for factory managers in cases where physical
access to actual equipment or machine data is limited.
6.1.3 Healthcare
Big data analytics has helped healthcare improve by providing personalized
medicine and prescriptive analytics, clinical risk intervention and predictive
analytics, waste and care variability reduction, automated external and internal
reporting of patient data, standardized medical terms and patient registries and
fragmented point solutions.
6.1.4 Technology
1. eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well
as a 40PB Hadoop cluster for search, consumer recommendations, and
merchandising.
2. Amazon.com handles millions of back-end operations every day, as well
as queries from more than half a million third-party sellers. The core
technology that keeps Amazon running is Linux-based and as of 2005 they
had the worlds three largest Linux databases, with capacities of 7.8 TB,
18.5 TB, and 24.7 TB.
3. Facebook handles 50 billion photos from its user base.
4. As of August 2012, Google was handling roughly 100 billion searches per
month.
5. Oracle NoSQL Database has been tested to past the 1M ops/sec mark with
8 shards and proceeded to hit 1.2M ops/sec with 10 shards.
DR.B.A.T.UNIVERSITY 41
Big Data Dept.of Information Technology
6.2 The Future of Big Data
However, those who feel that todays big data is just a continuation of past
information trends are as wrong as if they were to claim that a stone tablet is
essentially the same as a tablet computer or an abacus similar to a supercom-
puter.Today, we have more information than ever. But the importance of all
that information extends beyond simply being able to do more, or know more,
than we already do. The quantitative shift leads to a qualitative shift. Hav-
ing more data allows us to do new things that werent possible before. In other
words: More is not just more. More is new. More is better. More is different.Of
course, there are still limits on what we can obtain from or do with data. But
most of our assumptions about the cost of collecting and the difficulty of pro-
cessing data need to be overhauled. No area of human endeavor or industrial
sector will be immune from the incredible shakeup thats about to occur as big
data plows through society, politics, and business. People shape their toolsand
their tools shape them.This new world of data, and how companies can harness
it, bumps up against two areas of public policy and regulation. The first is
employment.
Big data will bring about great things in society. We like to think that
technology leads to job creation, even if it comes after a temporary period of
disruption. That was certainly true during the Industrial Revolution. To be
sure, it was a devastating time of dislocation, but it eventually led to better
livelihoods. Yet this optimistic outlook ignores the fact that some industries
simply never recover from change. When tractors and automobiles replaced
horse-drawn plows and carriages, the need for horses in the economy basically
ended.The upheavals of the Industrial Revolution created political change and
gave rise to new economic philosophies and political movements. Its not much
DR.B.A.T.UNIVERSITY 42
Big Data Dept.of Information Technology
of an intellectual stretch to predict that new political philosophies and social
movements will arise around big data, robots, computers, and the Internet, and
the effect of these technologies on the economy and representative democracy.
Recent debates over income inequality and the Occupy movement seem to point
in that direction.
Big data will change business, and business will change society. The hope,
of course, is that the benefits will outweigh the drawbacks, but that is mostly
a hope. The big-data world is still very new, and, as a society, were not very
good at handling all the data that we can collect now. We also cant foresee the
future. Technology will continue to surprise us, just as it would an ancient man
with an abacus looking upon an iPhone. What is certain is that more will not
be more: It will be different.Clearly Big Data is in its beginnings, and is much
more to be discovered. Now is for the most companies just a cool keyword,
because it has a great potential and not many truly know what all is about. A
clear sign that there is more to big data then is currently shown on the market,
is that the big software companies do not have, or do not present their Big Data
solutions, and those that have like Google, does not use it in ca commercial way.
The companies need to decide what kind of strategy use to implement Big Data.
They could use a more revolutionary approach and move all the data to the
new Big Data environment, and all the reporting, modeling and interrogation
will be executed using the new business intelligence based on Big Data. [1] This
approach is already used by many analytics driven organizations that puts all
the data on the Hadoop environment and build business intelligence solutions
on top of it. Another approach is the evolutionary approach; Big Data becomes
an input to the current BI platform. The data is accumulated and analyzed
using structured and unstructured tools, and the results are sent to the data
DR.B.A.T.UNIVERSITY 43
Big Data Dept.of Information Technology
warehouse. Standard modeling and reporting tools now have access to social
media sentiments, usage records, and other processed Big Data items. [1] One
of the issues of the evolutionary approach is that even if it gets most of the
capabilities of the Big Data environment, but also gets most of the problems
of the classic Business intelligence solution, and in some cases can create a
bottleneck between information that came from the Big Data and the power to
analyze of the traditional BI or data warehouse solution.4
DR.B.A.T.UNIVERSITY 44
Chapter 7
Conclusion
The availability of Big Data, low-cost commodity hardware, and new infor-
mation management and analytic software have produced a unique moment in
the history of data analysis. The convergence of these trends means that we
have the capabilities required to analyze astonishing data sets quickly and cost-
effectively for the first time in history. These capabilities are neither theoretical
nor trivial. They represent a genuine leap forward and a clear opportunity to
realize enormous gains in terms of efficiency, productivity, revenue, and prof-
itability.
The Age of Big Data is here, and these are truly revolutionary times if both
business and technology professionals continue to work together and deliver on
the promise.
45
Bibliography
[1] ¡Introduction¿, <www.WIKIPEDIA..COM>
[2] ¡Characteristics of Big Data¿, <www.STUDYMAFIA.ORG>
[3] ¡Big Data Analytics¿, <www.COMPUTERWEEKLY.COM>
[4] ¡Storage of Big Data¿, <www.COMPUTERWEEKLY.COM>
46

More Related Content

What's hot

SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETSSELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
Светла Иванова
 
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
Alan McSweeney
 
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
IJECEIAES
 
Big Data visualization
Big Data visualizationBig Data visualization
Big Data visualization
Shilpa Soi
 
IRJET- Magnetic Resonance Imaging (MRI) – Digital Transformation Journey Util...
IRJET- Magnetic Resonance Imaging (MRI) – Digital Transformation Journey Util...IRJET- Magnetic Resonance Imaging (MRI) – Digital Transformation Journey Util...
IRJET- Magnetic Resonance Imaging (MRI) – Digital Transformation Journey Util...
IRJET Journal
 
Identical Users in Different Social Media Provides Uniform Network Structure ...
Identical Users in Different Social Media Provides Uniform Network Structure ...Identical Users in Different Social Media Provides Uniform Network Structure ...
Identical Users in Different Social Media Provides Uniform Network Structure ...
IJMTST Journal
 

What's hot (6)

SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETSSELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
 
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
 
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
 
Big Data visualization
Big Data visualizationBig Data visualization
Big Data visualization
 
IRJET- Magnetic Resonance Imaging (MRI) – Digital Transformation Journey Util...
IRJET- Magnetic Resonance Imaging (MRI) – Digital Transformation Journey Util...IRJET- Magnetic Resonance Imaging (MRI) – Digital Transformation Journey Util...
IRJET- Magnetic Resonance Imaging (MRI) – Digital Transformation Journey Util...
 
Identical Users in Different Social Media Provides Uniform Network Structure ...
Identical Users in Different Social Media Provides Uniform Network Structure ...Identical Users in Different Social Media Provides Uniform Network Structure ...
Identical Users in Different Social Media Provides Uniform Network Structure ...
 

Viewers also liked

Virtual reality check phase ii version 2.0
Virtual reality check   phase ii version 2.0Virtual reality check   phase ii version 2.0
Virtual reality check phase ii version 2.0guestf144706
 
Virtual reality
Virtual realityVirtual reality
Virtual reality
Amit Sinha
 
Autonomic computing seminar documentation
Autonomic computing seminar documentationAutonomic computing seminar documentation
Autonomic computing seminar documentation
Georgekutty Francis
 
What is big data?
What is big data?What is big data?
What is big data?
David Wellman
 
Big data ppt
Big data pptBig data ppt
Big data ppt
IDBI Bank Ltd.
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
Bernard Marr
 

Viewers also liked (8)

Virtual reality check phase ii version 2.0
Virtual reality check   phase ii version 2.0Virtual reality check   phase ii version 2.0
Virtual reality check phase ii version 2.0
 
Virtual reality
Virtual realityVirtual reality
Virtual reality
 
VIRTUAL REALITY
VIRTUAL REALITYVIRTUAL REALITY
VIRTUAL REALITY
 
Autonomic computing seminar documentation
Autonomic computing seminar documentationAutonomic computing seminar documentation
Autonomic computing seminar documentation
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 

Similar to Big data

Big Data Social Network Analysis
Big Data Social Network AnalysisBig Data Social Network Analysis
Big Data Social Network Analysis
Chamin Nalinda Loku Gam Hewage
 
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
IJCSIS Research Publications
 
Big Data - Insights & Challenges
Big Data - Insights & ChallengesBig Data - Insights & Challenges
Big Data - Insights & Challenges
Rupen Momaya
 
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf6510.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
Med labbi
 
FULLTEXT01.pdf
FULLTEXT01.pdfFULLTEXT01.pdf
FULLTEXT01.pdf
BizuayehuDesalegn
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurt Portelli
 
Al-Mqbali, Leila, Big Data - Research Project
Al-Mqbali, Leila, Big Data - Research ProjectAl-Mqbali, Leila, Big Data - Research Project
Al-Mqbali, Leila, Big Data - Research ProjectLeila Al-Mqbali
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
Shital Kat
 
iGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - ReportiGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - Report
Nandu B Rajan
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisOktay Bahceci
 
BIG DATA-Seminar Report
BIG DATA-Seminar ReportBIG DATA-Seminar Report
BIG DATA-Seminar Report
josnapv
 
Big Data, Little Data, and Everything in Between
Big Data, Little Data, and Everything in BetweenBig Data, Little Data, and Everything in Between
Big Data, Little Data, and Everything in Between
xband
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
Sridhar Mamella
 
Big data-comes-of-age ema-9sight
Big data-comes-of-age ema-9sightBig data-comes-of-age ema-9sight
Big data-comes-of-age ema-9sightJyrki Määttä
 
predictive maintenance digital twin EMERSON EDUARDO RODRIGUES
predictive maintenance digital twin EMERSON EDUARDO RODRIGUESpredictive maintenance digital twin EMERSON EDUARDO RODRIGUES
predictive maintenance digital twin EMERSON EDUARDO RODRIGUES
EMERSON EDUARDO RODRIGUES
 
Secure and Smart IoT using Blockchain and AI
Secure and Smart  IoT using Blockchain and AISecure and Smart  IoT using Blockchain and AI
Secure and Smart IoT using Blockchain and AI
Ahmed Banafa
 
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
vinoth raja
 
Data mining tutorial
Data mining tutorialData mining tutorial
Data mining tutorial
grinu
 
Digital Asset Management Whitepaper by KeyFruit Inc.
Digital Asset Management Whitepaper by KeyFruit Inc.Digital Asset Management Whitepaper by KeyFruit Inc.
Digital Asset Management Whitepaper by KeyFruit Inc.
KeyFruit Inc.
 

Similar to Big data (20)

Big Data Social Network Analysis
Big Data Social Network AnalysisBig Data Social Network Analysis
Big Data Social Network Analysis
 
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
 
Big Data - Insights & Challenges
Big Data - Insights & ChallengesBig Data - Insights & Challenges
Big Data - Insights & Challenges
 
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf6510.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
 
FULLTEXT01.pdf
FULLTEXT01.pdfFULLTEXT01.pdf
FULLTEXT01.pdf
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertation
 
Al-Mqbali, Leila, Big Data - Research Project
Al-Mqbali, Leila, Big Data - Research ProjectAl-Mqbali, Leila, Big Data - Research Project
Al-Mqbali, Leila, Big Data - Research Project
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
iGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - ReportiGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - Report
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_Analysis
 
BIG DATA-Seminar Report
BIG DATA-Seminar ReportBIG DATA-Seminar Report
BIG DATA-Seminar Report
 
Big Data, Little Data, and Everything in Between
Big Data, Little Data, and Everything in BetweenBig Data, Little Data, and Everything in Between
Big Data, Little Data, and Everything in Between
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Big data-comes-of-age ema-9sight
Big data-comes-of-age ema-9sightBig data-comes-of-age ema-9sight
Big data-comes-of-age ema-9sight
 
predictive maintenance digital twin EMERSON EDUARDO RODRIGUES
predictive maintenance digital twin EMERSON EDUARDO RODRIGUESpredictive maintenance digital twin EMERSON EDUARDO RODRIGUES
predictive maintenance digital twin EMERSON EDUARDO RODRIGUES
 
Secure and Smart IoT using Blockchain and AI
Secure and Smart  IoT using Blockchain and AISecure and Smart  IoT using Blockchain and AI
Secure and Smart IoT using Blockchain and AI
 
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
 
Data mining tutorial
Data mining tutorialData mining tutorial
Data mining tutorial
 
Digital Asset Management Whitepaper by KeyFruit Inc.
Digital Asset Management Whitepaper by KeyFruit Inc.Digital Asset Management Whitepaper by KeyFruit Inc.
Digital Asset Management Whitepaper by KeyFruit Inc.
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Big data

  • 1. Big Data Submitted in partial fulfillment of the requirements for the award of the Degree of Information Technology Submitted By Mr.Prashant Maruti Navatre (Regisration No.20130737) the guidance of Prof.S.S.Barphe DEPARMENT OF INFORMATION TECHNOLOGY DR.BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY, LONERE, RAIGAD-MAHARASHTRA, INDIA-400103 2015-2016
  • 2. DR.BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY LONERE, RAIGAD-MAHARASHTRA, INDIA-400103 Certificate This is to certify that the project entitled “Big Data” is submitted by Mr.Prashant Maruti Navatre , Registration No. 20130737 for the partial fulfilment of the requirement for the award of the degree of Bachelor of Technology in INFORMATION TECHNOLOGY of the Dr. Babasaheb Ambedkar Technological University, Lonere is a bonafide work carried out during the academic year 2015-2016. Prof.S.S.Barphe Dr.S.M.Jadhav (Seminar Guide) (Head Of Deparment) Information Technology Information Technology Examiners: 1). 2). Date: Place:Vidyavihar Lonere-402103
  • 3. Acknowledgment I am pleased to present this seminar report entitled “Big Data”. It is indeed a great pleasure and a moment of immense satisfaction for me to express my sense of profound gratitude and indebtedness towards my guide Prof. S.S.Baprhe whose enthusiasm are the source of inspiration for me. I am extremely thankful for the guidance and untiring attention, which he bestowed on me right from the beginning. Her valuable and timely suggestions at crucial stages and above all his constant encouragement have made it possible for me to achieve this work. I would also like to give my sincere thanks to S.M. JADHAV Head of INFORMATION TECHNOLOGY for necessary help and providing me the required facilities for completion of this seminar report. I would like to thank the entire Teaching staffs who are directly or indirectly involved in the various data collection and software assistance to bring forward this seminar report. I express my deep sense of gratitude towards my parents for their sustained cooperation and wishes, which have been a prime source of inspiration to take this seminar work to its end without any hurdles.Last but not the least, I would like to thank all my B.Tech. colleagues for their co-operation and useful suggestion and all those who have directly or indirectly helped me in completion of this seminar work. Date Prashant Maruti Navatre Place 20130737
  • 4. ABSTRACT Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as big data analytics. These useful informations for companies or organizations with the help of gain- ing richer and deeper insights and getting an advantage over the competition. For this reason, big data implementations need to be analyzed and executed as accurately as possible. This paper presents an overview of big data’s con- tent, scope, samples, methods, advantages and challenges and discusses privacy concern on it. Every day, we create 2.5 quintillion bytes(one quintillion bytes = one billion gigabytes). of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is Big Data. I
  • 5. Contents 1 Introduction 1 1.1 The General Concept Of Big Data . . . . . . . . . . . . . . . . . 2 1.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 The Need For Big Data . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Sources Of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Characteristics of Big Data 10 2.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Veracity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Storage,Selection and Processing of Big Data 16 3.1 Storage of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.1 Key Requirements of Big Data . . . . . . . . . . . . . . . 18 3.2 Selection of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Processing of Big Data . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Batch Processing . . . . . . . . . . . . . . . . . . . . . . 23 II
  • 6. CONTENTS CONTENTS 3.3.2 Stream Processing . . . . . . . . . . . . . . . . . . . . . 24 3.3.3 Hadoop Ecosystem . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 Map and Reduce . . . . . . . . . . . . . . . . . . . . . . 25 4 Big Data Analytics 27 4.1 Examples of Big Data Analytics . . . . . . . . . . . . . . . . . . 29 4.2 Benefits of Big Data Analytics . . . . . . . . . . . . . . . . . . . 30 5 Challenges in Big Data 31 5.1 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.1 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.2 Ineffectiveness . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.4 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . 35 5.5 Behavioral change . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6 Applications and Future of Big Data 38 6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1.1 Government . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.1.2 Cyber-Physical Models . . . . . . . . . . . . . . . . . . . 40 6.1.3 Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.1.4 Technology . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2 The Future of Big Data . . . . . . . . . . . . . . . . . . . . . . 42 7 Conclusion 45 References 46 III
  • 7. List of Figures 1.1 Visualization of daily Wikipedia edits created by IBM . . . . . . 2 1.2 Growth and Digitalization of Global and Information Storage capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Sources of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Architecture of Big Data . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Selection of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 22 IV
  • 8. Chapter 1 Introduction In recent years , the term Big Data has emerged to describe a new paradigm for data applications. New Technologies tend to emerge with a lot of hype , but it can take some time to tell what is new and different. While big data has been defined in a myriad of ways, the heart of the Big Data paradigm is that too big (volume),arrives too fast(velocity),changes too fast(variability),conatins too much(vearcity),or is too diverse(variety) to be proceesed within a local com- puting structure using traditional approaches and techniques. The technologies being introduced to support this paradigm have a wide variety of interfaces making it difficult to construct tools and applications that integrate data from multiple Big Data sources. Analysis of data sets can find new correlations, to ”spot business trends, prevent diseases, combat crime and so on. Scientists, business executives, prac- titioners of media and advertising and governments alike regularly meet difficul- ties with large data sets in areas including Internet search, finance and business informatics. Scientists encounter limitations in e-Science work, including me- teorology, genomics, connectomics, complex physics simulations, and biological and environmental research. 1
  • 9. Big Data Dept.of Information Technology Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that en- able enhanced insight, decision making, and process automation.Data sets grow in size in part because they are increasingly being gathered by cheap and nu- merous information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. 1.1 The General Concept Of Big Data The term Big Data is an imprecise description of a rich and complicated set of characteristics, practices, techniques, ethical issues, and outcomes all associated with data.Big Data originated in the physical sciences, with physics and astronomy early to adopt of many of the techniques now called Big Data. Instruments like the Large Hadron Collider and the Square Kilometer Array are massive collectors of exabytes of information, and the ability to collect such massive amounts of data necessitated an increased capacity to manipulate and analyze these data as well. Figure 1.1: Visualization of daily Wikipedia edits created by IBM DR.B.A.T.UNIVERSITY 2
  • 10. Big Data Dept.of Information Technology 1.2 Definition Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reduction and reduced risk. 1.3 History Big data burst upon the scene in the first decade of the 21st century, and the first organizations to embrace it were online and startup firms. Arguably, firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning. They didnt have to reconcile or integrate big data with more traditional sources of data and the analytics performed upon them, because they didnt have those traditional forms. They didnt have to merge big data technologies with their traditional IT infrastructures because those infrastruc- tures didnt exist. Big data could stand alone, big data analytics could be the only focus of analytics, and big data technology architectures could be the only architecture. Consider, however, the position of large, well-established businesses. Big data in those environments shouldnt be separate, but must be integrated with every- thing else thats going on in the company.Analytics on big data have to coexist with analytics on other types of data. Hadoop clusters have to do their work DR.B.A.T.UNIVERSITY 3
  • 11. Big Data Dept.of Information Technology alongside IBM mainframes.Data scientists must somehow get along and work jointly with mere quantitative analysts.In order to understand this coexistence, we interviewed 20 large organizations in the early months of 2013 about how big data fit in to their overall data and analytics environments. Overall, we found the expected co-existence; in not a single one of these large organizations was big data being managed separately from other types of data and analytics. The integration was in fact leading to a new management perspective on analytics, which well call Analytics 3.0. In this paper well describe the overall context for how organizations think about big data, the organizational structure and skills required for itetc. Well conclude by describing the Analytics 3.0 era. In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as be- ing three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources).Gartner, and now much of the industry, continue to use this ”3Vs” model for describ- ing big data.In 2012, Gartner updated its definition as follows: ”Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” Additionally, a new V ”Veracity” is added by some organizations to describe it. 1.4 The Need For Big Data Like many new information technologies, big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task, or new product and service offerings. Like traditional analytics, it can also support internal business decisions. The technologies and concepts DR.B.A.T.UNIVERSITY 4
  • 12. Big Data Dept.of Information Technology Figure 1.2: Growth and Digitalization of Global and Information Storage capacity behind big data allow organizations to achieve a variety of objectives, but most of the organizations we interviewed were focused on one or two. The chosen objectives have implications for not only the outcome and financial benefits from big data, but also the processwho leads the initiative, where it fits within the organization, and how to manage the project.As the world becomes more connected via technology, the amount of data flowing into companies is growing exponentially and identifying value in that data becomes more difficult - as the data haystack grows larger, the needle becomes more difficult to find. So Big Data is really about finding the needles gathering, sorting and analyzing the flood of data to find the valuable information on which sound business decisions are made. When applied to energy related businesses, Big Data implications vary by market segment Big Data concerns for utilities are not the same as those for energy trading organizations; however the necessity for solving those problems can be as equally pressing. If Gartners definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business DR.B.A.T.UNIVERSITY 5
  • 13. Big Data Dept.of Information Technology Intelligence, regarding data and their use: Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc. Big data uses inductive statistics and concepts from nonlinear system iden- tification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships, dependencies and perform predictions of outcomes and behaviors. The real issue is not that you are acquiring large amounts of data. It’s what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable 1. cost reductions 2. time reductions 3. new product development and optimized offerings 4. smarter business decision making 1.5 Sources Of Big Data The sources and formats of data continue to grow in variety and complexity. A partial list of sources includes the public web; social media; mobile applica- tions; federal, state and local records and databases; commercial databases that aggregate individual data from a spectrum of commercial transactions and pub- lic records; geospatial data; surveys; and traditional offline documents scanned by optical character recognition into electronic form. The advent of the more Internet-enabled devices and sensors expands the capaci-ty to collect data from DR.B.A.T.UNIVERSITY 6
  • 14. Big Data Dept.of Information Technology physical entities, including sensors and radio-frequency identifica-tion (RFID) chips. Personal location data can come from GPS chips, cell-tower triangula- tion of mobile devices, mapping of wireless networks, and in-person payments There are many different types of Big Data sources e.g.: 1. Social media data 2. Personal data (e.g. data from tracking devices) 3. Sensor data 4. Transactional data 5. Enterprise data There are different opinions on whether Enterprise data should be consid- ered to be Big Data or not. Enterprise data are usually large in volume, they are generated for a different purpose and arise organically through Enterprise processes. Also the content of Enterprise data is usually not designed by re- searchers. For these reasons, and because there is a great potential in using Enterprise data, we will consider it to be in scope for this report. There are a number of differences between Enterprise data and other types of Big Data that are worth pointing out.The amount of control a researcher has and the potential inferential power vary between different types of Big Data sources. For example, a researcher will likely not have any control of data from different social media platforms and it could be difficult to decipher a text from social media. For Enterprise data on the other hand, a statistical agency can form partnership with owners of the data and influence the design of the data. En- terprise data is more structured, well defined and more is known about the data than perhaps other Big Data sources. DR.B.A.T.UNIVERSITY 7
  • 15. Big Data Dept.of Information Technology Figure 1.3: Sources of Big Data 1.6 Architecture In 2000, Seisint Inc. developed C++ based distributed file sharing framework for data storage and querying. Structured, semi-structured and/or unstructured data is stored and distributed across multiple servers. Querying of data is done by modified C++ called ECL which uses apply scheme on read method to cre- ate structure of stored data during time of query. In 2004 LexisNexis acquired Seisint Inc. and 2008 acquired ChoicePoint, Inc. and their high speed paral- lel processing platform. The two platforms were merged into HPCC Systems and in 2011 was open sourced under Apache v2.0 License. Currently HPCC and Quantcast File Systemare the only publicly available platforms capable of analyzing multiple exabytes of data. In 2004, Google published a paper on a process called MapReduce that used such an architecture. The MapReduce framework provides a parallel process- ing model and associated implementation to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and DR.B.A.T.UNIVERSITY 8
  • 16. Big Data Dept.of Information Technology Figure 1.4: Architecture of Big Data processed in parallel (the Map step). The results are then gathered and deliv- ered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open source project named Hadoop. Recent studies show that the use of a multiple layer architecture is an option for dealing with big data. The Distributed Parallel architecture distributes data across multiple processing units and parallel processing units provide data much faster, by improving processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power trans- parent to the end user by using a front end application server. DR.B.A.T.UNIVERSITY 9
  • 17. Chapter 2 Characteristics of Big Data 2.1 Volume The quantity of generated data is important in this context. The size of the data determines the value and potential of the data under consideration, and whether it can actually be considered big data or not. The name big data itself contains a term related to size, and hence the characteristic. Many fac- tors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increas- ing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data. This refers to the sheer amount of data available for analysis.This volume of data is driven by the increasing number of data collection instruments (e.g., social me- dia tools, mobile applications, sensors) as well as the increased ability to store and transfer those data with recent improvements in data storage and network- ing.Traditionally, the data volume requirements for analytic and transactional applications were in sub-terabyte territory.However, over the past decade, more organizations in diverse industries have identified requirements for analytic data 10
  • 18. Big Data Dept.of Information Technology volumes in the terabytes, petabytes,and beyond. Estimates produced by lon- gitudinal studies started in 2005[8] show that the amount of data in the world is doubling every two years. Should this trend continue, by 2020, there will be 50 times the amount of data as there had been in 2011. Other estimates indicate that 90 % of all data ever created, was created in the past 2 years.The sheer volume of the data are colossal - the era of a trillion sensors is upon us. This volume presents the most immediate challenge to conventional infor- mation technology structures. It has stimulated new ways for scalable storage across a collection of horizontally coupled resources, and a distributed approach to querying. Briefly, the traditional relational model has been relaxed for the persistence of newly prominent data types. These logical non-relational data models, typically lumped together as NoSQL, can currently be classified as Big Table, Name-Value, Document and Graphical models. 2.2 Velocity The term velocity in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.This refers to both the speed at which these data collection events can occur, and the pres- sure of managing large streams of real-time data.Across the means of collecting social information, new information is being added to the database at rates ranging from as slow as every hour or so, to as fast as thousands of events per second. In this context, the speed at which the data is generated and processed to meet the demands and the challenges that lie in the path of growth and de- velopment.Data is streaming in at unprecedented speed and must be dealt with in a timely manner.RFID tags,sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal DR.B.A.T.UNIVERSITY 11
  • 19. Big Data Dept.of Information Technology with data velocity is a challenge for most organizations The type of content, and an essential fact that data analysts must know.This helps people who are associated with and analyze the data to effectively use the data to their advan- tage and thus uphold its importance.The Velocity is the speed/rate at which the data are created, stored, analysed and visualized. Traditionally, most en- terprises separated their transaction processing and analytics. Enterprise data analytics were concerned with batch data extraction, processing, replication, delivery, and other applications. But increasingly, organizations everywhere have begun to emphasize the need for real-time, streaming, continuous data discovery, extraction, processing, analysis, and access. In the big data era, data are created in real-time or near real-time. With the availability of Internet connected devices, wireless or wired, machines and devices can pass-on their data the moment it is created. Data Flow rates are increasing with enormous speeds and variability, creating new challenges to enable real or near real-time data usage. Traditionally this concept has been described as streaming data. As such there are aspects of this that are not new, as companies such as those in telecommunication have been sifting through high volume and velocity data for years. The new horizontal scaling approaches do however add new big data engineering options for efficiently handling this data. 2.3 Variety Data today comes in all types of formats. Structured, numeric data in tra- ditional databases. Information created from line-of-business applications. Un- structured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with. The type of content, and an essential fact that data analysts must know. This helps people who are asso- DR.B.A.T.UNIVERSITY 12
  • 20. Big Data Dept.of Information Technology ciated with and analyze the data to effectively use the data to their advantage and thus uphold its importance.Variety refers to the complexity of formats in which Big Data can exist. Besides structured databases, there are large streams of unstructured documents, images, email messages, video, links between de- vices and other forms that create a heterogeneous set of data points. One effect of this complexity is that structuring and tying data together becomes a ma- jor effort, and therefore a central concern of Big Data analysis. Traditionally, enterprise data implementations for analytics and transactions operated on a single structured, row-based, relational domain of data. However, increasingly, data applications are creating, consuming, processing, and analysing data in a wide range of relational and non-relational formats including structured, un- structured, semistructured, documents and so forth from diverse application domains.Traditionally, a variety of data was handled through transforms or pre-analytics to extract features that would allow integration with other data through a relational model. Given the wider range of data formats, structures, timescales and semantics that are desirous to use in analytics, the integration of this data becomes more complex. This challenge arises as data to be integrated could be text from social networks, image data, or a raw feed directly from a sensor source. The Internet of Things is the term used to describe the ubiquity of connected sensors, from RFID tags for location, to smartphones, to home utility meters. The fusion of all of this streaming data will be a challenge for developing a total situational awareness. Big Data Engineering has spawned data storage models that are more efficient for unstructured data types than a relational model, causing a derivative issue for the mechanisms to integrate this data. It is possible that the data to be integrated for analytics may be of such volume that it cannot be moved in order to integrate, or it may be that some of the data are not under control of the organization creating the DR.B.A.T.UNIVERSITY 13
  • 21. Big Data Dept.of Information Technology data system. In either case, the variety of big data forces a range of new big data engineering in order to efficiently and automatically integrate data that is stored across multiple repositories and in multiple formats. 2.4 Variability The inconsistency the data can show at times-which can hamper the process of handling and managing the data effectively This is a factor which can be a problem for those who analyse the data. This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. Variability refers to changes in data rate, format/structure, semantics, and/or quality that impact the supported application, analytic, or problem. Specifically, variability is a change in one or more of the other Big Data characteristics. Impacts can include the need to refactor architectures, interfaces, processing/algorithms, integration/fusion, storage, applicability, or use of the data. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data involved.The other characteristics directly affect the scope of the impact for a change in one dimension. For, example in a system that deals with petabytes or exabytes of data refactoring the data architecture and performing the necessary transformation to accommodate a change in structure from the source data may not even be feasible even with the horizontal scaling typically associated with big data architectures. In addition, the trend to integrate data from outside the organization to obtain more refined analytic results combined with the rapid evolution in technology means that enterprises must be able to adapt rapidly to data variations. DR.B.A.T.UNIVERSITY 14
  • 22. Big Data Dept.of Information Technology 2.5 Veracity The quality of captured data, which can vary greatly.Accurate analysis de- pends on the veracity of source data. Veracity refers to the trustworthiness, applicability, noise, bias, abnormality and other quality properties in the data. Veracity is a challenge in combination with other Big Data characteristics, but is essential to the value associated with or developed from the data for a specific problem/application. Assessment, understanding, exploiting, and controlling Veracity in Big Data cannot be addressed efficiently and sufficiently through- out the data lifecycle using current technologies and techniques. 2.6 Complexity Data management can be very complex, especially when large volumes of data come from multiple sources. Data must be linked, connected, and correlated so users can grasp the information the data is supposed to convey. Today’s data comes from multiple sources. And it is still an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. DR.B.A.T.UNIVERSITY 15
  • 23. Chapter 3 Storage,Selection and Processing of Big Data 3.1 Storage of Big Data The explosive growth of data has more strict requirements on storage and management. In this section, we focus on the storage of big data. Big data stor- age refers to the storage and management of large-scale datasets while achieving reliability and availability of data accessing. We will review important issues including massive storage systems, distributed storage systems, and big data storage mechanisms. On one hand, the storage infrastructure needs to pro- vide information storage service with reliable storage space; on the other hand, it must provide a powerful access interface for query and analysis of a large amount of data. Traditionally, as auxiliary equipment of server, data storage device is used to store, manage, look up, and analyze data with structured RDBMSs. With the sharp growth of data, data storage device is becoming increasingly more important, and many Internet companies pursue big capacity of storage to be competitive. Therefore, there is a compelling need for research on data stor- age.Storage system for massive data.Various storage systems emerge to meet 16
  • 24. Big Data Dept.of Information Technology the demands of massive data. Existing massive storage technologies can be classified as Direct Attached Storage (DAS) and network storage, while net- work storage can be further classified into Network Attached Storage (NAS) and Storage Area Network (SAN). In DAS, various harddisks are directly con- nected with servers, and data management is server-centric, such that storage devices are peripheral equipments, each of which takes a certain amount of I/O resource and is managed by an individual application software. For this reason, DAS is only suitable to interconnect servers with a small scale. However, due to its low scalability, DAS will exhibit undesirable efficiency when the storage capacity is increased, i.e., the upgradeability and expandability are greatly lim- ited. Thus, DAS is mainly used in personal computers and small-sized servers. Network storage is to utilize network to provide users with a union interface for data access and sharing. Network storage equipment includes special data exchange equipments, disk array, tap library, and other storage media, as well as special storage software. It is characterized with strong expandability. NAS is actually an auxillary storage equipment of a network. It is directly connected to a network through a hub or switch through TCP/IP protocols. In NAS, data is transmitted in the form of files. Compared to DAS, the I/O burden at a NAS server is reduced extensively since the server accesses a storage device in- directly through a network. While NAS is network-oriented, SAN is especially designed for data storage with a scalable and bandwidth intensive network, e.g., a high-speed network with optical fiber connections. In SAN, data stor- age management is relatively independent within a storage local area network, where multipath based data switching among any internal nodes is utilized to achieve a maximum degree of data sharing and data management. DR.B.A.T.UNIVERSITY 17
  • 25. Big Data Dept.of Information Technology 3.1.1 Key Requirements of Big Data At root, the key requirements of big data storage are that it can handle very large amounts of data and keep scaling to keep up with growth, and that it can provide the input/output operations per second (IOPS) necessary to deliver data to analytics tools The largest big data practitioners Google, Face- book, Apple, etc run what are known as hyperscale computing environments. These comprise vast amounts of commodity servers with direct-attached storage (DAS). Redundancy is at the level of the entire compute/storage unit, and if a unit suffers an outage of any component it is replaced wholesale, having already failed over to its mirror. Such environments run the likes of Hadoop, NoSQL and Cassandra as analytics engines, and typically have PCIe flash storage alone in the server or in addition to disk to cut storage latency to a minimum. Theres no shared storage in this type of configuration. Hyperscale computing environ- ments have been the preserve of the largest web-based operations to date, but it is highly probable that such compute/storage architectures will bleed down into more mainstream enterprises in the coming years. The appetite for build- ing hyperscale systems will depend on the ability of an enterprise to take on a lot of in-house hardware building and maintenance and whether they can jus- tify such systems to handle limited tasks alongside more traditional enterprise environments that handle large amounts of applications on less specialised sys- tems. But hyperscale is not the only way. Many enterprises, and even quite small businesses, can take advantage of big data analytics. They will need the ability to handle relatively large data sets and handle them quickly, but may not need quite the same response times as those organisations that use it push adverts out to users over response times of a few seconds. So the key type of big data storage system with the attributes required will often be scale- out or clustered NAS. This is file access shared storage that can scale out to DR.B.A.T.UNIVERSITY 18
  • 26. Big Data Dept.of Information Technology meet capacity or increased compute requirements and uses parallel file systems that are distributed across many storage nodes that can handle billions of files without the kind of performance degradation that happens with ordinary file systems as they grow. For some time, scale-out or clustered NAS was a distinct product category, with specialised suppliers such as Isilon and BlueArc. But a measure of the increasing importance of such systems is that both of these have been bought relatively recently by big storage suppliers EMC and Hitachi Data Systems, respectively. Meanwhile, clustered NAS has gone mainstream, and the big change here was with NetApp incorporating true clustering and petabyte/parallel file system capability into its Data ONTAP OS in its FAS filers. The other storage format that is built for very large numbers of files is object storage. This tackles the same challenge as scale-out NAS that tradi- tional tree-like file systems become unwieldy when they contain large numbers of files. Object-based storage gets around this by giving each file a unique identifier and indexing the data and its location. Its more like the DNS way of doing things on the internet than the kind of file system were used to.Object storage systems can scale to very high capacity and large numbers of files in the bil- lions, so are another option for enterprises that want to take advantage of big data. Having said that, object storage is a less mature technology than scale- out NAS. So, to sum up, big data storage needs to be able to handle capacity and provide low latency for analytics work. You can choose to do it like the big boys in hyperscale environments or adopt NAS or object storage in more tradi- tional IT departments to do the job. Flash storage solutions, implemented at the server level and with all-flash arrays, offer some interesting alternatives for high-performance, low-latency storage, from a few terabytes to a hundred ter- DR.B.A.T.UNIVERSITY 19
  • 27. Big Data Dept.of Information Technology abytes or more in capacity. Object-based, scale-out architectures with erasure coding can provide scalable storage systems that eschew traditional RAID and replication methods to achieve new levels of efficiency and lower per-gigabyte costs. 3.2 Selection of Big Data Every organization seeking to make sense of big data must determine which platforms and tools, in the sea of available options, will help them to meet their business goals.Answering the following eight questions can help guide IT lead- ers to make the right data management choices for their organizations future success. For organizations needing to store and process tens of terabytes of data, using an open-source distributed file system is a mature choice due to its predictable scalability over clustered hardware. Plus, its the base platform for many big data architectures already. However, if looking to run analytics in online or real-time applications, consider hybrid architectures containing dis- tributed file systems combined with distributed database management systems (which have lower latency. Or look at large traditional relational systems to get real-time access to data that has been through the heavy lifting processes of a distributed file system. Many NoSQL databases require specific applica- tion interfaces (APIs) in order to access the data. With this, youll need to consider the integration of visualization or other tools that will need access to the data. If the tools being used with the big data platform need a SQL interface, choose a tool that has maturity in that area. Of note, NoSQL and big data platforms are evolving quickly and businesses just starting to build custom applications on top of a big data platform may be able to build around the sometimes raw data access frameworks. Alternatively, businesses with ex- isting applications will need a more mature offering. If data requirements are DR.B.A.T.UNIVERSITY 20
  • 28. Big Data Dept.of Information Technology especially unstructured, or include streaming data sources such as social media or video, businesses should look into data serialization technologies that allow capture, storage and representation of such high velocity data. Also, how ap- plications consume data should also be taken into consideration. For instance, some existing tools allow users to project different structures across the data store, giving flexibility to store data in one way and access it in another. Yes, being flexible in how data is presented to consuming applications is a bene- fit, but the performance may not be good enough for high velocity data. To overcome this performance challenge, you may need to integrate with a more structured data store further downstream in your data architecture. If looking to extend your current data architecture by integrating a big data platform into an existing data warehouse, data integration tools can help. Many integration vendors that support big data platforms also have specialized support for in- tegrating with SQL data warehouses and data marts. Clearly, choosing a big data solution isnt easy. As companies of all sizes try to extract more from their existing data stores, Big Data vendors are rushing in to provide a range of Big Data solutions, which comprise everything from database technology to visualization tools. With such a diverse selection of tools to choose from, buyers must carefully define their goals in order to find the right tools to meet their goals. Before finding the right tools, however, organization must first ask themselves what business problems theyre trying to solve – and why. Too many big data projects dont start with problems to solve, but rather start with exploratory analytics, said Chris Selland, VP of marketing and business development for HP Vertica. Thats okay to a point, but eventually these questions need to be asked and answered. Companies have a lot of Big Data and many questions, but that doesnt result DR.B.A.T.UNIVERSITY 21
  • 29. Big Data Dept.of Information Technology in the CIO or CFO simply handing you a large amount of money to work with. Figure 3.1: Selection of Big Data 3.3 Processing of Big Data A variety of platforms have emerged to process big data,including advanced SQL (sometimes called NewSQL) databases that adapt SQL to handle larger volumes of structured data with greater speed, and NoSQL platforms that may range from file systems to document or columnar data stores that typically dispense with the need for modelling data. Most of the early implementations of big data, especially with NoSQL platforms such as Hadoop, have focused more on volume and variety, with results delivered through batch processing. Behind the scenes, there is a growing range of use cases that also emphasise speed. Some of them consist of new applications that take advantage not only of powerful back-end data platforms, but also the growth in bandwidth and mobility. Examples include mobile applications such as Waze that harness sen- sory data from smartphones and GPS devices to provide real-time pictures of traffic conditions. On the horizon there are opportunities for mobile carriers DR.B.A.T.UNIVERSITY 22
  • 30. Big Data Dept.of Information Technology to track caller behaviour in real time to target ads, location-based services, or otherwise engage their customers, as well as Conversely, existing applications are being made more accurate, responsive and effective as smart sensors add more data points, intelligence and adaptive control. These are as diverse as optimising supply chain inventories, regulating public utility and infrastruc- ture networks, or providing real-time alerts for homeland security. The list of potential opportunities for fast processing of big data is limited only by the imagination. 3.3.1 Batch Processing Apache Hadoop is a distributed computing framework modeled after Google MapReduce to process large amounts of data in parallel. Once in a while, the first thing that comes to my mind when speaking about distributed computing is EJB. EJB is de facto a component model with remoting capability but short of the critical features being a distributed computing framework, that include computational parallelization, work distribution, and tolerance to unreliable hardware and software. Hadoop on the other hand has these merits built-in. ZooKeeper modeled on Google Chubby is a centralized service for maintain- ing configuration information, naming, providing distributed synchronization, and group services for the Hadoop cluster. Hadoop Distributed File System (HFDS) modeled on Google GFS is the underlying file system of a Hadoop cluster. HDFS works more efficiently with a few large data files than numer- ous small files. A real-world Hadoop job typically takes minutes to hours to complete, therefore Hadoop is not for real-time analytics, but rather for offline, batch data processing. Recently, Hadoop has undergone a complete overhaul for improved maintainability and manageability. Something called YARN (Yet Another Resource Negotiator) is at the center of this change. One major ob- DR.B.A.T.UNIVERSITY 23
  • 31. Big Data Dept.of Information Technology jective of Hadoop YARN is to decouple Hadoop from MapReduce paradigm to accommodate other parallel computing models, such as MPI (Message Passing Interface and Spark. In general, data flows from components to components in an enterprise ap- plication. This is the case for application frameworks (EJB and Spring frame- work), integration engines (Camel and Spring Integration), as well as ESB (Enterprise Service Bus) products. Nevertheless, for the data-intensive pro- cesses Hadoop deals with, it makes better sense to load a big data set once and perform various analysis jobs locally to minimize IO and network cost, the so-called ”Move-Code-To-Data” philosophy. When you load a big data file to HDFS, the file is split into chunks (or file blocks) through a centralized Name Node (master node) and resides on individual Data Nodes (slave nodes) in the Hadoop cluster for parallel processing. 3.3.2 Stream Processing Stream data processing is not intended to analyze a full big data set, nor is it capable of storing that amount of data (The Storm-on-YARN project is an exception). While you may be asked to build a real-time ad-hoc analytics system that operates on a complete big data set, you really need some mighty tools. Twitter Storm is an open source, big-data processing system intended for distributed, real-time streaming processing. Storm implements a data flow model in which data (time series facts) flows continuously through a topology (a network of transformation entities). The slice of data being analyzed at any moment in an aggregate function is specified by a sliding window, a concept in CEP/ESP. A sliding window may be like ”last hour”, or ”last 24 hours”, which is constantly shifting over time. Data can be fed to Storm through distributed DR.B.A.T.UNIVERSITY 24
  • 32. Big Data Dept.of Information Technology messaging queues like Kafka, Kestrel, and even regular JMS. Trident is an ab- straction API of Storm that makes it easier to use. Like Twitter Storm, Apache S4 is a product for distributed, scalable, continuous, stream data processing. Note, the size of a sliding window cannot grow infinitely. 3.3.3 Hadoop Ecosystem Hadoop API is often considered low level, as it is not easy to program with. The quickly growing Hadoop ecosystem offers a list of abstraction techniques, which encapsulate and hide the programming complexity of Hadoop. Pig, Hive, Cascading, Crunch, Scrunch, Scalding, Scoobi, and Cascalog all aim to provide low cost entry to Hadoop programming. Pig, Crunch (Scrunch), and Cascading are data-pipe based techniques. A data pipe is a multi-stepped process, in which transformation, splitting, merging, and join may be conducted individually at each step. Thinking about a work flow in a general work flow engine, a data pipe is similar. Hive on the other hand works like a data warehouse by offering a SQL compatible interactive shell. Programs or shell scripts developed on top of these techniques are compiled to native Hadoop Map and Reduce classes behind the scene to run in the cluster. Given the simplified programming interfaces in conjunction with libraries of reusable functions, development productivity is greatly improved. 3.3.4 Map and Reduce A centralized JobTracker process in the Hadoop cluster moves your code to data. The code hereby includes a Map and a Reduce class. Put simply, a Map class does the heavy-lifting job of data filtering, transformation, and splitting. For better IO and network efficiency, a Mapper instance only processes the data chunks co-located on the same data node, a concept termed data locality (or DR.B.A.T.UNIVERSITY 25
  • 33. Big Data Dept.of Information Technology data proximity). Mappers can run in parallel on all the available data nodes in the cluster. The outputs of the Mappers from different nodes are shuffled through a particular algorithm to the appropriate Reduce nodes. A Reduce class by nature is an aggregator. The number of Reducer instances is configurable to developers. DR.B.A.T.UNIVERSITY 26
  • 34. Chapter 4 Big Data Analytics Big data is now a reality: The volume, variety and velocity of data coming into your organization continue to reach unprecedented levels. This phenom- enal growth means that not only must you understand big data in order to decipher the information that truly counts, but you also must understand the possibilities of big data analytics. Big data analytics is the process of examin- ing big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions. With big data analytics, data scientists and others can analyze huge volumes of data that conventional analytics and business intelligence solutions can’t touch. Consider that your or- ganization could accumulate (if it hasn’t already) billions of rows of data with hundreds of millions of data combinations in multiple data stores and abundant formats. High-performance analytics is necessary to process that much data in order to figure out what’s important and what isn’t. Enter big data analytics. Why collect and store terabytes of data if you can’t analyze it in full context? Or if you have to wait hours or days to get results? With new advances in computing technology, there’s no need to avoid tackling even the most chal- lenging business problems. For simpler and faster processing of only relevant data, you can use high-performance analytics. Using high-performance data mining, predictive analytics, text mining, forecasting and optimization on big 27
  • 35. Big Data Dept.of Information Technology data enables you to continuously drive innovation and make the best possible decisions. In addition, organizations are discovering that the unique properties of machine learning are ideally suited to addressing their fast-paced big data needs in new ways. Big data can be analyzed with the software tools commonly used as part of advanced analytics disciplines such as predictive analytics, data mining, text analytics andstatistical analysis. Mainstream BI software and data visualiza- tion tools can also play a role in the analysis process. But the semi-structured and unstructured data may not fit well in traditional data warehouses based on relational databases. Furthermore, data warehouses may not be able to handle the processing demands posed by sets of big data that need to be updated fre- quently or even continually – for example, real-time data on the performance of mobile applications or of oil and gas pipelines. As a result, many organizations looking to collect, process and analyze big data have turned to a newer class of technologies that includes Hadoop and related tools such as YARN,MapReduce, Spark, Hive and Pig as well as NoSQL databases. Those technologies form the core of an open source software framework that supports the processing of large and diverse data sets across clustered systems. In some cases, Hadoop clusters and NoSQL systems are being used as landing pads and staging areas for data before it gets loaded into a data warehouse for analysis, often in a summarized form that is more conducive to relational structures. Increasingly though, big data vendors are pushing the concept of a Hadoop data lake that serves as the central repository for an organization’s incoming streams of raw data. In such architectures, subsets of the data can then be filtered for analysis in data warehouses and analytical databases, or it can be analyzed directly in Hadoop using batch query tools, stream processing DR.B.A.T.UNIVERSITY 28
  • 36. Big Data Dept.of Information Technology software and SQL on Hadoop technologies that run interactive, ad hoc queries written in SQL. Potential pitfalls that can trip up organizations on big data analytics initiatives include a lack of internal analytics skills and the high cost of hiring experienced analytics professionals. The amount of information that’s typically involved, and its variety, can also cause data management headaches, including data quality and consistency issues. In addition, integrating Hadoop systems and data warehouses can be a challenge, although various vendors now offer software connectors between Hadoop and relational databases, as well as other data integration tools with big data capabilities. 4.1 Examples of Big Data Analytics As the technology that helps an organization to break down data silos and analyze data improves, business can be transformed in all sorts of ways. Accord- ing to Datamation, today’s advances in analyzing Big Data allow researchers to decode human DNA in minutes, predict where terrorists plan to attack, deter- mine which gene is mostly likely to be responsible for certain diseases and, of course, which ads you are most likely to respond to on Facebook. The business cases for leveraging Big Data are compelling. For instance, Netflix mined its subscriber data to put the essential ingredients together for its recent hit House of Cards, and subscriber data also prompted the company to bring Arrested Development back from the dead. Another example comes from one of the biggest mobile carriers in the world. France’s Orange launched its Data for Development project by releasing sub- scriber data for customers in the Ivory Coast. The 2.5 billion records, which were made anonymous, included details on calls and text messages exchanged between 5 million users. Researchers accessed the data and sent Orange propos- DR.B.A.T.UNIVERSITY 29
  • 37. Big Data Dept.of Information Technology als for how the data could serve as the foundation for development projects to improve public health and safety. Proposed projects included one that showed how to improve public safety by tracking cell phone data to map where people went after emergencies; another showed how to use cellular data for disease containment. 4.2 Benefits of Big Data Analytics Enterprises are increasingly looking to find actionable insights into their data. Many big data projects originate from the need to answer specific business questions. With the right big data analytics platforms in place, an enterprise can boost sales, increase efficiency, and improve operations, customer service and risk management. 1. Webopedia parent company, QuinStreet, surveyed 540 enterprise decision- makers involved in big data purchases to learn which business areas com- panies plan to use Big Data analytics to improve operations. 2. About half of all respondents said they were applying big data analytics to improve customer retention, help with product development and gain a competitive advantage. 3. The business area getting the most attention relates to increasing efficien- cies and optimizing operations.62 percent of respondents said that they use big data analytics to improve speed and reduce complexity. DR.B.A.T.UNIVERSITY 30
  • 38. Chapter 5 Challenges in Big Data 5.1 Security The biggest challenge for big data from a security point of view is the pro- tection of users privacy. Big data frequently contains huge amounts of personal identifiable information and therefore privacy of users is a huge concern. Be- cause of the big amount of data stored, breaches affecting big data can have more devastating consequences than the data breaches we normally see in the press. This is because a big data security breach will potentially affect a much larger number of people, with consequences not only from a reputational point of view, but with enormous legal repercussions. When producing information for big data, organizations have to ensure that they have the right balance between utility of the data and privacy. Before the data is stored it should be adequately anonymised, removing any unique identifier for a user. This in itself can be a security challenge as removing unique identifiers might not be enough to guarantee that the data will remain anonymous. The anonymized data could be could be cross-referenced with other available data following de- anonymization techniques. When storing the data organizations will face the problem of encryption. Data cannot be sent encrypted by the users if the cloud needs to perform operations over the data. A solution for this is to use 31
  • 39. Big Data Dept.of Information Technology Fully Homomorphic Encryption (FHE), which allows data stored in the cloud to perform operations over the encrypted data so that new encrypted data will be created. When the data is decrypted the results will be the same as if the operations were carried out over plain text data. Therefore, the cloud will be able to perform operations over encrypted data without knowledge of the underlying plain text data. While using big data a significant challenge is how to establish ownership of information. If the data is stored in the cloud a trust boundary should be establish between the data owners and the data storage owners. Adequate access control mechanisms will be key in protecting the data. Access control has traditionally been provided by operating systems or applications restricting access to the information, which typically exposes all the information if the system or application is hacked. A better approach is to protect the information using encryption that only allows decryption if the entity trying to access the information is authorised by an access control pol- icy. An additional problem is that software commonly used to store big data, such as Hadoop, doesnt always come with user authentication by default. This makes the problem of access control worse, as a default installation would leave the information open to unauthenticated users. Big data solutions often rely on traditional firewalls or implementations at the application layer to restrict access to the information. The main solution to ensuring that data remains protected is the adequate use of encryption. For example, Attribute-Based Encryption can help in pro- viding fine-grained access control of encrypted data.Anonymizing the data is also important to ensure that privacy concerns are addressed. It should be en- sured that all sensitive information is removed from the set of records collected. Real-time security monitoring is also a key security component for a big data DR.B.A.T.UNIVERSITY 32
  • 40. Big Data Dept.of Information Technology project. It is important that organizations monitor access to ensure that there is no unauthorised access. It is also important that threat intelligence is in place to ensure that more sophisticated attacks are detected and that the organiza- tions can react to threats accordingly. If an adequate governance framework is not applied to big data then the data collected could be misleading and cause unexpected costs. The main problem from a governance point of view is that big data is a relatively new concept and therefore no one has created procedures and policies. The challenge with big data is that the unstructured nature of the information makes it difficult to categorize, model and map the data when it is captured and stored. The problem is made worst by the fact that the data normally comes from external sources, often making it complicated to confirm its accuracy. Data hackers have become more damaging in the era of big data due to the availability of large volumes of publically available data, the abil- ity to store massive amounts of data on portable devices such as USB drives and laptops, and the accessibility of simple tools to acquire and integrate dis- parate data sources. According to the Open Security Foundations DataLossDB project (http://datalossdb.org), hacking accounts for 28% of all data breach incidents, with theft accounting for an additional 24%, fraud accounting for 12%, and web-related loss accounting for 9% of all data loss incidents. Greater than half (57%) of all data loss incidents involve external parties, but 10% in- volve malicious actions on the part of internal parties, and an additional 20% involve accidental actions by internal parties. Private businesses, hospitals, and biomedical researchers are also making tremendous investments in the collec- tion, storage, and analysis of large-scale data and private information. DR.B.A.T.UNIVERSITY 33
  • 41. Big Data Dept.of Information Technology 5.2 Data Access 5.2.1 Inefficiency Securing and controlling access to data is a very time-consuming process. Currently, for most companies that are even aware that their data access pro- tocols are an issue, actual practices in securing that are inefficient, whether you are manually securing unstructured data, or whether it is data that is be- ing dealt with automatically. To secure it properly, enterprises need to assess where in the environment your data resides, assess how much data loss there is from the use of file servers and NAS devices, and even develop an inventory of whats available in your SharePoint deployments. All very time consuming, youll no doubt agree. 5.2.2 Ineffectiveness After answering the question as to who has access to your data, the next big question is whether they should have a given level of access or not. Does IT know what level of access should be offered to employees, and should this decision be left in the hands of the IT department at all .The chances are that they shouldnt be allowed decide, as this is ultimately a business decision. But, as in the case of most companies, there is no clear policy on who makes the decisions. Chances are in this situation theres also going to be a lot of unstructured and orphaned data lying around with no one to take responsibly for it. 5.3 Data Cleaning Data cleaning remains an important part of the process to ensure data qual- ity. The first is to verify that the quantitative and qualitative (i.e. categorical) DR.B.A.T.UNIVERSITY 34
  • 42. Big Data Dept.of Information Technology variables have been recorded as expected. The second involves removing out- liers, which in the Big Data paradigm means the use of decision tree algorithms. But data cleaning itself is a subjective process (e.g. deciding which variables to consider) and not truly agnostic as would be desired and thus open to philo- sophical debate (Bollier, 2010). 5.4 Data Representation Related to the question of data provenance is the issue of understanding the underlying population whose behavior has been captured. The large data sizes may make the sampling rate irrelevant, but it doesnt necessarily make it repre- sentative. Everybody does not use Twitter, Facebook or even Google searches. For example ITU estimates suggest that Internet usage is still limited to only 40 per cent of the world population. In other words, more than four billion people globally are not yet using the Internet, and 90 per cent of them are from the developing world. Of the worlds three billion Internet users, two-thirds are from the developing countries. At the other end of the spectrum, even though mobile cellular penetration is close to 100%, this does not mean that every person in the world is using a mobile phone. This issue of representativeness is an issue of high relevance when considering how telecommunication data may be used for monitoring and development. Whilst the promise in leveraging data from mobile network operators for monitoring and development hinges on its large coverage, nearing the actual population, it is still not the whole population. Questions such as the extant of coverage of the poor, or the levels of gender representation amongst telecom users are all valid questions. Whilst the regis- tration information might provide answers, the reality is that the demographic information on telecom subscribers for example is not always accurate. With pre-paid subscriptions being the norm in the majority of the developing world, DR.B.A.T.UNIVERSITY 35
  • 43. Big Data Dept.of Information Technology demographic information contained in the mobile operator records is practically useless, even with mandated registration. The issue of sampling bias is best illustrated by the case of Street Bump, a mobile app developed by the Boston City Hall. Street Bump uses a phones accelerometer to detect potholes and notify City Hall, whilst the app users drive around Boston. The app however introduces a selection bias since it biased towards the demographics of the app users, who often hail from affluent areas with greater smartphone ownership (Harford, 2014). Hence the Big in Big Data does not automatically mean that issues such as measurement bias and methodology, internal and external data validity, and inter-dependencies among data can be ignored. These are foundational issues not just for small data but also for Big Data (Boyd and Crawford, 2012). 5.5 Behavioral change For that matter, digitized online behavior can be subject to self-censorship and the creation of multiple personas, further muddying the waters. Thus studying the data exhaust of people may not always give us insights into real- world dynamics. This may be less of an issue with TGD, where in essence the data artifact is itself a byproduct of another activity. Telecom network Big Data, which mostly falls under this category, may be less susceptible to self-censorship and persona development. But it doesnt exclude the possibility either. It is not inconceivable that users may not use their mobiles or even turn it off in areas, where they do not wish their digital footprint to be left behind. In a way, Big Data analyses of behavioral data are subject to a form of the Heisenberg Uncertainty principle: as soon as the basic process of an analysis is known, there may be concerted efforts to exhibit different behavior and/or DR.B.A.T.UNIVERSITY 36
  • 44. Big Data Dept.of Information Technology actions to change the outcomes (Bollier, 2010). For example the famous Google page rank algorithm has spawned an entire industry of organizations that claim to enhance page ranks for websites. Search Engine Optimization (SEO) is now an established practice when developing websites.Change in behavior could also partly attribute to the declining verac- ity of Google Flu Trends. Researchers found that influenza-like-illness rates as exhibited by Google searches did not necessarily correlate with actual influenza virus infections (Ortiz et al., 2011). Recent research has shown that after 2009 (when it failed to catch the non-seasonal influenza outbreak of 2009), infre- quent updates, have not improved the results. In fact Google Flu Trends has persistently overestimated flu prevalence since 2009 (Lazer, Kennedy, King, and Vespignani, 2014). Google Flu Trends does not and cannot know what factors contributed to the strong correlations found in their initial work. The point is that the underlying real world actions of the population that turned to Google for its health queries and which contributed to the original correlations discov- ered by GFT, may have in-fact changed over time, diminishing the robustness of the original algorithm. For example the hoopla surrounding GFT could have even created rebound effects, with more and more people turning to Google for their broader health questions and thereby introducing additional search terms (due to different cultural norms and/or ground conditions), which can collec- tively introduce biases that GFT has not been able account for. Such possible problems could have been caught and resolved had the GFT method been more transparent. DR.B.A.T.UNIVERSITY 37
  • 45. Chapter 6 Applications and Future of Big Data 6.1 Applications Big data has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM, Microsoft, SAP,EMC, HP and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole. Developed economies increasingly use data- intensive technologies. There are 4.6 billion mobile-phone subscriptions world- wide, and between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people become more literate, which in turn leads to infor- mation growth. The world’s effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65exabytes in 2007 and predictions put the amount of in- ternet traffic at 667 exabytes annually by 2014. According to one estimate, one third of the globally stored information is in the form of alphanumeric text and still image data, which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and 38
  • 46. Big Data Dept.of Information Technology audio content).While many vendors offer off-the-shelf solutions for Big Data, experts recommend the development of in-house solutions custom-tailored to solve the company’s problem at hand if the company has sufficient technical capabilities. 6.1.1 Government The use and adoption of Big Data within governmental processes is beneficial and allows efficiencies in terms of cost, productivity, and innovation. That said, this process does not come without its flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome. Below are the thought leading examples within the Governmental Big Data space. United States of America In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address im- portant problems faced by the government. The initiative is composed of 84 different big data programs spread across six departments.Big data analysis played a large role in Barack Obama’s successful 2012 re-election campaign. The United States Federal Government owns six of the ten most powerful su- percomputers in the world. The Utah Data Center is a data center currently being constructed by the United States National Security Agency. When fin- ished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a fewexabytes. DR.B.A.T.UNIVERSITY 39
  • 47. Big Data Dept.of Information Technology India Big data analysis helped in parts, responsible for the BJP and its allies to win Indian General Election 2014. The Indian Government utilises numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation 6.1.2 Cyber-Physical Models Current PHM implementations mostly utilize data during the actual usage while analytical algorithms can perform more accurately when more informa- tion throughout the machines lifecycle, such as system configuration, physical knowledge and working principles, are included. There is a need to systemati- cally integrate, manage and analyze machinery or process data during different stages of machine life cycle to handle data/information more efficiently and further achieve better transparency of machine health condition for manufac- turing industry.With such motivation a cyber-physical (coupled) model scheme has been developed. The coupled model is a digital twin of the real machine that operates in the cloud platform and simulates the health condition with an integrated knowledge from both data driven analytical algorithms as well as other available physical knowledge. It can also be described as a 5S systematic approach consisting of Sensing, Storage, Synchronization, Synthesis and Ser- vice. The coupled model first constructs a digital image from the early design stage. System information and physical knowledge are logged during product design, based on which a simulation model is built as a reference for future analysis. Initial parameters may be statistically generalized and they can be tuned using data from testing or the manufacturing process using parameter estimation. After that step, the simulation model can be considered a mirrored image of the real machineable to continuously record and track machine condi- DR.B.A.T.UNIVERSITY 40
  • 48. Big Data Dept.of Information Technology tion during the later utilization stage. Finally, with the increased connectivity offered by cloud computing technology, the coupled model also provides better accessibility of machine condition for factory managers in cases where physical access to actual equipment or machine data is limited. 6.1.3 Healthcare Big data analytics has helped healthcare improve by providing personalized medicine and prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability reduction, automated external and internal reporting of patient data, standardized medical terms and patient registries and fragmented point solutions. 6.1.4 Technology 1. eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. 2. Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the worlds three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. 3. Facebook handles 50 billion photos from its user base. 4. As of August 2012, Google was handling roughly 100 billion searches per month. 5. Oracle NoSQL Database has been tested to past the 1M ops/sec mark with 8 shards and proceeded to hit 1.2M ops/sec with 10 shards. DR.B.A.T.UNIVERSITY 41
  • 49. Big Data Dept.of Information Technology 6.2 The Future of Big Data However, those who feel that todays big data is just a continuation of past information trends are as wrong as if they were to claim that a stone tablet is essentially the same as a tablet computer or an abacus similar to a supercom- puter.Today, we have more information than ever. But the importance of all that information extends beyond simply being able to do more, or know more, than we already do. The quantitative shift leads to a qualitative shift. Hav- ing more data allows us to do new things that werent possible before. In other words: More is not just more. More is new. More is better. More is different.Of course, there are still limits on what we can obtain from or do with data. But most of our assumptions about the cost of collecting and the difficulty of pro- cessing data need to be overhauled. No area of human endeavor or industrial sector will be immune from the incredible shakeup thats about to occur as big data plows through society, politics, and business. People shape their toolsand their tools shape them.This new world of data, and how companies can harness it, bumps up against two areas of public policy and regulation. The first is employment. Big data will bring about great things in society. We like to think that technology leads to job creation, even if it comes after a temporary period of disruption. That was certainly true during the Industrial Revolution. To be sure, it was a devastating time of dislocation, but it eventually led to better livelihoods. Yet this optimistic outlook ignores the fact that some industries simply never recover from change. When tractors and automobiles replaced horse-drawn plows and carriages, the need for horses in the economy basically ended.The upheavals of the Industrial Revolution created political change and gave rise to new economic philosophies and political movements. Its not much DR.B.A.T.UNIVERSITY 42
  • 50. Big Data Dept.of Information Technology of an intellectual stretch to predict that new political philosophies and social movements will arise around big data, robots, computers, and the Internet, and the effect of these technologies on the economy and representative democracy. Recent debates over income inequality and the Occupy movement seem to point in that direction. Big data will change business, and business will change society. The hope, of course, is that the benefits will outweigh the drawbacks, but that is mostly a hope. The big-data world is still very new, and, as a society, were not very good at handling all the data that we can collect now. We also cant foresee the future. Technology will continue to surprise us, just as it would an ancient man with an abacus looking upon an iPhone. What is certain is that more will not be more: It will be different.Clearly Big Data is in its beginnings, and is much more to be discovered. Now is for the most companies just a cool keyword, because it has a great potential and not many truly know what all is about. A clear sign that there is more to big data then is currently shown on the market, is that the big software companies do not have, or do not present their Big Data solutions, and those that have like Google, does not use it in ca commercial way. The companies need to decide what kind of strategy use to implement Big Data. They could use a more revolutionary approach and move all the data to the new Big Data environment, and all the reporting, modeling and interrogation will be executed using the new business intelligence based on Big Data. [1] This approach is already used by many analytics driven organizations that puts all the data on the Hadoop environment and build business intelligence solutions on top of it. Another approach is the evolutionary approach; Big Data becomes an input to the current BI platform. The data is accumulated and analyzed using structured and unstructured tools, and the results are sent to the data DR.B.A.T.UNIVERSITY 43
  • 51. Big Data Dept.of Information Technology warehouse. Standard modeling and reporting tools now have access to social media sentiments, usage records, and other processed Big Data items. [1] One of the issues of the evolutionary approach is that even if it gets most of the capabilities of the Big Data environment, but also gets most of the problems of the classic Business intelligence solution, and in some cases can create a bottleneck between information that came from the Big Data and the power to analyze of the traditional BI or data warehouse solution.4 DR.B.A.T.UNIVERSITY 44
  • 52. Chapter 7 Conclusion The availability of Big Data, low-cost commodity hardware, and new infor- mation management and analytic software have produced a unique moment in the history of data analysis. The convergence of these trends means that we have the capabilities required to analyze astonishing data sets quickly and cost- effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and prof- itability. The Age of Big Data is here, and these are truly revolutionary times if both business and technology professionals continue to work together and deliver on the promise. 45
  • 53. Bibliography [1] ¡Introduction¿, <www.WIKIPEDIA..COM> [2] ¡Characteristics of Big Data¿, <www.STUDYMAFIA.ORG> [3] ¡Big Data Analytics¿, <www.COMPUTERWEEKLY.COM> [4] ¡Storage of Big Data¿, <www.COMPUTERWEEKLY.COM> 46