Big data

Big Data
Submitted in partial fulﬁllment of the requirements
for the award of the Degree of Information Technology
Submitted By
Mr.Prashant Maruti Navatre
(Regisration No.20130737)
the guidance of
Prof.S.S.Barphe
DEPARMENT OF INFORMATION TECHNOLOGY
DR.BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY,
LONERE, RAIGAD-MAHARASHTRA, INDIA-400103
2015-2016

DR.BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY
LONERE, RAIGAD-MAHARASHTRA, INDIA-400103
Certificate
This is to certify that the project entitled “Big Data” is submitted by Mr.Prashant Maruti
Navatre , Registration No. 20130737 for the partial fulfilment of the requirement for the
award of the degree of Bachelor of Technology in INFORMATION TECHNOLOGY of the Dr.
Babasaheb Ambedkar Technological University, Lonere is a bonafide work carried out during
the academic year 2015-2016.
Prof.S.S.Barphe Dr.S.M.Jadhav
(Seminar Guide) (Head Of Deparment)
Information Technology Information Technology
Examiners:
1).
2).
Date:
Place:Vidyavihar Lonere-402103

Acknowledgment
I am pleased to present this seminar report entitled “Big Data”. It is indeed a great pleasure
and a moment of immense satisfaction for me to express my sense of profound gratitude and
indebtedness towards my guide Prof. S.S.Baprhe whose enthusiasm are the source of inspiration
for me. I am extremely thankful for the guidance and untiring attention, which he bestowed on
me right from the beginning. Her valuable and timely suggestions at crucial stages and above
all his constant encouragement have made it possible for me to achieve this work. I would also
like to give my sincere thanks to S.M. JADHAV Head of INFORMATION TECHNOLOGY for
necessary help and providing me the required facilities for completion of this seminar report.
I would like to thank the entire Teaching staﬀs who are directly or indirectly involved in the
various data collection and software assistance to bring forward this seminar report. I express
my deep sense of gratitude towards my parents for their sustained cooperation and wishes,
which have been a prime source of inspiration to take this seminar work to its end without
any hurdles.Last but not the least, I would like to thank all my B.Tech. colleagues for their
co-operation and useful suggestion and all those who have directly or indirectly helped me in
completion of this seminar work.
Date Prashant Maruti Navatre
Place 20130737

ABSTRACT
Big data is a term for massive data sets having large, more varied and complex
structure with the diﬃculties of storing, analyzing and visualizing for further
processes or results. The process of research into massive amounts of data
to reveal hidden patterns and secret correlations named as big data analytics.
These useful informations for companies or organizations with the help of gain-
ing richer and deeper insights and getting an advantage over the competition.
For this reason, big data implementations need to be analyzed and executed
as accurately as possible. This paper presents an overview of big data’s con-
tent, scope, samples, methods, advantages and challenges and discusses privacy
concern on it.
Every day, we create 2.5 quintillion bytes(one quintillion bytes = one billion
gigabytes). of data so much that 90% of the data in the world today has been
created in the last two years alone. This data comes from everywhere: sensors
used to gather climate information, posts to social media sites, digital pictures
and videos, purchase transaction records, and cell phone GPS signals to name
a few. This data is Big Data.
I

Contents
1 Introduction 1
1.1 The General Concept Of Big Data . . . . . . . . . . . . . . . . . 2
1.2 Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Need For Big Data . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Sources Of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Characteristics of Big Data 10
2.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Veracity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Storage,Selection and Processing of Big Data 16
3.1 Storage of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Key Requirements of Big Data . . . . . . . . . . . . . . . 18
3.2 Selection of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Processing of Big Data . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Batch Processing . . . . . . . . . . . . . . . . . . . . . . 23
II

CONTENTS CONTENTS
3.3.2 Stream Processing . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Hadoop Ecosystem . . . . . . . . . . . . . . . . . . . . . 25
3.3.4 Map and Reduce . . . . . . . . . . . . . . . . . . . . . . 25
4 Big Data Analytics 27
4.1 Examples of Big Data Analytics . . . . . . . . . . . . . . . . . . 29
4.2 Benefits of Big Data Analytics . . . . . . . . . . . . . . . . . . . 30
5 Challenges in Big Data 31
5.1 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2 Ineffectiveness . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . 35
5.5 Behavioral change . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Applications and Future of Big Data 38
6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.1 Government . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.2 Cyber-Physical Models . . . . . . . . . . . . . . . . . . . 40
6.1.3 Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1.4 Technology . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 The Future of Big Data . . . . . . . . . . . . . . . . . . . . . . 42
7 Conclusion 45
References 46
III

List of Figures
1.1 Visualization of daily Wikipedia edits created by IBM . . . . . . 2
1.2 Growth and Digitalization of Global and Information Storage
capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Sources of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Architecture of Big Data . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Selection of Big Data . . . . . . . . . . . . . . . . . . . . . . . . 22
IV

Chapter 1
Introduction
In recent years , the term Big Data has emerged to describe a new paradigm
for data applications. New Technologies tend to emerge with a lot of hype ,
but it can take some time to tell what is new and different. While big data has
been defined in a myriad of ways, the heart of the Big Data paradigm is that
too big (volume),arrives too fast(velocity),changes too fast(variability),conatins
too much(vearcity),or is too diverse(variety) to be proceesed within a local com-
puting structure using traditional approaches and techniques. The technologies
being introduced to support this paradigm have a wide variety of interfaces
making it difficult to construct tools and applications that integrate data from
multiple Big Data sources.
Analysis of data sets can find new correlations, to ”spot business trends,
prevent diseases, combat crime and so on. Scientists, business executives, prac-
titioners of media and advertising and governments alike regularly meet difficul-
ties with large data sets in areas including Internet search, finance and business
informatics. Scientists encounter limitations in e-Science work, including me-
teorology, genomics, connectomics, complex physics simulations, and biological
and environmental research.
1

Big Data Dept.of Information Technology
Big Data is high-volume, high-velocity and/or high-variety information assets
that demand cost-eﬀective, innovative forms of information processing that en-
able enhanced insight, decision making, and process automation.Data sets grow
in size in part because they are increasingly being gathered by cheap and nu-
merous information-sensing mobile devices, aerial (remote sensing), software
logs, cameras, microphones, radio-frequency identiﬁcation (RFID) readers, and
wireless sensor networks.
1.1 The General Concept Of Big Data
The term Big Data is an imprecise description of a rich and complicated
set of characteristics, practices, techniques, ethical issues, and outcomes all
associated with data.Big Data originated in the physical sciences, with physics
and astronomy early to adopt of many of the techniques now called Big Data.
Instruments like the Large Hadron Collider and the Square Kilometer Array
are massive collectors of exabytes of information, and the ability to collect such
massive amounts of data necessitated an increased capacity to manipulate and
analyze these data as well.
Figure 1.1: Visualization of daily Wikipedia edits created by IBM
DR.B.A.T.UNIVERSITY 2

1.2 Definition
Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture,
data curation, search, sharing, storage, transfer, visualization, and information
privacy. The term often refers simply to the use of predictive analytics or
other certain advanced methods to extract value from data, and seldom to a
particular size of data set. Accuracy in big data may lead to more confident
decision making. And better decisions can mean greater operational efficiency,
cost reduction and reduced risk.
1.3 History
Big data burst upon the scene in the first decade of the 21st century, and the
first organizations to embrace it were online and startup firms. Arguably, firms
like Google, eBay, LinkedIn, and Facebook were built around big data from
the beginning. They didnt have to reconcile or integrate big data with more
traditional sources of data and the analytics performed upon them, because
they didnt have those traditional forms. They didnt have to merge big data
technologies with their traditional IT infrastructures because those infrastruc-
tures didnt exist. Big data could stand alone, big data analytics could be the
only focus of analytics, and big data technology architectures could be the only
architecture.
Consider, however, the position of large, well-established businesses. Big data
in those environments shouldnt be separate, but must be integrated with every-
thing else thats going on in the company.Analytics on big data have to coexist
with analytics on other types of data. Hadoop clusters have to do their work

alongside IBM mainframes.Data scientists must somehow get along and work
jointly with mere quantitative analysts.In order to understand this coexistence,
we interviewed 20 large organizations in the early months of 2013 about how big
data fit in to their overall data and analytics environments. Overall, we found
the expected co-existence; in not a single one of these large organizations was
big data being managed separately from other types of data and analytics. The
integration was in fact leading to a new management perspective on analytics,
which well call Analytics 3.0. In this paper well describe the overall context for
how organizations think about big data, the organizational structure and skills
required for itetc. Well conclude by describing the Analytics 3.0 era.
In a 2001 research report and related lectures, META Group (now Gartner)
analyst Doug Laney defined data growth challenges and opportunities as be-
ing three-dimensional, i.e. increasing volume (amount of data), velocity (speed
of data in and out), and variety (range of data types and sources).Gartner,
and now much of the industry, continue to use this ”3Vs” model for describ-
ing big data.In 2012, Gartner updated its definition as follows: ”Big data is
high volume, high velocity, and/or high variety information assets that require
new forms of processing to enable enhanced decision making, insight discovery
and process optimization.” Additionally, a new V ”Veracity” is added by some
organizations to describe it.
1.4 The Need For Big Data
Like many new information technologies, big data can bring about dramatic
cost reductions, substantial improvements in the time required to perform a
computing task, or new product and service offerings. Like traditional analytics,
it can also support internal business decisions. The technologies and concepts

Figure 1.2: Growth and Digitalization of Global and Information Storage capacity
behind big data allow organizations to achieve a variety of objectives, but most
of the organizations we interviewed were focused on one or two. The chosen
objectives have implications for not only the outcome and financial benefits
from big data, but also the processwho leads the initiative, where it fits within
the organization, and how to manage the project.As the world becomes more
connected via technology, the amount of data flowing into companies is growing
exponentially and identifying value in that data becomes more difficult - as the
data haystack grows larger, the needle becomes more difficult to find. So Big
Data is really about finding the needles gathering, sorting and analyzing the
flood of data to find the valuable information on which sound business decisions
are made. When applied to energy related businesses, Big Data implications
vary by market segment Big Data concerns for utilities are not the same as
those for energy trading organizations; however the necessity for solving those
problems can be as equally pressing.
If Gartners definition (the 3Vs) is still widely used, the growing maturity
of the concept fosters a more sound difference between big data and Business

Intelligence, regarding data and their use:
Intelligence uses descriptive statistics with data with high information density
to measure things, detect trends etc.
Big data uses inductive statistics and concepts from nonlinear system iden-
tification to infer laws (regressions, nonlinear relationships, and causal effects)
from large sets of data with low information density to reveal relationships,
dependencies and perform predictions of outcomes and behaviors.
The real issue is not that you are acquiring large amounts of data. It’s what
you do with the data that counts. The hopeful vision is that organizations will
be able to take data from any source, harness relevant data and analyze it to
find answers that enable
1. cost reductions
2. time reductions
3. new product development and optimized offerings
4. smarter business decision making
1.5 Sources Of Big Data
The sources and formats of data continue to grow in variety and complexity.
A partial list of sources includes the public web; social media; mobile applica-
tions; federal, state and local records and databases; commercial databases that
aggregate individual data from a spectrum of commercial transactions and pub-
lic records; geospatial data; surveys; and traditional offline documents scanned
by optical character recognition into electronic form. The advent of the more
Internet-enabled devices and sensors expands the capaci-ty to collect data from

physical entities, including sensors and radio-frequency identifica-tion (RFID)
chips. Personal location data can come from GPS chips, cell-tower triangula-
tion of mobile devices, mapping of wireless networks, and in-person payments
There are many different types of Big Data sources e.g.:
1. Social media data
2. Personal data (e.g. data from tracking devices)
3. Sensor data
4. Transactional data
5. Enterprise data
There are different opinions on whether Enterprise data should be consid-
ered to be Big Data or not. Enterprise data are usually large in volume, they
are generated for a different purpose and arise organically through Enterprise
processes. Also the content of Enterprise data is usually not designed by re-
searchers. For these reasons, and because there is a great potential in using
Enterprise data, we will consider it to be in scope for this report. There are
a number of differences between Enterprise data and other types of Big Data
that are worth pointing out.The amount of control a researcher has and the
potential inferential power vary between different types of Big Data sources.
For example, a researcher will likely not have any control of data from different
social media platforms and it could be difficult to decipher a text from social
media. For Enterprise data on the other hand, a statistical agency can form
partnership with owners of the data and influence the design of the data. En-
terprise data is more structured, well defined and more is known about the data
than perhaps other Big Data sources.

Figure 1.3: Sources of Big Data
1.6 Architecture
In 2000, Seisint Inc. developed C++ based distributed ﬁle sharing framework
for data storage and querying. Structured, semi-structured and/or unstructured
data is stored and distributed across multiple servers. Querying of data is done
by modiﬁed C++ called ECL which uses apply scheme on read method to cre-
ate structure of stored data during time of query. In 2004 LexisNexis acquired
Seisint Inc. and 2008 acquired ChoicePoint, Inc. and their high speed paral-
lel processing platform. The two platforms were merged into HPCC Systems
and in 2011 was open sourced under Apache v2.0 License. Currently HPCC
and Quantcast File Systemare the only publicly available platforms capable of
analyzing multiple exabytes of data.
In 2004, Google published a paper on a process called MapReduce that used
such an architecture. The MapReduce framework provides a parallel process-
ing model and associated implementation to process huge amounts of data.
With MapReduce, queries are split and distributed across parallel nodes and

Figure 1.4: Architecture of Big Data
processed in parallel (the Map step). The results are then gathered and deliv-
ered (the Reduce step). The framework was very successful, so others wanted
to replicate the algorithm. Therefore, an implementation of the MapReduce
framework was adopted by an Apache open source project named Hadoop.
Recent studies show that the use of a multiple layer architecture is an option
for dealing with big data. The Distributed Parallel architecture distributes data
across multiple processing units and parallel processing units provide data much
faster, by improving processing speeds. This type of architecture inserts data
into a parallel DBMS, which implements the use of MapReduce and Hadoop
frameworks. This type of framework looks to make the processing power trans-
parent to the end user by using a front end application server.

Chapter 2
Characteristics of Big Data
2.1 Volume
The quantity of generated data is important in this context. The size of
the data determines the value and potential of the data under consideration,
and whether it can actually be considered big data or not. The name big data
itself contains a term related to size, and hence the characteristic. Many fac-
tors contribute to the increase in data volume. Transaction-based data stored
through the years. Unstructured data streaming in from social media. Increas-
ing amounts of sensor and machine-to-machine data being collected. In the
past, excessive data volume was a storage issue. But with decreasing storage
costs, other issues emerge, including how to determine relevance within large
data volumes and how to use analytics to create value from relevant data. This
refers to the sheer amount of data available for analysis.This volume of data is
driven by the increasing number of data collection instruments (e.g., social me-
dia tools, mobile applications, sensors) as well as the increased ability to store
and transfer those data with recent improvements in data storage and network-
ing.Traditionally, the data volume requirements for analytic and transactional
applications were in sub-terabyte territory.However, over the past decade, more
organizations in diverse industries have identiﬁed requirements for analytic data
10

volumes in the terabytes, petabytes,and beyond. Estimates produced by lon-
gitudinal studies started in 2005[8] show that the amount of data in the world
is doubling every two years. Should this trend continue, by 2020, there will
be 50 times the amount of data as there had been in 2011. Other estimates
indicate that 90 % of all data ever created, was created in the past 2 years.The
sheer volume of the data are colossal - the era of a trillion sensors is upon
us. This volume presents the most immediate challenge to conventional infor-
mation technology structures. It has stimulated new ways for scalable storage
across a collection of horizontally coupled resources, and a distributed approach
to querying. Brieﬂy, the traditional relational model has been relaxed for the
persistence of newly prominent data types. These logical non-relational data
models, typically lumped together as NoSQL, can currently be classiﬁed as Big
Table, Name-Value, Document and Graphical models.
2.2 Velocity
The term velocity in the context refers to the speed of generation of data
or how fast the data is generated and processed to meet the demands and the
challenges which lie ahead in the path of growth and development.This refers
to both the speed at which these data collection events can occur, and the pres-
sure of managing large streams of real-time data.Across the means of collecting
social information, new information is being added to the database at rates
ranging from as slow as every hour or so, to as fast as thousands of events per
second. In this context, the speed at which the data is generated and processed
to meet the demands and the challenges that lie in the path of growth and de-
velopment.Data is streaming in at unprecedented speed and must be dealt with
in a timely manner.RFID tags,sensors and smart metering are driving the need
to deal with torrents of data in near-real time. Reacting quickly enough to deal

with data velocity is a challenge for most organizations The type of content,
and an essential fact that data analysts must know.This helps people who are
associated with and analyze the data to effectively use the data to their advan-
tage and thus uphold its importance.The Velocity is the speed/rate at which
the data are created, stored, analysed and visualized. Traditionally, most en-
terprises separated their transaction processing and analytics. Enterprise data
analytics were concerned with batch data extraction, processing, replication,
delivery, and other applications. But increasingly, organizations everywhere
have begun to emphasize the need for real-time, streaming, continuous data
discovery, extraction, processing, analysis, and access. In the big data era, data
are created in real-time or near real-time. With the availability of Internet
connected devices, wireless or wired, machines and devices can pass-on their
data the moment it is created. Data Flow rates are increasing with enormous
speeds and variability, creating new challenges to enable real or near real-time
data usage. Traditionally this concept has been described as streaming data.
As such there are aspects of this that are not new, as companies such as those
in telecommunication have been sifting through high volume and velocity data
for years. The new horizontal scaling approaches do however add new big data
engineering options for efficiently handling this data.
2.3 Variety
Data today comes in all types of formats. Structured, numeric data in tra-
ditional databases. Information created from line-of-business applications. Un-
structured text documents, email, video, audio, stock ticker data and financial
transactions. Managing, merging and governing different varieties of data is
something many organizations still grapple with. The type of content, and an
essential fact that data analysts must know. This helps people who are asso-

ciated with and analyze the data to effectively use the data to their advantage
and thus uphold its importance.Variety refers to the complexity of formats in
which Big Data can exist. Besides structured databases, there are large streams
of unstructured documents, images, email messages, video, links between de-
vices and other forms that create a heterogeneous set of data points. One effect
of this complexity is that structuring and tying data together becomes a ma-
jor effort, and therefore a central concern of Big Data analysis. Traditionally,
enterprise data implementations for analytics and transactions operated on a
single structured, row-based, relational domain of data. However, increasingly,
data applications are creating, consuming, processing, and analysing data in a
wide range of relational and non-relational formats including structured, un-
structured, semistructured, documents and so forth from diverse application
domains.Traditionally, a variety of data was handled through transforms or
pre-analytics to extract features that would allow integration with other data
through a relational model. Given the wider range of data formats, structures,
timescales and semantics that are desirous to use in analytics, the integration of
this data becomes more complex. This challenge arises as data to be integrated
could be text from social networks, image data, or a raw feed directly from a
sensor source. The Internet of Things is the term used to describe the ubiquity
of connected sensors, from RFID tags for location, to smartphones, to home
utility meters. The fusion of all of this streaming data will be a challenge for
developing a total situational awareness. Big Data Engineering has spawned
data storage models that are more efficient for unstructured data types than
a relational model, causing a derivative issue for the mechanisms to integrate
this data. It is possible that the data to be integrated for analytics may be
of such volume that it cannot be moved in order to integrate, or it may be
that some of the data are not under control of the organization creating the

data system. In either case, the variety of big data forces a range of new big
data engineering in order to efficiently and automatically integrate data that is
stored across multiple repositories and in multiple formats.
2.4 Variability
The inconsistency the data can show at times-which can hamper the process
of handling and managing the data effectively This is a factor which can be a
problem for those who analyse the data. This refers to the inconsistency which
can be shown by the data at times, thus hampering the process of being able
to handle and manage the data effectively. Variability refers to changes in data
rate, format/structure, semantics, and/or quality that impact the supported
application, analytic, or problem. Specifically, variability is a change in one
or more of the other Big Data characteristics. Impacts can include the need
to refactor architectures, interfaces, processing/algorithms, integration/fusion,
storage, applicability, or use of the data. In addition to the increasing velocities
and varieties of data, data flows can be highly inconsistent with periodic peaks.
Is something trending in social media? Daily, seasonal and event-triggered peak
data loads can be challenging to manage. Even more so with unstructured data
involved.The other characteristics directly affect the scope of the impact for a
change in one dimension. For, example in a system that deals with petabytes or
exabytes of data refactoring the data architecture and performing the necessary
transformation to accommodate a change in structure from the source data may
not even be feasible even with the horizontal scaling typically associated with
big data architectures. In addition, the trend to integrate data from outside the
organization to obtain more refined analytic results combined with the rapid
evolution in technology means that enterprises must be able to adapt rapidly
to data variations.

2.5 Veracity
The quality of captured data, which can vary greatly.Accurate analysis de-
pends on the veracity of source data. Veracity refers to the trustworthiness,
applicability, noise, bias, abnormality and other quality properties in the data.
Veracity is a challenge in combination with other Big Data characteristics, but is
essential to the value associated with or developed from the data for a specific
problem/application. Assessment, understanding, exploiting, and controlling
Veracity in Big Data cannot be addressed efficiently and sufficiently through-
out the data lifecycle using current technologies and techniques.
2.6 Complexity
Data management can be very complex, especially when large volumes of data
come from multiple sources. Data must be linked, connected, and correlated
so users can grasp the information the data is supposed to convey. Today’s
data comes from multiple sources. And it is still an undertaking to link, match,
cleanse and transform data across systems. However, it is necessary to connect
and correlate relationships, hierarchies and multiple data linkages or your data
can quickly spiral out of control.

Chapter 3
Storage,Selection and Processing of Big
Data
3.1 Storage of Big Data
The explosive growth of data has more strict requirements on storage and
management. In this section, we focus on the storage of big data. Big data stor-
age refers to the storage and management of large-scale datasets while achieving
reliability and availability of data accessing. We will review important issues
including massive storage systems, distributed storage systems, and big data
storage mechanisms. On one hand, the storage infrastructure needs to pro-
vide information storage service with reliable storage space; on the other hand,
it must provide a powerful access interface for query and analysis of a large
amount of data.
Traditionally, as auxiliary equipment of server, data storage device is used
to store, manage, look up, and analyze data with structured RDBMSs. With
the sharp growth of data, data storage device is becoming increasingly more
important, and many Internet companies pursue big capacity of storage to be
competitive. Therefore, there is a compelling need for research on data stor-
age.Storage system for massive data.Various storage systems emerge to meet
16

the demands of massive data. Existing massive storage technologies can be
classified as Direct Attached Storage (DAS) and network storage, while net-
work storage can be further classified into Network Attached Storage (NAS)
and Storage Area Network (SAN). In DAS, various harddisks are directly con-
nected with servers, and data management is server-centric, such that storage
devices are peripheral equipments, each of which takes a certain amount of I/O
resource and is managed by an individual application software. For this reason,
DAS is only suitable to interconnect servers with a small scale. However, due
to its low scalability, DAS will exhibit undesirable efficiency when the storage
capacity is increased, i.e., the upgradeability and expandability are greatly lim-
ited. Thus, DAS is mainly used in personal computers and small-sized servers.
Network storage is to utilize network to provide users with a union interface
for data access and sharing. Network storage equipment includes special data
exchange equipments, disk array, tap library, and other storage media, as well
as special storage software. It is characterized with strong expandability. NAS
is actually an auxillary storage equipment of a network. It is directly connected
to a network through a hub or switch through TCP/IP protocols. In NAS,
data is transmitted in the form of files. Compared to DAS, the I/O burden at a
NAS server is reduced extensively since the server accesses a storage device in-
directly through a network. While NAS is network-oriented, SAN is especially
designed for data storage with a scalable and bandwidth intensive network,
e.g., a high-speed network with optical fiber connections. In SAN, data stor-
age management is relatively independent within a storage local area network,
where multipath based data switching among any internal nodes is utilized to
achieve a maximum degree of data sharing and data management.

3.1.1 Key Requirements of Big Data
At root, the key requirements of big data storage are that it can handle
very large amounts of data and keep scaling to keep up with growth, and that
it can provide the input/output operations per second (IOPS) necessary to
deliver data to analytics tools The largest big data practitioners Google, Face-
book, Apple, etc run what are known as hyperscale computing environments.
These comprise vast amounts of commodity servers with direct-attached storage
(DAS). Redundancy is at the level of the entire compute/storage unit, and if a
unit suffers an outage of any component it is replaced wholesale, having already
failed over to its mirror. Such environments run the likes of Hadoop, NoSQL
and Cassandra as analytics engines, and typically have PCIe flash storage alone
in the server or in addition to disk to cut storage latency to a minimum. Theres
no shared storage in this type of configuration. Hyperscale computing environ-
ments have been the preserve of the largest web-based operations to date, but
it is highly probable that such compute/storage architectures will bleed down
into more mainstream enterprises in the coming years. The appetite for build-
ing hyperscale systems will depend on the ability of an enterprise to take on a
lot of in-house hardware building and maintenance and whether they can jus-
tify such systems to handle limited tasks alongside more traditional enterprise
environments that handle large amounts of applications on less specialised sys-
tems. But hyperscale is not the only way. Many enterprises, and even quite
small businesses, can take advantage of big data analytics. They will need
the ability to handle relatively large data sets and handle them quickly, but
may not need quite the same response times as those organisations that use
it push adverts out to users over response times of a few seconds. So the key
type of big data storage system with the attributes required will often be scale-
out or clustered NAS. This is file access shared storage that can scale out to

meet capacity or increased compute requirements and uses parallel file systems
that are distributed across many storage nodes that can handle billions of files
without the kind of performance degradation that happens with ordinary file
systems as they grow. For some time, scale-out or clustered NAS was a distinct
product category, with specialised suppliers such as Isilon and BlueArc. But
a measure of the increasing importance of such systems is that both of these
have been bought relatively recently by big storage suppliers EMC and Hitachi
Data Systems, respectively. Meanwhile, clustered NAS has gone mainstream,
and the big change here was with NetApp incorporating true clustering and
petabyte/parallel file system capability into its Data ONTAP OS in its FAS
filers. The other storage format that is built for very large numbers of files is
object storage. This tackles the same challenge as scale-out NAS that tradi-
tional tree-like file systems become unwieldy when they contain large numbers
of files.
Object-based storage gets around this by giving each file a unique identifier
and indexing the data and its location. Its more like the DNS way of doing
things on the internet than the kind of file system were used to.Object storage
systems can scale to very high capacity and large numbers of files in the bil-
lions, so are another option for enterprises that want to take advantage of big
data. Having said that, object storage is a less mature technology than scale-
out NAS. So, to sum up, big data storage needs to be able to handle capacity
and provide low latency for analytics work. You can choose to do it like the big
boys in hyperscale environments or adopt NAS or object storage in more tradi-
tional IT departments to do the job. Flash storage solutions, implemented at
the server level and with all-flash arrays, offer some interesting alternatives for
high-performance, low-latency storage, from a few terabytes to a hundred ter-

abytes or more in capacity. Object-based, scale-out architectures with erasure
coding can provide scalable storage systems that eschew traditional RAID and
replication methods to achieve new levels of efficiency and lower per-gigabyte
costs.
3.2 Selection of Big Data
Every organization seeking to make sense of big data must determine which
platforms and tools, in the sea of available options, will help them to meet their
business goals.Answering the following eight questions can help guide IT lead-
ers to make the right data management choices for their organizations future
success. For organizations needing to store and process tens of terabytes of
data, using an open-source distributed file system is a mature choice due to its
predictable scalability over clustered hardware. Plus, its the base platform for
many big data architectures already. However, if looking to run analytics in
online or real-time applications, consider hybrid architectures containing dis-
tributed file systems combined with distributed database management systems
(which have lower latency. Or look at large traditional relational systems to
get real-time access to data that has been through the heavy lifting processes
of a distributed file system. Many NoSQL databases require specific applica-
tion interfaces (APIs) in order to access the data. With this, youll need to
consider the integration of visualization or other tools that will need access
to the data. If the tools being used with the big data platform need a SQL
interface, choose a tool that has maturity in that area. Of note, NoSQL and
big data platforms are evolving quickly and businesses just starting to build
custom applications on top of a big data platform may be able to build around
the sometimes raw data access frameworks. Alternatively, businesses with ex-
isting applications will need a more mature offering. If data requirements are

especially unstructured, or include streaming data sources such as social media
or video, businesses should look into data serialization technologies that allow
capture, storage and representation of such high velocity data. Also, how ap-
plications consume data should also be taken into consideration. For instance,
some existing tools allow users to project different structures across the data
store, giving flexibility to store data in one way and access it in another. Yes,
being flexible in how data is presented to consuming applications is a bene-
fit, but the performance may not be good enough for high velocity data. To
overcome this performance challenge, you may need to integrate with a more
structured data store further downstream in your data architecture. If looking
to extend your current data architecture by integrating a big data platform into
an existing data warehouse, data integration tools can help. Many integration
vendors that support big data platforms also have specialized support for in-
tegrating with SQL data warehouses and data marts. Clearly, choosing a big
data solution isnt easy.
As companies of all sizes try to extract more from their existing data stores,
Big Data vendors are rushing in to provide a range of Big Data solutions, which
comprise everything from database technology to visualization tools. With
such a diverse selection of tools to choose from, buyers must carefully define
their goals in order to find the right tools to meet their goals. Before finding
the right tools, however, organization must first ask themselves what business
problems theyre trying to solve – and why. Too many big data projects dont
start with problems to solve, but rather start with exploratory analytics, said
Chris Selland, VP of marketing and business development for HP Vertica. Thats
okay to a point, but eventually these questions need to be asked and answered.
Companies have a lot of Big Data and many questions, but that doesnt result

in the CIO or CFO simply handing you a large amount of money to work with.
Figure 3.1: Selection of Big Data
3.3 Processing of Big Data
A variety of platforms have emerged to process big data,including advanced
SQL (sometimes called NewSQL) databases that adapt SQL to handle larger
volumes of structured data with greater speed, and NoSQL platforms that may
range from ﬁle systems to document or columnar data stores that typically
dispense with the need for modelling data. Most of the early implementations
of big data, especially with NoSQL platforms such as Hadoop, have focused
more on volume and variety, with results delivered through batch processing.
Behind the scenes, there is a growing range of use cases that also emphasise
speed. Some of them consist of new applications that take advantage not only
of powerful back-end data platforms, but also the growth in bandwidth and
mobility. Examples include mobile applications such as Waze that harness sen-
sory data from smartphones and GPS devices to provide real-time pictures of
traﬃc conditions. On the horizon there are opportunities for mobile carriers

to track caller behaviour in real time to target ads, location-based services, or
otherwise engage their customers, as well as Conversely, existing applications
are being made more accurate, responsive and effective as smart sensors add
more data points, intelligence and adaptive control. These are as diverse as
optimising supply chain inventories, regulating public utility and infrastruc-
ture networks, or providing real-time alerts for homeland security. The list of
potential opportunities for fast processing of big data is limited only by the
imagination.
3.3.1 Batch Processing
Apache Hadoop is a distributed computing framework modeled after Google
MapReduce to process large amounts of data in parallel. Once in a while, the
first thing that comes to my mind when speaking about distributed computing
is EJB. EJB is de facto a component model with remoting capability but short
of the critical features being a distributed computing framework, that include
computational parallelization, work distribution, and tolerance to unreliable
hardware and software. Hadoop on the other hand has these merits built-in.
ZooKeeper modeled on Google Chubby is a centralized service for maintain-
ing configuration information, naming, providing distributed synchronization,
and group services for the Hadoop cluster. Hadoop Distributed File System
(HFDS) modeled on Google GFS is the underlying file system of a Hadoop
cluster. HDFS works more efficiently with a few large data files than numer-
ous small files. A real-world Hadoop job typically takes minutes to hours to
complete, therefore Hadoop is not for real-time analytics, but rather for offline,
batch data processing. Recently, Hadoop has undergone a complete overhaul
for improved maintainability and manageability. Something called YARN (Yet
Another Resource Negotiator) is at the center of this change. One major ob-

jective of Hadoop YARN is to decouple Hadoop from MapReduce paradigm to
accommodate other parallel computing models, such as MPI (Message Passing
Interface and Spark.
In general, data flows from components to components in an enterprise ap-
plication. This is the case for application frameworks (EJB and Spring frame-
work), integration engines (Camel and Spring Integration), as well as ESB
(Enterprise Service Bus) products. Nevertheless, for the data-intensive pro-
cesses Hadoop deals with, it makes better sense to load a big data set once
and perform various analysis jobs locally to minimize IO and network cost, the
so-called ”Move-Code-To-Data” philosophy. When you load a big data file to
HDFS, the file is split into chunks (or file blocks) through a centralized Name
Node (master node) and resides on individual Data Nodes (slave nodes) in the
Hadoop cluster for parallel processing.
3.3.2 Stream Processing
Stream data processing is not intended to analyze a full big data set, nor
is it capable of storing that amount of data (The Storm-on-YARN project is
an exception). While you may be asked to build a real-time ad-hoc analytics
system that operates on a complete big data set, you really need some mighty
tools. Twitter Storm is an open source, big-data processing system intended
for distributed, real-time streaming processing. Storm implements a data flow
model in which data (time series facts) flows continuously through a topology
(a network of transformation entities). The slice of data being analyzed at any
moment in an aggregate function is specified by a sliding window, a concept in
CEP/ESP. A sliding window may be like ”last hour”, or ”last 24 hours”, which
is constantly shifting over time. Data can be fed to Storm through distributed

messaging queues like Kafka, Kestrel, and even regular JMS. Trident is an ab-
straction API of Storm that makes it easier to use. Like Twitter Storm, Apache
S4 is a product for distributed, scalable, continuous, stream data processing.
Note, the size of a sliding window cannot grow infinitely.
3.3.3 Hadoop Ecosystem
Hadoop API is often considered low level, as it is not easy to program with.
The quickly growing Hadoop ecosystem offers a list of abstraction techniques,
which encapsulate and hide the programming complexity of Hadoop. Pig, Hive,
Cascading, Crunch, Scrunch, Scalding, Scoobi, and Cascalog all aim to provide
low cost entry to Hadoop programming. Pig, Crunch (Scrunch), and Cascading
are data-pipe based techniques. A data pipe is a multi-stepped process, in which
transformation, splitting, merging, and join may be conducted individually at
each step. Thinking about a work flow in a general work flow engine, a data
pipe is similar. Hive on the other hand works like a data warehouse by offering a
SQL compatible interactive shell. Programs or shell scripts developed on top of
these techniques are compiled to native Hadoop Map and Reduce classes behind
the scene to run in the cluster. Given the simplified programming interfaces
in conjunction with libraries of reusable functions, development productivity is
greatly improved.
3.3.4 Map and Reduce
A centralized JobTracker process in the Hadoop cluster moves your code to
data. The code hereby includes a Map and a Reduce class. Put simply, a Map
class does the heavy-lifting job of data filtering, transformation, and splitting.
For better IO and network efficiency, a Mapper instance only processes the data
chunks co-located on the same data node, a concept termed data locality (or

data proximity). Mappers can run in parallel on all the available data nodes
in the cluster. The outputs of the Mappers from different nodes are shuffled
through a particular algorithm to the appropriate Reduce nodes. A Reduce class
by nature is an aggregator. The number of Reducer instances is configurable
to developers.

Chapter 4
Big Data Analytics
Big data is now a reality: The volume, variety and velocity of data coming
into your organization continue to reach unprecedented levels. This phenom-
enal growth means that not only must you understand big data in order to
decipher the information that truly counts, but you also must understand the
possibilities of big data analytics. Big data analytics is the process of examin-
ing big data to uncover hidden patterns, unknown correlations and other useful
information that can be used to make better decisions. With big data analytics,
data scientists and others can analyze huge volumes of data that conventional
analytics and business intelligence solutions can’t touch. Consider that your or-
ganization could accumulate (if it hasn’t already) billions of rows of data with
hundreds of millions of data combinations in multiple data stores and abundant
formats. High-performance analytics is necessary to process that much data in
order to ﬁgure out what’s important and what isn’t. Enter big data analytics.
Why collect and store terabytes of data if you can’t analyze it in full context?
Or if you have to wait hours or days to get results? With new advances in
computing technology, there’s no need to avoid tackling even the most chal-
lenging business problems. For simpler and faster processing of only relevant
data, you can use high-performance analytics. Using high-performance data
mining, predictive analytics, text mining, forecasting and optimization on big
27

data enables you to continuously drive innovation and make the best possible
decisions. In addition, organizations are discovering that the unique properties
of machine learning are ideally suited to addressing their fast-paced big data
needs in new ways.
Big data can be analyzed with the software tools commonly used as part of
advanced analytics disciplines such as predictive analytics, data mining, text
analytics andstatistical analysis. Mainstream BI software and data visualiza-
tion tools can also play a role in the analysis process. But the semi-structured
and unstructured data may not ﬁt well in traditional data warehouses based on
relational databases. Furthermore, data warehouses may not be able to handle
the processing demands posed by sets of big data that need to be updated fre-
quently or even continually – for example, real-time data on the performance of
mobile applications or of oil and gas pipelines. As a result, many organizations
looking to collect, process and analyze big data have turned to a newer class of
technologies that includes Hadoop and related tools such as YARN,MapReduce,
Spark, Hive and Pig as well as NoSQL databases. Those technologies form the
core of an open source software framework that supports the processing of large
and diverse data sets across clustered systems.
In some cases, Hadoop clusters and NoSQL systems are being used as landing
pads and staging areas for data before it gets loaded into a data warehouse
for analysis, often in a summarized form that is more conducive to relational
structures. Increasingly though, big data vendors are pushing the concept of
a Hadoop data lake that serves as the central repository for an organization’s
incoming streams of raw data. In such architectures, subsets of the data can
then be ﬁltered for analysis in data warehouses and analytical databases, or it
can be analyzed directly in Hadoop using batch query tools, stream processing

software and SQL on Hadoop technologies that run interactive, ad hoc queries
written in SQL. Potential pitfalls that can trip up organizations on big data
analytics initiatives include a lack of internal analytics skills and the high cost
of hiring experienced analytics professionals. The amount of information that’s
typically involved, and its variety, can also cause data management headaches,
including data quality and consistency issues. In addition, integrating Hadoop
systems and data warehouses can be a challenge, although various vendors now
oﬀer software connectors between Hadoop and relational databases, as well as
other data integration tools with big data capabilities.
4.1 Examples of Big Data Analytics
As the technology that helps an organization to break down data silos and
analyze data improves, business can be transformed in all sorts of ways. Accord-
ing to Datamation, today’s advances in analyzing Big Data allow researchers to
decode human DNA in minutes, predict where terrorists plan to attack, deter-
mine which gene is mostly likely to be responsible for certain diseases and, of
course, which ads you are most likely to respond to on Facebook. The business
cases for leveraging Big Data are compelling. For instance, Netﬂix mined its
subscriber data to put the essential ingredients together for its recent hit House
of Cards, and subscriber data also prompted the company to bring Arrested
Development back from the dead.
Another example comes from one of the biggest mobile carriers in the world.
France’s Orange launched its Data for Development project by releasing sub-
scriber data for customers in the Ivory Coast. The 2.5 billion records, which
were made anonymous, included details on calls and text messages exchanged
between 5 million users. Researchers accessed the data and sent Orange propos-

als for how the data could serve as the foundation for development projects to
improve public health and safety. Proposed projects included one that showed
how to improve public safety by tracking cell phone data to map where people
went after emergencies; another showed how to use cellular data for disease
containment.
4.2 Benefits of Big Data Analytics
Enterprises are increasingly looking to find actionable insights into their data.
Many big data projects originate from the need to answer specific business
questions. With the right big data analytics platforms in place, an enterprise
can boost sales, increase efficiency, and improve operations, customer service
and risk management.
1. Webopedia parent company, QuinStreet, surveyed 540 enterprise decision-
makers involved in big data purchases to learn which business areas com-
panies plan to use Big Data analytics to improve operations.
2. About half of all respondents said they were applying big data analytics
to improve customer retention, help with product development and gain a
competitive advantage.
3. The business area getting the most attention relates to increasing efficien-
cies and optimizing operations.62 percent of respondents said that they
use big data analytics to improve speed and reduce complexity.

Chapter 5
Challenges in Big Data
5.1 Security
The biggest challenge for big data from a security point of view is the pro-
tection of users privacy. Big data frequently contains huge amounts of personal
identifiable information and therefore privacy of users is a huge concern. Be-
cause of the big amount of data stored, breaches affecting big data can have
more devastating consequences than the data breaches we normally see in the
press. This is because a big data security breach will potentially affect a much
larger number of people, with consequences not only from a reputational point
of view, but with enormous legal repercussions. When producing information
for big data, organizations have to ensure that they have the right balance
between utility of the data and privacy. Before the data is stored it should
be adequately anonymised, removing any unique identifier for a user. This in
itself can be a security challenge as removing unique identifiers might not be
enough to guarantee that the data will remain anonymous. The anonymized
data could be could be cross-referenced with other available data following de-
anonymization techniques. When storing the data organizations will face the
problem of encryption. Data cannot be sent encrypted by the users if the
cloud needs to perform operations over the data. A solution for this is to use
31

Fully Homomorphic Encryption (FHE), which allows data stored in the cloud
to perform operations over the encrypted data so that new encrypted data
will be created. When the data is decrypted the results will be the same as
if the operations were carried out over plain text data. Therefore, the cloud
will be able to perform operations over encrypted data without knowledge of
the underlying plain text data. While using big data a significant challenge is
how to establish ownership of information. If the data is stored in the cloud
a trust boundary should be establish between the data owners and the data
storage owners. Adequate access control mechanisms will be key in protecting
the data. Access control has traditionally been provided by operating systems
or applications restricting access to the information, which typically exposes all
the information if the system or application is hacked. A better approach is
to protect the information using encryption that only allows decryption if the
entity trying to access the information is authorised by an access control pol-
icy. An additional problem is that software commonly used to store big data,
such as Hadoop, doesnt always come with user authentication by default. This
makes the problem of access control worse, as a default installation would leave
the information open to unauthenticated users. Big data solutions often rely
on traditional firewalls or implementations at the application layer to restrict
access to the information.
The main solution to ensuring that data remains protected is the adequate
use of encryption. For example, Attribute-Based Encryption can help in pro-
viding fine-grained access control of encrypted data.Anonymizing the data is
also important to ensure that privacy concerns are addressed. It should be en-
sured that all sensitive information is removed from the set of records collected.
Real-time security monitoring is also a key security component for a big data

project. It is important that organizations monitor access to ensure that there
is no unauthorised access. It is also important that threat intelligence is in place
to ensure that more sophisticated attacks are detected and that the organiza-
tions can react to threats accordingly. If an adequate governance framework is
not applied to big data then the data collected could be misleading and cause
unexpected costs. The main problem from a governance point of view is that
big data is a relatively new concept and therefore no one has created procedures
and policies. The challenge with big data is that the unstructured nature of
the information makes it diﬃcult to categorize, model and map the data when
it is captured and stored. The problem is made worst by the fact that the data
normally comes from external sources, often making it complicated to conﬁrm
its accuracy. Data hackers have become more damaging in the era of big data
due to the availability of large volumes of publically available data, the abil-
ity to store massive amounts of data on portable devices such as USB drives
and laptops, and the accessibility of simple tools to acquire and integrate dis-
parate data sources. According to the Open Security Foundations DataLossDB
project (http://datalossdb.org), hacking accounts for 28% of all data breach
incidents, with theft accounting for an additional 24%, fraud accounting for
12%, and web-related loss accounting for 9% of all data loss incidents. Greater
than half (57%) of all data loss incidents involve external parties, but 10% in-
volve malicious actions on the part of internal parties, and an additional 20%
involve accidental actions by internal parties. Private businesses, hospitals, and
biomedical researchers are also making tremendous investments in the collec-
tion, storage, and analysis of large-scale data and private information.

5.2 Data Access
5.2.1 Inefficiency
Securing and controlling access to data is a very time-consuming process.
Currently, for most companies that are even aware that their data access pro-
tocols are an issue, actual practices in securing that are inefficient, whether
you are manually securing unstructured data, or whether it is data that is be-
ing dealt with automatically. To secure it properly, enterprises need to assess
where in the environment your data resides, assess how much data loss there
is from the use of file servers and NAS devices, and even develop an inventory
of whats available in your SharePoint deployments. All very time consuming,
youll no doubt agree.
5.2.2 Ineffectiveness
After answering the question as to who has access to your data, the next
big question is whether they should have a given level of access or not. Does
IT know what level of access should be offered to employees, and should this
decision be left in the hands of the IT department at all .The chances are
that they shouldnt be allowed decide, as this is ultimately a business decision.
But, as in the case of most companies, there is no clear policy on who makes
the decisions. Chances are in this situation theres also going to be a lot of
unstructured and orphaned data lying around with no one to take responsibly
for it.
5.3 Data Cleaning
Data cleaning remains an important part of the process to ensure data qual-
ity. The first is to verify that the quantitative and qualitative (i.e. categorical)

variables have been recorded as expected. The second involves removing out-
liers, which in the Big Data paradigm means the use of decision tree algorithms.
But data cleaning itself is a subjective process (e.g. deciding which variables
to consider) and not truly agnostic as would be desired and thus open to philo-
sophical debate (Bollier, 2010).
5.4 Data Representation
Related to the question of data provenance is the issue of understanding the
underlying population whose behavior has been captured. The large data sizes
may make the sampling rate irrelevant, but it doesnt necessarily make it repre-
sentative. Everybody does not use Twitter, Facebook or even Google searches.
For example ITU estimates suggest that Internet usage is still limited to only 40
per cent of the world population. In other words, more than four billion people
globally are not yet using the Internet, and 90 per cent of them are from the
developing world. Of the worlds three billion Internet users, two-thirds are from
the developing countries. At the other end of the spectrum, even though mobile
cellular penetration is close to 100%, this does not mean that every person in
the world is using a mobile phone. This issue of representativeness is an issue
of high relevance when considering how telecommunication data may be used
for monitoring and development. Whilst the promise in leveraging data from
mobile network operators for monitoring and development hinges on its large
coverage, nearing the actual population, it is still not the whole population.
Questions such as the extant of coverage of the poor, or the levels of gender
representation amongst telecom users are all valid questions. Whilst the regis-
tration information might provide answers, the reality is that the demographic
information on telecom subscribers for example is not always accurate. With
pre-paid subscriptions being the norm in the majority of the developing world,

demographic information contained in the mobile operator records is practically
useless, even with mandated registration.
The issue of sampling bias is best illustrated by the case of Street Bump,
a mobile app developed by the Boston City Hall. Street Bump uses a phones
accelerometer to detect potholes and notify City Hall, whilst the app users
drive around Boston. The app however introduces a selection bias since it
biased towards the demographics of the app users, who often hail from affluent
areas with greater smartphone ownership (Harford, 2014). Hence the Big in
Big Data does not automatically mean that issues such as measurement bias
and methodology, internal and external data validity, and inter-dependencies
among data can be ignored. These are foundational issues not just for small
data but also for Big Data (Boyd and Crawford, 2012).
5.5 Behavioral change
For that matter, digitized online behavior can be subject to self-censorship
and the creation of multiple personas, further muddying the waters. Thus
studying the data exhaust of people may not always give us insights into real-
world dynamics. This may be less of an issue with TGD, where in essence
the data artifact is itself a byproduct of another activity. Telecom network
Big Data, which mostly falls under this category, may be less susceptible to
self-censorship and persona development. But it doesnt exclude the possibility
either. It is not inconceivable that users may not use their mobiles or even turn
it off in areas, where they do not wish their digital footprint to be left behind.
In a way, Big Data analyses of behavioral data are subject to a form of the
Heisenberg Uncertainty principle: as soon as the basic process of an analysis
is known, there may be concerted efforts to exhibit different behavior and/or

actions to change the outcomes (Bollier, 2010).
For example the famous Google page rank algorithm has spawned an entire
industry of organizations that claim to enhance page ranks for websites. Search
Engine Optimization (SEO) is now an established practice when developing
websites.Change in behavior could also partly attribute to the declining verac-
ity of Google Flu Trends. Researchers found that influenza-like-illness rates as
exhibited by Google searches did not necessarily correlate with actual influenza
virus infections (Ortiz et al., 2011). Recent research has shown that after 2009
(when it failed to catch the non-seasonal influenza outbreak of 2009), infre-
quent updates, have not improved the results. In fact Google Flu Trends has
persistently overestimated flu prevalence since 2009 (Lazer, Kennedy, King, and
Vespignani, 2014). Google Flu Trends does not and cannot know what factors
contributed to the strong correlations found in their initial work. The point is
that the underlying real world actions of the population that turned to Google
for its health queries and which contributed to the original correlations discov-
ered by GFT, may have in-fact changed over time, diminishing the robustness
of the original algorithm. For example the hoopla surrounding GFT could have
even created rebound effects, with more and more people turning to Google for
their broader health questions and thereby introducing additional search terms
(due to different cultural norms and/or ground conditions), which can collec-
tively introduce biases that GFT has not been able account for. Such possible
problems could have been caught and resolved had the GFT method been more
transparent.

Chapter 6
Applications and Future of Big Data
6.1 Applications
Big data has increased the demand of information management specialists
in that Software AG, Oracle Corporation, IBM, Microsoft, SAP,EMC, HP and
Dell have spent more than $15 billion on software firms specializing in data
management and analytics. In 2010, this industry was worth more than $100
billion and was growing at almost 10 percent a year: about twice as fast as
the software business as a whole. Developed economies increasingly use data-
intensive technologies. There are 4.6 billion mobile-phone subscriptions world-
wide, and between 1 billion and 2 billion people accessing the internet. Between
1990 and 2005, more than 1 billion people worldwide entered the middle class,
which means more people become more literate, which in turn leads to infor-
mation growth. The world’s effective capacity to exchange information through
telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993,
2.2 exabytes in 2000, 65exabytes in 2007 and predictions put the amount of in-
ternet traffic at 667 exabytes annually by 2014. According to one estimate, one
third of the globally stored information is in the form of alphanumeric text and
still image data, which is the format most useful for most big data applications.
This also shows the potential of yet unused data (i.e. in the form of video and
38

audio content).While many vendors offer off-the-shelf solutions for Big Data,
experts recommend the development of in-house solutions custom-tailored to
solve the company’s problem at hand if the company has sufficient technical
capabilities.
6.1.1 Government
The use and adoption of Big Data within governmental processes is beneficial
and allows efficiencies in terms of cost, productivity, and innovation. That
said, this process does not come without its flaws. Data analysis often requires
multiple parts of government (central and local) to work in collaboration and
create new and innovative processes to deliver the desired outcome. Below are
the thought leading examples within the Governmental Big Data space.
United States of America
In 2012, the Obama administration announced the Big Data Research and
Development Initiative, to explore how big data could be used to address im-
portant problems faced by the government. The initiative is composed of 84
different big data programs spread across six departments.Big data analysis
played a large role in Barack Obama’s successful 2012 re-election campaign.
The United States Federal Government owns six of the ten most powerful su-
percomputers in the world. The Utah Data Center is a data center currently
being constructed by the United States National Security Agency. When fin-
ished, the facility will be able to handle a large amount of information collected
by the NSA over the Internet. The exact amount of storage space is unknown,
but more recent sources claim it will be on the order of a fewexabytes.

India
Big data analysis helped in parts, responsible for the BJP and its allies to
win Indian General Election 2014. The Indian Government utilises numerous
techniques to ascertain how the Indian electorate is responding to government
action, as well as ideas for policy augmentation
6.1.2 Cyber-Physical Models
Current PHM implementations mostly utilize data during the actual usage
while analytical algorithms can perform more accurately when more informa-
tion throughout the machines lifecycle, such as system configuration, physical
knowledge and working principles, are included. There is a need to systemati-
cally integrate, manage and analyze machinery or process data during different
stages of machine life cycle to handle data/information more efficiently and
further achieve better transparency of machine health condition for manufac-
turing industry.With such motivation a cyber-physical (coupled) model scheme
has been developed. The coupled model is a digital twin of the real machine
that operates in the cloud platform and simulates the health condition with an
integrated knowledge from both data driven analytical algorithms as well as
other available physical knowledge. It can also be described as a 5S systematic
approach consisting of Sensing, Storage, Synchronization, Synthesis and Ser-
vice. The coupled model first constructs a digital image from the early design
stage. System information and physical knowledge are logged during product
design, based on which a simulation model is built as a reference for future
analysis. Initial parameters may be statistically generalized and they can be
tuned using data from testing or the manufacturing process using parameter
estimation. After that step, the simulation model can be considered a mirrored
image of the real machineable to continuously record and track machine condi-

tion during the later utilization stage. Finally, with the increased connectivity
oﬀered by cloud computing technology, the coupled model also provides better
accessibility of machine condition for factory managers in cases where physical
access to actual equipment or machine data is limited.
6.1.3 Healthcare
Big data analytics has helped healthcare improve by providing personalized
medicine and prescriptive analytics, clinical risk intervention and predictive
analytics, waste and care variability reduction, automated external and internal
reporting of patient data, standardized medical terms and patient registries and
fragmented point solutions.
6.1.4 Technology
1. eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well
as a 40PB Hadoop cluster for search, consumer recommendations, and
merchandising.
2. Amazon.com handles millions of back-end operations every day, as well
as queries from more than half a million third-party sellers. The core
technology that keeps Amazon running is Linux-based and as of 2005 they
had the worlds three largest Linux databases, with capacities of 7.8 TB,
18.5 TB, and 24.7 TB.
3. Facebook handles 50 billion photos from its user base.
4. As of August 2012, Google was handling roughly 100 billion searches per
month.
5. Oracle NoSQL Database has been tested to past the 1M ops/sec mark with
8 shards and proceeded to hit 1.2M ops/sec with 10 shards.

6.2 The Future of Big Data
However, those who feel that todays big data is just a continuation of past
information trends are as wrong as if they were to claim that a stone tablet is
essentially the same as a tablet computer or an abacus similar to a supercom-
puter.Today, we have more information than ever. But the importance of all
that information extends beyond simply being able to do more, or know more,
than we already do. The quantitative shift leads to a qualitative shift. Hav-
ing more data allows us to do new things that werent possible before. In other
words: More is not just more. More is new. More is better. More is different.Of
course, there are still limits on what we can obtain from or do with data. But
most of our assumptions about the cost of collecting and the difficulty of pro-
cessing data need to be overhauled. No area of human endeavor or industrial
sector will be immune from the incredible shakeup thats about to occur as big
data plows through society, politics, and business. People shape their toolsand
their tools shape them.This new world of data, and how companies can harness
it, bumps up against two areas of public policy and regulation. The first is
employment.
Big data will bring about great things in society. We like to think that
technology leads to job creation, even if it comes after a temporary period of
disruption. That was certainly true during the Industrial Revolution. To be
sure, it was a devastating time of dislocation, but it eventually led to better
livelihoods. Yet this optimistic outlook ignores the fact that some industries
simply never recover from change. When tractors and automobiles replaced
horse-drawn plows and carriages, the need for horses in the economy basically
ended.The upheavals of the Industrial Revolution created political change and
gave rise to new economic philosophies and political movements. Its not much

of an intellectual stretch to predict that new political philosophies and social
movements will arise around big data, robots, computers, and the Internet, and
the effect of these technologies on the economy and representative democracy.
Recent debates over income inequality and the Occupy movement seem to point
in that direction.
Big data will change business, and business will change society. The hope,
of course, is that the benefits will outweigh the drawbacks, but that is mostly
a hope. The big-data world is still very new, and, as a society, were not very
good at handling all the data that we can collect now. We also cant foresee the
future. Technology will continue to surprise us, just as it would an ancient man
with an abacus looking upon an iPhone. What is certain is that more will not
be more: It will be different.Clearly Big Data is in its beginnings, and is much
more to be discovered. Now is for the most companies just a cool keyword,
because it has a great potential and not many truly know what all is about. A
clear sign that there is more to big data then is currently shown on the market,
is that the big software companies do not have, or do not present their Big Data
solutions, and those that have like Google, does not use it in ca commercial way.
The companies need to decide what kind of strategy use to implement Big Data.
They could use a more revolutionary approach and move all the data to the
new Big Data environment, and all the reporting, modeling and interrogation
will be executed using the new business intelligence based on Big Data. [1] This
approach is already used by many analytics driven organizations that puts all
the data on the Hadoop environment and build business intelligence solutions
on top of it. Another approach is the evolutionary approach; Big Data becomes
an input to the current BI platform. The data is accumulated and analyzed
using structured and unstructured tools, and the results are sent to the data

warehouse. Standard modeling and reporting tools now have access to social
media sentiments, usage records, and other processed Big Data items. [1] One
of the issues of the evolutionary approach is that even if it gets most of the
capabilities of the Big Data environment, but also gets most of the problems
of the classic Business intelligence solution, and in some cases can create a
bottleneck between information that came from the Big Data and the power to
analyze of the traditional BI or data warehouse solution.4

Chapter 7
Conclusion
The availability of Big Data, low-cost commodity hardware, and new infor-
mation management and analytic software have produced a unique moment in
the history of data analysis. The convergence of these trends means that we
have the capabilities required to analyze astonishing data sets quickly and cost-
effectively for the first time in history. These capabilities are neither theoretical
nor trivial. They represent a genuine leap forward and a clear opportunity to
realize enormous gains in terms of efficiency, productivity, revenue, and prof-
itability.
The Age of Big Data is here, and these are truly revolutionary times if both
business and technology professionals continue to work together and deliver on
the promise.
45

Bibliography
[1] ¡Introduction¿, <www.WIKIPEDIA..COM>
[2] ¡Characteristics of Big Data¿, <www.STUDYMAFIA.ORG>
[3] ¡Big Data Analytics¿, <www.COMPUTERWEEKLY.COM>
[4] ¡Storage of Big Data¿, <www.COMPUTERWEEKLY.COM>
46

Big data

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (8)

Similar to Big data

Similar to Big data (20)

Recently uploaded

Recently uploaded (20)

Big data