SlideShare a Scribd company logo
1 of 22
ACKNOWLEDGEMENT
I take this opportunity to express my gratitude to all those people who have been
directly and indirectly with me during the completion of this Seminar.
I pay special thanks to my seminar guide Er. Meenakshi Mittal, who was very
supportive and compassionate throughout the preparation this seminar report. A lot of
gratitude also goes to my friends who have contributed effectively towards the preparation
and refinement of my Seminar undertaken at CUP, Bathinda. They all inspired and
encouraged me to carry out this seminar report in best of contents. Particular thanks to all
authors whose work has been referenced.
Finally I would like to acknowledge my profound gratitude to Almighty. It would not
have been possible to accomplish this without their blessings.

Harsh Kishore Mishra
CUPBMTech-CSSETCST2013-1401
Centre for Computer Science and Technology
School of Engineering and Technology
Central University of Punjab, Bathinda

ii
LIST OF TABLES
Table 1: Tools available for Big Data ................................................................................................ 12
Table 2: Various Big Data Projects.................................................................................................... 13

iv
LIST OF FIGURES
Figure 1 : Example of Big Data Architecture ...................................................................................... 1
Figure 2 : Explosion in size of Data..................................................................................................... 3
Figure 3 : Hadoop High Level Architecture ...................................................................................... 11
Figure 4 : HDFS Architecture ............................................................................................................ 12

v
1. INTRODUCTION
Big data is defined as large amount of data which requires new technologies and
architectures to make possible to extract value from it by capturing and analysis process.
New sources of big data include location specific data arising from traffic management, and
from the tracking of personal devices such as Smartphones. Big Data has emerged because
we are living in a society which makes increasing use of data intensive technologies. Due to
such large size of data it becomes very difficult to perform effective analysis using the
existing traditional techniques. Since Big data is a recent upcoming technology in the
market which can bring huge benefits to the business organizations, it becomes necessary
that various challenges and issues associated in bringing and adapting to this technology are
need to be understood. Big Data concept means a datasets which continues to grow so much
that it becomes difficult to manage it using existing database management concepts & tools.
The difficulties can be related to data capture, storage, search, sharing, analytics and
visualization etc.

Figure 1 : Example of Big Data Architecture (Aveksa Inc., 2013)

Big data due to its various properties like volume, velocity, variety, variability, value and
complexity put forward many challenges. The various challenges faced in large data
management include – scalability, unstructured data, accessibility, real time analytics, fault
tolerance and many more. In addition to variations in the amount of data stored in different
sectors, the types of data generated and stored—i.e., encoded video, images, audio, or
text/numeric information; also differ markedly from industry to industry.

1
2. BIG DATA CHARACTERISTICS

Data Volume: The Big word in Big data itself defines the volume. At present the
data existing is in petabytes (1015) and is supposed to increase to zettabytes (1021) in nearby
future. Data volume measures the amount of data available to an organization, which does
not necessarily have to own all of it as long as it can access it.

Data Velocity: Velocity in Big data is a concept which deals with the speed of the data
coming from various sources. This characteristic is not being limited to the speed of
incoming data but also speed at which the data flows and aggregated.

Data Variety: Data variety is a measure of the richness of the data representation – text,
images video, audio, etc. Data being produced is not of single category as it not only
includes the traditional data but also the semi structured data from various resources like
web Pages, Web Log Files, social media sites, e-mail, documents.

Data Value: Data value measures the usefulness of data in making decisions. Data
science is exploratory and useful in getting to know the data, but “analytic science”
encompasses the predictive power of big data. User can run certain queries against the data
stored and thus can deduct important results from the filtered data obtained and can also
rank it according to the dimensions they require. These reports help these people to find the
business trends according to which they can change their strategies.

Complexity: Complexity measures the degree of interconnectedness (possibly very
large) and interdependence in big data structures such that a small change (or combination
of small changes) in one or a few elements can yield very large changes or a small change
that ripple across or cascade through the system and substantially affect its behavior, or no
change at all (Katal, Wazid, & Goudar, 2013).

2
3. ISSUES IN BIG DATA
The issues in Big Data are some of the conceptual points that should be understood by the
organization to implement the technology effectively. Big data Issues are need not be
confused with problems but they are important to know and crucial to handle.

Figure 2 : Explosion in size of Data (Hewlett-Packard Development Company, 2012)

3.1

Issues related to the Characteristics

Data Volume As data volume increases, the value of different data records will decrease in
proportion to age, type, richness, and quantity among other factors. The social networking
sites existing are themselves producing data in order of terabytes everyday and this amount
of data is definitely difficult to be handled using the existing traditional systems.
Data Velocity Our traditional systems are not capable enough on performing the analytics
on the data which is constantly in motion. E-Commerce has rapidly increased the speed and
richness of data used for different business transactions (for example, web-site clicks). Data
velocity management is much more than a bandwidth issue.
Data Variety All this data is totally different consisting of raw, structured, semi structured
and even unstructured data which is difficult to be handled by the existing traditional
analytic systems. From an analytic perspective, it is probably the biggest obstacle to

3
effectively using large volumes of data. Incompatible data formats, non-aligned data
structures, and inconsistent data semantics represents significant challenges that can lead to
analytic sprawl.
Data Value As the data stored by different organizations is being used by them for data
analytics. It will produce a kind of gap in between the Business leaders and the IT
professionals the main concern of business leaders would be to just adding value to their
business and getting more and more profit unlike the IT leaders who would have to concern
with the technicalities of the storage and processing.
Data Complexity One current difficulty of big data is working with it using relational
databases and desktop statistics/visualization packages, requiring massively parallel
software running on tens, hundreds, or even thousands of servers. It is quite an undertaking
to link, match, cleanse and transform data across systems coming from various sources. It is
also necessary to connect and correlate relationships, hierarchies and multiple data linkages
or data can quickly spiral out of control (Katal, Wazid, & Goudar, 2013).

3.2

Storage and Transport Issues

The quantity of data has exploded each time we have invented a new storage medium. The
difference about the most recent data explosion, mainly due to social media, is that there has
been no new storage medium. Moreover, data is being created by everyone and everything,
(from Mobile Devices to Super Computers) not just, as here to fore, by professionals such
as scientist, journalists, writers etc.
Current disk technology limits are about 4 terabytes (1012) per disk. So, 1 Exabyte (1018)
would require 25,000 disks. Even if an Exabyte of data could be processed on a single
computer system, it would be unable to directly attach the requisite number of disks. Access
to that data would overwhelm current communication networks. Assuming that a 1 gigabyte
per second network has an effective sustainable transfer rate of 80%, the sustainable
bandwidth is about 100 megabytes. Thus, transferring an Exabyte would take about 2800
hours, if we assume that a sustained transfer could be maintained. It would take longer time
to transmit the data from a collection or storage point to a processing point than the time
required to actually process it. To handle this issue, the data should be processed “in place”
and transmit only the resulting information. In other words, “bring the code to the data”,

4
unlike the traditional method of “bring the data to the code.” (Kaisler, Armour, Espinosa, &
Money, 2013).

3.3 Data Management Issues
Data Management will, perhaps, be the most difficult problem to address with big data.
Resolving issues of access, utilization, updating, governance, and reference (in publications)
have proven to be major stumbling blocks. The sources of the data are varied - by size, by
format, and by method of collection. Individuals contribute digital data in mediums
comfortable to them like- documents, drawings, pictures, sound and video recordings,
models, software behaviors, user interface designs, etc., with or without adequate metadata
describing what, when, where, who, why and how it was collected and its provenance.
Unlike the collection of data by manual methods, where rigorous protocols are often
followed in order to ensure accuracy and validity, Digital data collection is much more
relaxed. Given the volume, it is impractical to validate every data item. New approaches to
data qualification and validation are needed. The richness of digital data representation
prohibits a personalized methodology for data collection. To summarize, there is no perfect
big data management solution yet. This represents an important gap in the research
literature on big data that needs to be filled.

3.4 Processing Issues
Assume that an Exabyte of data needs to be processed in its entirety. For simplicity, assume
the data is chunked into blocks of 8 words, so 1 Exabyte = 1K petabytes. Assuming a
processor expends 100 instructions on one block at 5 gigahertz, the time required for endto-end processing would be 20 nanoseconds. To process 1K petabytes would require a total
end-to-end processing time of roughly 635 years.
Thus, effective processing of Exabyte of data will require extensive parallel processing and
new analytics algorithms in order to provide timely and actionable information (Kaisler,
Armour, Espinosa, & Money, 2013).

5
4. CHALLENGES IN BIG DATA
The challenges in Big Data are usually the real implementation hurdles which require
immediate attention. Any implementation without handling these challenges may lead to the
failure of the technology implementation and some unpleasant results.

4.1 Privacy and Security
It is the most important challenges with Big data which is sensitive and includes conceptual,
technical as well as legal significance.
•

The personal information (e.g. in database of a merchant or social networking
website) of a person when combined with external large data sets, leads to the
inference of new facts about that person and it’s possible that these kinds of facts
about the person are secretive and the person might not want the data owner to know
or any person to know about them.

•

Information regarding the people is collected and used in order to add value to the
business of the organization. This is done by creating insights in their lives which
they are unaware of.

•

Another important consequence arising would be Social stratification where a
literate person would be taking advantages of the Big data predictive analysis and on
the other hand underprivileged will be easily identified and treated worse.

•

Big Data used by law enforcement will increase the chances of certain tagged people
to suffer from adverse consequences without the ability to fight back or even having
knowledge that they are being discriminated.

4.2

Data Access and Sharing of Information

If the data in the companies information systems is to be used to make accurate decisions in
time it becomes necessary that it should be available in accurate, complete and timely
manner. This makes the data management and governance process bit complex adding the
necessity to make data open and make it available to government agencies in standardized
manner with standardized APIs, metadata and formats thus leading to better decision
making, business intelligence and productivity improvements.
Expecting sharing of data between companies is awkward because of the need to get an
edge in business. Sharing data about their clients and operations threatens the culture of
secrecy and competitiveness.

6
4.3

Analytical Challenges

The main challenging questions are as:
•

What if data volume gets so large and varied and it is not known how to deal with it?

•

Does all data need to be stored?

•

Does all data need to be analyzed?

•

How to find out which data points are really important?

•

How can the data be used to best advantage?

Big data brings along with it some huge analytical challenges. The type of analysis to be
done on this huge amount of data which can be unstructured, semi structured or structured
requires a large number of advance skills. Moreover the type of analysis which is needed to
be done on the data depends highly on the results to be obtained i.e. decision making. This
can be done by using one of two techniques: either incorporate massive data volumes in
analysis or determine upfront which Big data is relevant.

4.4 Human Resources and Manpower
Since Big data is at its youth and an emerging technology so it needs to attract organizations
and youth with diverse new skill sets. These skills should not be limited to technical ones
but also should extend to research, analytical, interpretive and creative ones. These skills
need to be developed in individuals hence requires training programs to be held by the
organizations. Moreover the Universities need to introduce curriculum on Big data to
produce skilled employees in this expertise.

4.5 Technical Challenges
4.5.1

Fault Tolerance: With the incoming of new technologies like Cloud

computing and Big data it is always intended that whenever the failure occurs the damage
done should be within acceptable threshold rather than beginning the whole task from the
scratch. Fault-tolerant computing is extremely hard, involving intricate algorithms. It is
simply not possible to devise absolutely foolproof, 100% reliable fault tolerant machines or
software. Thus the main task is to reduce the probability of failure to an "acceptable" level.
Unfortunately, the more we strive to reduce this probability, the higher the cost.
Two methods which seem to increase the fault tolerance in Big data are as:
•

First is to divide the whole computation being done into tasks and assign these tasks
to different nodes for computation.
7
•

Second is, one node is assigned the work of observing that these nodes are working
properly. If something happens that particular task is restarted.

But sometimes it’s quite possible that that the whole computation can’t be divided into such
independent tasks. There could be some tasks which might be recursive in nature and the
output of the previous computation of task is the input to the next computation. Thus
restarting the whole computation becomes cumbersome process. This can be avoided by
applying Checkpoints which keeps the state of the system at certain intervals of the time. In
case of any failure, the computation can restart from last checkpoint maintained.
4.5.2

Scalability: The scalability issue of Big data has lead towards cloud

computing, which now aggregates multiple disparate workloads with varying performance
goals into very large clusters. This requires a high level of sharing of resources which is
expensive and also brings with it various challenges like how to run and execute various
jobs so that we can meet the goal of each workload cost effectively. It also requires dealing
with the system failures in an efficient manner which occurs more frequently if operating on
large clusters. These factors combined put the concern on how to express the programs,
even complex machine learning tasks. There has been a huge shift in the technologies being
used. Hard Disk Drives (HDD) are being replaced by the solid state Drives and Phase
Change technology which are not having the same performance between sequential and
random data transfer. Thus, what kinds of storage devices are to be used; is again a big
question for data storage.
4.5.3

Quality of Data: Collection of huge amount of data and its storage comes at

a cost. More data if used for decision making or for predictive analysis in business will
definitely lead to better results. Business Leaders will always want more and more data
storage whereas the IT Leaders will take all technical aspects in mind before storing all the
data. Big data basically focuses on quality data storage rather than having very large
irrelevant data so that better results and conclusions can be drawn. This further leads to
various questions like how it can be ensured that which data is relevant, how much data
would be enough for decision making and whether the stored data is accurate or not to draw
conclusions from it etc. (Katal, Wazid, & Goudar, 2013).
4.5.4

Heterogeneous Data: Unstructured data represents almost every kind of

data being produced like social media interactions, to recorded meetings, to handling of
PDF documents, fax transfers, to emails and more. Working with unstructured data is

8
cumbersome and of course costly too. Converting all this unstructured data into structured
one is also not feasible.
Structured data is always organized into highly mechanized and manageable way. It shows
well integration with database but unstructured data is completely raw and unorganized.

5. BIG DATA TECHNOLOGIES AND PROJECTS
Big data requires exceptional technologies to efficiently process large quantities of data
within tolerable elapsed times. Technologies being applied to big data include massively
parallel processing (MPP) databases, data mining grids, distributed file systems, distributed
databases, cloud computing platforms, the Internet, and scalable storage systems. Real or
near-real time information delivery is one of the defining characteristics of Big Data
Analytics. A wide variety of techniques and technologies has been developed and adapted
to aggregate, manipulate, analyze, and visualize big data. These techniques and
technologies draw from several fields including statistics, computer science, applied
mathematics, and economics. This means that an organization that intends to derive value
from big data has to adopt a flexible, multidisciplinary approach.
When an enterprise can leverage all the information available with large data rather than
just a subset of its data, then it has a powerful advantage over the market competitors. Big
Data can help to gain insights and make better decisions. Big Data presents an opportunity
to create unprecedented business advantage and better service delivery. The average end
user accesses myriad websites and employs a growing number of operating systems and
apps daily utilizing a variety of mobile and desktop devices. This translates to an
overwhelming and ever-increasing volume, velocity, and variety of data generated, shared,
and propagated. Successful protection relies on the right combination of methodologies,
human insight, an expert understanding of the threat landscape, and the efficient processing
of big data to create actionable intelligence.

5.1 Big Data Technologies
The International Data Corporation (IDC) study predicts that overall data will grow by 50
times by 2020, driven in large part by more embedded systems such as sensors in clothing,
medical devices and structures like buildings and bridges. The study also determined that
unstructured information - such as files, email and video - will account for 90% of all data
created over the next decade. But the number of IT professionals available to manage all
9
that data will only grow by 1.5 times today's levels (World's data will grow by 50X in next
decade, IDC study predicts , 2011).The digital universe is 1.8 trillion gigabytes in size and
stored in 500 quadrillion files. And its size gets more than double in every two years time
frame. If we compare the digital universe with our physical universe then it's nearly as many
bits of information in the digital universe as stars in our physical universe (The 2011 Digital
Universe Study: Extracting Value from Chaos, 2011).
Given a very large heterogeneous data set, a major challenge is to figure out what data one
has and how to analyze it. Unlike the previous two sections, hybrid data sets combined from
many other data sets hold more surprises than immediate answers. To analyze these data
will require adapting and integrating multiple analytic techniques to “see around corners”,
e.g., to realize that new knowledge is likely to emerge in a non-linear way. It is not clear
that statistical analysis methods, as Ayres argues, are or can be the whole answer.
A Big Data platform should give a solution which is designed specifically with the needs of
the enterprise in the mind. The following are the basic features of a Big Data Platform
offering (Singh & Singh, 2012)•

Comprehensive - It should offer a broad platform and address all three dimensions
of the Big Data challenge -Volume, Variety and Velocity.

•

Enterprise-ready - It should include the performance, security, usability and
reliability features.

•

Integrated - It should simplify and accelerates the introduction of Big Data
technology to enterprise. It should enable integration with information supply chain
including databases, data warehouses and business intelligence applications.

•

Open source based - It should be open source technology with the enterprise-class
functionality and integration.

•

Low latency reads and updates

•

Robust and fault-tolerant

•

Scalability

•

Extensible

•

Allows adhoc queries

•

Minimal maintenance

Mainly used technique is Hadoop, which is open source.

10
Hadoop
Hadoop is an open source project hosted by Apache Software Foundation. It consists of
many small sub projects which belong to the category of infrastructure for distributed
computing. Hadoop mainly consists of:
•

File System (The Hadoop File System)

•

Programming Paradigm (Map Reduce)

The other subprojects
provide complementary
services

or

they

are

building on the core to
add

higher-level

abstractions. There exist
many

problems

in

dealing with storage of
large amount of data.
Though

the

storage

capacities of the drives

Figure 3 : Hadoop High Level Architecture (Patel, Birla, & Nair, 2013)

have increased massively
but the rate of reading data from them hasn’t shown that considerable improvement. The
reading process takes large amount of time and the process of writing is also slower.
This time can be reduced by reading from multiple disks at once. Only using one hundredth
of a disk may seem wasteful. But if there are one hundred datasets, each of which is one
terabyte and providing shared access to them is also a solution. There occur many problems
also with using many pieces of hardware as it increases the chances of failure. This can be
avoided by Replication i.e. creating redundant copies of the same data at different devices
so that in case of failure the copy of the data is available.

11
The main problem is of combining the data being read from different devices. Many a
methods are available in distributed computing to handle this problem but still it is quite
challenging. All the problems discussed are easily handled by Hadoop. The problem of
failure is handled by the Hadoop Distributed File System and problem of combining data is
handled by Map reduce programming Paradigm. Map Reduce basically reduces the problem
of disk reads and writes by providing a programming model dealing in computation with
keys and values. Hadoop thus provides: a reliable shared storage and analysis system. The
storage is provided by HDFS and analysis by MapReduce (Katal, Wazid, & Goudar, 2013).

Figure 4 : HDFS Architecture (Patel, Birla, & Nair, 2013)

The following are the key Big Data tools available in the market for the enterprise use-

Table 1: Tools available for Big Data

Tool
IBM
InfoSphereBigInsights

Description
It is open source Apache Hadoop with IBM Big Sheet which is able
to analyze data in its native format without imposing any schema/
structure and enables fast adhoc analysis.

12
WX2 Kognitio Analytical
Platform
ParAccel Analytic
Platform

SAND Analytic Platform

It is a fast and scalable in-memory analytic database platform
It is a columnar, massively parallel processing (MPP) analytic
database platform. It has strong features for the query optimization
and compilation, compression and network interconnect.
It is a columnar analytic database platform that achieves linear data
scalability through massively parallel processing (MPP).

5.2 Big Data Projects
The table below describes the some of the projects which are using Big Data effectively.
Table 2: Various Big Data Projects

Domain

Description
1. The Large Hadron Collider (LHC) is the world's largest and
highest-energy particle accelerator with the aim of allowing
physicists to test the predictions of different theories of particle
physics and high-energy physics. The data flow in experiments
consists of 25 petabytes (as of 2012) before replication and reaches

Big Sciences

upto 200 petabytes after replication.
2. The Sloan Digital Sky Survey is a multi-filter imaging and
spectroscopic redshift survey using a 2.5-m wide-angle optical
telescope at Apache Point Observatory in New Mexico, United
States. It is Continuing at a rate of about 200 GB per night and has
more than 140 terabytes of information.
1. Amazon.com handles millions of back-end operations every day,
as well as queries from more than half a million third-party sellers.
The core technology that keeps Amazon running is Linux-based
and as of 2005 they had the world’s three largest Linux databases,

Private Sector

with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.
2. Walmart is estimated to store about more than 2.5 petabytes of
data in order to handle about more than 1 million customer
transactions every hour.

13
1. The Obama administration project is a big initiative where a
Government is trying to find the uses of the big data which eases
their tasks somehow and thus reducing the problems faced. It
includes 84 different Big data programs which are a part of 6
Governments

different departments.
2. The Community Comprehensive National Cyber Security
initiated a data center, Utah Data Center (United States NSA and
Director of National Intelligence initiative) which stores data in
scale of yottabytes. Its main task is to provide cyber security.
Information and Communication Technologies for Development
(ICT4D) uses the Information and Communication Technologies

International Development

(ICTs) for the socio economic development, human rights and
international

development.

Big

data

contributions to international development.

14

can

make

important
6. BIG DATA ADVANTAGES AND GOOD PRACTICES
The Big Data has numerous advantages on society, science and technology. It is unto the
way that how it is used for the human beings. Some of the advantages (Marr, 2013)are
described below:

• Understanding and Targeting Customers
This is one of the biggest and most publicized areas of big data use today. Here, big data is
used to better understand customers and their behaviors and preferences. Companies are
keen to expand their traditional data sets with social media data, browser logs as well as text
analytics and sensor data to get a more complete picture of their customers. The big
objective, in many cases, is to create predictive models.

• Understanding and Optimizing Business Process
Big data is also increasingly used to optimize business processes. Retailers are able to
optimize their stock based on predictions generated from social media data, web search
trends and weather forecasts. One particular business process that is seeing a lot of big data
analytics is supply chain or delivery route optimization. HR business processes are also
being improved using big data analytics. This includes the optimization of talent acquisition
– Moneyball style, as well as the measurement of company culture and staff engagement
using big data tool.

• Improving Science and Research
Science and research is currently being transformed by the new possibilities big data brings.
Take, for example, CERN, the Swiss nuclear physics lab with its Large Hadron Collider, the
world’s largest and most powerful particle accelerator. Experiments to unlock the secrets of
our universe – how it started and works - generate huge amounts of data. The CERN data
centre has 65,000 processors to analyse its 30 petabytes of data. However, it uses the
computing powers of thousands of computers distributed across 150 data centres worldwide
to analyse the data. Such computing powers can be leveraged to transform so many other
areas of science and research.

15
• Improving Healthcare and Public Health
The computing power of big data analytics enables us to decode entire DNA strings in
minutes and will allow us to find new cures and better understand and predict disease
patterns. Just think of what happens when all the individual data from smart watches and
wearable devices can be used to apply it to millions of people and their various diseases.
The clinical trials of the future won’t be limited by small sample sizes but could potentially
include everyone.

• Optimizing Machine and Device Performance
Big data analytics help machines and devices become smarter and more autonomous. For
example, big data tools are used to operate Google’s self-driving car. The Toyota Prius is
fitted with cameras, GPS as well as powerful computers and sensors to safely drive on the
road without the intervention of human beings. Big data tools are also used to optimize
energy grids using data from smart meters. We can even use big data tools to optimize the
performance of computers and data warehouses.

• Financial Trading
High Frequency Trading (HFT) is an area where big data finds a lot of use today. Here, big
data algorithms are used to make trading decisions. Today, the majority of equity trading
now takes place via data algorithms that increasingly take into account signals from social
media networks and news websites to make buy and sell decisions in split seconds.

• Improving Security and Law Enforcement
Big data is applied heavily in improving security and enabling law enforcement. The
revelations are that the National Security Agency (NSA) in the U.S. uses big data analytics
to foil terrorist plots (and maybe spy on us). Others use big data techniques to detect and
prevent cyber-attacks. Police forces use big data tools to catch criminals and even predict
criminal activity and credit card companies use big data use it to detect fraudulent
transactions.
Some other advantages and uses includes improving sports performance, improving and
optimizing cities and countries, personal quantification and performance optimization.

16
• Good Practices for Big Data
•

Creating dimensions of all the data being store is a good practice for Big data
analytics. It needs to be divided into dimensions and facts.

•

All the dimensions should have durable surrogate keys meaning that these keys
can’t be changed by any business rule and are assigned in sequence or generated by
some hashing algorithm ensuring uniqueness.

•

Expect to integrate structured and unstructured data as all kind of data is a part of
Big data which needs to be analyzed together.

•

Generality of the technology is needed to deal with different formats of data.
Building technology around key value pairs work.

•

Analyzing data sets including identifying information about individuals or
organizations privacy is an issue whose importance particularly to consumers is
growing as the value of Big data becomes more apparent.

•

Data quality needs to be better. Different tasks like filtering, cleansing, pruning,
conforming, matching, joining, and diagnosing should be applied at the earliest
touch points possible.

•

There should be certain limits on the scalability of the data stored.

•

Business leaders and IT leaders should work together to yield more business value
from the data. Collecting, storing and analyzing data comes at a cost. Business
leaders will go for it but IT leaders have to look for many things like technological
limitations, staff restrictions etc. The decisions taken should be revised to ensure
that the organization is considering the right data to produce insights at any given
point of time.

•

Investment in data quality and metadata is also important as it reduces the
processing time.

17
7. CONCLUSION
In this seminar report some of the important issues are covered that are needed to be
analyzed by the organization while estimating the significance of implementing the Big
Data technology and some direct challenges to the infrastructure of the technology.
The commercial impacts of the Big data have the potential to generate significant
productivity growth for a number of vertical sectors. They should also create healthy
demand for talented individuals who are capable to help organizations in making sense
of this growing volume of raw data. In short, Big Data presents opportunity to create
unprecedented business advantages and better service delivery.
A regulatory framework for big data is essential. That framework must be constructed
with a clear understanding of the ravages that have been wrought on personal interests
by the reduction of information to data, its centralization, and its expropriation but the
biggest gap is the lack of the skilled managers to make decisions based on analysis by a
factor of 10x. Growing talent and building teams to make analytic-based decisions is
the key to realize the value of Big Data.

18
REFERENCES
Aveksa Inc. (2013). Ensuring “Big Data” Security with Identity and Access Management. Waltham,
MA: Aveksa.
Hewlett-Packard Development Company. (2012). Big Security for Big Data. L.P.: Hewlett-Packard
Development Company.
Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big Data: Issues and Challenges
Moving Forward. International Confrence on System Sciences (pp. 995-1004). Hawaii:
IEEE Computer Soceity.
Katal, A., Wazid, M., & Goudar, R. H. (2013). Big Data: Issues, Challenges, Tools and Good
Practices. IEEE, 404-409.
Marr, B. (2013, November 13). The Awesome Ways Big Data is used Today to Change Our World.
Retrieved November 14, 2013, from LinkedIn: https://www.linkedin.com/today
/post/article/20131113065157-64875646-the-awesome-ways-big-data-is-used-today-tochange-our-world
Patel, A. B., Birla, M., & Nair, U. (2013). Addressing Big Data Problem Using Hadoop and. Nirma
University, Gujrat: Nirma University.
Singh, S., & Singh, N. (2012). Big Data Analytics. International Conference on Communication,
Information & Computing Technology (ICCICT) (pp. 1-4). Mumbai: IEEE.
The 2011 Digital Universe Study: Extracting Value from Chaos. (2011, November 30). Retrieved
from EMC: http://www.emc.com/collateral/demos/microsites/emc-digital-universe2011/index.htm
World's data will grow by 50X in next decade, IDC study predicts . (2011, June 28). Retrieved from
Computer World:
http://www.computerworld.com/s/article/9217988/World_s_data_will_grow_by_50X_in_ne
xt_decade_IDC_study_predicts

19

More Related Content

What's hot (20)

Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Big Data
Big DataBig Data
Big Data
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Big data
Big dataBig data
Big data
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
 
Data mining
Data miningData mining
Data mining
 
Big data
Big dataBig data
Big data
 
Data Science
Data ScienceData Science
Data Science
 
Fraud and Risk in Big Data
Fraud and Risk in Big DataFraud and Risk in Big Data
Fraud and Risk in Big Data
 
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
data mining
data miningdata mining
data mining
 

Similar to Big Data: Issues and Challenges

big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)NikitaRajbhoj
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAudrey Britton
 
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfSoumodeep Nanee Kundu
 
Unit No2 Introduction to big data.pdf
Unit No2 Introduction to big data.pdfUnit No2 Introduction to big data.pdf
Unit No2 Introduction to big data.pdfRanjeet Bhalshankar
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET Journal
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
Big data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxBig data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxhartrobert670
 
research publish journal
research publish journalresearch publish journal
research publish journalrikaseorika
 
research publish journal
research publish journalresearch publish journal
research publish journalrikaseorika
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewIJERA Editor
 

Similar to Big Data: Issues and Challenges (20)

big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdf
 
1
11
1
 
Unit No2 Introduction to big data.pdf
Unit No2 Introduction to big data.pdfUnit No2 Introduction to big data.pdf
Unit No2 Introduction to big data.pdf
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
Big data
Big dataBig data
Big data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Big data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxBig data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docx
 
Big Data
Big DataBig Data
Big Data
 
Big Data.pdf
Big Data.pdfBig Data.pdf
Big Data.pdf
 
research publish journal
research publish journalresearch publish journal
research publish journal
 
research publish journal
research publish journalresearch publish journal
research publish journal
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –Review
 

More from Harsh Kishore Mishra

More from Harsh Kishore Mishra (12)

Wormhole attack
Wormhole attackWormhole attack
Wormhole attack
 
Intellectual Property Rights : Indian Perspective
Intellectual Property Rights : Indian PerspectiveIntellectual Property Rights : Indian Perspective
Intellectual Property Rights : Indian Perspective
 
IEEE 802.11ac Standard
IEEE 802.11ac StandardIEEE 802.11ac Standard
IEEE 802.11ac Standard
 
BYOD: Bring Your Own Device Implementation and Security Issues
BYOD: Bring Your Own Device Implementation and Security IssuesBYOD: Bring Your Own Device Implementation and Security Issues
BYOD: Bring Your Own Device Implementation and Security Issues
 
BYOD: Implementation and Security Issues
BYOD: Implementation and Security IssuesBYOD: Implementation and Security Issues
BYOD: Implementation and Security Issues
 
Role of MicroRNA in Phosphorus Defficiency
Role of MicroRNA in Phosphorus DefficiencyRole of MicroRNA in Phosphorus Defficiency
Role of MicroRNA in Phosphorus Defficiency
 
Windows 8: inside what and how
Windows 8: inside what and howWindows 8: inside what and how
Windows 8: inside what and how
 
Windows 7 Versions Features
Windows 7 Versions FeaturesWindows 7 Versions Features
Windows 7 Versions Features
 
Software Testing and UML Lab
Software Testing and UML LabSoftware Testing and UML Lab
Software Testing and UML Lab
 
Network security
Network securityNetwork security
Network security
 
Intellectual Property Rights
Intellectual Property RightsIntellectual Property Rights
Intellectual Property Rights
 
Windows 8 CP
Windows 8 CPWindows 8 CP
Windows 8 CP
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

Big Data: Issues and Challenges

  • 1. ACKNOWLEDGEMENT I take this opportunity to express my gratitude to all those people who have been directly and indirectly with me during the completion of this Seminar. I pay special thanks to my seminar guide Er. Meenakshi Mittal, who was very supportive and compassionate throughout the preparation this seminar report. A lot of gratitude also goes to my friends who have contributed effectively towards the preparation and refinement of my Seminar undertaken at CUP, Bathinda. They all inspired and encouraged me to carry out this seminar report in best of contents. Particular thanks to all authors whose work has been referenced. Finally I would like to acknowledge my profound gratitude to Almighty. It would not have been possible to accomplish this without their blessings. Harsh Kishore Mishra CUPBMTech-CSSETCST2013-1401 Centre for Computer Science and Technology School of Engineering and Technology Central University of Punjab, Bathinda ii
  • 2. LIST OF TABLES Table 1: Tools available for Big Data ................................................................................................ 12 Table 2: Various Big Data Projects.................................................................................................... 13 iv
  • 3. LIST OF FIGURES Figure 1 : Example of Big Data Architecture ...................................................................................... 1 Figure 2 : Explosion in size of Data..................................................................................................... 3 Figure 3 : Hadoop High Level Architecture ...................................................................................... 11 Figure 4 : HDFS Architecture ............................................................................................................ 12 v
  • 4. 1. INTRODUCTION Big data is defined as large amount of data which requires new technologies and architectures to make possible to extract value from it by capturing and analysis process. New sources of big data include location specific data arising from traffic management, and from the tracking of personal devices such as Smartphones. Big Data has emerged because we are living in a society which makes increasing use of data intensive technologies. Due to such large size of data it becomes very difficult to perform effective analysis using the existing traditional techniques. Since Big data is a recent upcoming technology in the market which can bring huge benefits to the business organizations, it becomes necessary that various challenges and issues associated in bringing and adapting to this technology are need to be understood. Big Data concept means a datasets which continues to grow so much that it becomes difficult to manage it using existing database management concepts & tools. The difficulties can be related to data capture, storage, search, sharing, analytics and visualization etc. Figure 1 : Example of Big Data Architecture (Aveksa Inc., 2013) Big data due to its various properties like volume, velocity, variety, variability, value and complexity put forward many challenges. The various challenges faced in large data management include – scalability, unstructured data, accessibility, real time analytics, fault tolerance and many more. In addition to variations in the amount of data stored in different sectors, the types of data generated and stored—i.e., encoded video, images, audio, or text/numeric information; also differ markedly from industry to industry. 1
  • 5. 2. BIG DATA CHARACTERISTICS Data Volume: The Big word in Big data itself defines the volume. At present the data existing is in petabytes (1015) and is supposed to increase to zettabytes (1021) in nearby future. Data volume measures the amount of data available to an organization, which does not necessarily have to own all of it as long as it can access it. Data Velocity: Velocity in Big data is a concept which deals with the speed of the data coming from various sources. This characteristic is not being limited to the speed of incoming data but also speed at which the data flows and aggregated. Data Variety: Data variety is a measure of the richness of the data representation – text, images video, audio, etc. Data being produced is not of single category as it not only includes the traditional data but also the semi structured data from various resources like web Pages, Web Log Files, social media sites, e-mail, documents. Data Value: Data value measures the usefulness of data in making decisions. Data science is exploratory and useful in getting to know the data, but “analytic science” encompasses the predictive power of big data. User can run certain queries against the data stored and thus can deduct important results from the filtered data obtained and can also rank it according to the dimensions they require. These reports help these people to find the business trends according to which they can change their strategies. Complexity: Complexity measures the degree of interconnectedness (possibly very large) and interdependence in big data structures such that a small change (or combination of small changes) in one or a few elements can yield very large changes or a small change that ripple across or cascade through the system and substantially affect its behavior, or no change at all (Katal, Wazid, & Goudar, 2013). 2
  • 6. 3. ISSUES IN BIG DATA The issues in Big Data are some of the conceptual points that should be understood by the organization to implement the technology effectively. Big data Issues are need not be confused with problems but they are important to know and crucial to handle. Figure 2 : Explosion in size of Data (Hewlett-Packard Development Company, 2012) 3.1 Issues related to the Characteristics Data Volume As data volume increases, the value of different data records will decrease in proportion to age, type, richness, and quantity among other factors. The social networking sites existing are themselves producing data in order of terabytes everyday and this amount of data is definitely difficult to be handled using the existing traditional systems. Data Velocity Our traditional systems are not capable enough on performing the analytics on the data which is constantly in motion. E-Commerce has rapidly increased the speed and richness of data used for different business transactions (for example, web-site clicks). Data velocity management is much more than a bandwidth issue. Data Variety All this data is totally different consisting of raw, structured, semi structured and even unstructured data which is difficult to be handled by the existing traditional analytic systems. From an analytic perspective, it is probably the biggest obstacle to 3
  • 7. effectively using large volumes of data. Incompatible data formats, non-aligned data structures, and inconsistent data semantics represents significant challenges that can lead to analytic sprawl. Data Value As the data stored by different organizations is being used by them for data analytics. It will produce a kind of gap in between the Business leaders and the IT professionals the main concern of business leaders would be to just adding value to their business and getting more and more profit unlike the IT leaders who would have to concern with the technicalities of the storage and processing. Data Complexity One current difficulty of big data is working with it using relational databases and desktop statistics/visualization packages, requiring massively parallel software running on tens, hundreds, or even thousands of servers. It is quite an undertaking to link, match, cleanse and transform data across systems coming from various sources. It is also necessary to connect and correlate relationships, hierarchies and multiple data linkages or data can quickly spiral out of control (Katal, Wazid, & Goudar, 2013). 3.2 Storage and Transport Issues The quantity of data has exploded each time we have invented a new storage medium. The difference about the most recent data explosion, mainly due to social media, is that there has been no new storage medium. Moreover, data is being created by everyone and everything, (from Mobile Devices to Super Computers) not just, as here to fore, by professionals such as scientist, journalists, writers etc. Current disk technology limits are about 4 terabytes (1012) per disk. So, 1 Exabyte (1018) would require 25,000 disks. Even if an Exabyte of data could be processed on a single computer system, it would be unable to directly attach the requisite number of disks. Access to that data would overwhelm current communication networks. Assuming that a 1 gigabyte per second network has an effective sustainable transfer rate of 80%, the sustainable bandwidth is about 100 megabytes. Thus, transferring an Exabyte would take about 2800 hours, if we assume that a sustained transfer could be maintained. It would take longer time to transmit the data from a collection or storage point to a processing point than the time required to actually process it. To handle this issue, the data should be processed “in place” and transmit only the resulting information. In other words, “bring the code to the data”, 4
  • 8. unlike the traditional method of “bring the data to the code.” (Kaisler, Armour, Espinosa, & Money, 2013). 3.3 Data Management Issues Data Management will, perhaps, be the most difficult problem to address with big data. Resolving issues of access, utilization, updating, governance, and reference (in publications) have proven to be major stumbling blocks. The sources of the data are varied - by size, by format, and by method of collection. Individuals contribute digital data in mediums comfortable to them like- documents, drawings, pictures, sound and video recordings, models, software behaviors, user interface designs, etc., with or without adequate metadata describing what, when, where, who, why and how it was collected and its provenance. Unlike the collection of data by manual methods, where rigorous protocols are often followed in order to ensure accuracy and validity, Digital data collection is much more relaxed. Given the volume, it is impractical to validate every data item. New approaches to data qualification and validation are needed. The richness of digital data representation prohibits a personalized methodology for data collection. To summarize, there is no perfect big data management solution yet. This represents an important gap in the research literature on big data that needs to be filled. 3.4 Processing Issues Assume that an Exabyte of data needs to be processed in its entirety. For simplicity, assume the data is chunked into blocks of 8 words, so 1 Exabyte = 1K petabytes. Assuming a processor expends 100 instructions on one block at 5 gigahertz, the time required for endto-end processing would be 20 nanoseconds. To process 1K petabytes would require a total end-to-end processing time of roughly 635 years. Thus, effective processing of Exabyte of data will require extensive parallel processing and new analytics algorithms in order to provide timely and actionable information (Kaisler, Armour, Espinosa, & Money, 2013). 5
  • 9. 4. CHALLENGES IN BIG DATA The challenges in Big Data are usually the real implementation hurdles which require immediate attention. Any implementation without handling these challenges may lead to the failure of the technology implementation and some unpleasant results. 4.1 Privacy and Security It is the most important challenges with Big data which is sensitive and includes conceptual, technical as well as legal significance. • The personal information (e.g. in database of a merchant or social networking website) of a person when combined with external large data sets, leads to the inference of new facts about that person and it’s possible that these kinds of facts about the person are secretive and the person might not want the data owner to know or any person to know about them. • Information regarding the people is collected and used in order to add value to the business of the organization. This is done by creating insights in their lives which they are unaware of. • Another important consequence arising would be Social stratification where a literate person would be taking advantages of the Big data predictive analysis and on the other hand underprivileged will be easily identified and treated worse. • Big Data used by law enforcement will increase the chances of certain tagged people to suffer from adverse consequences without the ability to fight back or even having knowledge that they are being discriminated. 4.2 Data Access and Sharing of Information If the data in the companies information systems is to be used to make accurate decisions in time it becomes necessary that it should be available in accurate, complete and timely manner. This makes the data management and governance process bit complex adding the necessity to make data open and make it available to government agencies in standardized manner with standardized APIs, metadata and formats thus leading to better decision making, business intelligence and productivity improvements. Expecting sharing of data between companies is awkward because of the need to get an edge in business. Sharing data about their clients and operations threatens the culture of secrecy and competitiveness. 6
  • 10. 4.3 Analytical Challenges The main challenging questions are as: • What if data volume gets so large and varied and it is not known how to deal with it? • Does all data need to be stored? • Does all data need to be analyzed? • How to find out which data points are really important? • How can the data be used to best advantage? Big data brings along with it some huge analytical challenges. The type of analysis to be done on this huge amount of data which can be unstructured, semi structured or structured requires a large number of advance skills. Moreover the type of analysis which is needed to be done on the data depends highly on the results to be obtained i.e. decision making. This can be done by using one of two techniques: either incorporate massive data volumes in analysis or determine upfront which Big data is relevant. 4.4 Human Resources and Manpower Since Big data is at its youth and an emerging technology so it needs to attract organizations and youth with diverse new skill sets. These skills should not be limited to technical ones but also should extend to research, analytical, interpretive and creative ones. These skills need to be developed in individuals hence requires training programs to be held by the organizations. Moreover the Universities need to introduce curriculum on Big data to produce skilled employees in this expertise. 4.5 Technical Challenges 4.5.1 Fault Tolerance: With the incoming of new technologies like Cloud computing and Big data it is always intended that whenever the failure occurs the damage done should be within acceptable threshold rather than beginning the whole task from the scratch. Fault-tolerant computing is extremely hard, involving intricate algorithms. It is simply not possible to devise absolutely foolproof, 100% reliable fault tolerant machines or software. Thus the main task is to reduce the probability of failure to an "acceptable" level. Unfortunately, the more we strive to reduce this probability, the higher the cost. Two methods which seem to increase the fault tolerance in Big data are as: • First is to divide the whole computation being done into tasks and assign these tasks to different nodes for computation. 7
  • 11. • Second is, one node is assigned the work of observing that these nodes are working properly. If something happens that particular task is restarted. But sometimes it’s quite possible that that the whole computation can’t be divided into such independent tasks. There could be some tasks which might be recursive in nature and the output of the previous computation of task is the input to the next computation. Thus restarting the whole computation becomes cumbersome process. This can be avoided by applying Checkpoints which keeps the state of the system at certain intervals of the time. In case of any failure, the computation can restart from last checkpoint maintained. 4.5.2 Scalability: The scalability issue of Big data has lead towards cloud computing, which now aggregates multiple disparate workloads with varying performance goals into very large clusters. This requires a high level of sharing of resources which is expensive and also brings with it various challenges like how to run and execute various jobs so that we can meet the goal of each workload cost effectively. It also requires dealing with the system failures in an efficient manner which occurs more frequently if operating on large clusters. These factors combined put the concern on how to express the programs, even complex machine learning tasks. There has been a huge shift in the technologies being used. Hard Disk Drives (HDD) are being replaced by the solid state Drives and Phase Change technology which are not having the same performance between sequential and random data transfer. Thus, what kinds of storage devices are to be used; is again a big question for data storage. 4.5.3 Quality of Data: Collection of huge amount of data and its storage comes at a cost. More data if used for decision making or for predictive analysis in business will definitely lead to better results. Business Leaders will always want more and more data storage whereas the IT Leaders will take all technical aspects in mind before storing all the data. Big data basically focuses on quality data storage rather than having very large irrelevant data so that better results and conclusions can be drawn. This further leads to various questions like how it can be ensured that which data is relevant, how much data would be enough for decision making and whether the stored data is accurate or not to draw conclusions from it etc. (Katal, Wazid, & Goudar, 2013). 4.5.4 Heterogeneous Data: Unstructured data represents almost every kind of data being produced like social media interactions, to recorded meetings, to handling of PDF documents, fax transfers, to emails and more. Working with unstructured data is 8
  • 12. cumbersome and of course costly too. Converting all this unstructured data into structured one is also not feasible. Structured data is always organized into highly mechanized and manageable way. It shows well integration with database but unstructured data is completely raw and unorganized. 5. BIG DATA TECHNOLOGIES AND PROJECTS Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies being applied to big data include massively parallel processing (MPP) databases, data mining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems. Real or near-real time information delivery is one of the defining characteristics of Big Data Analytics. A wide variety of techniques and technologies has been developed and adapted to aggregate, manipulate, analyze, and visualize big data. These techniques and technologies draw from several fields including statistics, computer science, applied mathematics, and economics. This means that an organization that intends to derive value from big data has to adopt a flexible, multidisciplinary approach. When an enterprise can leverage all the information available with large data rather than just a subset of its data, then it has a powerful advantage over the market competitors. Big Data can help to gain insights and make better decisions. Big Data presents an opportunity to create unprecedented business advantage and better service delivery. The average end user accesses myriad websites and employs a growing number of operating systems and apps daily utilizing a variety of mobile and desktop devices. This translates to an overwhelming and ever-increasing volume, velocity, and variety of data generated, shared, and propagated. Successful protection relies on the right combination of methodologies, human insight, an expert understanding of the threat landscape, and the efficient processing of big data to create actionable intelligence. 5.1 Big Data Technologies The International Data Corporation (IDC) study predicts that overall data will grow by 50 times by 2020, driven in large part by more embedded systems such as sensors in clothing, medical devices and structures like buildings and bridges. The study also determined that unstructured information - such as files, email and video - will account for 90% of all data created over the next decade. But the number of IT professionals available to manage all 9
  • 13. that data will only grow by 1.5 times today's levels (World's data will grow by 50X in next decade, IDC study predicts , 2011).The digital universe is 1.8 trillion gigabytes in size and stored in 500 quadrillion files. And its size gets more than double in every two years time frame. If we compare the digital universe with our physical universe then it's nearly as many bits of information in the digital universe as stars in our physical universe (The 2011 Digital Universe Study: Extracting Value from Chaos, 2011). Given a very large heterogeneous data set, a major challenge is to figure out what data one has and how to analyze it. Unlike the previous two sections, hybrid data sets combined from many other data sets hold more surprises than immediate answers. To analyze these data will require adapting and integrating multiple analytic techniques to “see around corners”, e.g., to realize that new knowledge is likely to emerge in a non-linear way. It is not clear that statistical analysis methods, as Ayres argues, are or can be the whole answer. A Big Data platform should give a solution which is designed specifically with the needs of the enterprise in the mind. The following are the basic features of a Big Data Platform offering (Singh & Singh, 2012)• Comprehensive - It should offer a broad platform and address all three dimensions of the Big Data challenge -Volume, Variety and Velocity. • Enterprise-ready - It should include the performance, security, usability and reliability features. • Integrated - It should simplify and accelerates the introduction of Big Data technology to enterprise. It should enable integration with information supply chain including databases, data warehouses and business intelligence applications. • Open source based - It should be open source technology with the enterprise-class functionality and integration. • Low latency reads and updates • Robust and fault-tolerant • Scalability • Extensible • Allows adhoc queries • Minimal maintenance Mainly used technique is Hadoop, which is open source. 10
  • 14. Hadoop Hadoop is an open source project hosted by Apache Software Foundation. It consists of many small sub projects which belong to the category of infrastructure for distributed computing. Hadoop mainly consists of: • File System (The Hadoop File System) • Programming Paradigm (Map Reduce) The other subprojects provide complementary services or they are building on the core to add higher-level abstractions. There exist many problems in dealing with storage of large amount of data. Though the storage capacities of the drives Figure 3 : Hadoop High Level Architecture (Patel, Birla, & Nair, 2013) have increased massively but the rate of reading data from them hasn’t shown that considerable improvement. The reading process takes large amount of time and the process of writing is also slower. This time can be reduced by reading from multiple disks at once. Only using one hundredth of a disk may seem wasteful. But if there are one hundred datasets, each of which is one terabyte and providing shared access to them is also a solution. There occur many problems also with using many pieces of hardware as it increases the chances of failure. This can be avoided by Replication i.e. creating redundant copies of the same data at different devices so that in case of failure the copy of the data is available. 11
  • 15. The main problem is of combining the data being read from different devices. Many a methods are available in distributed computing to handle this problem but still it is quite challenging. All the problems discussed are easily handled by Hadoop. The problem of failure is handled by the Hadoop Distributed File System and problem of combining data is handled by Map reduce programming Paradigm. Map Reduce basically reduces the problem of disk reads and writes by providing a programming model dealing in computation with keys and values. Hadoop thus provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce (Katal, Wazid, & Goudar, 2013). Figure 4 : HDFS Architecture (Patel, Birla, & Nair, 2013) The following are the key Big Data tools available in the market for the enterprise use- Table 1: Tools available for Big Data Tool IBM InfoSphereBigInsights Description It is open source Apache Hadoop with IBM Big Sheet which is able to analyze data in its native format without imposing any schema/ structure and enables fast adhoc analysis. 12
  • 16. WX2 Kognitio Analytical Platform ParAccel Analytic Platform SAND Analytic Platform It is a fast and scalable in-memory analytic database platform It is a columnar, massively parallel processing (MPP) analytic database platform. It has strong features for the query optimization and compilation, compression and network interconnect. It is a columnar analytic database platform that achieves linear data scalability through massively parallel processing (MPP). 5.2 Big Data Projects The table below describes the some of the projects which are using Big Data effectively. Table 2: Various Big Data Projects Domain Description 1. The Large Hadron Collider (LHC) is the world's largest and highest-energy particle accelerator with the aim of allowing physicists to test the predictions of different theories of particle physics and high-energy physics. The data flow in experiments consists of 25 petabytes (as of 2012) before replication and reaches Big Sciences upto 200 petabytes after replication. 2. The Sloan Digital Sky Survey is a multi-filter imaging and spectroscopic redshift survey using a 2.5-m wide-angle optical telescope at Apache Point Observatory in New Mexico, United States. It is Continuing at a rate of about 200 GB per night and has more than 140 terabytes of information. 1. Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, Private Sector with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. 2. Walmart is estimated to store about more than 2.5 petabytes of data in order to handle about more than 1 million customer transactions every hour. 13
  • 17. 1. The Obama administration project is a big initiative where a Government is trying to find the uses of the big data which eases their tasks somehow and thus reducing the problems faced. It includes 84 different Big data programs which are a part of 6 Governments different departments. 2. The Community Comprehensive National Cyber Security initiated a data center, Utah Data Center (United States NSA and Director of National Intelligence initiative) which stores data in scale of yottabytes. Its main task is to provide cyber security. Information and Communication Technologies for Development (ICT4D) uses the Information and Communication Technologies International Development (ICTs) for the socio economic development, human rights and international development. Big data contributions to international development. 14 can make important
  • 18. 6. BIG DATA ADVANTAGES AND GOOD PRACTICES The Big Data has numerous advantages on society, science and technology. It is unto the way that how it is used for the human beings. Some of the advantages (Marr, 2013)are described below: • Understanding and Targeting Customers This is one of the biggest and most publicized areas of big data use today. Here, big data is used to better understand customers and their behaviors and preferences. Companies are keen to expand their traditional data sets with social media data, browser logs as well as text analytics and sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models. • Understanding and Optimizing Business Process Big data is also increasingly used to optimize business processes. Retailers are able to optimize their stock based on predictions generated from social media data, web search trends and weather forecasts. One particular business process that is seeing a lot of big data analytics is supply chain or delivery route optimization. HR business processes are also being improved using big data analytics. This includes the optimization of talent acquisition – Moneyball style, as well as the measurement of company culture and staff engagement using big data tool. • Improving Science and Research Science and research is currently being transformed by the new possibilities big data brings. Take, for example, CERN, the Swiss nuclear physics lab with its Large Hadron Collider, the world’s largest and most powerful particle accelerator. Experiments to unlock the secrets of our universe – how it started and works - generate huge amounts of data. The CERN data centre has 65,000 processors to analyse its 30 petabytes of data. However, it uses the computing powers of thousands of computers distributed across 150 data centres worldwide to analyse the data. Such computing powers can be leveraged to transform so many other areas of science and research. 15
  • 19. • Improving Healthcare and Public Health The computing power of big data analytics enables us to decode entire DNA strings in minutes and will allow us to find new cures and better understand and predict disease patterns. Just think of what happens when all the individual data from smart watches and wearable devices can be used to apply it to millions of people and their various diseases. The clinical trials of the future won’t be limited by small sample sizes but could potentially include everyone. • Optimizing Machine and Device Performance Big data analytics help machines and devices become smarter and more autonomous. For example, big data tools are used to operate Google’s self-driving car. The Toyota Prius is fitted with cameras, GPS as well as powerful computers and sensors to safely drive on the road without the intervention of human beings. Big data tools are also used to optimize energy grids using data from smart meters. We can even use big data tools to optimize the performance of computers and data warehouses. • Financial Trading High Frequency Trading (HFT) is an area where big data finds a lot of use today. Here, big data algorithms are used to make trading decisions. Today, the majority of equity trading now takes place via data algorithms that increasingly take into account signals from social media networks and news websites to make buy and sell decisions in split seconds. • Improving Security and Law Enforcement Big data is applied heavily in improving security and enabling law enforcement. The revelations are that the National Security Agency (NSA) in the U.S. uses big data analytics to foil terrorist plots (and maybe spy on us). Others use big data techniques to detect and prevent cyber-attacks. Police forces use big data tools to catch criminals and even predict criminal activity and credit card companies use big data use it to detect fraudulent transactions. Some other advantages and uses includes improving sports performance, improving and optimizing cities and countries, personal quantification and performance optimization. 16
  • 20. • Good Practices for Big Data • Creating dimensions of all the data being store is a good practice for Big data analytics. It needs to be divided into dimensions and facts. • All the dimensions should have durable surrogate keys meaning that these keys can’t be changed by any business rule and are assigned in sequence or generated by some hashing algorithm ensuring uniqueness. • Expect to integrate structured and unstructured data as all kind of data is a part of Big data which needs to be analyzed together. • Generality of the technology is needed to deal with different formats of data. Building technology around key value pairs work. • Analyzing data sets including identifying information about individuals or organizations privacy is an issue whose importance particularly to consumers is growing as the value of Big data becomes more apparent. • Data quality needs to be better. Different tasks like filtering, cleansing, pruning, conforming, matching, joining, and diagnosing should be applied at the earliest touch points possible. • There should be certain limits on the scalability of the data stored. • Business leaders and IT leaders should work together to yield more business value from the data. Collecting, storing and analyzing data comes at a cost. Business leaders will go for it but IT leaders have to look for many things like technological limitations, staff restrictions etc. The decisions taken should be revised to ensure that the organization is considering the right data to produce insights at any given point of time. • Investment in data quality and metadata is also important as it reduces the processing time. 17
  • 21. 7. CONCLUSION In this seminar report some of the important issues are covered that are needed to be analyzed by the organization while estimating the significance of implementing the Big Data technology and some direct challenges to the infrastructure of the technology. The commercial impacts of the Big data have the potential to generate significant productivity growth for a number of vertical sectors. They should also create healthy demand for talented individuals who are capable to help organizations in making sense of this growing volume of raw data. In short, Big Data presents opportunity to create unprecedented business advantages and better service delivery. A regulatory framework for big data is essential. That framework must be constructed with a clear understanding of the ravages that have been wrought on personal interests by the reduction of information to data, its centralization, and its expropriation but the biggest gap is the lack of the skilled managers to make decisions based on analysis by a factor of 10x. Growing talent and building teams to make analytic-based decisions is the key to realize the value of Big Data. 18
  • 22. REFERENCES Aveksa Inc. (2013). Ensuring “Big Data” Security with Identity and Access Management. Waltham, MA: Aveksa. Hewlett-Packard Development Company. (2012). Big Security for Big Data. L.P.: Hewlett-Packard Development Company. Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big Data: Issues and Challenges Moving Forward. International Confrence on System Sciences (pp. 995-1004). Hawaii: IEEE Computer Soceity. Katal, A., Wazid, M., & Goudar, R. H. (2013). Big Data: Issues, Challenges, Tools and Good Practices. IEEE, 404-409. Marr, B. (2013, November 13). The Awesome Ways Big Data is used Today to Change Our World. Retrieved November 14, 2013, from LinkedIn: https://www.linkedin.com/today /post/article/20131113065157-64875646-the-awesome-ways-big-data-is-used-today-tochange-our-world Patel, A. B., Birla, M., & Nair, U. (2013). Addressing Big Data Problem Using Hadoop and. Nirma University, Gujrat: Nirma University. Singh, S., & Singh, N. (2012). Big Data Analytics. International Conference on Communication, Information & Computing Technology (ICCICT) (pp. 1-4). Mumbai: IEEE. The 2011 Digital Universe Study: Extracting Value from Chaos. (2011, November 30). Retrieved from EMC: http://www.emc.com/collateral/demos/microsites/emc-digital-universe2011/index.htm World's data will grow by 50X in next decade, IDC study predicts . (2011, June 28). Retrieved from Computer World: http://www.computerworld.com/s/article/9217988/World_s_data_will_grow_by_50X_in_ne xt_decade_IDC_study_predicts 19