Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01

BIG DATA : ISSUES, CHALLENGES,
TOOLS AND GOOD PRACTICES
1

MOTIVATION
• Data stores are growing by 50% each year, and that rate of increase
is accelerating[8]
• The type of data is also changing. Over 80% of it will be
unstructured data which does not work well with relational
databases[8]
• The main difficulty is because the volume is increasing rapidly in
comparison to computing resources
2

DEFINING BIG DATA
• It is defined as large amount of data which requires new
technologies and architectures so that it becomes possible to
extract value form it by capturing and analysis process.
• It is a recent upcoming technology that can bring huge benefits to
the business organizations.
3

PROPERTIES OF BIG DATA
• Variety : Data being produced is not only traditional but also semi
structured from various sources.
• Volume : Data is supposed to increase in zetta bytes in near future
• Velocity : Speed of data coming from various sources
• Variability : It considers the inconsistencies of data flow.
• Complexity : It is difficult to link, match cleanse, and transform
data across systems coming from various sources.
• Value : Queries can be run against the data stored to deduct
important results.
4

RELATED WORK
• Collaborative research on methodologies for big data analysis and
design.[1]
• Databases required for big data [2]
• Architectural considerations for big data [3]
• Concept of big data with market solutions [4]
• Scientific Data Infrastructure (SDI) generic architectural model [5]
• How big data analytics is different from traditional analytics [6]
• Analysis of social media sites like facebook,flickr,google+ [7]
6

IMPORTANCE OF BIG DATA
• Log Storage in IT Industries
– IT industries store large amounts of data as logs to deal with
problems which occur rarely.
– Big data analytics is used on the data to pinpoint the point of
failures
– Traditional Systems are not able to handle these logs.
• Sensor Data
– Massive amount of sensor data is also a big challenge for Big data
7

• Risk Analysis
– It’s important for financial institutions to model data in order to
calculate the risk.
– A lot of potential data is underutilized because of its volume and should
be integrated to determine the risk patterns more accurately
• Social Media
– The largest use of Big data is for social media and customer sentiments
– Keeping an eye on what the customers are saying is like getting a
feedback.
– The customer feedback can then be used to make decisions and add value
to the business
8

BIG DATA CHALLENGES AND ISSUES
• Privacy and Security
– The most important issue with Big data which includes conceptual,
technical as well as legal significance
– The personal information of a person when combined with external
large data sets leads to the inference of new private facts about
that person
– Big data used by law enforcement will increase the chances of
certain tagged people to suffer from adverse consequences .
9

• Data Access and Sharing of Information
– If data is to be used to make accurate decisions in time it becomes
necessary that it should be available in accurate, complete and timely
manner
• Storage and Processing Issues
– Many companies are struggling to store the large amount of data they
are producing
• Outsourcing storage to the cloud may seem like an option but long
upload times and constant updates to the data preclude this
option
– Processing a large amount of data also takes a lot of time
10

• Analytical Challenges
– What if data volume gets so large that we don’t know how to
deal with it
– Does all data need to be stored ?
– Does all data need to be analyzed?
– Which data points are really important ?
– How can data be used to best advantages
• Skill Requirement : Being a new and emerging technology, it needs
to attract organization and youth with diverse new skill sets.
11

• Technical Challenges
– Fault Tolerance
– Scalability
– Quality of Data
– Heterogeneous Data
Ravi 12

TOOLS AND TECHNIQUES AVAILABLE
• Hadoop - is an open source project hosted by Apache Software
Foundation for managing Big data
• Hadoop consists of two main components :
– Hadoop File System (HDFS) which is a distributed file-
system that stores the data on multiple separate servers
(each of which having its own processor(s))
– MapReduce the framework that understands and assigns
work to the nodes in a cluster[9]
13

ADVANTAGES OF HADOOP
• Hadoop provides the following advantages[9]
– Data read/write performance is increased by distributing the
data across the cluster allowing each processor to do work in a
parallel fashion
– It’s scalable, new nodes can be added as needed without making
changes to the existing system
– It’s cost effective because it brings parallel computing to
commodity servers
14

ADVANTAGES OF HADOOP…
– It’s flexible, it can absorb any type of data, structured or not
from any number of sources
– It’s fault tolerant, it handles failures intrinsically by always
storing multiple copies of the data and automatically loading a
copy when a fault is detected
15

HADOOP
• How do you use Hadoop?
– The developer writes a program that conforms to the MapReduce
programming model
– The developer specifies the format of the data to be processed in
their program
16

HADOOP
• How does MapReduce work?[10]
– Each Hadoop program performs two tasks:
• Map - Breaks all of the data down into key/value pairs
• Reduce - Takes the output from the map step as input and
combines those data key/value pairs into a smaller set of
key/value pairs
17

MAP REDUCE - EXAMPLE
• MapReduce example[10]: Assume you have five files, and each file
contains two columns that represent a city and the corresponding
temperature recorded in that city for the various measurement days
– Toronto, 20 , New York, 22, Rome, 32 , Toronto, 4, Rome, 33 ,New
York, 18
• We want to find the maximum temperature for each city across all of
the data files
• Then we create five map tasks, where each mapper works on one of the
five files and the mapper task goes through the data and returns the
maximum temperature for each city
– Which results in: (Toronto, 20) (New York, 22) (Rome, 33)
18

MAP REDUCE – EXAMPLE…
• Let’s assume the other four mapper tasks (working on the other four
files not shown here) produced the following intermediate results:
– (Toronto, 18) (New York, 32) (Rome, 37)(Toronto, 32) (New York,
33) (Rome, 38)(Toronto, 22) (New York, 20) (Rome, 31)(Toronto,
31) (New York, 19) (Rome, 30)
• All five of these output streams would be fed into the reduce tasks,
which combines the input results and outputs a single value for each
city, producing a final result set as follows:
– (Toronto, 32) (New York, 33) (Rome, 38)
19

BIG DATA – GOOD PRACTICES
• Creating dimensions of all the data being stored is good practice.
• All the dimensions should have durable surrogate keys that can’t be
changed and are unique.
• Expect to integrate structured and unstructured data
• Generality of technology is needed. Building it around key value pairs
work.
20

BIG DATA – GOOD PRACTICES…
• As value of big data becomes more apparent, privacy concerns grow.
• Data quality needs to be better.
• Limit on scalability of records.
• Business and IT leaders should work together to create more value
from data.
• Investment in data quality and metadata reduces processing time.
21

CONCLUSIONS
• New concept of big data, its importance and existing projects.
• Many challenges and issues exist which need to be brought up.
• Big data will help business grow.
• Hadoop Tool
22

REFERENCES
• [1] Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William
Money,“Big Data: Issues and Challenges Moving Forward”, IEEE, 46th
Hawaii International Conference on System Sciences, 2013.
• [2] Sam Madden, “ From Databases to Big Data”, IEEE, Internet
Computing, May-June 2012.
• [3] Kapil Bakshi, “Considerations for Big Data: Architecture and
Approach”,IEEE , Aerospace Conference, 2012.
• [4] Sachchidanand Singh, Nirmala Singh, “Big Data Analytics”,
IEEE,International Conference on Communication, Information &
Computing Technology (ICCICT), Oct. 19-20, 2012.
• [5] Yuri Demchenko, Zhiming Zhao, Paola Grosso, Adianto Wibisono,
Cees de Laat, “Addressing Big Data Challenges for Scientific Data
Infrastructure”, IEEE , 4th International Conference on Cloud
Computing Technology and Science, 2012.
23

REFERENCES...
• [6] Martin Courtney, “The Larging-up of Big Data”, IEEE, Engineering
& Technology, September 2012.
• [7] Matthew Smith, Christian Szongott, Benjamin Henne, Gabriele von
Voigt, “Big Data Privacy Issues in Public Social Media”, IEEE, 6th
International Conference on Digital Ecosystems Technologies (DEST),
18-20 June 2012.
• [8] Why Every Database Must Be Broken Soon
https://blogs.vmware.com/vfabric/2013/03/why-every-database-
must-be-broken-soon.html
• [9] What is Hadoop? . http://www-
01.ibm.com/software/data/infosphere/hadoop/
• [10] What is MapReduce? http://www-
01.ibm.com/software/data/infosphere/hadoop/mapreduce
24

Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01

Similar to Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01 (20)

More from Soujanya V

More from Soujanya V (7)

Recently uploaded

Recently uploaded (20)

Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01