This document discusses issues, opportunities, and challenges related to big data. It provides an overview of big data characteristics like volume, variety, velocity, and veracity. It also describes Hadoop and HDFS for distributed storage and processing of big data. The document outlines issues in big data like storage, management, and processing challenges due to scale. Opportunities in big data analytics are also presented. Finally, challenges like heterogeneity, scale, timeliness, and ownership are discussed along with approaches like Hadoop, Spark, NoSQL databases, and Presto for tackling big data problems.
1. Issues, OppOrtunItIes And ChAllenges
In BIg dAtA
To Be Presented
By
Zaharaddeen Karami Lawal
Department Of Computer Science And Engineering.
Jodhpur National University.
2. Content
Introduction
Characteristics of Big Data.
Hadoop and HDFS.
Map Reduced and Its Component.
Issues in Big Data.
Opportunities with Big Data.
Tackling Big Data Challenges.
Conclusion.
References.
3. Introduction
The concept of big data has been endemic within computer science since the
earliest days of computing. “Big Data” originally meant the volume of data
that could not be processed (efficiently) by traditional database methods
and tools.
In a broad term Big data can be describe as a data sets which is so large
or complex that can not be handle by traditional data processing
applications. More especially unstructured or semi-structured data.
Big Data size: Ranges from terabytes (1012
bytes) to petabytes (1015
bytes).
Around 2.5 quintillion (1018
) bytes of new data is created every day.
90% of the world’s data has been created in just last 2 to 3 years.
4. Characteristics
Big data can be described by the following characteristics:
1. Volume – The quantity of data that is generated.
2. Variety - This means that the category to which Big Data belongs to is also
a very essential fact that needs to be known by the data analysts.
3. Velocity - The term ‘velocity’ in the context refers to the speed of
generation of data or how fast the data is generated and processed to meet
the demands and the challenges which lie ahead in the path of growth and
development.
4. Veracity - The quality of the data being captured can vary greatly.
Accuracy of analysis depends on the veracity of the source data. i.e
Uncertainty of data.
5. Techniques for Big Data Mining
Hadoop and HDFS
Hadoop is a scalable, open source, fault-tolerant Virtual Grid operating
system architecture for data storage and processing. It runs on commodity
hardware, it uses HDFS which is fault-tolerant high-bandwidth clustered
storage architecture. It runs MapReduce for distributed data processing and
is works with structured and unstructured data.
Map Reduce
MapReduce is a programming model for processing large-scale datasets
in computer clusters. The MapReduce programming model consists of two
functions, map() and reduce().
7. Issues in Big Data
1. Storage and Transport Issues
The quantity of data has exploded each time we have invented a new storage
medium. What is different about the most recent explosion – due largely to social
media – is that there has been no new storage medium. Moreover, data is being
created by everyone and everything (e.g., facebook, twitter, Whatsapp).
2. Management Issues
Management will, perhaps, be the most difficult problem to address with big data.
This problem first surfaced a decade ago in the UK eScience initiatives where data
was distributed geographically and “owned” and “managed” by multiple entities.
3. Processing Issues
Assume that an exabyte of data needs to be processed in its entirety. For
simplicity, assume the data is chunked into blocks of 8 words, so 1 exabyte = 1K
petabytes. Assuming a processor expends 100 instructions on one block at 5
gigahertz, the time required for end-to-end processing would be 20 nanoseconds. To
process 1K petabytes would require a total end-to-end processing time of roughly
635 years.
8. Big Data Opportunities
Manageability - when data can grow in a single file system namespace the
manageability of the system increases significantly and a single data
administrator can now manage a petabyte or more of storage versus 50 or
100 terabytes on a scale up system.
Elimination of stovepipes - since these systems scale linearly and do not
have the bottlenecks that scale up systems create, all data is kept in a single
file system in a single grid eliminating the stovepipes introduced by the
multiple arrays and files systems required .
Just in time scalability - as my storage needs grow I can add an
appropriate number of nodes to meet my needs at the time I need them.
Increased utilization rates - since the data servers in these scales out
systems can address the entire pool of storage there is no stranded capacity
9. Big Data Challenges
• Heterogeneity and Incompleteness
The difficulties of big data analysis derive from its large scale as well as
the presence of mixed data based on different patterns or rules
(heterogeneous mixture data) in the collected and stored data. In the case of
complicated heterogeneous mixture data, the data has several patterns and
rules and the properties of the patterns vary greatly. Data can be both
structured and unstructured. Now a days 80% of the data generated by
organizations are unstructured.
• Scale and complexity
Managing large and rapidly increasing volumes of data is a challenging
issue. Traditional software tools are not enough for managing the increasing
volumes of data. Data analysis, organization, retrieval and modeling are
also challenges due to scalability and complexity of data that needs to be
analyzed.
10. Continue
• Timeliness
As the size of the data sets to be processed increases, it will take more
time to analyze. In some situations, results of the analysis are required
immediately. For example, if a fraudulent credit card transaction is
suspected, it should ideally be flagged before the transaction is completed
by preventing the transaction from taking place at all. Any delay in Stock
exchange can cause huge lost
• Data Ownership
Data ownership presents a critical and ongoing challenge, particularly in
the social media arena. While petabytes of social media data reside on the
servers of Facebook, MySpace, and Twitter. It is not really owned by them
(although they may argue that, because of residency). Certainly, the
“owners” of the pages or accounts believe they own the data. This
dichotomy will have to be resolved in court.
11. Continue
Other Challenges are:-
Availability
Inconsistency
Performance
Privacy
Security
Infrastructure faults
Extreme Data Distribution
Dynamic Design Challenges
12. Tackling Big Data Challenges
Hadoop
Hadoop and HDFS by Apache is widely used for storing and managing Big Data.
Analyzing Big Data is a challenging task as it involves large distributed file systems
which should be fault tolerant, flexible and scalable.
Spark
Ability to handle advanced data processing tasks such as real time stream processing and
machine learning is way ahead of that of Hadoop.
NoSQL Databases
they do not relay on the relational model and do not use the SQL languages;
they tend to run on cluster architectures;
they do not have a fixed schema, allowing to store data in any record.
Presto
An efficient Big Data system developed by data engineers at the popular social
networking site, Facebook.
An open source distributed SQL query engine for running interactive analytical
queries against data sources of all sizes ranging from gigabytes to petabytes.
14. Conclusion
The concept of big Data and HDFS component was introduced.
The issues regarding the Storage, Processing and Management of Big Data
was discussed.
The Opportunities and Challenges facing the Big Data from Data Storage
and Analytics perspectives was also discussed.
The measures that can be taken to overcome those challenges was also
discussed.
15. References
[1] S. Kaisler, F. Armour and J. A. Espinosa, "Big Data: Issues and Challenges Moving Forward," in 46th
Hawaii International Conference on System Sciences, Hawaii, 2013.
[2] " www.studymafia.org," [Online].
[3] D. J. S. Kiran, M. Sravanthi, K. Preethi and M. Anusha, "Recent Issues and Challenges on Big Data in
Cloud Computing," International Journal of Computer Science And Technology(IJCST), vol. Vol. 6, no.
Issue 2, April - June 2015.
[4] " (http://www.aip.org/fyi/2010/)," American Institute of Physics (AIP) College Park, MD , 2010.
[Online].
[5] D. Borthakur, The Hadoop Distributed File System Architecture and Design, 2007.
[6] A. Johnson , H. P.H, V. Paul and M. S. P.N, "Big Data Processing Using Hadoop MapReduce,"
International Journal of Computer Science and Information Technologies,(IJCSIT), vol. Vol. 6 (1), pp.
127-132, 2015.
[7] "Hadoop," ”http://wiki.apache.org/hadoop/PoweredBy, [Online]. Available:
http://wiki.apache.org/hadoop/PoweredBy.
[8] C. Ordonez, Algorithms and Optimizations for Big Data Analytics, University of Houston, USA.:
Cubes, Tech Talks.