Issues, OppOrtunItIes And ChAllenges
In BIg dAtA
To Be Presented
By
Zaharaddeen Karami Lawal
Department Of Computer Science And Engineering.
Jodhpur National University.
Content
Introduction
Characteristics of Big Data.
Hadoop and HDFS.
Map Reduced and Its Component.
Issues in Big Data.
Opportunities with Big Data.
Tackling Big Data Challenges.
Conclusion.
References.
Introduction
The concept of big data has been endemic within computer science since the
earliest days of computing. “Big Data” originally meant the volume of data
that could not be processed (efficiently) by traditional database methods
and tools.
In a broad term Big data can be describe as a data sets which is so large
or complex that can not be handle by traditional data processing
applications. More especially unstructured or semi-structured data.
 Big Data size: Ranges from terabytes (1012
bytes) to petabytes (1015
bytes).
 Around 2.5 quintillion (1018
) bytes of new data is created every day.
 90% of the world’s data has been created in just last 2 to 3 years.
Characteristics
Big data can be described by the following characteristics:
1. Volume – The quantity of data that is generated.
2. Variety - This means that the category to which Big Data belongs to is also
a very essential fact that needs to be known by the data analysts.
3. Velocity - The term ‘velocity’ in the context refers to the speed of
generation of data or how fast the data is generated and processed to meet
the demands and the challenges which lie ahead in the path of growth and
development.
4. Veracity - The quality of the data being captured can vary greatly.
Accuracy of analysis depends on the veracity of the source data. i.e
Uncertainty of data.
Techniques for Big Data Mining
Hadoop and HDFS
Hadoop is a scalable, open source, fault-tolerant Virtual Grid operating
system architecture for data storage and processing. It runs on commodity
hardware, it uses HDFS which is fault-tolerant high-bandwidth clustered
storage architecture. It runs MapReduce for distributed data processing and
is works with structured and unstructured data.
Map Reduce
MapReduce is a programming model for processing large-scale datasets
in computer clusters. The MapReduce programming model consists of two
functions, map() and reduce().
Hadoop Ecosystem
Issues in Big Data
1. Storage and Transport Issues
The quantity of data has exploded each time we have invented a new storage
medium. What is different about the most recent explosion – due largely to social
media – is that there has been no new storage medium. Moreover, data is being
created by everyone and everything (e.g., facebook, twitter, Whatsapp).
2. Management Issues
Management will, perhaps, be the most difficult problem to address with big data.
This problem first surfaced a decade ago in the UK eScience initiatives where data
was distributed geographically and “owned” and “managed” by multiple entities.
3. Processing Issues
Assume that an exabyte of data needs to be processed in its entirety. For
simplicity, assume the data is chunked into blocks of 8 words, so 1 exabyte = 1K
petabytes. Assuming a processor expends 100 instructions on one block at 5
gigahertz, the time required for end-to-end processing would be 20 nanoseconds. To
process 1K petabytes would require a total end-to-end processing time of roughly
635 years.
Big Data Opportunities
Manageability - when data can grow in a single file system namespace the
manageability of the system increases significantly and a single data
administrator can now manage a petabyte or more of storage versus 50 or
100 terabytes on a scale up system.
Elimination of stovepipes - since these systems scale linearly and do not
have the bottlenecks that scale up systems create, all data is kept in a single
file system in a single grid eliminating the stovepipes introduced by the
multiple arrays and files systems required .
Just in time scalability - as my storage needs grow I can add an
appropriate number of nodes to meet my needs at the time I need them.
Increased utilization rates - since the data servers in these scales out
systems can address the entire pool of storage there is no stranded capacity
Big Data Challenges
• Heterogeneity and Incompleteness
The difficulties of big data analysis derive from its large scale as well as
the presence of mixed data based on different patterns or rules
(heterogeneous mixture data) in the collected and stored data. In the case of
complicated heterogeneous mixture data, the data has several patterns and
rules and the properties of the patterns vary greatly. Data can be both
structured and unstructured. Now a days 80% of the data generated by
organizations are unstructured.
• Scale and complexity
Managing large and rapidly increasing volumes of data is a challenging
issue. Traditional software tools are not enough for managing the increasing
volumes of data. Data analysis, organization, retrieval and modeling are
also challenges due to scalability and complexity of data that needs to be
analyzed.
Continue
• Timeliness
As the size of the data sets to be processed increases, it will take more
time to analyze. In some situations, results of the analysis are required
immediately. For example, if a fraudulent credit card transaction is
suspected, it should ideally be flagged before the transaction is completed
by preventing the transaction from taking place at all. Any delay in Stock
exchange can cause huge lost
• Data Ownership
Data ownership presents a critical and ongoing challenge, particularly in
the social media arena. While petabytes of social media data reside on the
servers of Facebook, MySpace, and Twitter. It is not really owned by them
(although they may argue that, because of residency). Certainly, the
“owners” of the pages or accounts believe they own the data. This
dichotomy will have to be resolved in court.
Continue
Other Challenges are:-
 Availability
 Inconsistency
 Performance
 Privacy
 Security
 Infrastructure faults
 Extreme Data Distribution
 Dynamic Design Challenges
Tackling Big Data Challenges
Hadoop
Hadoop and HDFS by Apache is widely used for storing and managing Big Data.
Analyzing Big Data is a challenging task as it involves large distributed file systems
which should be fault tolerant, flexible and scalable.
Spark
Ability to handle advanced data processing tasks such as real time stream processing and
machine learning is way ahead of that of Hadoop.
NoSQL Databases
 they do not relay on the relational model and do not use the SQL languages;
 they tend to run on cluster architectures;
 they do not have a fixed schema, allowing to store data in any record.
Presto
 An efficient Big Data system developed by data engineers at the popular social
networking site, Facebook.
 An open source distributed SQL query engine for running interactive analytical
queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto Architecture
Conclusion
The concept of big Data and HDFS component was introduced.
The issues regarding the Storage, Processing and Management of Big Data
was discussed.
The Opportunities and Challenges facing the Big Data from Data Storage
and Analytics perspectives was also discussed.
The measures that can be taken to overcome those challenges was also
discussed.
References
[1] S. Kaisler, F. Armour and J. A. Espinosa, "Big Data: Issues and Challenges Moving Forward," in 46th
Hawaii International Conference on System Sciences, Hawaii, 2013.
[2] " www.studymafia.org," [Online].
[3] D. J. S. Kiran, M. Sravanthi, K. Preethi and M. Anusha, "Recent Issues and Challenges on Big Data in
Cloud Computing," International Journal of Computer Science And Technology(IJCST), vol. Vol. 6, no.
Issue 2, April - June 2015.
[4] " (http://www.aip.org/fyi/2010/)," American Institute of Physics (AIP) College Park, MD , 2010.
[Online].
[5] D. Borthakur, The Hadoop Distributed File System Architecture and Design, 2007.
[6] A. Johnson , H. P.H, V. Paul and M. S. P.N, "Big Data Processing Using Hadoop MapReduce,"
International Journal of Computer Science and Information Technologies,(IJCSIT), vol. Vol. 6 (1), pp.
127-132, 2015.
[7] "Hadoop," ”http://wiki.apache.org/hadoop/PoweredBy, [Online]. Available:
http://wiki.apache.org/hadoop/PoweredBy.
[8] C. Ordonez, Algorithms and Optimizations for Big Data Analytics, University of Houston, USA.:
Cubes, Tech Talks.

Seminar presentation

  • 1.
    Issues, OppOrtunItIes AndChAllenges In BIg dAtA To Be Presented By Zaharaddeen Karami Lawal Department Of Computer Science And Engineering. Jodhpur National University.
  • 2.
    Content Introduction Characteristics of BigData. Hadoop and HDFS. Map Reduced and Its Component. Issues in Big Data. Opportunities with Big Data. Tackling Big Data Challenges. Conclusion. References.
  • 3.
    Introduction The concept ofbig data has been endemic within computer science since the earliest days of computing. “Big Data” originally meant the volume of data that could not be processed (efficiently) by traditional database methods and tools. In a broad term Big data can be describe as a data sets which is so large or complex that can not be handle by traditional data processing applications. More especially unstructured or semi-structured data.  Big Data size: Ranges from terabytes (1012 bytes) to petabytes (1015 bytes).  Around 2.5 quintillion (1018 ) bytes of new data is created every day.  90% of the world’s data has been created in just last 2 to 3 years.
  • 4.
    Characteristics Big data canbe described by the following characteristics: 1. Volume – The quantity of data that is generated. 2. Variety - This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. 3. Velocity - The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development. 4. Veracity - The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data. i.e Uncertainty of data.
  • 5.
    Techniques for BigData Mining Hadoop and HDFS Hadoop is a scalable, open source, fault-tolerant Virtual Grid operating system architecture for data storage and processing. It runs on commodity hardware, it uses HDFS which is fault-tolerant high-bandwidth clustered storage architecture. It runs MapReduce for distributed data processing and is works with structured and unstructured data. Map Reduce MapReduce is a programming model for processing large-scale datasets in computer clusters. The MapReduce programming model consists of two functions, map() and reduce().
  • 6.
  • 7.
    Issues in BigData 1. Storage and Transport Issues The quantity of data has exploded each time we have invented a new storage medium. What is different about the most recent explosion – due largely to social media – is that there has been no new storage medium. Moreover, data is being created by everyone and everything (e.g., facebook, twitter, Whatsapp). 2. Management Issues Management will, perhaps, be the most difficult problem to address with big data. This problem first surfaced a decade ago in the UK eScience initiatives where data was distributed geographically and “owned” and “managed” by multiple entities. 3. Processing Issues Assume that an exabyte of data needs to be processed in its entirety. For simplicity, assume the data is chunked into blocks of 8 words, so 1 exabyte = 1K petabytes. Assuming a processor expends 100 instructions on one block at 5 gigahertz, the time required for end-to-end processing would be 20 nanoseconds. To process 1K petabytes would require a total end-to-end processing time of roughly 635 years.
  • 8.
    Big Data Opportunities Manageability- when data can grow in a single file system namespace the manageability of the system increases significantly and a single data administrator can now manage a petabyte or more of storage versus 50 or 100 terabytes on a scale up system. Elimination of stovepipes - since these systems scale linearly and do not have the bottlenecks that scale up systems create, all data is kept in a single file system in a single grid eliminating the stovepipes introduced by the multiple arrays and files systems required . Just in time scalability - as my storage needs grow I can add an appropriate number of nodes to meet my needs at the time I need them. Increased utilization rates - since the data servers in these scales out systems can address the entire pool of storage there is no stranded capacity
  • 9.
    Big Data Challenges •Heterogeneity and Incompleteness The difficulties of big data analysis derive from its large scale as well as the presence of mixed data based on different patterns or rules (heterogeneous mixture data) in the collected and stored data. In the case of complicated heterogeneous mixture data, the data has several patterns and rules and the properties of the patterns vary greatly. Data can be both structured and unstructured. Now a days 80% of the data generated by organizations are unstructured. • Scale and complexity Managing large and rapidly increasing volumes of data is a challenging issue. Traditional software tools are not enough for managing the increasing volumes of data. Data analysis, organization, retrieval and modeling are also challenges due to scalability and complexity of data that needs to be analyzed.
  • 10.
    Continue • Timeliness As thesize of the data sets to be processed increases, it will take more time to analyze. In some situations, results of the analysis are required immediately. For example, if a fraudulent credit card transaction is suspected, it should ideally be flagged before the transaction is completed by preventing the transaction from taking place at all. Any delay in Stock exchange can cause huge lost • Data Ownership Data ownership presents a critical and ongoing challenge, particularly in the social media arena. While petabytes of social media data reside on the servers of Facebook, MySpace, and Twitter. It is not really owned by them (although they may argue that, because of residency). Certainly, the “owners” of the pages or accounts believe they own the data. This dichotomy will have to be resolved in court.
  • 11.
    Continue Other Challenges are:- Availability  Inconsistency  Performance  Privacy  Security  Infrastructure faults  Extreme Data Distribution  Dynamic Design Challenges
  • 12.
    Tackling Big DataChallenges Hadoop Hadoop and HDFS by Apache is widely used for storing and managing Big Data. Analyzing Big Data is a challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. Spark Ability to handle advanced data processing tasks such as real time stream processing and machine learning is way ahead of that of Hadoop. NoSQL Databases  they do not relay on the relational model and do not use the SQL languages;  they tend to run on cluster architectures;  they do not have a fixed schema, allowing to store data in any record. Presto  An efficient Big Data system developed by data engineers at the popular social networking site, Facebook.  An open source distributed SQL query engine for running interactive analytical queries against data sources of all sizes ranging from gigabytes to petabytes.
  • 13.
  • 14.
    Conclusion The concept ofbig Data and HDFS component was introduced. The issues regarding the Storage, Processing and Management of Big Data was discussed. The Opportunities and Challenges facing the Big Data from Data Storage and Analytics perspectives was also discussed. The measures that can be taken to overcome those challenges was also discussed.
  • 15.
    References [1] S. Kaisler,F. Armour and J. A. Espinosa, "Big Data: Issues and Challenges Moving Forward," in 46th Hawaii International Conference on System Sciences, Hawaii, 2013. [2] " www.studymafia.org," [Online]. [3] D. J. S. Kiran, M. Sravanthi, K. Preethi and M. Anusha, "Recent Issues and Challenges on Big Data in Cloud Computing," International Journal of Computer Science And Technology(IJCST), vol. Vol. 6, no. Issue 2, April - June 2015. [4] " (http://www.aip.org/fyi/2010/)," American Institute of Physics (AIP) College Park, MD , 2010. [Online]. [5] D. Borthakur, The Hadoop Distributed File System Architecture and Design, 2007. [6] A. Johnson , H. P.H, V. Paul and M. S. P.N, "Big Data Processing Using Hadoop MapReduce," International Journal of Computer Science and Information Technologies,(IJCSIT), vol. Vol. 6 (1), pp. 127-132, 2015. [7] "Hadoop," ”http://wiki.apache.org/hadoop/PoweredBy, [Online]. Available: http://wiki.apache.org/hadoop/PoweredBy. [8] C. Ordonez, Algorithms and Optimizations for Big Data Analytics, University of Houston, USA.: Cubes, Tech Talks.