SlideShare a Scribd company logo
1 of 28
Overview of Big Data & Apache
Hadoop
Presented By :
Sunny
Objectives
 What is Big Data
 BIG DATA Challenges
 Sources of Big Data challenges
 Categories Of 'Big Data'
 Characteristics Of 'Big Data’
 Live Example
 Introduction Of Hadoop
Big Data - Definition and Concepts
• Big data is the term for collection of data sets so
large and complex that it become difficult to process
using on – hand database management tool
• Traditionally, “Big Data” = massive volumes of data
– E.g., volume of data at CERN, NASA, Google, …
• Where does the Big Data come from?
– Everywhere! Web logs, GPS systems, sensor networks,
social networks, Internet-based text documents, Internet
search indexes, detail call records, astronomy, atmospheric
science, biology, nuclear physics, biochemical experiments,
medical records, scientific research, military surveillance,
multimedia archives, …
Data Explosion
2.5 billion gigabytes of data was generated everyday in 2012.
40,000 search queries search in every second.
300 hours of video is uploaded every min.
31.25 million message sent & 2.77 million video been viewed.
2020 all data will pass through cloud .
Technology Insights 6.1
The Data Size Is Getting Big, Bigger, …
• Hadron Collider - 1
PB/sec
• Boeing jet - 20 TB/hr
• Facebook - 500 TB/day
• YouTube – 1 TB/4 min
• The proposed Square
Kilometer Array
telescope (the world’s
proposed biggest
telescope) – 1 EB/day
Characteristics Of 'Big Data'
BIG DATA Challenges-
The challenges include :
 Capture,
 Storage,
 Search,
 Sharing,
 Transfer analysis and
 Visualization.
Categories Of 'Big Data'
Big data could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and
processed in the form of fixed format is termed
as a 'structured' data.
• In semi-structured data, the entities belonging to the
same class may have different attributes even though
they are grouped together.
Semi-structured Data
• Example :
A Word document is generally considered to be unstructured
data. However, you can add metadata tags in the form of
keywords and other metadata that represent the document
content and make it easier for that document to be found when
people search for those terms -- the data is now semi-
structured. Nevertheless, the document still lacks the complex
organization of the database, so falls short of being fully
structured data.
Semi-structured
• Semi-structured data can contain both the
forms of data.
Unstructured
• Any data with unknown form or the structure
is classified as unstructured data.
• Word, PDF, Text, Media Logs.
Live Example
Bank manager assigned task to find best
location to setup the ATM machine .
Distributed System
 A model in which components located on
networked computer communication.
 Distributed System use multiple
Machine for a single job.
How does a distributed system works ?
1 Machine
Data = 1 Terabyte
Processing time 45 min.
100 Machine
Data = 1 Terabyte
Processing time 47 sec.
Challenges of Distributed System
1. Multiple computer are used.
2. High chance of system failure.
3. Limit Bandwidth.
4. Complex Programming.
Solution to all these is Hadoop
Big Data Technologies
• MapReduce …
• Hadoop …
• Hive
• Pig
• Hbase
• Flume
• Oozie
• Ambari
• Avro
• Mahout, Sqoop, Hcatalog, ….
Hadoop
Doug Cutting is creator of Hadoop.
 Hadoop don’t have any meaning – its a
made up name by his kid.
Hadoop
Hadoop is a framework that allows for
distributed processing of large data sets across
clusters of commodity computers using simple
programming models.
 Hadoop is an open source, java based
programming framework that supports the
processing and storage of extremely large data
sets in a distributed computing environment.
Why Hadoop
1. Runs a number of application which involving
Petabyte of data.
2. Has a distributed file system, called HDFS,
which enables fast data transfer among the
nodes or server.
Big Data Technologies Hadoop
• Hadoop Technical Components
– Hadoop Distributed File System (HDFS)
– Name Node (primary facilitator)
– Secondary Node (backup to Name Node)
– Job Tracker
– Slave Nodes (the grunts of any Hadoop cluster)
– Additionally, Hadoop ecosystem is made up of a
number of complementary sub-projects: NoSQL
(Cassandra, Hbase), DW (Hive), …
• NoSQL = not only SQL
Hadoop Characteristics
1. Economical – ordinary computer can be used for data
processing.
2. Reliable - Stores copies of the data on different machines
and is resistant to hardware failure.
3. Scalable - Hadoop cluster can be extended by just adding
nodes in the cluster.
4. Flexible - Can store a lot of data and decide to use it later.
Hadoop Core Components
Big Data Technologies
MapReduce
4
3
3
3
3
Raw Data Map Function Reduce Function
How does
MapReduce
work?
Top 10 Big Data Vendors
with Primary Focus on Hadoop
$0
$10
$20
$30
$40
$50
$60
$70
Stream Analytics Applications
• e-Commerce
• Telecommunication
• Law Enforcement and Cyber Security
• Power Industry
• Financial Services
• Health Services
• Government
Thank You

More Related Content

What's hot

Data tree product brochure
Data tree product brochureData tree product brochure
Data tree product brochure
lwiggins
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
butest
 

What's hot (20)

Introduction to Data Management and Sharing
Introduction to Data Management and SharingIntroduction to Data Management and Sharing
Introduction to Data Management and Sharing
 
Digital data
Digital dataDigital data
Digital data
 
Data management
Data management Data management
Data management
 
Beekman5 std ppt_08
Beekman5 std ppt_08Beekman5 std ppt_08
Beekman5 std ppt_08
 
General concepts: DDI
General concepts: DDIGeneral concepts: DDI
General concepts: DDI
 
Big data
Big dataBig data
Big data
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
Dbms unit 1
Dbms unit   1Dbms unit   1
Dbms unit 1
 
Hadoop
HadoopHadoop
Hadoop
 
Data tree product brochure
Data tree product brochureData tree product brochure
Data tree product brochure
 
Concepts of Data Bases
Concepts of Data BasesConcepts of Data Bases
Concepts of Data Bases
 
Digital Types
Digital TypesDigital Types
Digital Types
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
CURRENT AND FUTURE TRENDS IN DBMS
CURRENT AND FUTURE TRENDS IN DBMSCURRENT AND FUTURE TRENDS IN DBMS
CURRENT AND FUTURE TRENDS IN DBMS
 
Hota hadoop
Hota hadoopHota hadoop
Hota hadoop
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Available techniques in hadoop small file issue
Available techniques in hadoop small file issueAvailable techniques in hadoop small file issue
Available techniques in hadoop small file issue
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 

Similar to Overview of Big Data by Sunny

Similar to Overview of Big Data by Sunny (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
paper
paperpaper
paper
 
Big Data
Big DataBig Data
Big Data
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
big data analytics and hadoop comparions
big data analytics and hadoop comparionsbig data analytics and hadoop comparions
big data analytics and hadoop comparions
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Big Data
Big DataBig Data
Big Data
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Big data
Big dataBig data
Big data
 

More from DignitasDigital1

More from DignitasDigital1 (14)

Ambush Marketing By Amandeep
Ambush Marketing By AmandeepAmbush Marketing By Amandeep
Ambush Marketing By Amandeep
 
Wireframing with balsamiq by Chandeep
Wireframing with balsamiq by ChandeepWireframing with balsamiq by Chandeep
Wireframing with balsamiq by Chandeep
 
10 Productive habits for working from home By Shweta
10 Productive habits for working from home By Shweta 10 Productive habits for working from home By Shweta
10 Productive habits for working from home By Shweta
 
5 Principles of brand on social media during lockdown By Aman
5 Principles of brand on social media during lockdown By Aman5 Principles of brand on social media during lockdown By Aman
5 Principles of brand on social media during lockdown By Aman
 
Typography By Amit
Typography By AmitTypography By Amit
Typography By Amit
 
Bootstrap By Shafeeq
Bootstrap By Shafeeq Bootstrap By Shafeeq
Bootstrap By Shafeeq
 
Drip Marketing by Abhishek
Drip Marketing by AbhishekDrip Marketing by Abhishek
Drip Marketing by Abhishek
 
7Cs for communication by Shweta
7Cs for communication by Shweta 7Cs for communication by Shweta
7Cs for communication by Shweta
 
Flutter by Shubham
Flutter by ShubhamFlutter by Shubham
Flutter by Shubham
 
Blue Ocean strategy by Vinita
Blue Ocean strategy by VinitaBlue Ocean strategy by Vinita
Blue Ocean strategy by Vinita
 
Sass:-Syntactically Awesome Stylesheet by Shafeeq
Sass:-Syntactically Awesome Stylesheet by ShafeeqSass:-Syntactically Awesome Stylesheet by Shafeeq
Sass:-Syntactically Awesome Stylesheet by Shafeeq
 
Advertising and marketing at zero cost by Jatin
Advertising and marketing at zero cost by JatinAdvertising and marketing at zero cost by Jatin
Advertising and marketing at zero cost by Jatin
 
Ui trends 2019 by Amit
Ui trends 2019 by AmitUi trends 2019 by Amit
Ui trends 2019 by Amit
 
Kubernetes by Jai
Kubernetes by JaiKubernetes by Jai
Kubernetes by Jai
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Overview of Big Data by Sunny

  • 1. Overview of Big Data & Apache Hadoop Presented By : Sunny
  • 2. Objectives  What is Big Data  BIG DATA Challenges  Sources of Big Data challenges  Categories Of 'Big Data'  Characteristics Of 'Big Data’  Live Example  Introduction Of Hadoop
  • 3. Big Data - Definition and Concepts • Big data is the term for collection of data sets so large and complex that it become difficult to process using on – hand database management tool • Traditionally, “Big Data” = massive volumes of data – E.g., volume of data at CERN, NASA, Google, … • Where does the Big Data come from? – Everywhere! Web logs, GPS systems, sensor networks, social networks, Internet-based text documents, Internet search indexes, detail call records, astronomy, atmospheric science, biology, nuclear physics, biochemical experiments, medical records, scientific research, military surveillance, multimedia archives, …
  • 4.
  • 5. Data Explosion 2.5 billion gigabytes of data was generated everyday in 2012. 40,000 search queries search in every second. 300 hours of video is uploaded every min. 31.25 million message sent & 2.77 million video been viewed. 2020 all data will pass through cloud .
  • 6. Technology Insights 6.1 The Data Size Is Getting Big, Bigger, … • Hadron Collider - 1 PB/sec • Boeing jet - 20 TB/hr • Facebook - 500 TB/day • YouTube – 1 TB/4 min • The proposed Square Kilometer Array telescope (the world’s proposed biggest telescope) – 1 EB/day
  • 8. BIG DATA Challenges- The challenges include :  Capture,  Storage,  Search,  Sharing,  Transfer analysis and  Visualization.
  • 9. Categories Of 'Big Data' Big data could be found in three forms: 1. Structured 2. Unstructured 3. Semi-structured
  • 10. Structured Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data.
  • 11. • In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together. Semi-structured Data • Example : A Word document is generally considered to be unstructured data. However, you can add metadata tags in the form of keywords and other metadata that represent the document content and make it easier for that document to be found when people search for those terms -- the data is now semi- structured. Nevertheless, the document still lacks the complex organization of the database, so falls short of being fully structured data.
  • 12. Semi-structured • Semi-structured data can contain both the forms of data.
  • 13. Unstructured • Any data with unknown form or the structure is classified as unstructured data. • Word, PDF, Text, Media Logs.
  • 14. Live Example Bank manager assigned task to find best location to setup the ATM machine .
  • 15. Distributed System  A model in which components located on networked computer communication.  Distributed System use multiple Machine for a single job.
  • 16. How does a distributed system works ? 1 Machine Data = 1 Terabyte Processing time 45 min. 100 Machine Data = 1 Terabyte Processing time 47 sec.
  • 17. Challenges of Distributed System 1. Multiple computer are used. 2. High chance of system failure. 3. Limit Bandwidth. 4. Complex Programming. Solution to all these is Hadoop
  • 18. Big Data Technologies • MapReduce … • Hadoop … • Hive • Pig • Hbase • Flume • Oozie • Ambari • Avro • Mahout, Sqoop, Hcatalog, ….
  • 19. Hadoop Doug Cutting is creator of Hadoop.  Hadoop don’t have any meaning – its a made up name by his kid.
  • 20. Hadoop Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using simple programming models.  Hadoop is an open source, java based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.
  • 21. Why Hadoop 1. Runs a number of application which involving Petabyte of data. 2. Has a distributed file system, called HDFS, which enables fast data transfer among the nodes or server.
  • 22. Big Data Technologies Hadoop • Hadoop Technical Components – Hadoop Distributed File System (HDFS) – Name Node (primary facilitator) – Secondary Node (backup to Name Node) – Job Tracker – Slave Nodes (the grunts of any Hadoop cluster) – Additionally, Hadoop ecosystem is made up of a number of complementary sub-projects: NoSQL (Cassandra, Hbase), DW (Hive), … • NoSQL = not only SQL
  • 23. Hadoop Characteristics 1. Economical – ordinary computer can be used for data processing. 2. Reliable - Stores copies of the data on different machines and is resistant to hardware failure. 3. Scalable - Hadoop cluster can be extended by just adding nodes in the cluster. 4. Flexible - Can store a lot of data and decide to use it later.
  • 25. Big Data Technologies MapReduce 4 3 3 3 3 Raw Data Map Function Reduce Function How does MapReduce work?
  • 26. Top 10 Big Data Vendors with Primary Focus on Hadoop $0 $10 $20 $30 $40 $50 $60 $70
  • 27. Stream Analytics Applications • e-Commerce • Telecommunication • Law Enforcement and Cyber Security • Power Industry • Financial Services • Health Services • Government