SlideShare a Scribd company logo
Big Data and Hadoop
Presented By: Maulik Lakhani
Session 1 – Introduction to Big Data and Hadoop
 About Me
 Outline
Introduction to Big Data
Traditional Data Processing
Introduction to Hadoop
HadoopArchitecture
Hadoop and RDBMS
Hadoop Distributions
 What is Big Data?
Collection of large datasets that cannot be processed using traditional
computing techniques.
Not a single technique or a tool, rather a complete subject. Various tools,
techniques and frameworks.
It’s not the amount of data that’s important. It’s what organisations do with
the data that matters.
Big data can be analysed for insights that lead to better decisions and
strategic business moves.
 Traditional Data Processing
For storage purpose, the programmers will take the help of their choice of database
vendors such as Oracle, IBM and others.
An enterprise will have a computer to store and process big data.
The user interacts with the application, which in turn handles the part of data
storage and analysis.
 Characteristics - Vs of Big Data
• Quality of the
data
• Structured, Un-
structured,
Semi-structured
• Periodic, Near-
time, Real-time
• Terabytes of
data,
Transactions,
Files
Volume Velocity
VeracityVariety
 Characteristics - Vs of Big Data
Volume - Size of data
• Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of 300 times from 2005.
Velocity
• There are 1.03 billion Daily Active Users (Facebook) on Mobile as of now, which is
an increase of 22% year-over-year.
 Characteristics - Vs of Big Data
Variety
• Data can be structured, semi-structured or unstructured in the form of images,
audios, videos, sensor data.
• Variety of data creates problems in capturing, storage, mining and analyzing the
data.
Veracity – data uncertainty, data inconsistency and incompleteness
• Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they
use to make decisions.
 Characteristics - Vs of Big Data
Veracity
• Poor data quality costs the US economy around $3.1 trillion a year.
Value
• Adding to the benefits of the organizations. Is the organization working on Big
Data achieving high ROI?
 Ms of Big Data
Make Me
MoneyMore
 What Comes Under Big Data?
• Black Box Data: Voices of the flight crew, performance information of the aircraft.
• Social Media Data: Posts and views by millions of people across the globe.
• Stock Exchange Data: information about the ‘buy’ and ‘sell’ decisions made on the
stock exchange.
• Power Grid Data: Electricity consumed by a particular node with respect to a base
station.
• Transport Data: Shipping / freight data.
 Examples of Big Data
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to recommend
products.
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors.
 Applications of Big Data
• Smarter Healthcare (EHR): Predict the patient’s deteriorating condition in advance.
• Telecom: Reduce data packet loss. Offer personalized plans.
• Retail: Recommendation engines - Suggestion based on the browsing history of the
consumer.
• Traffic control: managing traffic better via effective use of data and sensors.
• Manufacturing: reduce component defects, improve product quality, increase
efficiency, and save time and money.
• Search Quality: Personalised search results based on previous searches.
AlphaGo is a narrow AI developed by Alphabet Inc. to play the board game named Go.
AlphaGo's algorithm uses a tree search algorithm to find its moves based on previously
learned knowledge.
It gains knowledge by machine learning, specifically by an artificial neural network from
extensive training, both from human and computer play.
In March 2016, it beat a professional GO player named Lee Sedol in a five-game match for
the first time.
Google Photos
Google implements different forms of machine learning into the Photos service,
particularly in recognition of photo contents.
People: Photos app collects all the photos containing faces. It doesn’t identify these
people, but just collects them for quick access
Places: The relies on landmarks. It can correctly identify well-known places like Taj
Mahal.
Things: This feature aggregates pictures of things like, flowers, cars, sky, birthdays and
cats.There are many more categories, including screenshots, posters and castles.
Not everything in the garden is rosy!
 Challenges with Big Data
• Data Quality: messy, inconsistent and incomplete data. Dirty data cost $600 billion to
the companies each year in the United States.
• Discovery: Analyzing petabytes of data using powerful algorithms to find patterns and
insights are very difficult.
• Storage: The more data an organization has, the more complex the problem of
managing it.
• Lack of Talent: A developers, data scientists and analysts who also have sufficient
amount of domain knowledge.
 Summary of Big Data
Big Data is defined as data that is huge in size. Bigdata is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
Examples of Big Data generation includes stock exchanges, social media sites, jet
engines, etc.
Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
Volume,Variety,Velocity, andVariability are few Characteristics of Bigdata
 Introduction to Hadoop
Hadoop is an open source framework from Apache.
Used to store process and analyze data which are very huge in
volume.
Hadoop is written in Java and is not OLAP (online analytical
processing).
Used for batch/offline processing.
 Introduction to Hadoop
 Modules of Hadoop
Hadoop Distributed File System: States that the files will be broken into blocks
and stored in nodes over the distributed architecture.
Yet another Resource Negotiator: Used for job scheduling and manage the
cluster.
Map Reduce: Framework which helps programs to do the parallel computation
on data using key value pair.
 Modules of Hadoop
Map Reduce: The Map task takes input data and converts it into a data set which
can be computed in Key value pair.The output of Map task is consumed by
reduce task and then the out of reducer gives the desired result.
Hadoop Common:These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
 Hadoop Architecture
 Hadoop Architecture
 Hadoop Architecture
 Hadoop Architecture
 Hadoop Architecture
NameNode
• Stores metadata for the files, like the directory structure of a typical FS.
• The server holding the NameNode instance is quite crucial, as there is only one.
• Transaction log for file deletes/adds,etc. Does not use transaction for whole
blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a DataNode
failure.
 Hadoop Architecture
DataNode
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
 Hadoop Architecture
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode. In response, NameNode provides metadata to Job
Tracker.
Task Tracker
It works as a slave node for Job Tracker. It receives task and code from Job Tracker
and applies that code on the file. This process can also be called as a Mapper.
 Hadoop vs RDBMS
Until recently many applications utilized RDBMS for batch processing – Oracle,
Sybase, MySQL, Microsoft SQL Server, etc.
Hadoop doesn’t fully replace relational products; many architectures would
benefit from both Hadoop and a Relational product(s).
 Hadoop vs RDBMS
RDBMS products scale up
• Expensive to scale for larger installations
• Hits a ceiling when storage reaches 100s of terabytes
Hadoop clusters can scale-out to 100s of machines and to petabytes of storage.
 Comparison to RDBMS
Hadoop was not designed for real-time or low latency queries
Products that do provide low latency queries such as HBase have limited query
functionality
Hadoop performs best for offline batch processing on large amounts of data
RDBMS is best for online transactions and low-latency queries
Hadoop is designed to stream large files and large amounts of data
RDBMS works best with small records
 Hadoop Distributions
Let’s say you go download Hadoop’s HDFS and MapReduce.
At first it works great but then you decide to start using Hbase.
No problem, just download HBase and point it to your existing HDFS.
But you find that HBase can only work with a previous version of HDFS.
You go downgrade HDFS and everything still works great.
 Hadoop Distributions
Hadoop Distributions aim to resolve version incompatibilities.
DistributionVendor will do the following:
1. IntegrationTest a set of Hadoop products.
2. Distributions may provide additional scripts to execute Hadoop
3. Package Hadoop products in various installation formats
1. Linux Packages, tarballs
 Hadoop Distributions
 Distribution Vendors
Cloudera Hadoop Distribution
MapR Hadoop Distribution
AmazonWeb Services Elastic MapReduce Hadoop Distribution
Hortonworks Hadoop Distribution
IBM Infosphere BigInsights Hadoop Distribution
Microsoft Azure's HDInsight Cloud based Hadoop Distribution
 Cloudera Distribution for Hadoop
Most popular distribution
Cloudera has taken the lead on providing Hadoop Distribution
CDH is provided in various formats
Linux Packages,Virtual Machine Images, andTarballs
AmazonWeb Services Elastic MapReduce Hadoop Distribution.
Integrates HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig,
Sqoop,Whirr, Zookeeper, Flume
 References
• McAfee, A. (2012). Big Data:The Management Revolution. Harvard Business Review.
• Zettaset. (2010). What is Big Data andWhy Do Organizations Need It? Retrieved from Zettaset
Corporate: http://www.zettaset.com/index.php/info-center/what-is-big-data
 Q & A
Big Data and Hadoop

More Related Content

What's hot

Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Yukti Kaura
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
SahilRaina21
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
RojaT4
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
TejashBansal2
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
Abbas Maazallahi
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
Maryan Faryna
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
RojaT4
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 

What's hot (20)

Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 

Similar to Big Data and Hadoop

Big Data
Big DataBig Data
Big Data
Kirubaburi R
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
Umair Shafique
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
Vishwajeet Jadeja
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
Sysfore Technologies
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Big Data
Big DataBig Data
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
Sarvesh Meena
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
Dux Chandegra
 
Big Data
Big DataBig Data
Big Data
Neha Mehta
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Sri Kanth
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
nandhiniarumugam619
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 

Similar to Big Data and Hadoop (20)

Big Data
Big DataBig Data
Big Data
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Big Data
Big DataBig Data
Big Data
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Big Data
Big DataBig Data
Big Data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 

Big Data and Hadoop

  • 1. Big Data and Hadoop Presented By: Maulik Lakhani Session 1 – Introduction to Big Data and Hadoop
  • 3.  Outline Introduction to Big Data Traditional Data Processing Introduction to Hadoop HadoopArchitecture Hadoop and RDBMS Hadoop Distributions
  • 4.  What is Big Data? Collection of large datasets that cannot be processed using traditional computing techniques. Not a single technique or a tool, rather a complete subject. Various tools, techniques and frameworks. It’s not the amount of data that’s important. It’s what organisations do with the data that matters. Big data can be analysed for insights that lead to better decisions and strategic business moves.
  • 5.  Traditional Data Processing For storage purpose, the programmers will take the help of their choice of database vendors such as Oracle, IBM and others. An enterprise will have a computer to store and process big data. The user interacts with the application, which in turn handles the part of data storage and analysis.
  • 6.  Characteristics - Vs of Big Data • Quality of the data • Structured, Un- structured, Semi-structured • Periodic, Near- time, Real-time • Terabytes of data, Transactions, Files Volume Velocity VeracityVariety
  • 7.  Characteristics - Vs of Big Data Volume - Size of data • Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005. Velocity • There are 1.03 billion Daily Active Users (Facebook) on Mobile as of now, which is an increase of 22% year-over-year.
  • 8.  Characteristics - Vs of Big Data Variety • Data can be structured, semi-structured or unstructured in the form of images, audios, videos, sensor data. • Variety of data creates problems in capturing, storage, mining and analyzing the data. Veracity – data uncertainty, data inconsistency and incompleteness • Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they use to make decisions.
  • 9.  Characteristics - Vs of Big Data Veracity • Poor data quality costs the US economy around $3.1 trillion a year. Value • Adding to the benefits of the organizations. Is the organization working on Big Data achieving high ROI?
  • 10.  Ms of Big Data Make Me MoneyMore
  • 11.  What Comes Under Big Data? • Black Box Data: Voices of the flight crew, performance information of the aircraft. • Social Media Data: Posts and views by millions of people across the globe. • Stock Exchange Data: information about the ‘buy’ and ‘sell’ decisions made on the stock exchange. • Power Grid Data: Electricity consumed by a particular node with respect to a base station. • Transport Data: Shipping / freight data.
  • 12.  Examples of Big Data • Walmart handles more than 1 million customer transactions every hour. • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. • 230+ millions of tweets are created every day. • YouTube users upload 48 hours of new video every minute of the day. • Amazon handles 15 million customer click stream user data per day to recommend products. • 294 billion emails are sent every day. Services analyses this data to find the spams. • Modern cars have close to 100 sensors.
  • 13.  Applications of Big Data • Smarter Healthcare (EHR): Predict the patient’s deteriorating condition in advance. • Telecom: Reduce data packet loss. Offer personalized plans. • Retail: Recommendation engines - Suggestion based on the browsing history of the consumer. • Traffic control: managing traffic better via effective use of data and sensors. • Manufacturing: reduce component defects, improve product quality, increase efficiency, and save time and money. • Search Quality: Personalised search results based on previous searches.
  • 14. AlphaGo is a narrow AI developed by Alphabet Inc. to play the board game named Go. AlphaGo's algorithm uses a tree search algorithm to find its moves based on previously learned knowledge. It gains knowledge by machine learning, specifically by an artificial neural network from extensive training, both from human and computer play. In March 2016, it beat a professional GO player named Lee Sedol in a five-game match for the first time.
  • 15. Google Photos Google implements different forms of machine learning into the Photos service, particularly in recognition of photo contents. People: Photos app collects all the photos containing faces. It doesn’t identify these people, but just collects them for quick access Places: The relies on landmarks. It can correctly identify well-known places like Taj Mahal. Things: This feature aggregates pictures of things like, flowers, cars, sky, birthdays and cats.There are many more categories, including screenshots, posters and castles. Not everything in the garden is rosy!
  • 16.  Challenges with Big Data • Data Quality: messy, inconsistent and incomplete data. Dirty data cost $600 billion to the companies each year in the United States. • Discovery: Analyzing petabytes of data using powerful algorithms to find patterns and insights are very difficult. • Storage: The more data an organization has, the more complex the problem of managing it. • Lack of Talent: A developers, data scientists and analysts who also have sufficient amount of domain knowledge.
  • 17.  Summary of Big Data Big Data is defined as data that is huge in size. Bigdata is a term used to describe a collection of data that is huge in size and yet growing exponentially with time. Examples of Big Data generation includes stock exchanges, social media sites, jet engines, etc. Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured Volume,Variety,Velocity, andVariability are few Characteristics of Bigdata
  • 18.  Introduction to Hadoop Hadoop is an open source framework from Apache. Used to store process and analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). Used for batch/offline processing.
  • 20.  Modules of Hadoop Hadoop Distributed File System: States that the files will be broken into blocks and stored in nodes over the distributed architecture. Yet another Resource Negotiator: Used for job scheduling and manage the cluster. Map Reduce: Framework which helps programs to do the parallel computation on data using key value pair.
  • 21.  Modules of Hadoop Map Reduce: The Map task takes input data and converts it into a data set which can be computed in Key value pair.The output of Map task is consumed by reduce task and then the out of reducer gives the desired result. Hadoop Common:These Java libraries are used to start Hadoop and are used by other Hadoop modules.
  • 26.  Hadoop Architecture NameNode • Stores metadata for the files, like the directory structure of a typical FS. • The server holding the NameNode instance is quite crucial, as there is only one. • Transaction log for file deletes/adds,etc. Does not use transaction for whole blocks or file-streams, only metadata. • Handles creation of more replica blocks when necessary after a DataNode failure.
  • 27.  Hadoop Architecture DataNode • Stores the actual data in HDFS • Can run on any underlying filesystem (ext3/4, NTFS, etc) • Notifies NameNode of what blocks it has • NameNode replicates blocks 2x in local rack, 1x elsewhere
  • 28.  Hadoop Architecture Job Tracker The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode. In response, NameNode provides metadata to Job Tracker. Task Tracker It works as a slave node for Job Tracker. It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.
  • 29.  Hadoop vs RDBMS Until recently many applications utilized RDBMS for batch processing – Oracle, Sybase, MySQL, Microsoft SQL Server, etc. Hadoop doesn’t fully replace relational products; many architectures would benefit from both Hadoop and a Relational product(s).
  • 30.  Hadoop vs RDBMS RDBMS products scale up • Expensive to scale for larger installations • Hits a ceiling when storage reaches 100s of terabytes Hadoop clusters can scale-out to 100s of machines and to petabytes of storage.
  • 31.  Comparison to RDBMS Hadoop was not designed for real-time or low latency queries Products that do provide low latency queries such as HBase have limited query functionality Hadoop performs best for offline batch processing on large amounts of data RDBMS is best for online transactions and low-latency queries Hadoop is designed to stream large files and large amounts of data RDBMS works best with small records
  • 32.  Hadoop Distributions Let’s say you go download Hadoop’s HDFS and MapReduce. At first it works great but then you decide to start using Hbase. No problem, just download HBase and point it to your existing HDFS. But you find that HBase can only work with a previous version of HDFS. You go downgrade HDFS and everything still works great.
  • 33.  Hadoop Distributions Hadoop Distributions aim to resolve version incompatibilities. DistributionVendor will do the following: 1. IntegrationTest a set of Hadoop products. 2. Distributions may provide additional scripts to execute Hadoop 3. Package Hadoop products in various installation formats 1. Linux Packages, tarballs
  • 35.  Distribution Vendors Cloudera Hadoop Distribution MapR Hadoop Distribution AmazonWeb Services Elastic MapReduce Hadoop Distribution Hortonworks Hadoop Distribution IBM Infosphere BigInsights Hadoop Distribution Microsoft Azure's HDInsight Cloud based Hadoop Distribution
  • 36.  Cloudera Distribution for Hadoop Most popular distribution Cloudera has taken the lead on providing Hadoop Distribution CDH is provided in various formats Linux Packages,Virtual Machine Images, andTarballs AmazonWeb Services Elastic MapReduce Hadoop Distribution. Integrates HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig, Sqoop,Whirr, Zookeeper, Flume
  • 37.  References • McAfee, A. (2012). Big Data:The Management Revolution. Harvard Business Review. • Zettaset. (2010). What is Big Data andWhy Do Organizations Need It? Retrieved from Zettaset Corporate: http://www.zettaset.com/index.php/info-center/what-is-big-data
  • 38.  Q & A