SlideShare a Scribd company logo
HADOOP OVERVIEW
By Sunitha Flowerhill
(Masters in Computer Applications-MCA)
Data, Business Intelligence and Hadoop Architect
AGENDA
EVOLUTION AND EXPANSION OF BUSINESS DATA PROCESSING
MOTIVATION BEHIND HADOOP
HADOOP ARCHITECTURE
HADOOP TECHNOLOGIES AND USAGES
DATA WRANGLING ON HADOOP
BUSINESS INTELLIGENCE AND ANALYTICS ON HADOOP
EVOLUTION – STAGE 1
 70S – PUNCH CARDS AND PUNCH TAPES WITH HOLES IN IT
 COBOL AND JOB CONTROL LANGUAGE
 ISAM AND C-ISAM FILES – FLAT FILES WITH INDEXES
 WINCHESTER HARD DISKS WHICH LOOKED LIKE DRUMS
 EXAMPLE SYSTEM – PDP 11 BY DIGITAL CORPORATION
 DRAWBACK – VERY SLOW, LOW CAPACITY
EVOLUTION – STAGE 2
 80’S - CAME MINI COMPUTERS
 UNIX OPERATING SYSTEM (WHICH WAS DEVELOPED IN THE
60S IN UC-BERKLEY) WHICH IS STILL RUNNING IN MANY
FORMS LIKE HPUX, AIX AND ALSO IS THE MAJOR OPERATING
SYSTEM WHERE HADOOP RESIDES - LINUX
 RELATIONAL DATABASE SYSTEMS LIKE UNIFY, INFORMIX,
SYBASE AND DB2
 LAN BASED NETWORKED PCS – NOVELL NETWARE, DBASE,
FOXPRO – PC/MD-DOS/LAN BASED RDBMS
 SQL – STRUCTURED QUERY LANGUAGE, WHICH IS STILL
HEAVILY USED IN HADOOP AS HIVEQL, SPARK SQL ETC.
 STURDY AND FAULT TOLERANT
 DRAWBACK: LIMITED PROCESSING POWER AND GREEN
SCREEN! NOT MUCH OF A GRAPHICAL EXPERIENCE
EVOLUTION – STAGE 3
 CLIENT SERVER ARCHITECTURE – 2 TIER – PC BASED THICK CLIENT
FRONT END FOR PROCESSING DATA AT THE USER END AND A LAN
OR UNIX BASED SERVER FOR THE DATABASE SERVERSIDE
PROCESSING
 GRAPHICAL USER INTERFACE (GUI) FOR THE USER
 MORE PROCESSING POWER AT THE SERVER SIDE
 CONNECTION BETWEEN CLIENT AND SERVER USING OBJECT DATA
BASE CONNECTIVITY (ODBC) OR CALL LEVEL INTERFACE (CLI) –
USING DYNAMIC LINK LIBRARIES (DLLS)
 CLASSIFIED AS DISTRIBUTED SYSTEMS
 DATA STORAGE AND RECOVERY MECHANISMS SUCH AS
MIRRORING, REPLICATION, BLADING ETC WERE POSSIBLE AT THE
SERVER LEVEL
 DRAWBACK : LOW AVAILABILITY, FAILURES, LOTS OF
TROUBLESHOOTING
 “You know you have a distributed system when
the crash of a computer you’ve never
heard of stops you from getting any work
done.” -Leslie Lamport – distributed system computer scientist
EVOLUTION – STAGE 4
 3 TIER ARCHITECTURE – THIN CLIENT, APPLICATION-
MIDDLEWARE AND SERVERS FOR DATABASE STORAGE
 THIN APPLICATION CLIENT OR WEB BASED CLIENT, WHICH ONLY
SERVES AS DATA DELIVERY, WITH MINIMAL PROCESSING AT
CLIENT END
 INTRODUCTION OF MIDDLEWARE SUCH AS TUXEDO, WEB
SERVICES, JAVA BEANS – MOST OF BUSINESS LOGIC RESIDES
HERE
 USES PACKET TECHNOLOGY FOR EFFICIENT TRANSPORTATION
AND RECOVERY
 USES DIFFERENT INTERNET PROTOCOLS FOR SECURITY AND
EFFICIENT TRANSPORTATION OF DATA BETWEEN THIN CLIENT
AND SERVER
 MORE GEOGRAPHICALLY DISTRIBUTED SERVERS, MIDDLEWARE
SERVERS, CLUSTER COMPUTING, CHEAP HARDWARE
 LOT OF DATA CAPTURING ACROSS THE INTERNET, FROM SELF
SERVICE APPLICATIONS, USERS, MOBILE APPLICATIONS
THAT BRINGS US TO THE MOTIVATION BEHIND
HADOOP
 CHEAP CLUSTERED HARDWARE AVAILABLE NOW
 WE CAN RUN A HADOOP CLUSTER WITH ALL THE LAPTOPS
IN THIS CLASS CONNECTED TOGETHER AS NODES OF THE
CLUSTER
 HARDWARE FAILURE IS COMMON SO HEAVILY REPLICATED
DATA
 MULTIPLE PARALLEL PROCESSING – USAGE OF MULTIPLE
CPUS FOR A SINGLE TASK –SPARK ENGINE IS A GOOD
EXAMPLE OF MPP.
 VARIOUS ANALYSIS CAN BE DONE IN LARGE DATASETS,
FORECASTING, PREDICTIONS, DIRECTIONS FOR BUSINESS
 ANALYTICS BASED INTELLIGENCE RATHER THAN PURE
PRODUCTION BASED MIS REPORTS
 SELLING OF THE DATASETS – HUGE BUSINESS
 AND MANY MORE…….
HADOOP
 WE ARE DEALING WITH TERABYTES OF DATA HERE IN CLUSTERED
COMPUTING
 APACHE TOP LEVEL PROJECT, OPEN SOURCE IMPLEMENTATION,
FOR RELIABLE, SCALABLE, DISTRIBUTED COMPUTING AND STORAGE.
 DISTRIBUTED BY HORTONWORKS AND CLOUDERA
 FLEXIBLE AND HIGHLY-AVAILABLE ARCHITECTURE FOR LARGE
SCALE COMPUTATION AND DATA PROCESSING ON A NETWORK
OF COMMODITY HARDWARE.
 STORAGE AND PROCESSING OF LARGE AND RAPIDLY GROWING
DATA.
 STRUCTURED AND UNSTRUCTURED DATA
 HIGH SCALABILITY AND AVAILABILITY
 FAULT TOLERANCE
 NOW INFRASTRUCTURE MAINTENANCE IS AVAILABLE AT LOW COST
BY CLOUD COMPANIES LIKE AWS, GOOGLE, GAIA, MS AZURE ETC
BASIC ARCHITECTURE
 MAIN NODES OF CLUSTER ARE WHERE MOST
OF THE COMPUTATIONAL POWER AND
STORAGE OF THE SYSTEM LIES
 MAIN NODES RUN TASKTRACKER TO ACCEPT
AND REPLY TO MAPREDUCE TASKS, AND
ALSO TO DATA NODE TO STORE NEEDED
BLOCKS AS AVAILABLE AS POSSIBLE
 CENTRAL CONTROL NODE RUNS NAMENODE
TO KEEP TRACK OF HDFS DIRECTORIES &
FILES, AND JOBTRACKER TO DISPATCH
COMPUTE TASKS TO TASKTRACKER
 HADOOP IS WRITTEN IN JAVA, ALSO
SUPPORTS PYTHON, RUBY OTHER ENGINES
LIKE SPARK, MORE EFFICIENT LANGUAGES LIKE
SCALA
HADOOP DISTRIBUTED FILESYSTEM
(HDFS) ARCHITECTURE
 TAILORED TO THE NEEDS OF MAPREDUCE
 TARGETED TOWARDS MANY READS OF
FILESTREAMS
 WRITES ARE MORE COSTLY – TIME, EFFORT –
SO WRITE ONCE – READ MANY PREFERRED
 HIGH DEGREE OF DATA REPLICATION (3X BY
DEFAULT)
 LARGE BLOCKSIZE (128 MB)
 LOCATION AWARENESS OF DATA NODES IN
NETWORK (GEOGRAPHIC SENSIBLE STORAGE)
Cluster of machines running
Hadoop at Yahoo! (Source: Yahoo!)
ARCHITECTURE - NAMENODE
 STORES METADATA FOR THE FILES, LIKE THE
DIRECTORY STRUCTURE OF A TYPICAL FS
 THE SERVER HOLDING THE NAMENODE
INSTANCE IS QUITE CRUCIAL, AS THERE IS
ONLY ONE. AND THERE IS A SECONDARY OR
BACKUP NAMENODE
 TRANSACTION LOG FOR FILE DELETES/ADDS,
ETC. DOES NOT USE TRANSACTIONS FOR
WHOLE BLOCKS OR FILE-STREAMS, ONLY
METADATA
 HANDLES CREATION OF MORE REPLICA
BLOCKS WHEN NECESSARY AFTER A DATA
NODE FAILURE
ARCHITECTURE - NAMENODE:
 STORES THE ACTUAL DATA IN HDFS
 CAN RUN ON ANY UNDERLYING
FILESYSTEM (EXT 3/4, NTFS, ETC.)
 NOTIFIES NAMENODE OF WHAT BLOCKS
IT HAS
 NAMENODE REPLICATES BLOCKS 2X IN
LOCAL RACK, 1X ELSEWHERE
ARCHITECTURE – JOBTRACKER AND TASKTRACKER
 JOB TRACKER MAKES SURE THAT
EACH OPERATION IS COMPLETED
AND IF THERE IS A PROCESS
FAILURE AT ANY NODE, IT NEEDS
TO ASSIGN A DUPLICATE TASK TO
SOME TASK TRACKER. JOB
TRACKER ALSO DISTRIBUTES THE
ENTIRE TASK TO ALL THE
MACHINES.
 THE TASK TRACKERS (PROJECT
MANAGER IN OUR ANALOGY) IN
DIFFERENT MACHINES
ARE COORDINATED BY A JOB
TRACKER
ARCHITECTURE – YARN (YET ANOTHER
RESOURCE NEGOTIATOR)
 YARN ARCHITECTURE CAN BE A
LITTLE CONFUSING..
 HADOOP 2.0 INTRODUCED YARN
(YET ANOTHER RESOURCE
NEGOTIATOR) AS HADOOP MOVED
FROM MAP REDUCE TO MORE
GENERIC MODEL, WITH ABILITY TO
SUPPORT APACHE SPARK AND
OTHER REAL TIME ENGINES.
 ITS BASICALLY MULTI THREADING –
MORE INSTANCES OF AN
APPLICATION MANAGED BY A
MASTER-MANAGER
 EXPAND THIS IDEA TO A CLUSTER. A
NUMBER OF APPLICATIONS MAY BE
SPAWNED BY A
CORRESPONDING APPLICATION
MASTER TASKS OR WORKERS ARE
RUN AND MANAGED BY
APPLICATION MASTER. APPLICATION
MASTER REQUESTS RESOURCE
MANAGER, WHO ALLOCATE
RESOURCES
TECHNOLOGIES ON HADOOP
 ECOSYSTEM – WHERE ALL TOOLS RESIDES IN UNION,
LIKE A POND ECOSYSTEM
 DATA PONDS, DATA LAKES AND DATA RESERVOIRS -
WHICH ARE REPLACING TRADITIONAL DATA
WAREHOUSES
 EFFICIENT BUSINESS INTELLIGENCES BY PREDICTION
AND FORECASTING
 ALGORITHMS FOR MACHINE LEARNING AND DEEP
LEARNING
 WEB NOTEBOOKS E.G.. ZEPPELIN
 DATABASES AND SQL – NOSQL DATABASES – NON-
RELATIONAL DATABASES – CASSANDRA, HBASE,
HIVEQL, SPARKQL
TECHNOLOGIES ON HADOOP
 OPEN APIS FOR OPERATING ON DOCUMENTS – OPEN
JSON
 STREAM PROCESSING – DATA STREAMING – SPARK
STREAMING, APACHE STORM, REAL-TIME, EVENT
BASED – EX: FACEBOOK LIVE, REAL TIME DATA
STREAMING FOR DATA LAKES
 MESSAGING PLATFORMS – APACHE KAFKA – USED BY
LINKEDIN FOR MESSAGING, ANALYTICS, WITHOUT
HAVING TO PERFORM ANY KIND OF DATA MOVEMENT
EX: GROUPME, FACEBOOK MESSENGER
 GLOBAL RESOURCE MANAGEMENT - THE ABILITY TO
PRESSURIZE THE RESOURCES (CPU, MEMORY,
BANDWIDTH) OF AN APPLICATION. - BUSINESSES CAN
GREATLY INCREASE THEIR MOMENTUM WHEN THEY
ARE ABLE TO USE THEIR ASSETS FOR CRITICAL
PROJECTS
DATA PREPARATION,
WRANGLING,ANALYSIS ON HADOOP
 VARIOUS ALGORITHMS FOR
 METADATA EXTRACTION
 FORMAT CONVERSION
 MDM IDENTIFICATION
 CROSS LINKING AMONG VARIOUS DATA
 CENTRALIZED INDEXING, TAGS, BUSINESS
METADATA, TECHNICAL METADATA
 TEXTUAL PATTERN RECOGNITION
 MOST OF THESE TOOLS ARE
SELF SERVICE ONES
 DATA INTEGRATION
BUSINESS INTELLIGENCE ON HADOOP
 SEARCH ENGINE TOOLS FOR OFFICE DATA
DIGGING OR MINING, WITH RANKED RESULTS
AND SUGGESTIONS. EXAMPLE – ELASTIC
SEARCH
 CUBING TOOLS – PREPARE DATA, COMPUTE
COMPLEX CALCULATIONS AND KEEP FOR
CONSUMPTION/REPORTING. EX: ATSCALE,
TRIFACTA
 STATISTICAL TOOLS – JMP AND SAS
 GEOSPATIAL TOOLS AND ACCESSORIES – EX:
ESRI SPECIAL FRAMEWORK
 TARGET MARKETING – EX: ELECTION
SOLICITING TO TARGET AUDIENCE OVER
SOCIAL MEDIA
 DECENTRALIZED ANALYTICS – ANALYSIS
DIVIDED ONTO MULTIPLE LOCATIONS,
MULTIPLE TALENTS AND THEN CONVERGE
INTO GOOD RESULTS
Hadoop  Tutorial, Usage, Evolution, Data Lake, Business Intelligence by Sunitha Flowerhill
Hadoop  Tutorial, Usage, Evolution, Data Lake, Business Intelligence by Sunitha Flowerhill

More Related Content

Similar to Hadoop Tutorial, Usage, Evolution, Data Lake, Business Intelligence by Sunitha Flowerhill

Arif
ArifArif
Arif
arifsumon
 
From big data to big value : Infrastructure need and Huawei best practise
From big data to big value : Infrastructure need and Huawei best practise From big data to big value : Infrastructure need and Huawei best practise
From big data to big value : Infrastructure need and Huawei best practise
BSP Media Group
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
'Malware Analysis' by PP Singh
'Malware Analysis' by PP Singh'Malware Analysis' by PP Singh
'Malware Analysis' by PP Singh
Bipin Upadhyay
 
Malware Analysis -an overview by PP Singh
Malware Analysis -an overview by PP SinghMalware Analysis -an overview by PP Singh
Malware Analysis -an overview by PP Singh
n|u - The Open Security Community
 
Sharing bisnis big data v3 part2
Sharing  bisnis big data v3 part2Sharing  bisnis big data v3 part2
Sharing bisnis big data v3 part2
Dwika Sudrajat
 
Big Data on the Cloud
Big Data on the CloudBig Data on the Cloud
Big Data on the Cloud
Sercan Karaoglu
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloads
Alluxio, Inc.
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DATAVERSITY
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
Sybase Türkiye
 
Red Hat Storage Product Overview
Red Hat Storage Product OverviewRed Hat Storage Product Overview
Red Hat Storage Product Overview
Scott Clinton
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
DataWorks Summit/Hadoop Summit
 
Scalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudScalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the Cloud
Red_Hat_Storage
 
The NoSQL Movement
The NoSQL MovementThe NoSQL Movement
The NoSQL Movement
RalucaGheorghita
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
Facebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeFacebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challenge
Cristina Munoz
 
Cloud Computing & Benefits
Cloud Computing & BenefitsCloud Computing & Benefits
Cloud Computing & Benefits
Muthu Natarajan
 

Similar to Hadoop Tutorial, Usage, Evolution, Data Lake, Business Intelligence by Sunitha Flowerhill (20)

Arif
ArifArif
Arif
 
From big data to big value : Infrastructure need and Huawei best practise
From big data to big value : Infrastructure need and Huawei best practise From big data to big value : Infrastructure need and Huawei best practise
From big data to big value : Infrastructure need and Huawei best practise
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
'Malware Analysis' by PP Singh
'Malware Analysis' by PP Singh'Malware Analysis' by PP Singh
'Malware Analysis' by PP Singh
 
Malware Analysis -an overview by PP Singh
Malware Analysis -an overview by PP SinghMalware Analysis -an overview by PP Singh
Malware Analysis -an overview by PP Singh
 
Sharing bisnis big data v3 part2
Sharing  bisnis big data v3 part2Sharing  bisnis big data v3 part2
Sharing bisnis big data v3 part2
 
Big Data on the Cloud
Big Data on the CloudBig Data on the Cloud
Big Data on the Cloud
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloads
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
Red Hat Storage Product Overview
Red Hat Storage Product OverviewRed Hat Storage Product Overview
Red Hat Storage Product Overview
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
Scalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudScalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the Cloud
 
The NoSQL Movement
The NoSQL MovementThe NoSQL Movement
The NoSQL Movement
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Facebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeFacebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challenge
 
Cloud Computing & Benefits
Cloud Computing & BenefitsCloud Computing & Benefits
Cloud Computing & Benefits
 

Recently uploaded

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 

Recently uploaded (20)

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 

Hadoop Tutorial, Usage, Evolution, Data Lake, Business Intelligence by Sunitha Flowerhill

  • 1. HADOOP OVERVIEW By Sunitha Flowerhill (Masters in Computer Applications-MCA) Data, Business Intelligence and Hadoop Architect
  • 2. AGENDA EVOLUTION AND EXPANSION OF BUSINESS DATA PROCESSING MOTIVATION BEHIND HADOOP HADOOP ARCHITECTURE HADOOP TECHNOLOGIES AND USAGES DATA WRANGLING ON HADOOP BUSINESS INTELLIGENCE AND ANALYTICS ON HADOOP
  • 3. EVOLUTION – STAGE 1  70S – PUNCH CARDS AND PUNCH TAPES WITH HOLES IN IT  COBOL AND JOB CONTROL LANGUAGE  ISAM AND C-ISAM FILES – FLAT FILES WITH INDEXES  WINCHESTER HARD DISKS WHICH LOOKED LIKE DRUMS  EXAMPLE SYSTEM – PDP 11 BY DIGITAL CORPORATION  DRAWBACK – VERY SLOW, LOW CAPACITY
  • 4. EVOLUTION – STAGE 2  80’S - CAME MINI COMPUTERS  UNIX OPERATING SYSTEM (WHICH WAS DEVELOPED IN THE 60S IN UC-BERKLEY) WHICH IS STILL RUNNING IN MANY FORMS LIKE HPUX, AIX AND ALSO IS THE MAJOR OPERATING SYSTEM WHERE HADOOP RESIDES - LINUX  RELATIONAL DATABASE SYSTEMS LIKE UNIFY, INFORMIX, SYBASE AND DB2  LAN BASED NETWORKED PCS – NOVELL NETWARE, DBASE, FOXPRO – PC/MD-DOS/LAN BASED RDBMS  SQL – STRUCTURED QUERY LANGUAGE, WHICH IS STILL HEAVILY USED IN HADOOP AS HIVEQL, SPARK SQL ETC.  STURDY AND FAULT TOLERANT  DRAWBACK: LIMITED PROCESSING POWER AND GREEN SCREEN! NOT MUCH OF A GRAPHICAL EXPERIENCE
  • 5. EVOLUTION – STAGE 3  CLIENT SERVER ARCHITECTURE – 2 TIER – PC BASED THICK CLIENT FRONT END FOR PROCESSING DATA AT THE USER END AND A LAN OR UNIX BASED SERVER FOR THE DATABASE SERVERSIDE PROCESSING  GRAPHICAL USER INTERFACE (GUI) FOR THE USER  MORE PROCESSING POWER AT THE SERVER SIDE  CONNECTION BETWEEN CLIENT AND SERVER USING OBJECT DATA BASE CONNECTIVITY (ODBC) OR CALL LEVEL INTERFACE (CLI) – USING DYNAMIC LINK LIBRARIES (DLLS)  CLASSIFIED AS DISTRIBUTED SYSTEMS  DATA STORAGE AND RECOVERY MECHANISMS SUCH AS MIRRORING, REPLICATION, BLADING ETC WERE POSSIBLE AT THE SERVER LEVEL  DRAWBACK : LOW AVAILABILITY, FAILURES, LOTS OF TROUBLESHOOTING  “You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” -Leslie Lamport – distributed system computer scientist
  • 6. EVOLUTION – STAGE 4  3 TIER ARCHITECTURE – THIN CLIENT, APPLICATION- MIDDLEWARE AND SERVERS FOR DATABASE STORAGE  THIN APPLICATION CLIENT OR WEB BASED CLIENT, WHICH ONLY SERVES AS DATA DELIVERY, WITH MINIMAL PROCESSING AT CLIENT END  INTRODUCTION OF MIDDLEWARE SUCH AS TUXEDO, WEB SERVICES, JAVA BEANS – MOST OF BUSINESS LOGIC RESIDES HERE  USES PACKET TECHNOLOGY FOR EFFICIENT TRANSPORTATION AND RECOVERY  USES DIFFERENT INTERNET PROTOCOLS FOR SECURITY AND EFFICIENT TRANSPORTATION OF DATA BETWEEN THIN CLIENT AND SERVER  MORE GEOGRAPHICALLY DISTRIBUTED SERVERS, MIDDLEWARE SERVERS, CLUSTER COMPUTING, CHEAP HARDWARE  LOT OF DATA CAPTURING ACROSS THE INTERNET, FROM SELF SERVICE APPLICATIONS, USERS, MOBILE APPLICATIONS
  • 7. THAT BRINGS US TO THE MOTIVATION BEHIND HADOOP  CHEAP CLUSTERED HARDWARE AVAILABLE NOW  WE CAN RUN A HADOOP CLUSTER WITH ALL THE LAPTOPS IN THIS CLASS CONNECTED TOGETHER AS NODES OF THE CLUSTER  HARDWARE FAILURE IS COMMON SO HEAVILY REPLICATED DATA  MULTIPLE PARALLEL PROCESSING – USAGE OF MULTIPLE CPUS FOR A SINGLE TASK –SPARK ENGINE IS A GOOD EXAMPLE OF MPP.  VARIOUS ANALYSIS CAN BE DONE IN LARGE DATASETS, FORECASTING, PREDICTIONS, DIRECTIONS FOR BUSINESS  ANALYTICS BASED INTELLIGENCE RATHER THAN PURE PRODUCTION BASED MIS REPORTS  SELLING OF THE DATASETS – HUGE BUSINESS  AND MANY MORE…….
  • 8. HADOOP  WE ARE DEALING WITH TERABYTES OF DATA HERE IN CLUSTERED COMPUTING  APACHE TOP LEVEL PROJECT, OPEN SOURCE IMPLEMENTATION, FOR RELIABLE, SCALABLE, DISTRIBUTED COMPUTING AND STORAGE.  DISTRIBUTED BY HORTONWORKS AND CLOUDERA  FLEXIBLE AND HIGHLY-AVAILABLE ARCHITECTURE FOR LARGE SCALE COMPUTATION AND DATA PROCESSING ON A NETWORK OF COMMODITY HARDWARE.  STORAGE AND PROCESSING OF LARGE AND RAPIDLY GROWING DATA.  STRUCTURED AND UNSTRUCTURED DATA  HIGH SCALABILITY AND AVAILABILITY  FAULT TOLERANCE  NOW INFRASTRUCTURE MAINTENANCE IS AVAILABLE AT LOW COST BY CLOUD COMPANIES LIKE AWS, GOOGLE, GAIA, MS AZURE ETC
  • 9. BASIC ARCHITECTURE  MAIN NODES OF CLUSTER ARE WHERE MOST OF THE COMPUTATIONAL POWER AND STORAGE OF THE SYSTEM LIES  MAIN NODES RUN TASKTRACKER TO ACCEPT AND REPLY TO MAPREDUCE TASKS, AND ALSO TO DATA NODE TO STORE NEEDED BLOCKS AS AVAILABLE AS POSSIBLE  CENTRAL CONTROL NODE RUNS NAMENODE TO KEEP TRACK OF HDFS DIRECTORIES & FILES, AND JOBTRACKER TO DISPATCH COMPUTE TASKS TO TASKTRACKER  HADOOP IS WRITTEN IN JAVA, ALSO SUPPORTS PYTHON, RUBY OTHER ENGINES LIKE SPARK, MORE EFFICIENT LANGUAGES LIKE SCALA
  • 10.
  • 11. HADOOP DISTRIBUTED FILESYSTEM (HDFS) ARCHITECTURE  TAILORED TO THE NEEDS OF MAPREDUCE  TARGETED TOWARDS MANY READS OF FILESTREAMS  WRITES ARE MORE COSTLY – TIME, EFFORT – SO WRITE ONCE – READ MANY PREFERRED  HIGH DEGREE OF DATA REPLICATION (3X BY DEFAULT)  LARGE BLOCKSIZE (128 MB)  LOCATION AWARENESS OF DATA NODES IN NETWORK (GEOGRAPHIC SENSIBLE STORAGE) Cluster of machines running Hadoop at Yahoo! (Source: Yahoo!)
  • 12. ARCHITECTURE - NAMENODE  STORES METADATA FOR THE FILES, LIKE THE DIRECTORY STRUCTURE OF A TYPICAL FS  THE SERVER HOLDING THE NAMENODE INSTANCE IS QUITE CRUCIAL, AS THERE IS ONLY ONE. AND THERE IS A SECONDARY OR BACKUP NAMENODE  TRANSACTION LOG FOR FILE DELETES/ADDS, ETC. DOES NOT USE TRANSACTIONS FOR WHOLE BLOCKS OR FILE-STREAMS, ONLY METADATA  HANDLES CREATION OF MORE REPLICA BLOCKS WHEN NECESSARY AFTER A DATA NODE FAILURE
  • 13. ARCHITECTURE - NAMENODE:  STORES THE ACTUAL DATA IN HDFS  CAN RUN ON ANY UNDERLYING FILESYSTEM (EXT 3/4, NTFS, ETC.)  NOTIFIES NAMENODE OF WHAT BLOCKS IT HAS  NAMENODE REPLICATES BLOCKS 2X IN LOCAL RACK, 1X ELSEWHERE
  • 14. ARCHITECTURE – JOBTRACKER AND TASKTRACKER  JOB TRACKER MAKES SURE THAT EACH OPERATION IS COMPLETED AND IF THERE IS A PROCESS FAILURE AT ANY NODE, IT NEEDS TO ASSIGN A DUPLICATE TASK TO SOME TASK TRACKER. JOB TRACKER ALSO DISTRIBUTES THE ENTIRE TASK TO ALL THE MACHINES.  THE TASK TRACKERS (PROJECT MANAGER IN OUR ANALOGY) IN DIFFERENT MACHINES ARE COORDINATED BY A JOB TRACKER
  • 15. ARCHITECTURE – YARN (YET ANOTHER RESOURCE NEGOTIATOR)  YARN ARCHITECTURE CAN BE A LITTLE CONFUSING..  HADOOP 2.0 INTRODUCED YARN (YET ANOTHER RESOURCE NEGOTIATOR) AS HADOOP MOVED FROM MAP REDUCE TO MORE GENERIC MODEL, WITH ABILITY TO SUPPORT APACHE SPARK AND OTHER REAL TIME ENGINES.  ITS BASICALLY MULTI THREADING – MORE INSTANCES OF AN APPLICATION MANAGED BY A MASTER-MANAGER  EXPAND THIS IDEA TO A CLUSTER. A NUMBER OF APPLICATIONS MAY BE SPAWNED BY A CORRESPONDING APPLICATION MASTER TASKS OR WORKERS ARE RUN AND MANAGED BY APPLICATION MASTER. APPLICATION MASTER REQUESTS RESOURCE MANAGER, WHO ALLOCATE RESOURCES
  • 16. TECHNOLOGIES ON HADOOP  ECOSYSTEM – WHERE ALL TOOLS RESIDES IN UNION, LIKE A POND ECOSYSTEM  DATA PONDS, DATA LAKES AND DATA RESERVOIRS - WHICH ARE REPLACING TRADITIONAL DATA WAREHOUSES  EFFICIENT BUSINESS INTELLIGENCES BY PREDICTION AND FORECASTING  ALGORITHMS FOR MACHINE LEARNING AND DEEP LEARNING  WEB NOTEBOOKS E.G.. ZEPPELIN  DATABASES AND SQL – NOSQL DATABASES – NON- RELATIONAL DATABASES – CASSANDRA, HBASE, HIVEQL, SPARKQL
  • 17. TECHNOLOGIES ON HADOOP  OPEN APIS FOR OPERATING ON DOCUMENTS – OPEN JSON  STREAM PROCESSING – DATA STREAMING – SPARK STREAMING, APACHE STORM, REAL-TIME, EVENT BASED – EX: FACEBOOK LIVE, REAL TIME DATA STREAMING FOR DATA LAKES  MESSAGING PLATFORMS – APACHE KAFKA – USED BY LINKEDIN FOR MESSAGING, ANALYTICS, WITHOUT HAVING TO PERFORM ANY KIND OF DATA MOVEMENT EX: GROUPME, FACEBOOK MESSENGER  GLOBAL RESOURCE MANAGEMENT - THE ABILITY TO PRESSURIZE THE RESOURCES (CPU, MEMORY, BANDWIDTH) OF AN APPLICATION. - BUSINESSES CAN GREATLY INCREASE THEIR MOMENTUM WHEN THEY ARE ABLE TO USE THEIR ASSETS FOR CRITICAL PROJECTS
  • 18. DATA PREPARATION, WRANGLING,ANALYSIS ON HADOOP  VARIOUS ALGORITHMS FOR  METADATA EXTRACTION  FORMAT CONVERSION  MDM IDENTIFICATION  CROSS LINKING AMONG VARIOUS DATA  CENTRALIZED INDEXING, TAGS, BUSINESS METADATA, TECHNICAL METADATA  TEXTUAL PATTERN RECOGNITION  MOST OF THESE TOOLS ARE SELF SERVICE ONES  DATA INTEGRATION
  • 19. BUSINESS INTELLIGENCE ON HADOOP  SEARCH ENGINE TOOLS FOR OFFICE DATA DIGGING OR MINING, WITH RANKED RESULTS AND SUGGESTIONS. EXAMPLE – ELASTIC SEARCH  CUBING TOOLS – PREPARE DATA, COMPUTE COMPLEX CALCULATIONS AND KEEP FOR CONSUMPTION/REPORTING. EX: ATSCALE, TRIFACTA  STATISTICAL TOOLS – JMP AND SAS  GEOSPATIAL TOOLS AND ACCESSORIES – EX: ESRI SPECIAL FRAMEWORK  TARGET MARKETING – EX: ELECTION SOLICITING TO TARGET AUDIENCE OVER SOCIAL MEDIA  DECENTRALIZED ANALYTICS – ANALYSIS DIVIDED ONTO MULTIPLE LOCATIONS, MULTIPLE TALENTS AND THEN CONVERGE INTO GOOD RESULTS