SlideShare a Scribd company logo
Haden Pereira
Data Engineer , Applications Work Group @
EMC
5+ Years Experience in the Big Data Space
Quick Survey
How many Programmers/Developers ?
Quick Survey
How many SQL Developers?
Quick Survey
How many Application Developers
(Java,C#,etc)
Quick Survey
How many System Administrators
(Database, Tomcat etc)
Quick Survey
How many of you have heard of Hadoop
Quick Survey
How many of you have hands on Experience
in Hadoop ?
Quick Survey
How many of you have worked with any of
the NoSQL tools.
Cassandra, MongoDB, Elasticsearch
Hadoop
A quick walkthrough
What is Hadoop?
Hadoop is an open source framework for
large-scale data storing & processing.
Why Hadoop?
• Traditional Data processing was done on large systems.
• Every time need for better performance arises , they would replace
the old computer with better ones.
• Scaling up was expensive
• Also scaling was limited to the maximum available resources of a
single system.
How does Hadoop Scale?
• ”Scale Out” , rather than “Scale Up”
• If data set/data processing requirement increases , add in one more
server.
• Eliminates the strategy of growing computing capacity by throwing
more expensive hardware at the problem.
Core Components of Hadoop
Hadoop v1 - HDFS & Map/Reduce
Hadoop v2 - HDFS & YARN
HDFS
Distributed: Scale of data growing at higher pace than single storage
disk capacity growth, hence cluster of disk distributed over network is
necessary.
Scalable: Extends to handle growing data requirement.
Fault-Tolerant: Protects against increased failure probability due to
large number of disks by replication
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
Total Capacity 6 TB
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F1 F1
100MB 100MB 100MB
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1 F2 F3
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 F3-R2
Map Reduce
Framework for writing applications that process large amounts of
structured and unstructured data in parallel, across a cluster of
thousands of machines, in a reliable and fault-tolerant manner.
Map Reduce
File.txt
300 MB
….. , ….. , ….. , ….. , ….. , ….. , ….. , 654 , INR
….. , ….. , ….. , ….. , ….. , ….. , ….. , 432 , AED
….. , ….. , ….. , ….. , ….. , ….. , ….. , 573 , USD
….. , ….. , ….. , ….. , ….. , ….. , ….. , 948 , EUR
….. , ….. , ….. , ….. , ….. , ….. , ….. , 392 , GBP
CSV file with around 1 million lines
Map Reduce
File.txt
300 MB
1 Hour to process 300 MB File
Map Reduce
File.txt
150 MB
1/2 Hour to process 150 MB File
File.txt
150 MB
1/2 Hour to process 150 MB File
Map Reduce
File.txt
75 MB
1/4 Hour to process 75 MB File
File.txt
75 MB
1/4 Hour to process 75 MB File
File.txt
75 MB
1/4 Hour to process 75MB File
1/4 Hour to process 75 MB File
File.txt
75 MB
Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 P1-R1 P2-R1 P3-R1
Map Reduce
• Handles tasks incase of server failures
• Distributes tasks evenly
• Tries to run tasks on the same server where the data block resides
YARN
Multi-tenancy - YARN allows multiple access engines (either open-source or
proprietary) to use Hadoop as the common standard for batch, interactive and real-
time engines that can simultaneously access the same data set.
Cluster utilization -YARN’s dynamic allocation of cluster resources improves utilization
over more static Map Reduce rules used in early versions of Hadoop.
Scalability - Data center processing power continues to rapidly expand. YARN’s
Resource Manager focuses exclusively on scheduling and keeps pace as clusters
expand to thousands of nodes managing petabytes of data.
Compatibility - Existing Map Reduce applications developed for Hadoop 1 can run
YARN without any disruption to existing processes that already work
YARN
Hadoop Ecosystem
Pig (scripting): Platform for analyzing large data sets. It is comprised of a high-
level language (Pig Latin) that is translapted to Map Reduce. Cuts down writing
code . Ideal for Extract-transform-load (ETL) data pipelines, research on raw
data, and iterative processing of data.
Hive (SQL). Provides data warehouse infrastructure, enabling data
summarization, ad- hoc query and analysis of large data sets. The query
language, HiveQL (HQL), is similar to SQL.
HCatalog (SQL). Table and storage management layer that provides users with
Pig, MapReduce and Hive with a relational view of data in HDFS . Provides REST
APIs so that external systems can access these tables' metadata.
Hadoop Ecosystem
Ambari : Provides an open operational framework for provisioning, managing
and monitoring Hadoop clusters.
Zookeeper : Provides distributed configuration service, a synchronization service
and a naming registry for distributed systems
Oozie : Enables Hadoop administrators to build complex data transformations out
of multiple component tasks, enabling greater control over complex jobs and also
making it easier to schedule repetitions of those jobs.
Hadoop Ecosystem
Tez leverages the MapReduce paradigm to enable the creation and execution of
more complex Directed Acyclic Graphs (DAG) of tasks. Tez eliminates unnecessary
tasks, synchronization barriers and reads-from and writes-to HDFS, speeding up
data processing across both small-scale/low-latency and large-scale/high-
throughput workloads
Spark : fast and general in memory processing engine that uses YARN as a
framework for deployment and can read/write data from HDFS.
Hadoop Ecosystem
Sqoop : Tool designed to transfer data between Hadoop and relational database
servers
HBase (NoSQL). Non-relational database that provides random real-time access
to data in very large tables. HBase provides transactional capabilities to Hadoop,
allowing users to conduct updates, inserts and deletes.
Flume : Distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into HDFS
Hadoop Ecosystem
Intro to hadoop

More Related Content

What's hot

Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
Yan Wang
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
DataWorks Summit
 
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
Grant McAlister
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Stuart Pook
 
Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017
Grant McAlister
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
Databricks
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
Dat305 Deep Dive on Amazon Aurora PostgreSQL
Dat305 Deep Dive on Amazon Aurora PostgreSQLDat305 Deep Dive on Amazon Aurora PostgreSQL
Dat305 Deep Dive on Amazon Aurora PostgreSQL
Grant McAlister
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
bigdatagurus_meetup
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
HBaseCon
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
DataWorks Summit/Hadoop Summit
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
Michael Stack
 
HPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and WorkflowsHPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and Workflows
inside-BigData.com
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
DataWorks Summit
 
Upgrading from HDP 2.1 to HDP 2.2
Upgrading from HDP 2.1 to HDP 2.2Upgrading from HDP 2.1 to HDP 2.2
Upgrading from HDP 2.1 to HDP 2.2
SATOSHI TAGOMORI
 
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibilityre:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
Grant McAlister
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
DataWorks Summit
 

What's hot (20)

Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
 
Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Dat305 Deep Dive on Amazon Aurora PostgreSQL
Dat305 Deep Dive on Amazon Aurora PostgreSQLDat305 Deep Dive on Amazon Aurora PostgreSQL
Dat305 Deep Dive on Amazon Aurora PostgreSQL
 
File Context
File ContextFile Context
File Context
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
 
HPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and WorkflowsHPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and Workflows
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
Upgrading from HDP 2.1 to HDP 2.2
Upgrading from HDP 2.1 to HDP 2.2Upgrading from HDP 2.1 to HDP 2.2
Upgrading from HDP 2.1 to HDP 2.2
 
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibilityre:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 

Similar to Intro to hadoop

Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Lamp Stack Optimization
Lamp Stack OptimizationLamp Stack Optimization
Lamp Stack Optimization
Dave Ross
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
Fujio Turner
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
Exchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store ChangesExchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store Changes
Microsoft TechNet - Belgium and Luxembourg
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
Amazon Web Services
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Jason Shao
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
Ryousei Takano
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
Omnia Safaan
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Database-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable TablespacesDatabase-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable Tablespaces
Markus Flechtner
 

Similar to Intro to hadoop (20)

Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Lamp Stack Optimization
Lamp Stack OptimizationLamp Stack Optimization
Lamp Stack Optimization
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Exchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store ChangesExchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store Changes
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Database-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable TablespacesDatabase-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable Tablespaces
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Intro to hadoop

  • 1. Haden Pereira Data Engineer , Applications Work Group @ EMC 5+ Years Experience in the Big Data Space
  • 2. Quick Survey How many Programmers/Developers ?
  • 3. Quick Survey How many SQL Developers?
  • 4. Quick Survey How many Application Developers (Java,C#,etc)
  • 5. Quick Survey How many System Administrators (Database, Tomcat etc)
  • 6. Quick Survey How many of you have heard of Hadoop
  • 7. Quick Survey How many of you have hands on Experience in Hadoop ?
  • 8. Quick Survey How many of you have worked with any of the NoSQL tools. Cassandra, MongoDB, Elasticsearch
  • 10. What is Hadoop? Hadoop is an open source framework for large-scale data storing & processing.
  • 11. Why Hadoop? • Traditional Data processing was done on large systems. • Every time need for better performance arises , they would replace the old computer with better ones. • Scaling up was expensive • Also scaling was limited to the maximum available resources of a single system.
  • 12. How does Hadoop Scale? • ”Scale Out” , rather than “Scale Up” • If data set/data processing requirement increases , add in one more server. • Eliminates the strategy of growing computing capacity by throwing more expensive hardware at the problem.
  • 13.
  • 14. Core Components of Hadoop Hadoop v1 - HDFS & Map/Reduce Hadoop v2 - HDFS & YARN
  • 15. HDFS Distributed: Scale of data growing at higher pace than single storage disk capacity growth, hence cluster of disk distributed over network is necessary. Scalable: Extends to handle growing data requirement. Fault-Tolerant: Protects against increased failure probability due to large number of disks by replication
  • 16. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB Total Capacity 6 TB
  • 17. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB
  • 18. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F1 F1 100MB 100MB 100MB
  • 19. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1 F2 F3
  • 20. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1
  • 21. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2
  • 22. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
  • 23. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
  • 24. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 F3-R2
  • 25. Map Reduce Framework for writing applications that process large amounts of structured and unstructured data in parallel, across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
  • 26. Map Reduce File.txt 300 MB ….. , ….. , ….. , ….. , ….. , ….. , ….. , 654 , INR ….. , ….. , ….. , ….. , ….. , ….. , ….. , 432 , AED ….. , ….. , ….. , ….. , ….. , ….. , ….. , 573 , USD ….. , ….. , ….. , ….. , ….. , ….. , ….. , 948 , EUR ….. , ….. , ….. , ….. , ….. , ….. , ….. , 392 , GBP CSV file with around 1 million lines
  • 27. Map Reduce File.txt 300 MB 1 Hour to process 300 MB File
  • 28. Map Reduce File.txt 150 MB 1/2 Hour to process 150 MB File File.txt 150 MB 1/2 Hour to process 150 MB File
  • 29. Map Reduce File.txt 75 MB 1/4 Hour to process 75 MB File File.txt 75 MB 1/4 Hour to process 75 MB File File.txt 75 MB 1/4 Hour to process 75MB File 1/4 Hour to process 75 MB File File.txt 75 MB
  • 30. Map Reduce Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
  • 31. Map Reduce Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
  • 32. Map Reduce Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 P1-R1 P2-R1 P3-R1
  • 33. Map Reduce • Handles tasks incase of server failures • Distributes tasks evenly • Tries to run tasks on the same server where the data block resides
  • 34. YARN Multi-tenancy - YARN allows multiple access engines (either open-source or proprietary) to use Hadoop as the common standard for batch, interactive and real- time engines that can simultaneously access the same data set. Cluster utilization -YARN’s dynamic allocation of cluster resources improves utilization over more static Map Reduce rules used in early versions of Hadoop. Scalability - Data center processing power continues to rapidly expand. YARN’s Resource Manager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes managing petabytes of data. Compatibility - Existing Map Reduce applications developed for Hadoop 1 can run YARN without any disruption to existing processes that already work
  • 35. YARN
  • 36. Hadoop Ecosystem Pig (scripting): Platform for analyzing large data sets. It is comprised of a high- level language (Pig Latin) that is translapted to Map Reduce. Cuts down writing code . Ideal for Extract-transform-load (ETL) data pipelines, research on raw data, and iterative processing of data. Hive (SQL). Provides data warehouse infrastructure, enabling data summarization, ad- hoc query and analysis of large data sets. The query language, HiveQL (HQL), is similar to SQL. HCatalog (SQL). Table and storage management layer that provides users with Pig, MapReduce and Hive with a relational view of data in HDFS . Provides REST APIs so that external systems can access these tables' metadata.
  • 37. Hadoop Ecosystem Ambari : Provides an open operational framework for provisioning, managing and monitoring Hadoop clusters. Zookeeper : Provides distributed configuration service, a synchronization service and a naming registry for distributed systems Oozie : Enables Hadoop administrators to build complex data transformations out of multiple component tasks, enabling greater control over complex jobs and also making it easier to schedule repetitions of those jobs.
  • 38. Hadoop Ecosystem Tez leverages the MapReduce paradigm to enable the creation and execution of more complex Directed Acyclic Graphs (DAG) of tasks. Tez eliminates unnecessary tasks, synchronization barriers and reads-from and writes-to HDFS, speeding up data processing across both small-scale/low-latency and large-scale/high- throughput workloads Spark : fast and general in memory processing engine that uses YARN as a framework for deployment and can read/write data from HDFS.
  • 39. Hadoop Ecosystem Sqoop : Tool designed to transfer data between Hadoop and relational database servers HBase (NoSQL). Non-relational database that provides random real-time access to data in very large tables. HBase provides transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. Flume : Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS