SlideShare a Scribd company logo
Hadoop In Action
When
Where
Tuesday 06-12-2016
06:00 PM -08:00 PM
Badir Program for
Technology Incubators
#DataRiyadh DataGeeks DataGeeksarabia
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data
transformations on it , save our result and show it via BI tool.
presented by
Mahmoud Yassin
Hadoop:
-Hadoop quick
definition.
-Why Hadoop?
-Hadoop ecosystem.
-Tools to be used.
Practical part:
-What’s the current setup?
-Ambari look.
-Current installed systems.
-Use case high-level
description.
-Steps to develop the use
case?
Use case:
-Locating the data.
-Ingest the data into the HDFS
-See how the files got created in
HDFS
-Feed other data from DB.
-Data querying via Hive and
MapReduce
-Hive table creation.
-Running transudation job via Pig.
-Check the Hive metastore.
-Connect BI to Hadoop.
-Sqoop basic commands
-End to End look solution.
By Mahmoud Yassin
Hadoop Hands On session
Agenda:
Hadoop:
-Hadoop quick definition.
-Why Hadoop?
-Hadoop ecosystem.
-Tools to be used.
Practical part:
-What’s the current setup?
-Ambari look.
-Current installed systems.
-Use case high-level description.
-Steps to develop the use case?
Use case:
-Locating the data.
-Ingest the data into the HDFS
-See how the files got created in HDFS
-Feed other data from DB.
-Data querying via Hive and MapReduce
-Hive table creation.
-Running transudation job via Pig.
-Check the Hive metastore.
- Connect BI to Hadoop.
What is Hadoop
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models.
Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for any
kind of data, enormous processing power and the ability to handle virtually limitless
concurrent tasks or jobs.
#DataRiyadh
Why Hadoop is important ?
Ability to store and process huge amounts of
any kind of data, quickly.
With data volumes and varieties constantly
increasing, especially from social media and the
Internet of Things (IoT), that's a key
consideration.
Computing power. Hadoop's distributed computing model processes big data fast. The
more computing nodes you use, the more processing power you have.
Fault tolerance. Data and application processing are protected against hardware failure.
If a node goes down, jobs are automatically redirected to other nodes to make sure the
distributed computing does not fail. Multiple copies of all data are stored automatically.
Why Hadoop is important ?
Flexibility. Unlike traditional relational databases, you
don’t have to preprocess data before storing it. You
can store as much data as you want and decide how
to use it later. That includes unstructured data like
text, images and videos.
Low cost. The open-source framework is free and uses commodity hardware
to store large quantities of data.
Scalability. You can easily grow your system to handle more data simply by
adding nodes. Little administration is required.
Scalability
Horizontal scaling means that you scale by adding more
machines into your pool of resources
Vertical scaling means that you scale by adding more
power (CPU, RAM) to an existing machine #DataRiyadh
Hadoop ecosystem
Cluster monitoring, provisioning and management
#DataRiyadh
Hadoop | Data Ingestion
Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured data stores such as relational databases.
#DataRiyadh
Hadoop | Data Storage Layer
Hadoop Distributed File System (HDFS) offers a way to store large files across
multiple machines. Hadoop and HDFS was derived from Google File System
(GFS) paper.
#DataRiyadh
Hadoop | Data Storage Layer
#DataRiyadh
Hadoop | Data Processing Layer
MapReduce is the heart of Hadoop. It is this programming paradigm that
allows for massive scalability across hundreds or thousands of servers in a
Hadoop cluster with a parallel, distributed algorithm.
#DataRiyadh
Hadoop | Data Processing Layer
Hadoop | Data Processing Layer
A scripting SQL based language and execution environment for creating complex
MapReduce transformations. Functions are written in Pig Latin (the language)
and translated into executable MapReduce jobs. Pig also allows the user to
create extended functions (UDFs) using Java.
#DataRiyadh
Hadoop | Data Querying Layer
A distributed data warehouse built on top of HDFS to manage and organize
large amounts of data. Hive provides a query language based on SQL semantic
(HiveQL) which is translated by the runtime engine to MapReduce jobs for
querying the data.
#DataRiyadh
Hadoop | Management Layer
intuitive, easy-to-use Hadoop management web UI. Apache Ambari was
donated by Hortonworks team. It's a powerful and nice interface for Hadoop
and other typical applications from the Hadoop ecosystem.
Hadoop | Management Layer
is an open-source Web interface that supports Apache Hadoop and its
ecosystem, licensed under the Apache v2 license
Big data existing solutions:
Current Setup
Current Setup
is a subsidiary of Dell Technologies, that provides cloud and virtualization
software and services.
http://www.vmware.com/
Current Setup
The VM make it easy to quickly get hands-on with CDH for testing, demo, and
self-learning purposes, and include Cloudera Manager for managing your
cluster. Cloudera QuickStart VM also includes a tutorial, sample data, and scripts
for getting started.
http://www.cloudera.com/downloads/quickstart_vms/5-8.html
Inside the VM:
Our RDBMS Hadoop Storage
Use Case
The case:
Data Sources
HDFS :
A platform for manipulating data stored in HDFS via a high-level
language called Pig Latin. It does data extractions, transformations
and loading, and basic analysis in patch mode
File
Video
RDBMS
A platform for manipulating data
stored in HDFS via a high-level
language called Pig Latin. It does
data extractions, transformations
and loading, and basic analysis in
patch mode
A data warehousing and SQL-like
query language that presents data
in the form of tables. Hive
programming is similar to database
programming.
open source massively parallel
processing (MPP) SQL query
engine for data stored in a
computer cluster running Apache
Hadoop.
Connect BI tools
to the Hadoop
cluster
Cloudera CDH cluster
Basic Linux Commands
cat [filename] Display file’s contents to the standard output
device
(usually your monitor).
cd /directorypath Change to directory.
chmod [options] mode filename Change a file’s permissions.
clear Clear a command line screen/window for a
fresh start.
cp [options] source destination Copy files and directories.
ls [options] List directory contents.
mkdir [options] directory Create a new directory.
mv [options] source destination Rename or move file(s) or directories.
pwd Display the pathname for the current
directory.touch filename Create an empty file with the specified name.
who [options] Display who is logged on.
Demo
Questions

More Related Content

What's hot

Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
Vishwajeet Jadeja
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Yukti Kaura
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Gigaom
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
Maryan Faryna
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
Avkash Chauhan
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Haluan Irsad
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 

What's hot (20)

Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 

Viewers also liked

MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataSteven Francia
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
Amazon Web Services
 
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivBig Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Amazon Web Services
 
AWS VPC best practices 2016 by Bogdan Naydenov
AWS VPC best practices 2016 by Bogdan NaydenovAWS VPC best practices 2016 by Bogdan Naydenov
AWS VPC best practices 2016 by Bogdan Naydenov
Bogdan Naydenov
 
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
Amazon Web Services
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Amazon Web Services
 
(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive
Amazon Web Services
 
AWS Security Best Practices (March 2017)
AWS Security Best Practices (March 2017)AWS Security Best Practices (March 2017)
AWS Security Best Practices (March 2017)
Julien SIMON
 
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
Amazon Web Services
 
Amazon Elasticache Deep Dive - March 2017 AWS Online Tech Talks
Amazon Elasticache Deep Dive - March 2017 AWS Online Tech TalksAmazon Elasticache Deep Dive - March 2017 AWS Online Tech Talks
Amazon Elasticache Deep Dive - March 2017 AWS Online Tech Talks
Amazon Web Services
 
Best Practices with IoT Security - February Online Tech Talks
Best Practices with IoT Security - February Online Tech TalksBest Practices with IoT Security - February Online Tech Talks
Best Practices with IoT Security - February Online Tech Talks
Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Deep Dive: Amazon Virtual Private Cloud (March 2017)
Deep Dive: Amazon Virtual Private Cloud (March 2017)Deep Dive: Amazon Virtual Private Cloud (March 2017)
Deep Dive: Amazon Virtual Private Cloud (March 2017)
Julien SIMON
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
Amazon Web Services
 

Viewers also liked (14)

MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
 
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivBig Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
 
AWS VPC best practices 2016 by Bogdan Naydenov
AWS VPC best practices 2016 by Bogdan NaydenovAWS VPC best practices 2016 by Bogdan Naydenov
AWS VPC best practices 2016 by Bogdan Naydenov
 
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive
 
AWS Security Best Practices (March 2017)
AWS Security Best Practices (March 2017)AWS Security Best Practices (March 2017)
AWS Security Best Practices (March 2017)
 
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
 
Amazon Elasticache Deep Dive - March 2017 AWS Online Tech Talks
Amazon Elasticache Deep Dive - March 2017 AWS Online Tech TalksAmazon Elasticache Deep Dive - March 2017 AWS Online Tech Talks
Amazon Elasticache Deep Dive - March 2017 AWS Online Tech Talks
 
Best Practices with IoT Security - February Online Tech Talks
Best Practices with IoT Security - February Online Tech TalksBest Practices with IoT Security - February Online Tech Talks
Best Practices with IoT Security - February Online Tech Talks
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Deep Dive: Amazon Virtual Private Cloud (March 2017)
Deep Dive: Amazon Virtual Private Cloud (March 2017)Deep Dive: Amazon Virtual Private Cloud (March 2017)
Deep Dive: Amazon Virtual Private Cloud (March 2017)
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 

Similar to Hadoop in action

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
Rahul Sharma
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
Anthony Thomas
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
amrutupre
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
MarianJRuben
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
Chandra Sekhar Saripaka
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product page
Janu Jahnavi
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
Mrigendra Sharma
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth
 
HDFS
HDFSHDFS
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Hadoop
HadoopHadoop
Hadoop
Oded Rotter
 

Similar to Hadoop in action (20)

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product page
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
HDFS
HDFSHDFS
HDFS
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

Hadoop in action

  • 1. Hadoop In Action When Where Tuesday 06-12-2016 06:00 PM -08:00 PM Badir Program for Technology Incubators #DataRiyadh DataGeeks DataGeeksarabia Enough taking about Big data and Hadoop and let’s see how Hadoop works in action. We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool. presented by Mahmoud Yassin Hadoop: -Hadoop quick definition. -Why Hadoop? -Hadoop ecosystem. -Tools to be used. Practical part: -What’s the current setup? -Ambari look. -Current installed systems. -Use case high-level description. -Steps to develop the use case? Use case: -Locating the data. -Ingest the data into the HDFS -See how the files got created in HDFS -Feed other data from DB. -Data querying via Hive and MapReduce -Hive table creation. -Running transudation job via Pig. -Check the Hive metastore. -Connect BI to Hadoop. -Sqoop basic commands -End to End look solution.
  • 2. By Mahmoud Yassin Hadoop Hands On session
  • 3. Agenda: Hadoop: -Hadoop quick definition. -Why Hadoop? -Hadoop ecosystem. -Tools to be used. Practical part: -What’s the current setup? -Ambari look. -Current installed systems. -Use case high-level description. -Steps to develop the use case? Use case: -Locating the data. -Ingest the data into the HDFS -See how the files got created in HDFS -Feed other data from DB. -Data querying via Hive and MapReduce -Hive table creation. -Running transudation job via Pig. -Check the Hive metastore. - Connect BI to Hadoop.
  • 4. What is Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. #DataRiyadh
  • 5. Why Hadoop is important ? Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration. Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
  • 6. Why Hadoop is important ? Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required. Scalability Horizontal scaling means that you scale by adding more machines into your pool of resources Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine #DataRiyadh
  • 7. Hadoop ecosystem Cluster monitoring, provisioning and management #DataRiyadh
  • 8. Hadoop | Data Ingestion Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. #DataRiyadh
  • 9. Hadoop | Data Storage Layer Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. #DataRiyadh
  • 10. Hadoop | Data Storage Layer #DataRiyadh
  • 11. Hadoop | Data Processing Layer MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster with a parallel, distributed algorithm. #DataRiyadh
  • 12. Hadoop | Data Processing Layer
  • 13. Hadoop | Data Processing Layer A scripting SQL based language and execution environment for creating complex MapReduce transformations. Functions are written in Pig Latin (the language) and translated into executable MapReduce jobs. Pig also allows the user to create extended functions (UDFs) using Java. #DataRiyadh
  • 14. Hadoop | Data Querying Layer A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a query language based on SQL semantic (HiveQL) which is translated by the runtime engine to MapReduce jobs for querying the data. #DataRiyadh
  • 15. Hadoop | Management Layer intuitive, easy-to-use Hadoop management web UI. Apache Ambari was donated by Hortonworks team. It's a powerful and nice interface for Hadoop and other typical applications from the Hadoop ecosystem.
  • 16. Hadoop | Management Layer is an open-source Web interface that supports Apache Hadoop and its ecosystem, licensed under the Apache v2 license
  • 17. Big data existing solutions:
  • 19. Current Setup is a subsidiary of Dell Technologies, that provides cloud and virtualization software and services. http://www.vmware.com/
  • 20. Current Setup The VM make it easy to quickly get hands-on with CDH for testing, demo, and self-learning purposes, and include Cloudera Manager for managing your cluster. Cloudera QuickStart VM also includes a tutorial, sample data, and scripts for getting started. http://www.cloudera.com/downloads/quickstart_vms/5-8.html
  • 21. Inside the VM: Our RDBMS Hadoop Storage
  • 23. The case: Data Sources HDFS : A platform for manipulating data stored in HDFS via a high-level language called Pig Latin. It does data extractions, transformations and loading, and basic analysis in patch mode File Video RDBMS A platform for manipulating data stored in HDFS via a high-level language called Pig Latin. It does data extractions, transformations and loading, and basic analysis in patch mode A data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Connect BI tools to the Hadoop cluster Cloudera CDH cluster
  • 24. Basic Linux Commands cat [filename] Display file’s contents to the standard output device (usually your monitor). cd /directorypath Change to directory. chmod [options] mode filename Change a file’s permissions. clear Clear a command line screen/window for a fresh start. cp [options] source destination Copy files and directories. ls [options] List directory contents. mkdir [options] directory Create a new directory. mv [options] source destination Rename or move file(s) or directories. pwd Display the pathname for the current directory.touch filename Create an empty file with the specified name. who [options] Display who is logged on.
  • 25. Demo

Editor's Notes

  1. http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
  2. http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf