Day1_23Aug.txt
Introductions
Myself:
Participants: experience, background, why Hadoop Admin
Murali: Msc-Comp, Sys admin
James:MBA MCA, 2 yrs experience in SAP FICO, BW, and 2-3 as a
Business Analyst, Not working, Looking to get
in to a new domain.
Vimal : working as ORACLE DBA in HCL TECH. TOTAL EXP:-5.7 Years
Preetam: My basic qualification is B.comI have 5+years of experience as
sys admin looking gent in to biga data
domain
Vaibhav:My basic qualification is BCA and I have 5 years of experience in
Systemand network field, Currently I am
working as IT support engineer.
Vineeth:This Side.Completed M.C.A, Having 4 Years of expereince In
IT.Working in CSG International..As
consultant in ORACLE DBA..Started my Carrer in Wipro, and currently in
CSG International, Bengaluru..
Page 1
Day1_23Aug.txt
Prem: B.tech(comp.science ) , having 9 yr of exp. Informatica admin ,
working in barcelona , spain ..currently in
india on a vaccation......planning to get into big data domain as prt of
extending skills
Mayur:curently working with TCS as a WebSphere admin,and as a WAS
admin fimilier with linux and Hadoop is
one the great emerging technology .and obviously as working as admin
so interested in Hadoop admin
Gurpreet: I have 7.6 yrs of Exp. in IT .Currently I amworking on
SQL,PLSQL,Unix Shell scriping in BFSI. Want to
chnage and upgarde my Technology space that is why hadoop
Samadhan: i amworking with SDK Infotech as Developer , qualification
is MBA IT
------------------>
Roles in Hadoop Space
Admin
Developer
Analyst
------------------> Course
1) Big Data - Hadoop
Page 2
Day1_23Aug.txt
2) Storage: HDFS
3) Processing: MapReduce - Programming part of hadoop - Unstructured
4) Installation: Linux / Unix --> Ubuntu OS --> Pseudo Cluster
32 bit: preetam, samadhan, vaibhav
5) Gen1 &Gen2 of Hadoop
6) Multi Node Cluster
7) Eco-Systems: installation + some basic examples
hive - SQL engine to hadoop --> Structured Data
pig - Scripting solution --> Structured &Semi-Structured
sqoop: import - export of data between RDBMS - HDFS
hbase: columnar type no-sql
oozie: scheduling tool for hadoop
8) Distribution of Hadoop
a) Cloudera
b) Hortonworks
c) MapR
d) Pivotal
e) IBMBig Insights
9) Hadoop Reference Architecture
10) MapReduce --> Job Scheduling aspects of hadoop
11) Performance Tuning
12) Management &Monitoring aspects
13) Security
---->
3 types of Data
a) Structured --< like rows and cols. RDBMS, csv
b) Semi-Structured --> like xml. blogs, emails, chat, comments --> text
based but not in rows and cols, documents
Page 3
Day1_23Aug.txt
c) Un-structured --> images, videos, audio
Reading Material
a) Hadoop Definitive Guide --> TomWhite
b) Hadoop in Action
c) Hadoop Operations
----------------------------------------------->
1) Most commonly used Linux Commands
http://www.thegeekstuff.com/2010/11/50-linux-commands/
http://searchenterpriselinux.techtarget.com/tutorial/77-useful-Linux-comm
ands-and-utilities
https://www.getfilecloud.com/blog/2014/01/25-linux-commands-for-system
-administrators/
2) Java:
https://www.youtube.com/watch?v=Hl-zzrqQoSE&list=PLFE2CE09D83E
E3E28
------------------------------------------------->
What is Big Data? --> Termfor Storage and Analysis of Data
Traditional Systems: File Systems [MainFrame ], RDBMS, DWH - Data
Warehouse / MPP - Massively Parallel
Processing
Page 4
Day1_23Aug.txt
a) Storage was Centralized &not distributed
b) Processing was also Centralized &not distributed
Problems
a) Taking a long time to analyse the data: DWH: [Teradata, Netezza,
Datastage]+ ETL - Extract Transformation
Loading [Informatica ]--> couple of hours are common.
b) Works only on Structured Data.
c) Very high TCO. Total Cost of Ownership
------------------------------->
What is Hadoop:
Framework for storage and analysis --
Doug Cutting -- founder of hadoop
2002: Doug Cutting + Mike Caferella --> creating a open source search
Engine: Nutch --> History of Hadoop is in
Chapter 1 of the Definitive Guide.
2003: Sanjay Ghemawat --> wrote a white paper on how Google indexes
and stores their web pages. GFS white
paper. Infrastructure paper
research.google.com/archive/gfs-sosp2003.pdf
Page 5
Day1_23Aug.txt
2004: Jeff Dean --> programming framework that Google uses --> Map
Reduce
research.google.com/archive/mapreduce-osdi04.pdf
2006: Doug gaves his software to ASF [Apache Software Foundation ]
--< Vanilla Hadoop
Video: Sanjay Ghemawat:
https://www.youtube.com/watch?v=NXCIItzkn3E
------------------------------------------------------------------------->
Attributes of Big Data
a) Volume
b) Variety
c) Velocity
d) Value / Varacity [Validity ]
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
2015 200% 2017
Structured 85% 30%
Semi-Structured 10% 25%
Un-Structured 5% 45%
search: boston bombing big data
http://business.time.com/2013/08/06/big-data-is-my-copilot-auto-insurers-
Page 6
Day1_23Aug.txt
push-devices-that-track-driving-habits/
-------------------------------------------------------------------------------------------------
-------------------->
txns dataset: 1 Million [10 lacs ]
36GB
2 - 4 GB
RDBMS DWH/MPP
MainFrame Hadoop
A]Load the data ~ 10 min 12 - 15 min
20 min ~ 10 - 20 sec
B]select txno,amount
fromtable ~20-30 min 10 - 15 min
30 min ~50 - 80 sec
10 million
~120 - 150 sec
---------------------------------------------------->
hadoop version installed is 0.20 in a folder called /usr/lib/hadoop-0.20
hadoop --> list of options for the hadoop command. If this works, then
hadoop is running find.
To check the services in hadoop, we will log in as
Page 7
Day1_23Aug.txt
sudo su --> logging in as a super user - root
password --> cloudera
jps --> will list 6 services
NameNode
DataNode
SecondaryNameNode
JobTracker
TaskTracker
HMaster
jps --> java process status
The IDS that you see on the left is actually the process ids.
Page 8
Day1_23Aug.txt
Page 9

Day1_23Aug.txt - Notepad

  • 1.
    Day1_23Aug.txt Introductions Myself: Participants: experience, background,why Hadoop Admin Murali: Msc-Comp, Sys admin James:MBA MCA, 2 yrs experience in SAP FICO, BW, and 2-3 as a Business Analyst, Not working, Looking to get in to a new domain. Vimal : working as ORACLE DBA in HCL TECH. TOTAL EXP:-5.7 Years Preetam: My basic qualification is B.comI have 5+years of experience as sys admin looking gent in to biga data domain Vaibhav:My basic qualification is BCA and I have 5 years of experience in Systemand network field, Currently I am working as IT support engineer. Vineeth:This Side.Completed M.C.A, Having 4 Years of expereince In IT.Working in CSG International..As consultant in ORACLE DBA..Started my Carrer in Wipro, and currently in CSG International, Bengaluru.. Page 1
  • 2.
    Day1_23Aug.txt Prem: B.tech(comp.science ), having 9 yr of exp. Informatica admin , working in barcelona , spain ..currently in india on a vaccation......planning to get into big data domain as prt of extending skills Mayur:curently working with TCS as a WebSphere admin,and as a WAS admin fimilier with linux and Hadoop is one the great emerging technology .and obviously as working as admin so interested in Hadoop admin Gurpreet: I have 7.6 yrs of Exp. in IT .Currently I amworking on SQL,PLSQL,Unix Shell scriping in BFSI. Want to chnage and upgarde my Technology space that is why hadoop Samadhan: i amworking with SDK Infotech as Developer , qualification is MBA IT ------------------> Roles in Hadoop Space Admin Developer Analyst ------------------> Course 1) Big Data - Hadoop Page 2
  • 3.
    Day1_23Aug.txt 2) Storage: HDFS 3)Processing: MapReduce - Programming part of hadoop - Unstructured 4) Installation: Linux / Unix --> Ubuntu OS --> Pseudo Cluster 32 bit: preetam, samadhan, vaibhav 5) Gen1 &Gen2 of Hadoop 6) Multi Node Cluster 7) Eco-Systems: installation + some basic examples hive - SQL engine to hadoop --> Structured Data pig - Scripting solution --> Structured &Semi-Structured sqoop: import - export of data between RDBMS - HDFS hbase: columnar type no-sql oozie: scheduling tool for hadoop 8) Distribution of Hadoop a) Cloudera b) Hortonworks c) MapR d) Pivotal e) IBMBig Insights 9) Hadoop Reference Architecture 10) MapReduce --> Job Scheduling aspects of hadoop 11) Performance Tuning 12) Management &Monitoring aspects 13) Security ----> 3 types of Data a) Structured --< like rows and cols. RDBMS, csv b) Semi-Structured --> like xml. blogs, emails, chat, comments --> text based but not in rows and cols, documents Page 3
  • 4.
    Day1_23Aug.txt c) Un-structured -->images, videos, audio Reading Material a) Hadoop Definitive Guide --> TomWhite b) Hadoop in Action c) Hadoop Operations -----------------------------------------------> 1) Most commonly used Linux Commands http://www.thegeekstuff.com/2010/11/50-linux-commands/ http://searchenterpriselinux.techtarget.com/tutorial/77-useful-Linux-comm ands-and-utilities https://www.getfilecloud.com/blog/2014/01/25-linux-commands-for-system -administrators/ 2) Java: https://www.youtube.com/watch?v=Hl-zzrqQoSE&list=PLFE2CE09D83E E3E28 -------------------------------------------------> What is Big Data? --> Termfor Storage and Analysis of Data Traditional Systems: File Systems [MainFrame ], RDBMS, DWH - Data Warehouse / MPP - Massively Parallel Processing Page 4
  • 5.
    Day1_23Aug.txt a) Storage wasCentralized &not distributed b) Processing was also Centralized &not distributed Problems a) Taking a long time to analyse the data: DWH: [Teradata, Netezza, Datastage]+ ETL - Extract Transformation Loading [Informatica ]--> couple of hours are common. b) Works only on Structured Data. c) Very high TCO. Total Cost of Ownership -------------------------------> What is Hadoop: Framework for storage and analysis -- Doug Cutting -- founder of hadoop 2002: Doug Cutting + Mike Caferella --> creating a open source search Engine: Nutch --> History of Hadoop is in Chapter 1 of the Definitive Guide. 2003: Sanjay Ghemawat --> wrote a white paper on how Google indexes and stores their web pages. GFS white paper. Infrastructure paper research.google.com/archive/gfs-sosp2003.pdf Page 5
  • 6.
    Day1_23Aug.txt 2004: Jeff Dean--> programming framework that Google uses --> Map Reduce research.google.com/archive/mapreduce-osdi04.pdf 2006: Doug gaves his software to ASF [Apache Software Foundation ] --< Vanilla Hadoop Video: Sanjay Ghemawat: https://www.youtube.com/watch?v=NXCIItzkn3E -------------------------------------------------------------------------> Attributes of Big Data a) Volume b) Variety c) Velocity d) Value / Varacity [Validity ] http://www.ibmbigdatahub.com/infographic/four-vs-big-data 2015 200% 2017 Structured 85% 30% Semi-Structured 10% 25% Un-Structured 5% 45% search: boston bombing big data http://business.time.com/2013/08/06/big-data-is-my-copilot-auto-insurers- Page 6
  • 7.
    Day1_23Aug.txt push-devices-that-track-driving-habits/ ------------------------------------------------------------------------------------------------- --------------------> txns dataset: 1Million [10 lacs ] 36GB 2 - 4 GB RDBMS DWH/MPP MainFrame Hadoop A]Load the data ~ 10 min 12 - 15 min 20 min ~ 10 - 20 sec B]select txno,amount fromtable ~20-30 min 10 - 15 min 30 min ~50 - 80 sec 10 million ~120 - 150 sec ----------------------------------------------------> hadoop version installed is 0.20 in a folder called /usr/lib/hadoop-0.20 hadoop --> list of options for the hadoop command. If this works, then hadoop is running find. To check the services in hadoop, we will log in as Page 7
  • 8.
    Day1_23Aug.txt sudo su -->logging in as a super user - root password --> cloudera jps --> will list 6 services NameNode DataNode SecondaryNameNode JobTracker TaskTracker HMaster jps --> java process status The IDS that you see on the left is actually the process ids. Page 8
  • 9.