Day1_23Aug.txt - Notepad

Day1_23Aug.txt
Introductions
Myself:
Participants: experience, background, why Hadoop Admin
Murali: Msc-Comp, Sys admin
James:MBA MCA, 2 yrs experience in SAP FICO, BW, and 2-3 as a
Business Analyst, Not working, Looking to get
in to a new domain.
Vimal : working as ORACLE DBA in HCL TECH. TOTAL EXP:-5.7 Years
Preetam: My basic qualification is B.comI have 5+years of experience as
sys admin looking gent in to biga data
domain
Vaibhav:My basic qualification is BCA and I have 5 years of experience in
Systemand network field, Currently I am
working as IT support engineer.
Vineeth:This Side.Completed M.C.A, Having 4 Years of expereince In
IT.Working in CSG International..As
consultant in ORACLE DBA..Started my Carrer in Wipro, and currently in
CSG International, Bengaluru..
Page 1

Day1_23Aug.txt
Prem: B.tech(comp.science ) , having 9 yr of exp. Informatica admin ,
working in barcelona , spain ..currently in
india on a vaccation......planning to get into big data domain as prt of
extending skills
Mayur:curently working with TCS as a WebSphere admin,and as a WAS
admin fimilier with linux and Hadoop is
one the great emerging technology .and obviously as working as admin
so interested in Hadoop admin
Gurpreet: I have 7.6 yrs of Exp. in IT .Currently I amworking on
SQL,PLSQL,Unix Shell scriping in BFSI. Want to
chnage and upgarde my Technology space that is why hadoop
Samadhan: i amworking with SDK Infotech as Developer , qualification
is MBA IT
------------------>
Roles in Hadoop Space
Admin
Developer
Analyst
------------------> Course
1) Big Data - Hadoop
Page 2

Day1_23Aug.txt
2) Storage: HDFS
3) Processing: MapReduce - Programming part of hadoop - Unstructured
4) Installation: Linux / Unix --> Ubuntu OS --> Pseudo Cluster
32 bit: preetam, samadhan, vaibhav
5) Gen1 &Gen2 of Hadoop
6) Multi Node Cluster
7) Eco-Systems: installation + some basic examples
hive - SQL engine to hadoop --> Structured Data
pig - Scripting solution --> Structured &Semi-Structured
sqoop: import - export of data between RDBMS - HDFS
hbase: columnar type no-sql
oozie: scheduling tool for hadoop
8) Distribution of Hadoop
a) Cloudera
b) Hortonworks
c) MapR
d) Pivotal
e) IBMBig Insights
9) Hadoop Reference Architecture
10) MapReduce --> Job Scheduling aspects of hadoop
11) Performance Tuning
12) Management &Monitoring aspects
13) Security
---->
3 types of Data
a) Structured --< like rows and cols. RDBMS, csv
b) Semi-Structured --> like xml. blogs, emails, chat, comments --> text
based but not in rows and cols, documents
Page 3

Day1_23Aug.txt
c) Un-structured --> images, videos, audio
Reading Material
a) Hadoop Definitive Guide --> TomWhite
b) Hadoop in Action
c) Hadoop Operations
----------------------------------------------->
1) Most commonly used Linux Commands
http://www.thegeekstuff.com/2010/11/50-linux-commands/
http://searchenterpriselinux.techtarget.com/tutorial/77-useful-Linux-comm
ands-and-utilities
https://www.getfilecloud.com/blog/2014/01/25-linux-commands-for-system
-administrators/
2) Java:
https://www.youtube.com/watch?v=Hl-zzrqQoSE&list=PLFE2CE09D83E
E3E28
------------------------------------------------->
What is Big Data? --> Termfor Storage and Analysis of Data
Traditional Systems: File Systems [MainFrame ], RDBMS, DWH - Data
Warehouse / MPP - Massively Parallel
Processing
Page 4

Day1_23Aug.txt
a) Storage was Centralized &not distributed
b) Processing was also Centralized &not distributed
Problems
a) Taking a long time to analyse the data: DWH: [Teradata, Netezza,
Datastage]+ ETL - Extract Transformation
Loading [Informatica ]--> couple of hours are common.
b) Works only on Structured Data.
c) Very high TCO. Total Cost of Ownership
------------------------------->
What is Hadoop:
Framework for storage and analysis --
Doug Cutting -- founder of hadoop
2002: Doug Cutting + Mike Caferella --> creating a open source search
Engine: Nutch --> History of Hadoop is in
Chapter 1 of the Definitive Guide.
2003: Sanjay Ghemawat --> wrote a white paper on how Google indexes
and stores their web pages. GFS white
paper. Infrastructure paper
research.google.com/archive/gfs-sosp2003.pdf
Page 5

Day1_23Aug.txt
2004: Jeff Dean --> programming framework that Google uses --> Map
Reduce
research.google.com/archive/mapreduce-osdi04.pdf
2006: Doug gaves his software to ASF [Apache Software Foundation ]
--< Vanilla Hadoop
Video: Sanjay Ghemawat:
https://www.youtube.com/watch?v=NXCIItzkn3E
------------------------------------------------------------------------->
Attributes of Big Data
a) Volume
b) Variety
c) Velocity
d) Value / Varacity [Validity ]
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
2015 200% 2017
Structured 85% 30%
Semi-Structured 10% 25%
Un-Structured 5% 45%
search: boston bombing big data
http://business.time.com/2013/08/06/big-data-is-my-copilot-auto-insurers-
Page 6

Day1_23Aug.txt
push-devices-that-track-driving-habits/
-------------------------------------------------------------------------------------------------
-------------------->
txns dataset: 1 Million [10 lacs ]
36GB
2 - 4 GB
RDBMS DWH/MPP
MainFrame Hadoop
A]Load the data ~ 10 min 12 - 15 min
20 min ~ 10 - 20 sec
B]select txno,amount
fromtable ~20-30 min 10 - 15 min
30 min ~50 - 80 sec
10 million
~120 - 150 sec
---------------------------------------------------->
hadoop version installed is 0.20 in a folder called /usr/lib/hadoop-0.20
hadoop --> list of options for the hadoop command. If this works, then
hadoop is running find.
To check the services in hadoop, we will log in as
Page 7

Day1_23Aug.txt
sudo su --> logging in as a super user - root
password --> cloudera
jps --> will list 6 services
NameNode
DataNode
SecondaryNameNode
JobTracker
TaskTracker
HMaster
jps --> java process status
The IDS that you see on the left is actually the process ids.
Page 8

Day1_23Aug.txt - Notepad

More Related Content

What's hot

Similar to Day1_23Aug.txt - Notepad

Day1_23Aug.txt - Notepad