Data science bootcamp day2

Data Science Bootcamp, Day 2
Presented By:
Chetan Khatri, Volunteer Teaching assistant, Data Science Lab.
Guidance By:
Prof. Devji D. Chhanga, University of Kachchh.

Agenda
Understanding Git.
Understanding Apache Maven.
Hello World Java Program with Apache Maven.
Understanding of Hadoop Administrative Commands.
WordCount Hadoop Program on Hadoop Cluster with Maven.

Git with Github
● Github: Repository storage where you can store your source code and share
with team member work interactively.
● Installation: sudo apt-get install git
● Steps TODO:
1. Create Repository
2. Clone - Copy someone else's repository
3. Commit - Ready to submit your code to repository.

Let’s have Demo with Git
● Create Repository at Github named hadoopdemo
● Cloning Repository: git clone https://github.com/dskskv/hadoopdemo.git
● Configure github with your credentials:
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
Add individual file: sudo git add README.md
for adding every files: sudo git add .

Let’s have Demo with Git (Conti…)
commit command - sudo git commit -m "Comment anything"
Submit request to github repository with whatever has been added:
sudo git origin master
pull - is to get latest updated code from repository
Example : git pull https://github.com/dskskv/hadoopdemo.git
Git Branches:
Are Different Modules of the Repository, Such as Development, Test, Production phase of the
software development.
Master branch has always updated code.

Understanding Apache Maven
Apache Maven is Build Tool for Java, where you can use Other Artifacts(Jar files
written by someone else) and build your Jar file which contains all other’s
added before.
Maven Life Cycle:
Create Maven Project
Update Maven Project
Write Java Code
Maven Clean
Maven Build (For building your Jar file)

Understanding Hadoop Administrative Commands
1. Cloning github cccs936 repository
git clone https://github.com/dskskv/CCCS936.git
2. Start Hadoop Cluster
sbin/start-dfs.sh
sbin/start-yarn.sh
3. Check Hadoop Version
hadoop version
4. Check all the options under hadoop command
hadoop
5. Create Directory as "dskskv" at HDFS
hadoop fs -mkdir /dskskv

6. List out the contents of dskskv object inside HDFS
hadoop fs -ls /dskskv
7. Create Text file
sudo gedit inputfile.txt
8. Put text file inside HDFS block
hadoop fs -put inputfile.txt /dskskv
9. Read the content of HDFS textfile object
hadoop fs -cat /dskskv/inputfile.txt

10. hadoop deprecated, use hdfs also for the same operations.
hdfs dfs -mkdir /chetan
hdfs dfs -put inputfile.txt /chetan
hdfs dfs -cat /chetan/inputfile.txt
11. Deleting file from HDFS
hadoop fs -rm /dskskv/inputfile.txt
12. Deleting Directory from HDFS
hadoop fs -rm -r /dskskv

WordCount Hadoop Program on Hadoop Cluster with
Maven
1) Login as a Hadoop User:
su hduser
2) Start hadoop deamon services
sbin/start-dfs.sh
sbin/start-yarn.sh
3) Check whether all deamon services are up or not
jps
4) Create directory in HDFS, Note: make sure wherever you are in the console , Hadoop user should
have previlegies to access it.
hadoop fs -mkdir /input
5) Transfer textfile to HDFS
hadoop fs -put inputfile.txt /input

WordCount Hadoop Program on Hadoop Cluster with
Maven
6) Check whether file is transferred successfully
hadoop fs -ls /input
7) execute hadoop job by providing Hadoop Program executable Jar file and input directory path where
text file is there and output directory path where you are looking to store process data.
hadoop jar WordCountDSKSKV-0.0.1-SNAPSHOT.jar /input /output
8) Check Processed Directory has processed files ?
hadoop fs -ls /output
9) Read your desired output from Hadoop Job.
hadoop fs -cat /output/part-r-00000

Data science bootcamp day2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Data science bootcamp day2

Similar to Data science bootcamp day2 (20)

More from Chetan Khatri

More from Chetan Khatri (20)

Recently uploaded

Recently uploaded (20)

Data science bootcamp day2