This document provides instructions for configuring a single node Hadoop deployment on Ubuntu. It describes installing Java, adding a dedicated Hadoop user, configuring SSH for key-based authentication, disabling IPv6, installing Hadoop, updating environment variables, and configuring Hadoop configuration files including core-site.xml, mapred-site.xml, and hdfs-site.xml. Key steps include setting JAVA_HOME, configuring HDFS directories and ports, and setting hadoop.tmp.dir to the local /app/hadoop/tmp directory.
Session 03 - Hadoop Installation and Basic CommandsAnandMHadoop
In this session you will learn:
Hadoop Installation and Commands
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Session 03 - Hadoop Installation and Basic CommandsAnandMHadoop
In this session you will learn:
Hadoop Installation and Commands
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
At Salesforce, we have deployed many thousands of HBase/HDFS servers, and learned a lot about tuning during this process. This talk will walk you through the many relevant HBase, HDFS, Apache ZooKeeper, Java/GC, and Operating System configuration options and provides guidelines about which options to use in what situation, and how they relate to each other.
Speakers: Jingcheng Du and Ramkrishna Vasudevan (Intel)
As HBase continues to expand in application and enterprise or government deployments, there is a growing demand for storing data across geographically distributed datacenters for improved availability and disaster recovery. The Cross-Site BigTable extends HBase to make it well-suited for such deployments, providing the capabilities of creating and accessing HBase tables that are partitioned and asynchronously backed-up over a number of distributed datacenters. This talk reveals how the Cross-Site BigTable manages data access over multiple datacenters and removes the data center itself as a single point of failure in geographically distributed HBase deployments.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Administering a Hadoop cluster isn't easy. Many Hadoop clusters suffer from Linux configuration problems that can negatively impact performance. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn't as expected. Learn how to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes.
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
Administering a Hadoop cluster isn't easy. Many Hadoop clusters suffer from Linux configuration problems that can negatively impact performance. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn't as expected. Learn how to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
data analysis business;
At Salesforce, we have deployed many thousands of HBase/HDFS servers, and learned a lot about tuning during this process. This talk will walk you through the many relevant HBase, HDFS, Apache ZooKeeper, Java/GC, and Operating System configuration options and provides guidelines about which options to use in what situation, and how they relate to each other.
Speakers: Jingcheng Du and Ramkrishna Vasudevan (Intel)
As HBase continues to expand in application and enterprise or government deployments, there is a growing demand for storing data across geographically distributed datacenters for improved availability and disaster recovery. The Cross-Site BigTable extends HBase to make it well-suited for such deployments, providing the capabilities of creating and accessing HBase tables that are partitioned and asynchronously backed-up over a number of distributed datacenters. This talk reveals how the Cross-Site BigTable manages data access over multiple datacenters and removes the data center itself as a single point of failure in geographically distributed HBase deployments.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Administering a Hadoop cluster isn't easy. Many Hadoop clusters suffer from Linux configuration problems that can negatively impact performance. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn't as expected. Learn how to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes.
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
Administering a Hadoop cluster isn't easy. Many Hadoop clusters suffer from Linux configuration problems that can negatively impact performance. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn't as expected. Learn how to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
data analysis business;
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir
Hadoop is a quickly budding ecosystem of components based on Google’s MapReduce algorithm and file system work for implementing MapReduce[3] algorithms in a scalable fashion and distributed on commodity hardware. Hadoop enables users to store and process large volumes of data and analyze it in ways not previously possible with SQL-based approaches or less scalable solutions. Remarkable improvements in conventional compute and storage resources help make Hadoop clusters feasible for most organizations. This paper begins with the discussion of Big Data [1][7][9] evolution and the future of Big Data based on Gartner’s Hype Cycle. We have explained how Hadoop Distributed File System (HDFS) works and its architecture with suitable illustration. Hadoop’s MapReduce paradigm for distributing a task across multiple nodes in Hadoop is discussed with sample data sets. The working of MapReduce and HDFS when they are put all together is discussed. Finally the paper ends with a discussion on Big Data Hadoop sample use cases which shows how enterprises can gain a competitive benefit by being early adopters of big data analytics. Hadoop Distributed File System (HDFS) is the core component of Apache Hadoop project. In HDFS, the computation is carried out in the nodes where relevant data is stored. Hadoop also implemented a parallel computational paradigm named as Map-Reduce. In this paper, we have measured the performance of read and write operations in HDFS by considering small and large files. For performance evaluation, we have used a Hadoop cluster with five nodes. The results indicate that HDFS performs well for the files with the size greater than the default block size and performs poorly for the files with the size less than the default block size.
Simplifying Use of Hive with the Hive Query ToolDataWorks Summit
As TripAdvisor moves increasing amounts of data into Hadoop and Hive, the need for simplifying, controlling, and expanding access to this data has grown. Having reviewed existing solutions without finding what we needed, we began working on our own solution to meet our specific goals and use-cases. The Hive Query Tool (HQT) is a web interface that allows anybody to configure and run Hive queries without requiring client-side installation or even knowledge of the query language. Users familiar with HQL can add sophisticated and highly customizable queries with a flexible and powerful template system. A primary innovation, the template system, allows one to define the inputs available to the end-user, validation checks, and what HQL to generate, easily and concisely. We plan to release the code as open-source. This talk will discuss: – The features of the HQT and how it is used for business intelligence – The challenges it was built to meet and how its design and architecture addresses them – Installing and running an HQT server – How to use, customize, and expand the template system – Known limitations and issues – Future plans and features
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
Imagine this: You wake up one day in a world without technology – all the computers on the planet just disappeared. First of all, you’d wake up late because your smartphone alarm no longer exists.
Hadoop installation on windows using virtual box and also hadoop installation on ubuntu
http://logicallearn2.blogspot.in/2018/01/hadoop-installation-on-ubuntu.html
The session titled "Setting up LAMP for linux beginners" was all about lamp installation, apache2 configuration, setting up virtual host and all other important topics without them no linux user could survive. It turned out a very interactive session at the end and all the audience were seemed quited interested.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
3. OS
•Install the Ubuntu Server (Maverick Meerkat) operating system that is available for download from the Ubuntu releases site.
•Some important points to remember while installing the OS
–Ensure that the SSH server is selected to be installed
–Enter the proxy details needed for systems to connect to the internet from within your network
–Create a user on each installation
•Preferably with the same password on each node
4. Prerequisites
•Supported Platforms
–GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
–Win32 is supported as adevelopment platform. Distributed operation has not been well tested on Win32, so it is not supported as aproduction platform.
•Required Software
–Required software for Linux and Windows include:
•JavaTM1.7.x, preferably from Sun, must be installed.
•sshmust be installed andsshdmust be running to use the Hadoop scripts that manage remote Hadoop daemons.
•Additional requirements for Windows include:
–Cygwin-Required for shell support in addition to the required software above.
•Installing Software
–If your cluster doesn't have the requisite software you will need to install it.
–For example on Ubuntu Linux:
•$ sudoapt-get install ssh$ sudoapt-get install rsync
•On Windows, if you did not install the required software when you installed cygwin, start the cygwininstaller and select the packages:
–openssh-theNetcategory
5. Install Sun’s java JDK
•Install Sun’s java JDK on each node in the cluster
•Add the canonical partner repository to your list of apt repositories.
•You can do this by adding the line below into your /etc/apt/sources.listfiledebhttp://archive.canonical.com/ maverick partner
•Update the source list
–sudoapt-get update
•Install sun-java7-jdk
–sudoapt-get install sun-java6-jdk
•Select Sun’s java as the default on the machine
–sudoupdate-java-alternatives -s java-6-sun
•Verify the installation running the command
–java –version
6. Adding a dedicated Hadoop system user
•Use a dedicated Hadoop user account for running Hadoop.
•While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
•This will add the userhduserand the grouphadoopto your local machine:
–$ sudoaddgrouphadoop
–$ sudoadduser--ingrouphadoophduser
7. Configuring SSH
•Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it.
•For single-node setup of Hadoop, we therefore need to configure SSH access tolocalhostfor thehduseruser we created in the previous slide.
•Have SSH up and running on your machine and configured it to allow SSH public key authentication. http://ubuntuguide.org/
•Generate an SSH key for thehduseruser.
user@ubuntu:~$ su-hduser
hduser@ubuntu:~$ ssh-keygen-t rsa-P ""
Generating public/private rsakey pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomartimage is:
[...snipp...]
hduser@ubuntu:~$
8. Configuring SSH
•Second, you have to enable SSH access to your local machine with this newly created key.
–hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
•The final step is to test the SSH setup by connecting to your local machine with thehduseruser.
•The step is also needed to save your local machine’s host key fingerprint to thehduseruser’sknown_hostsfile.
•If you have any special SSH configuration for your local machine like a non- standard SSH port, you can define host-specific SSH options in$HOME/.ssh/config(seeman ssh_configfor more information).
hduser@ubuntu:~$ sshlocalhost
The authenticity of host 'localhost(::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
hduser@ubuntu:~$
9. Disabling IPv6
•One problem with IPv6 on Ubuntu is that using0.0.0.0for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses.
•To disable IPv6 on Ubuntu 10.04 LTS, open/etc/sysctl.confin the editor of your choice and add the following lines to the end of the file:
•You have to reboot your machine in order to make the changes take effect.
•You can check whether IPv6 is enabled on your machine with the following command:
•You can also disable IPv6 only for Hadoop as documented inHADOOP-3437. You can do so by adding the following line toconf/hadoop-env.sh:
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
10. Hadoop Installation
•You have todownload Hadoopfrom theApache Download Mirrorsand extract the contents of the Hadoop package to a location of your choice.
•Say/usr/local/hadoop.
•Make sure to change the owner of all the files to thehduseruser andhadoopgroup, for example:
•Create a symlinkfromhadoop-xxxxxtohadoop
$ cd/usr/local
$ sudotar xzfhadoop-xxxx.tar.gz
$ sudomv hadoop-xxxxxhadoop
$ sudochown-R hduser:hadoophadoop
11. Update $HOME/.bashrc
•Add the following lines to the end of the$HOME/.bashrcfile of userhduser.
•If you use a shell other than bash, you should of course update its appropriate configuration files instead of.bashrc.
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Some convenient aliases and functions for running Hadoop-related commands
unaliasfs&> /dev/null
alias fs="hadoopfs"
unaliashls&> /dev/null
alias hls="fs-ls"
12. Update $HOME/.bashrc
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead/hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead() {
hadoopfs-cat $1 | lzop-dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
13. Configuration files
•The$HADOOP_INSTALL/hadoop/confdirectory contains some configuration files for Hadoop. These are:
•hadoop-env.sh-This file contains some environment variable settings used by Hadoop. You can use these to affect some aspects of Hadoop daemon behavior, such as where log files are stored, the maximum amount of heap used etc. The only variable you should need to change in this file is JAVA_HOME, which specifies the path to the Java 1.5.x installation used by Hadoop.
•slaves-This file lists the hosts, one per line, where the Hadoop slave daemons (datanodesand tasktrackers) will run. By default this contains the single entry localhost
•hdfs-site.xml-This file contains generic default settings for Hadoop daemons and Map/Reduce jobs. Do not modify this file.
•mapred-site.xml-This file contains site specific settings for the Hadoop Map/Reduce daemons and jobs. The file is empty by default. Putting configuration properties in this file will override Map/Reduce settings in the hadoop-default.xml file. Use this file to tailor the behavior of Map/Reduce on your site.
•core-site.xml-This file contains site specific settings for all Hadoop daemons and Map/Reduce jobs. This file is empty by default. Settings in this file override those in hadoop-default.xml and mapred-default.xml. This file should contain settings that must be respected by all servers and clients in a Hadoop installation, for instance, the location of the namenode and the jobtracker.
14. Configuration : Single node
•hadoop-env.sh :
–The only required environment variable we have to configure for Hadoop in this case isJAVA_HOME.
–Open etc/hadoop/conf/hadoop-env.shin the editor of your choice
–set theJAVA_HOMEenvironment variable to the Sun JDK/JRE 6 directory
–export JAVA_HOME=/usr/lib/jvm/java-6-sun
•conf/*-site.xml
–We configure following:
–core-site.xml
•hadoop.tmp.dir
•fs.default.name
–mapred-site.xml
•mapred.job.tracker
–hdfs-site.xml
•dfs.replication
15. Configure HDFS
•We will configure the directory where Hadoop will store its data files, the network ports it listens to, etc.
•Our setup will use Hadoop’sDistributed File System,HDFS, even though our little “cluster” only contains our single local machine.
•You can leave the settings below ”as is” with the exception of thehadoop.tmp.dirvariable which you have to change to the directory of your choice.
•We will use the directory/app/hadoop/tmp
•Hadoop’sdefault configurations usehadoop.tmp.diras the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.
$ sudomkdir-p /app/hadoop/tmp
$ sudochownhduser:hadoop/app/hadoop/tmp
# ...and if you want to tighten up security, chmodfrom 755 to 750...
$ sudochmod750 /app/hadoop/tmp
16. conf/core-site.xml
<!--In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystemimplementation. The
uri'sscheme determines the configproperty (fs.SCHEME.impl) naming
the FileSystemimplementation class. The uri'sauthority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
17. conf/mapred-site.xml
<!--In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
18. conf/hdfs-site.xml
<!--In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
19. Formatting the HDFS and Starting
•To format the filesystem(which simply initializes the directory specified by thedfs.name.dirvariable), run the command
•hadoopnamenode –format
•Run start-all.sh : This will startup a Namenode, Datanode, Jobtrackerand a Tasktrackeron your machine
•Run stop-all.sh to stop all processes
20. Download example input data
•Create a directory inside /home/…/gutenberg
•Download:
–The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson http://www.gutenberg.org/ebooks/20417.txt.utf-8
–The Notebooks of Leonardo DaVincihttp://www.gutenberg.org/cache/epub/5000/pg5000.txt
–Ulysses by James Joycehttp://www.gutenberg.org/cache/epub/4300/pg4300.txt
•Copy local example data to HDFS
–hdfsdfs-copyFromLocalgutenberggutenberg
•Check
–hadoopdfs-lsgutenberg
21. Run the MapReduce job
•Now, we run the WordCountexample job
•hadoopjar /usr/lib/hadoop/hadoop-xxxx-example.jar wordcountgutenberggutenberg-out
•This command will
–read all the files in the HDFS directory/user/cloudera/gutenberg,
–process it, and
–store the result in the HDFS directory/user/cloudera/gutenberg-out
•Check if the result is successfully stored in HDFS directorygutenberg-out
–hdfsdfs–lsgutenberg-out
•Retrieve the job result from HDFS
–hdfsdfs–cat gutenberg-out/part-r-00000
•Better:
–hdfsdfs–cat gutenberg-out/part-r-00000 | sort –nk2,2 –r | less
22. Hadoop Web Interfaces
•Hadoop comes with several web interfaces which are by default (seeconf/hadoop-default.xml) available at these locations:
•http://localhost:50030/–web UI for MapReduce job tracker(s)
•http://localhost:50060/–web UI for task tracker(s)
•http://localhost:50070/–web UI for HDFS name node(s)
23. Cluster setup
•Basic idea
Box 1
Single Node Cluster
Master
Box 2
Single Node Cluster
Master
What we have done so far
Master
Slave
Gateway
Switch
LAN
Use BitviseTunnelierSSH port forwarding
24. Calling by name
•Now that you have two single-node clusters up and running, we will modify the Hadoop configuration to make
•one Ubuntu box the ”master” (which will also act as a slave) and
•the other Ubuntu box a ”slave”.
•We will call the designated master machine just themasterfrom now on and the slave-only machine the slave.
•We will also give the two machines these respective hostnames in their networking setup, most notably in/etc/hosts.
•If the hostnames of your machines are different (e.g.node01) then you must adapt the settings as appropriate.
25. Networking
•connect both machines via a single hub or switch and configure the network interfaces to use a common network such as192.168.0.x/24.
•To make it simple,
•we will assign the IP address192.168.0.1to themastermachine and
•192.168.0.2to theslavemachine.
•Update/etc/hostson both machines with the following lines:
# /etc/hosts (for master AND slave)
192.168.0.1 master
192.168.0.2 slave
26. SSH access
•Thehduseruser on themaster(akahduser@master) must be able to connect a) to its own user account on the master–i.e.sshmasterin this context and not necessarilysshlocalhost–and b) to thehduseruser account on theslave(akahduser@slave) via a password-less SSH login.
•you just have to add thehduser@master‘spublic SSH key (which should be in$HOME/.ssh/id_rsa.pub) to theauthorized_keysfile ofhduser@slave(in this user’s$HOME/.ssh/authorized_keys).
•ssh-copy-id -i$HOME/.ssh/id_rsa.pub hduser@slave
•Verify that the password-less access to all slaves from the master workssshhduser@slavesshhduser@master
28. Naming again
•The master node will run the “master” daemons for each layer:
–NameNodefor the HDFS storage layer, and
–JobTrackerfor the MapReduce processing layer
•Both machines will run the “slave” daemons:
–DataNodefor the HDFS layer, and
–TaskTrackerfor MapReduce processing layer
•The “master” daemons are responsible for coordination and management of the “slave” daemons while the latter will do the actual data storage and data processing work.
•Typically one machine in the cluster is designated as the NameNodeand another machine the as JobTracker, exclusively.
•These are the actual “master nodes”.
•The rest of the machines in the cluster act as both DataNodeand TaskTracker.
•These are the slaves or “worker nodes”.
29. conf/masters (masteronly)
•Theconf/mastersfile defines on which machines Hadoop will startsecondary NameNodesin our multi-node cluster.
•In our case, this is just themastermachine.
•The primary NameNodeand the JobTrackerwill always be the machines on which you run thebin/start-dfs.shandbin/start- mapred.shscripts, respectively
•The primary NameNodeand the JobTrackerwill be started on the same machine if you runbin/start-all.sh
•Onmaster, update/conf/mastersthat it looks like this: master
30. conf/slaves (masteronly)
•Thisconf/slavesfile lists the hosts, one per line, where the Hadoop slave daemons (DataNodesand TaskTrackers) will be run.
•We want both themasterbox and theslavebox to act as Hadoop slaves because we want both of them to store and process data.
•Onmaster, updateconf/slavesthat it looks like this:
•If you have additional slave nodes, just add them to theconf/slavesfile, one per line (do this on all machines in the cluster).
master
slave
31. conf/*-site.xml(all machines)
•You have to change the configuration files
•conf/core-site.xml,
•conf/mapred-site.xml and
•conf/hdfs-site.xml
•on ALL machines:
–fs.default.name : The name of the default file system. A URI whose scheme and authority determine the FileSystemimplementation. Set as hdfs://master:54310
–mapred.job.tracker: The host and port that the MapReduce job tracker runs at. Set as master:54311
–dfs.replication: Default block replication. Set as 2
–mapred.local.dir: Determines where temporary MapReduce data is written. It also may be a list of directories.
–mapred.map.tasks: As a rule of thumb, use 10x the number of slaves (i.e., number of TaskTrackers).
–mapred.reduce.tasks: As a rule of thumb, use 2x the number of slave processors (i.e., number of TaskTrackers).
32. Formatting the HDFS and Starting
•To format the filesystem(which simply initializes the directory specified by thedfs.name.dirvariable on the NameNode), run the command
•hdfsnamenode –format
•Starting the multi-node cluster
–Starting the cluster is done in two steps.
–First, the HDFS daemons are started: start-dfs.sh
•NameNodedaemon is started on master, and
•DataNodedaemons are started on all slaves (here:masterandslave)
–Second, the MapReduce daemons are started: start-mapred.sh
•JobTrackeris started onmaster, and
•TaskTrackerdaemons are started on all slaves (here:masterandslave)
•stop-mapred.sh followed by stop-dfs.sh to stop
33. End of session
Day –1: Hadoop Deployment and Configuration -Single machine and a cluster
Run the PiEstimatorexamplehadoopjar /usr/lib/hadoop/hadoop-xxxxx-example.jar pi 2 100000