Your SlideShare is downloading. ×
0
Instant Hadoop of your Own         Created by Jack Bezalel            Senior IT Architect As part of the CTE Mentorship Pr...
What’s Hadoop all about?• OPPORTUNITY: We have access to amazingly  valuable data (Social Media, Mobile, …)• Problem: Data...
What’s in Hadoop?• Reliable data storage using the Hadoop  Distributed File System (HDFS)• High-Performance parallel data ...
How does it scale so well?• Hadoop runs on a collection of commodity,  shared-nothing servers• You can add or remove serve...
Who uses Hadoop?• Originally developed and employed by Yahoo and  Facebook• Hadoop is now widely used in  –   Finance  –  ...
Why did we use Cloudera’s Hadoop                kit?• Cloudera is an active contributor to the  Hadoop project• Provides a...
The solution we tested is provided by         Cloudera Free Edition• Automates the installation and configuration  of CDH3...
Cloudera Manager Free Edition             consists of:• A small self-executing Cloudera Manager  installation program• Ser...
What does Cloudera Include - Flume• Flume — Reliable Data Mover• The primary use case  – a logging system  – gathers a set...
What does Cloudera Include - Sqoop• Sqoop — A tool that imports / exports data  between relational databases and Hadoop  c...
What does Cloudera Include - Hue• Hue — GUI to work with CDH• Web application
What does Cloudera Include - Pig• Pig — Analyzes large amounts of data• Using Pigs query language called Pig Latin• Querie...
What does Cloudera Include - Hive• Hive — A powerful data warehousing APP• Enables access your data using Hive QL• Hive QL...
What does Cloudera Include - HBase• HBase — Large-scale tabular storage• Using HDFS• Cloudera recommends installing HBase ...
What does Cloudera Include -             ZooKeeper• Zookeeper — Service that provides  coordination between distributed pr...
What does Cloudera Include - Oozie• Oozie — A server-based workflow engine• Runs workflow jobs with actions that execute  ...
What does Cloudera Include – 3 last      strangely named tools…• Whirr — Provides a fast way to run cloud  services• Snapp...
Setup Walkthrough• Use Redhat RH5.5+ (CentOS and others  supported as well, we used RH5.7)• 64bit only• 3 VMs used:  – Clo...
About the Cloudera Manager Free     Edition Installation Program• Automatically Installs the package repositories  for Clo...
Download the CDH3 (Cloudera)             Manager• http://archive.cloudera.com/cloudera-  manager/installer/latest/cloudera...
Set yum.conf with your proxy if exists• Add those lines to /etc/yum.conf in your first  Redhat Hadoop node (example here)p...
Let the show begin!• Make sure Selinux is disabled, or this won’t work!  – View file /etc/sysconfig/selinux  – Make sure y...
This one is Easy…
And this one as well…
What do you think about this one?
And yet another one…
It will soon be over 
And it starts rolling
Why it is important to avoid     cleaning up your      presentation…
OOPsss!
Here is why…(posgresql missing…)
After getting “Installation Failed” I gotthis as well…then it exited to OS shell
Installing PostgreSQL• rpm -ivh postgresql-8.1.23-  1.el5_7.2.x86_64.rpm (CLIENT – not a must)• rpm –ivh postgresql-server...
Re-run installation• ./cloudera-manager-installer.bin
Looking better now…
Hooray!
Continue Setup via the web…
Welcome…
You have to give something now…  No such thing as free gifts 
Now enter your 2 or more Hadoop          Node names
Give it some credentials…
Cool!
Here goes nothing…
Here is why it failed on the nodes…
Installing what’s missing on both                  nodes• rpm –ivh cyrus-sasl-gssapi-2.1.22-  5.el5_4.3.x86_64.rpm
Do it, do it again
This bogus issue was resolved bysimple re-try. Looks like it fails due to internet access issues and does not         accu...
Yeh!
What’s on the Menu?
Files and Folders…(Used the Defaults and both nodes had the same directory structure)
All systems are GO!
Here is our glorious Hadoop Cluster
Including all the services
How to start Hadooping – using its GUI             option (HUE)• Download the HUE user guide right here:  https://ccp.clou...
Syslog Action TimeMapping and Analyzing Syslog
Give me some GUI Hue!  Use hostname:8088
Wait a Minute…• Expect undocumented issues if you do this:•   HUE requires a special user (let’s say “admin”)•   Tell HUE ...
Starting the Data Import from File
Ready, Set, GO!
This results in a new “Query”
Let’s load it!
Use this directory
Done!
Let’s hit the road!
And we have a new table created!
Upload the data
Create a Select QUERY from our new        table and Execute it
Monitor the log report as the query is              executed
What a wonderful output! 
Instant hadoop of your own
Instant hadoop of your own
Instant hadoop of your own
Instant hadoop of your own
Instant hadoop of your own
Instant hadoop of your own
Instant hadoop of your own
Instant hadoop of your own
Instant hadoop of your own
Upcoming SlideShare
Loading in...5
×

Instant hadoop of your own

1,446

Published on

Why is everyone interested in Big Data and Hadoop?
Why you should use Hadoop?
Read this to and you as well you quickly and easily be the proud owner of a Hadoop kit of your own, using Cloudera Free Edition.

************************NOTE**********************

This presentation is still being edited and new slides added every day. Stay tuned...

****************************************************

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,446
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
32
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Instant hadoop of your own"

  1. 1. Instant Hadoop of your Own Created by Jack Bezalel Senior IT Architect As part of the CTE Mentorship Program CA Technologies
  2. 2. What’s Hadoop all about?• OPPORTUNITY: We have access to amazingly valuable data (Social Media, Mobile, …)• Problem: Data is seldom UN-Structured• Relational and data warehouse MUST have Structured Data, so they are off the list• Hadoop = fast, reliable analysis of both structured data and complex data
  3. 3. What’s in Hadoop?• Reliable data storage using the Hadoop Distributed File System (HDFS)• High-Performance parallel data processing using a technique called MapReduce.
  4. 4. How does it scale so well?• Hadoop runs on a collection of commodity, shared-nothing servers• You can add or remove servers in a Hadoop cluster at will• The system detects and compensates for hardware or system problems on any server. (self-healing)
  5. 5. Who uses Hadoop?• Originally developed and employed by Yahoo and Facebook• Hadoop is now widely used in – Finance – Technology – Telecom – media and entertainment – Government – research institutions and other markets with significant data.
  6. 6. Why did we use Cloudera’s Hadoop kit?• Cloudera is an active contributor to the Hadoop project• Provides an enterprise-ready, commercial Distribution for Hadoop• Cloudera Distribution saves time by bundling and testing the most popular projects related to Hadoop into a single easier to use package
  7. 7. The solution we tested is provided by Cloudera Free Edition• Automates the installation and configuration of CDH3• Entire cluster (up to 50 nodes)• requiring only root SSH access to your clusters machines• Download Here: https://ccp.cloudera.com/display/SUPPORT/Cl oudera+Manager+Free+Edition+Download
  8. 8. Cloudera Manager Free Edition consists of:• A small self-executing Cloudera Manager installation program• Server and other packages in preparation for cluster host installation• Cloudera Manager wizard for automating CDH3 installation and configuration on the cluster• Cloudera Manager monitoring and configuring the cluster after installation is completed
  9. 9. What does Cloudera Include - Flume• Flume — Reliable Data Mover• The primary use case – a logging system – gathers a set of log files on every machine – aggregates them to a centralized persistent store (such as HDFS)
  10. 10. What does Cloudera Include - Sqoop• Sqoop — A tool that imports / exports data between relational databases and Hadoop clusters.• Using JDBC imports into a Hadoop HDFS• Generates Java classes that enable users to interpret the tables schema
  11. 11. What does Cloudera Include - Hue• Hue — GUI to work with CDH• Web application
  12. 12. What does Cloudera Include - Pig• Pig — Analyzes large amounts of data• Using Pigs query language called Pig Latin• Queries run distributed on a Hadoop cluster
  13. 13. What does Cloudera Include - Hive• Hive — A powerful data warehousing APP• Enables access your data using Hive QL• Hive QL = language that is similar to SQL.
  14. 14. What does Cloudera Include - HBase• HBase — Large-scale tabular storage• Using HDFS• Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster.
  15. 15. What does Cloudera Include - ZooKeeper• Zookeeper — Service that provides coordination between distributed processes.
  16. 16. What does Cloudera Include - Oozie• Oozie — A server-based workflow engine• Runs workflow jobs with actions that execute Hadoop jobs• A command line client is also available for Remote Management
  17. 17. What does Cloudera Include – 3 last strangely named tools…• Whirr — Provides a fast way to run cloud services• Snappy — A compression/decompression library• Mahout — A machine-learning tool. By enabling you to build machine-learning libraries that are scalable to "reasonably large" datasets, it aims to make building intelligent applications easier and faster
  18. 18. Setup Walkthrough• Use Redhat RH5.5+ (CentOS and others supported as well, we used RH5.7)• 64bit only• 3 VMs used: – Cloudera Manager – 2 Nodes to deploy Hadoop on
  19. 19. About the Cloudera Manager Free Edition Installation Program• Automatically Installs the package repositories for Cloudera Manager and the Oracle (JDK)• Installs the Cloudera Manager Server• Installs and configures an embedded PostgreSQL database
  20. 20. Download the CDH3 (Cloudera) Manager• http://archive.cloudera.com/cloudera- manager/installer/latest/cloudera-manager- installer.bin
  21. 21. Set yum.conf with your proxy if exists• Add those lines to /etc/yum.conf in your first Redhat Hadoop node (example here)proxy=http://proxy.corp.com:80proxy_username=usernameproxy_password=password
  22. 22. Let the show begin!• Make sure Selinux is disabled, or this won’t work! – View file /etc/sysconfig/selinux – Make sure you have this line: SELINUX=disabled – You will need to reboot to if you changed the SELINUX setting• Launch the Cloudera Manager Installation:Sudo chmod u+x ./cloudera-manager-installer.binsudo ./cloudera-manager-installer.bin
  23. 23. This one is Easy…
  24. 24. And this one as well…
  25. 25. What do you think about this one?
  26. 26. And yet another one…
  27. 27. It will soon be over 
  28. 28. And it starts rolling
  29. 29. Why it is important to avoid cleaning up your presentation…
  30. 30. OOPsss!
  31. 31. Here is why…(posgresql missing…)
  32. 32. After getting “Installation Failed” I gotthis as well…then it exited to OS shell
  33. 33. Installing PostgreSQL• rpm -ivh postgresql-8.1.23- 1.el5_7.2.x86_64.rpm (CLIENT – not a must)• rpm –ivh postgresql-server-8.1.23- 1.el5_7.2.x86_64.rpm
  34. 34. Re-run installation• ./cloudera-manager-installer.bin
  35. 35. Looking better now…
  36. 36. Hooray!
  37. 37. Continue Setup via the web…
  38. 38. Welcome…
  39. 39. You have to give something now… No such thing as free gifts 
  40. 40. Now enter your 2 or more Hadoop Node names
  41. 41. Give it some credentials…
  42. 42. Cool!
  43. 43. Here goes nothing…
  44. 44. Here is why it failed on the nodes…
  45. 45. Installing what’s missing on both nodes• rpm –ivh cyrus-sasl-gssapi-2.1.22- 5.el5_4.3.x86_64.rpm
  46. 46. Do it, do it again
  47. 47. This bogus issue was resolved bysimple re-try. Looks like it fails due to internet access issues and does not accurately report it.
  48. 48. Yeh!
  49. 49. What’s on the Menu?
  50. 50. Files and Folders…(Used the Defaults and both nodes had the same directory structure)
  51. 51. All systems are GO!
  52. 52. Here is our glorious Hadoop Cluster
  53. 53. Including all the services
  54. 54. How to start Hadooping – using its GUI option (HUE)• Download the HUE user guide right here: https://ccp.cloudera.com/display/CDH4B2/Hu e+2.0+User+Guide
  55. 55. Syslog Action TimeMapping and Analyzing Syslog
  56. 56. Give me some GUI Hue! Use hostname:8088
  57. 57. Wait a Minute…• Expect undocumented issues if you do this:• HUE requires a special user (let’s say “admin”)• Tell HUE about it, the first time you use it• Add the user to the Unix system as well• Add the user to groups “hive” and “hadoop”
  58. 58. Starting the Data Import from File
  59. 59. Ready, Set, GO!
  60. 60. This results in a new “Query”
  61. 61. Let’s load it!
  62. 62. Use this directory
  63. 63. Done!
  64. 64. Let’s hit the road!
  65. 65. And we have a new table created!
  66. 66. Upload the data
  67. 67. Create a Select QUERY from our new table and Execute it
  68. 68. Monitor the log report as the query is executed
  69. 69. What a wonderful output! 
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×