Nordic Perl Workshop 2013

Playing with Hadoop
Søren Lund (slu)
slu@369.dk
DISCLAIMER








I have no experience with Hadoop in a realworld project
The installation notes I present are not
ne...
About Hadoop (and Big Data)
The Problem (it's not new)









We have (access to)
more and more data
Processing this data
takes longer and
long...
Scaling up








Upgrade hardware:
bigger and faster
Redundancy: power
supply, RAID, hotswap
Expensive to keep
scali...
Scaling out








Add more
(commodity) servers
Redundancy is
replaced by
replication
You can keep on
scaling out, it...
Google solved this


Google published two papers


Google File System (GFS), 2003
http://research.google.com/archive/gfs...
Hadoop was born





Doug Cutting read the Google papers
Based on those, he created Hadoop
(named after his sons toy el...
Hadoop Components


Hadoop Common – utilities to control the rest



HDFS – Hadoop Distributed File System



YARN – Ye...
Big Data isn't just big


Huge amounts of data (volume)



Unstructured data (form)



Highly dynamic data (burst/chang...
Examples of Big Data


Log files, i.e.





web server access logs
application logs

Internet feeds





Twitter, F...
Installing Hadoop
Needed to run Hadoop


You need the following to run Hadoop



Java JDK





Linux server
Hadoop tarball

I'm using t...
Install Java
Setup Java home and path
Add hadoop user
Create SSH key for hadoop user
Accept SSH key
Install Hadoop and add to path
Disable IPv6
Reboot and check installation
Running an example job
Calculate Pi
Estimated value of Pi
Three modes of operation


Pi was calculated in Local standalone mode





it is the default mode (i.e. no configurati...
Create base directory for HDFS
Set JAVA_HOME
Edit core-site.xml
Edit hdfs-site.xml
Edit mapred-site.xml
Log out and log on as hadoop
Format HDFS
Start HDFS
Start Map Reduce
Create home directory & test data
Running Word Count
First let's try the example jar
Inspect the result
Compile and run our own jar
https://gist.github.com/soren/7213273
Inspect result
Run improved version
https://gist.github.com/soren/7213453
Inspect (improved) result
Hadoop MapReduce





A reducer will get all values associated with a
given key
Precursor job can be used to normalize ...
Perl MapReduce
Playing with MapReduce



We don't need Hadoop to play with MapReduce
Instead we can emulate Hadoop using two
scripts

...
Run MapReduce without Hadoop
https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285
Hadoop's Streaming interface


Enables you to write jobs in any programming
language, e.g. Perl



Input from STDIN



...
Run Perl Word Count
https://gist.github.com/soren/7596270

https://gist.github.com/soren/7596285
Inspect result
Hadoop::Streaming


Perl interface to Hadoop's Streaming interface



Implemented in Moose



You'll can now implement ...
Installing Hadoop::Streaming


Btw, Perl was already installed on the server ;-)



But we want to install Hadoop::Strea...
Run Hadoop::Streaming job
https://gist.github.com/soren/7596451
https://gist.github.com/soren/7600134 https://gist.github....
Inspect result
Some final notes and loose ends
The Web User Interface


HDFS




MapReduce




http://localhost:8070/

File Browser




http://localhost:8030/

ht...
Joins in Hadoop


It's possible to implement joins in MapReduce





Reduce-joins – simple
Map-joins – less data to tr...
Hadoop in the Cloud


Elastic MapReduce (EMR)
http://aws.amazon.com/elasticmapreduce/



Essentially Hadoop in the Cloud...
There's more


Distributions








Cloudera Distribution for Hadoop (CDH)
http://www.cloudera.com/
Hortonworks Data...
I like big data and I can not lie
Oh, my God, Becky, look at the data, it's so big
It looks like one of those Hadoop guys ...
The End

Questions?

Slides will be available at http://www.slideshare.net/slu/
Find me on Twitter https://twitter.com/slu
Upcoming SlideShare
Loading in...5
×

Playing with Hadoop (NPW2013)

477

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
477
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Playing with Hadoop (NPW2013)

  1. 1. Nordic Perl Workshop 2013 Playing with Hadoop Søren Lund (slu) slu@369.dk
  2. 2. DISCLAIMER     I have no experience with Hadoop in a realworld project The installation notes I present are not nescessarily suitable for production The example scripts have not been used on real (big) data Hence the title Playing with Hadoop
  3. 3. About Hadoop (and Big Data)
  4. 4. The Problem (it's not new)      We have (access to) more and more data Processing this data takes longer and longer Not enough memory Running out of disk space Our trusty old server can't keep up !!!!!
  5. 5. Scaling up     Upgrade hardware: bigger and faster Redundancy: power supply, RAID, hotswap Expensive to keep scaling up Our software will run without modifications
  6. 6. Scaling out     Add more (commodity) servers Redundancy is replaced by replication You can keep on scaling out, it's cheap How do we enable our software to run across multiple servers?
  7. 7. Google solved this  Google published two papers  Google File System (GFS), 2003 http://research.google.com/archive/gfs.html    MapReduce, 2004 http://research.google.com/archive/mapreduce.html GFS and MapReduced provided a platform for processing huge amounts of data in an efficient way
  8. 8. Hadoop was born    Doug Cutting read the Google papers Based on those, he created Hadoop (named after his sons toy elephant) It is an implementation of GFS/MapReduce (Open Source / Apache License)  Written in Java and deployed on Linux  First part of Lucene, now an Apache project  https://hadoop.apache.org/
  9. 9. Hadoop Components  Hadoop Common – utilities to control the rest  HDFS – Hadoop Distributed File System  YARN – Yet Another Resource Negotiator  MapReduce – YARN-based parallel processing  This enables us to write software that can handle Big Data by scaling out
  10. 10. Big Data isn't just big  Huge amounts of data (volume)  Unstructured data (form)  Highly dynamic data (burst/change rate)  Big Data is actually hard-to-handle (with traditional tools/methods) data
  11. 11. Examples of Big Data  Log files, i.e.    web server access logs application logs Internet feeds    Twitter, Facebook, etc. RSS Images (face recognition, tagging)
  12. 12. Installing Hadoop
  13. 13. Needed to run Hadoop  You need the following to run Hadoop   Java JDK   Linux server Hadoop tarball I'm using the following   JDK 1.6.24 64 bit   Ubuntu 12.04 LTS 64 bit Hadoop 1.0.4 Could not get JDK7 + Hadoop 2.2 to work
  14. 14. Install Java
  15. 15. Setup Java home and path
  16. 16. Add hadoop user
  17. 17. Create SSH key for hadoop user
  18. 18. Accept SSH key
  19. 19. Install Hadoop and add to path
  20. 20. Disable IPv6
  21. 21. Reboot and check installation
  22. 22. Running an example job
  23. 23. Calculate Pi
  24. 24. Estimated value of Pi
  25. 25. Three modes of operation  Pi was calculated in Local standalone mode    it is the default mode (i.e. no configuration needed) all components of Hadoop run in a single JVM Pseudo-distributed mode   components communicate using sockets   a separate JVM is spawned for each component it is a mini-cluster on a single host Fully distributed mode  components are spread across multiple machines
  26. 26. Create base directory for HDFS
  27. 27. Set JAVA_HOME
  28. 28. Edit core-site.xml
  29. 29. Edit hdfs-site.xml
  30. 30. Edit mapred-site.xml
  31. 31. Log out and log on as hadoop
  32. 32. Format HDFS
  33. 33. Start HDFS
  34. 34. Start Map Reduce
  35. 35. Create home directory & test data
  36. 36. Running Word Count
  37. 37. First let's try the example jar
  38. 38. Inspect the result
  39. 39. Compile and run our own jar https://gist.github.com/soren/7213273
  40. 40. Inspect result
  41. 41. Run improved version https://gist.github.com/soren/7213453
  42. 42. Inspect (improved) result
  43. 43. Hadoop MapReduce    A reducer will get all values associated with a given key Precursor job can be used to normalize data Combiners can be used to perform early sorting of map output before it is send to the reducer
  44. 44. Perl MapReduce
  45. 45. Playing with MapReduce   We don't need Hadoop to play with MapReduce Instead we can emulate Hadoop using two scripts  wc_mapper.pl – a Word Count Mapper  wc_reducer.pl – a Word Count Reducer  We connect them using a pipe (|)  Very Unix-like!
  46. 46. Run MapReduce without Hadoop https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285
  47. 47. Hadoop's Streaming interface  Enables you to write jobs in any programming language, e.g. Perl  Input from STDIN  Output to STDOUT  Key/Value pairs separated by TAB  Reducers will get values one-by-one  Not to be confused with Hadoop Pipes that provides a native C++ interface to Hadoop
  48. 48. Run Perl Word Count https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285
  49. 49. Inspect result
  50. 50. Hadoop::Streaming  Perl interface to Hadoop's Streaming interface  Implemented in Moose  You'll can now implement you MapReduce as  a class with a map() and reduce() method  a mapper script  a reducer script
  51. 51. Installing Hadoop::Streaming  Btw, Perl was already installed on the server ;-)  But we want to install Hadoop::Streaming  I also had to install local::lib to make it work  All you have to do is sudo cpan local::lib Hadoop::Streaming  Nice and easy
  52. 52. Run Hadoop::Streaming job https://gist.github.com/soren/7596451 https://gist.github.com/soren/7600134 https://gist.github.com/soren/7600144
  53. 53. Inspect result
  54. 54. Some final notes and loose ends
  55. 55. The Web User Interface  HDFS   MapReduce   http://localhost:8070/ File Browser   http://localhost:8030/ http://localhost:8075/browseDirectory.jsp?namenodeInfo Note: this is with port forwarding in VirtualBox  50030 → 8030, 50070 → 8070, 50075 → 8075
  56. 56. Joins in Hadoop  It's possible to implement joins in MapReduce    Reduce-joins – simple Map-joins – less data to transfer Do you need joins?  Maybe you're data has structure → SQL?  Try Hive (HiveQL)  Or Pig (Pig Latin)
  57. 57. Hadoop in the Cloud  Elastic MapReduce (EMR) http://aws.amazon.com/elasticmapreduce/  Essentially Hadoop in the Cloud  Build on EC2 and S3  You can upload JARs or scripts
  58. 58. There's more  Distributions     Cloudera Distribution for Hadoop (CDH) http://www.cloudera.com/ Hortonworks Data Platform (HDP) http://hortonworks.com/ HBase, Hive, Pig and other related projects https://hadoop.apache.org/ But, a basic Hadoop setup is a good start  and a nice place to just play with Hadoop
  59. 59. I like big data and I can not lie Oh, my God, Becky, look at the data, it's so big It looks like one of those Hadoop guys setups Who understands those Hadoop guys They only map/reduce it because it is on a distributed file system I mean the data, it's just so big I can't believe it's so huge It's just out there, I mean, it's gross Look, it's just so blah
  60. 60. The End Questions? Slides will be available at http://www.slideshare.net/slu/ Find me on Twitter https://twitter.com/slu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×