Your SlideShare is downloading. ×

Playing with Hadoop (NPW2013)

434

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
434
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Nordic Perl Workshop 2013 Playing with Hadoop Søren Lund (slu) slu@369.dk
  • 2. DISCLAIMER     I have no experience with Hadoop in a realworld project The installation notes I present are not nescessarily suitable for production The example scripts have not been used on real (big) data Hence the title Playing with Hadoop
  • 3. About Hadoop (and Big Data)
  • 4. The Problem (it's not new)      We have (access to) more and more data Processing this data takes longer and longer Not enough memory Running out of disk space Our trusty old server can't keep up !!!!!
  • 5. Scaling up     Upgrade hardware: bigger and faster Redundancy: power supply, RAID, hotswap Expensive to keep scaling up Our software will run without modifications
  • 6. Scaling out     Add more (commodity) servers Redundancy is replaced by replication You can keep on scaling out, it's cheap How do we enable our software to run across multiple servers?
  • 7. Google solved this  Google published two papers  Google File System (GFS), 2003 http://research.google.com/archive/gfs.html    MapReduce, 2004 http://research.google.com/archive/mapreduce.html GFS and MapReduced provided a platform for processing huge amounts of data in an efficient way
  • 8. Hadoop was born    Doug Cutting read the Google papers Based on those, he created Hadoop (named after his sons toy elephant) It is an implementation of GFS/MapReduce (Open Source / Apache License)  Written in Java and deployed on Linux  First part of Lucene, now an Apache project  https://hadoop.apache.org/
  • 9. Hadoop Components  Hadoop Common – utilities to control the rest  HDFS – Hadoop Distributed File System  YARN – Yet Another Resource Negotiator  MapReduce – YARN-based parallel processing  This enables us to write software that can handle Big Data by scaling out
  • 10. Big Data isn't just big  Huge amounts of data (volume)  Unstructured data (form)  Highly dynamic data (burst/change rate)  Big Data is actually hard-to-handle (with traditional tools/methods) data
  • 11. Examples of Big Data  Log files, i.e.    web server access logs application logs Internet feeds    Twitter, Facebook, etc. RSS Images (face recognition, tagging)
  • 12. Installing Hadoop
  • 13. Needed to run Hadoop  You need the following to run Hadoop   Java JDK   Linux server Hadoop tarball I'm using the following   JDK 1.6.24 64 bit   Ubuntu 12.04 LTS 64 bit Hadoop 1.0.4 Could not get JDK7 + Hadoop 2.2 to work
  • 14. Install Java
  • 15. Setup Java home and path
  • 16. Add hadoop user
  • 17. Create SSH key for hadoop user
  • 18. Accept SSH key
  • 19. Install Hadoop and add to path
  • 20. Disable IPv6
  • 21. Reboot and check installation
  • 22. Running an example job
  • 23. Calculate Pi
  • 24. Estimated value of Pi
  • 25. Three modes of operation  Pi was calculated in Local standalone mode    it is the default mode (i.e. no configuration needed) all components of Hadoop run in a single JVM Pseudo-distributed mode   components communicate using sockets   a separate JVM is spawned for each component it is a mini-cluster on a single host Fully distributed mode  components are spread across multiple machines
  • 26. Create base directory for HDFS
  • 27. Set JAVA_HOME
  • 28. Edit core-site.xml
  • 29. Edit hdfs-site.xml
  • 30. Edit mapred-site.xml
  • 31. Log out and log on as hadoop
  • 32. Format HDFS
  • 33. Start HDFS
  • 34. Start Map Reduce
  • 35. Create home directory & test data
  • 36. Running Word Count
  • 37. First let's try the example jar
  • 38. Inspect the result
  • 39. Compile and run our own jar https://gist.github.com/soren/7213273
  • 40. Inspect result
  • 41. Run improved version https://gist.github.com/soren/7213453
  • 42. Inspect (improved) result
  • 43. Hadoop MapReduce    A reducer will get all values associated with a given key Precursor job can be used to normalize data Combiners can be used to perform early sorting of map output before it is send to the reducer
  • 44. Perl MapReduce
  • 45. Playing with MapReduce   We don't need Hadoop to play with MapReduce Instead we can emulate Hadoop using two scripts  wc_mapper.pl – a Word Count Mapper  wc_reducer.pl – a Word Count Reducer  We connect them using a pipe (|)  Very Unix-like!
  • 46. Run MapReduce without Hadoop https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285
  • 47. Hadoop's Streaming interface  Enables you to write jobs in any programming language, e.g. Perl  Input from STDIN  Output to STDOUT  Key/Value pairs separated by TAB  Reducers will get values one-by-one  Not to be confused with Hadoop Pipes that provides a native C++ interface to Hadoop
  • 48. Run Perl Word Count https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285
  • 49. Inspect result
  • 50. Hadoop::Streaming  Perl interface to Hadoop's Streaming interface  Implemented in Moose  You'll can now implement you MapReduce as  a class with a map() and reduce() method  a mapper script  a reducer script
  • 51. Installing Hadoop::Streaming  Btw, Perl was already installed on the server ;-)  But we want to install Hadoop::Streaming  I also had to install local::lib to make it work  All you have to do is sudo cpan local::lib Hadoop::Streaming  Nice and easy
  • 52. Run Hadoop::Streaming job https://gist.github.com/soren/7596451 https://gist.github.com/soren/7600134 https://gist.github.com/soren/7600144
  • 53. Inspect result
  • 54. Some final notes and loose ends
  • 55. The Web User Interface  HDFS   MapReduce   http://localhost:8070/ File Browser   http://localhost:8030/ http://localhost:8075/browseDirectory.jsp?namenodeInfo Note: this is with port forwarding in VirtualBox  50030 → 8030, 50070 → 8070, 50075 → 8075
  • 56. Joins in Hadoop  It's possible to implement joins in MapReduce    Reduce-joins – simple Map-joins – less data to transfer Do you need joins?  Maybe you're data has structure → SQL?  Try Hive (HiveQL)  Or Pig (Pig Latin)
  • 57. Hadoop in the Cloud  Elastic MapReduce (EMR) http://aws.amazon.com/elasticmapreduce/  Essentially Hadoop in the Cloud  Build on EC2 and S3  You can upload JARs or scripts
  • 58. There's more  Distributions     Cloudera Distribution for Hadoop (CDH) http://www.cloudera.com/ Hortonworks Data Platform (HDP) http://hortonworks.com/ HBase, Hive, Pig and other related projects https://hadoop.apache.org/ But, a basic Hadoop setup is a good start  and a nice place to just play with Hadoop
  • 59. I like big data and I can not lie Oh, my God, Becky, look at the data, it's so big It looks like one of those Hadoop guys setups Who understands those Hadoop guys They only map/reduce it because it is on a distributed file system I mean the data, it's just so big I can't believe it's so huge It's just out there, I mean, it's gross Look, it's just so blah
  • 60. The End Questions? Slides will be available at http://www.slideshare.net/slu/ Find me on Twitter https://twitter.com/slu

×