Intro to Hadoop


Published on

We give an Overview of Hadoop, HDFS, and MapReduce. We then move on to present scenarios for Hadoop usage with Java code, and touch on some of the more useful features of and projects under the Hadoop umbrella.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is my colleague Jacob.
    Looking to scale up to 50-100 nodes in the next year

  • Mention namenodes and datanodes
  • ERIC


  • 2 sets of input, 2 sets of output

  • ERIC -
    Hadoop can support other filesystems, like Amazon S3, but we are going to focus the on filesystem that is part of Hadoop Common.
  • There are 2 types of node operation in HDFS, Namenodes and Datanodes. You’ll see this master-worker pattern often in Hadoop as many subprojects have similar structures.
    A block is the minimum amount of data that can be read or written.

  • If your data is split up into a bunch of small files, it is best to combine it or use CombineFileInputFormat for your Hadoop job.
  • By default, HDFS replicates each file to 3 servers. It is rack-aware so it will actually put two files on the same rack and then one on a separate rack where it can.
  • If your data is split up into a bunch of small files, it is best to combine it or use CombineFileInputFormat for your Hadoop job.

  • If you need low-latency access to your Big Data, you will have to use a NoSQL solution such as HBase. It is not baked into Hadoop.
    All writes to HDFS are made by a single writer to the end of the file
  • Show CDH3 VM
  • First things first. You need to make your data available to Hadoop for processing by putting it into HDFS.
    Let’s do this live. And create a directory for all of Fred’s baby pictures. That should give us a nice multi-terabyte dataset, eh?

  • Since this is, after all, a JAVA user group, so I see it only fitting that we actually look at some Java code.

  • We use Cloudera’s Distribution of Hadoop since their distribution contains many patches that you would otherwise only get by building Hadoop from the source. They also guarantee that the patches they apply will work.
    You can download a VMWare image or other stuff to get you started.
  • We don’t have to use Java to interact with Hadoop! Any executable that can read from stdin and write to stdout can be used with Hadoop Streaming.
  • You can also use C++ with Hadoop Pipes, but I’m not going to run an example as I am not very familiar with this feature.

  • Subprojects isn’t the right term for them, but I’m too lazy to come up with something better. These are all Apache top-level projects (with the exception of Flume)
  • Hadoop is more than a single project, but an ecosystem of related top-level Apache projects. I’d like to show you some examples of how these projects tie into Hadoop Common.

  • I know that pig!
  • Using HDFS for storage means that you don’t have to have some other solution for data storage required by some other tools.
    Has an interactive shell called Grunt

  • ZooKeepers allows you to safely handle partial failures, which are intrinsic to distributed applications
  • A comparison of NoSQL solutions isn’t within the scope of this talk, so I’m just going to go over solutions that are technically part of Hadoop.

    Use this when you need real-time random read/write capabilities to your Big Data. We currently use HBase to serve up data to web pages much faster than a traditional RDBMS could with the amount of data we have.

  • We use this for our DNS and Apache logs. You can load into HBase or Hive instead.

  • In summary
  • plug survey

  • ×