More Related Content



Editor's Notes

  1. This is my colleague Jacob.
  2. JACOB Looking to scale up to 50-100 nodes in the next year
  3. Mention namenodes and datanodes
  4. ERIC
  5. JACOB
  6. 2 sets of input, 2 sets of output
  7. ERIC - Hadoop can support other filesystems, like Amazon S3, but we are going to focus the on filesystem that is part of Hadoop Common.
  8. There are 2 types of node operation in HDFS, Namenodes and Datanodes. You’ll see this master-worker pattern often in Hadoop as many subprojects have similar structures. A block is the minimum amount of data that can be read or written.
  9. If your data is split up into a bunch of small files, it is best to combine it or use CombineFileInputFormat for your Hadoop job.
  10. By default, HDFS replicates each file to 3 servers. It is rack-aware so it will actually put two files on the same rack and then one on a separate rack where it can.
  11. If your data is split up into a bunch of small files, it is best to combine it or use CombineFileInputFormat for your Hadoop job.
  12. If you need low-latency access to your Big Data, you will have to use a NoSQL solution such as HBase. It is not baked into Hadoop. All writes to HDFS are made by a single writer to the end of the file
  13. Show CDH3 VM
  14. First things first. You need to make your data available to Hadoop for processing by putting it into HDFS. Let’s do this live. And create a directory for all of Fred’s baby pictures. That should give us a nice multi-terabyte dataset, eh?
  15. Since this is, after all, a JAVA user group, so I see it only fitting that we actually look at some Java code.
  16. We use Cloudera’s Distribution of Hadoop since their distribution contains many patches that you would otherwise only get by building Hadoop from the source. They also guarantee that the patches they apply will work. You can download a VMWare image or other stuff to get you started.
  17. We don’t have to use Java to interact with Hadoop! Any executable that can read from stdin and write to stdout can be used with Hadoop Streaming.
  18. You can also use C++ with Hadoop Pipes, but I’m not going to run an example as I am not very familiar with this feature.
  19. Subprojects isn’t the right term for them, but I’m too lazy to come up with something better. These are all Apache top-level projects (with the exception of Flume)
  20. Hadoop is more than a single project, but an ecosystem of related top-level Apache projects. I’d like to show you some examples of how these projects tie into Hadoop Common.
  21. I know that pig!
  22. Using HDFS for storage means that you don’t have to have some other solution for data storage required by some other tools. Has an interactive shell called Grunt
  23. ZooKeepers allows you to safely handle partial failures, which are intrinsic to distributed applications
  24. A comparison of NoSQL solutions isn’t within the scope of this talk, so I’m just going to go over solutions that are technically part of Hadoop. Use this when you need real-time random read/write capabilities to your Big Data. We currently use HBase to serve up data to web pages much faster than a traditional RDBMS could with the amount of data we have.
  25. We use this for our DNS and Apache logs. You can load into HBase or Hive instead.
  26. In summary
  27. plug survey