• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big data week presentation

Big data week presentation



This is the presentation that I gave for Big Data week

This is the presentation that I gave for Big Data week



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


13 of 3 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Thanks for having me here today as part of Big Data week. For a lot of people, Hadoop is big data.Today, I’m here to share my experience as a Hadoop user. I use Hadoop every day at LinkedIn because it helps me get my work done. Ask audience: Who uses Hadoop nowWho is thinking about itWho sort of knows what Hadoop is for, but isn’t sure how it helps them
  • Hadoop can help you if you have a gigantic amount of data. You can do things with Hadoop that are hard to do with any other off-the-shelf tool. But Hadoop can be a handful.
  • I’m hoping that you leave here today knowing what Hadoop is.
  • Open sourceJava basedNetwork of serversComodity serversMap reduce
  • The biggest users are mostly web companies:Amazon builds their search indices on HadoopFacebook processes all their usage logs on Hadoop. (They also store photos with hbase.) I bet they do other things as well.Twitter uses hadoop for data analysisYahoo use Hadoop for many things, including a log of their advertising modelseBay and Netlix uses Hadoop as wellAnd a lot more people are using Hadoop for some tasks.
  • The source code for hadoop is freely avaialble, and easy to modifyBut that doesn’t mean it’s cheap and easy to run. It take a lot of operational expertise to set up and run a system with hundreds or thousands of computers. Every big Hadoop shop has a team of developers and operations people who keep the system runningWe’ve modified the Hadoop scheduler, added extra code for debugging, and fixed quite a few bugs
  • I have become very good at reading Java stack traces.
  • Hadoop was designed to run on commodity servers.It doesn’t need servers with super-fast processors, huge amounts of memory, solid state disks, or any other exotic featuresBut that doesn’t mean you should just run down to Fry’s and buy the cheapest computers you can find. Cheap computers fail more often. You need to find a good balance between cost and reliability.By the way, Hadoop runs really well on cloud services.
  • Even really good quality computers fail, and Hadoop was designed to deal with that problem. If the probability of a machine failing is 1/1000 for a given day, you’re going to see failures when you have thousands of computers.As a user, you don’t usually have to worry too much about how hadoop runs your jobs. But sometimes, understanding what Hadoop is doing can help you understand what the system is up to.
  • Let’s talk about each of these things Hadoop is great for doing all the data munging that you do at the start of a data project
  • Mentally, this is my hierarchy of tools. As your data gets bigger, it takes more work to use each tool, so I try not to overshoot.[should add in databases, python tools in the middle of R and hadoop]But sometimes, you have to upgrade. For example, suppose that it takes 25 hours to analyze 24 hours of data on your desktop…
  • As we said before, for your problem to fit, your problem should meet 4 criteria… one of them is that it has to work with Map/Reduce.To help explain map reduce, we’re going to use map reduce here to do some work. [ask for volunteers]
  • The key is used to group data together and to route it to the right reducer.
  • At LinkedIn, we have hundreds of users on our Hadoop system running dozens of jobs. It’s pretty busy in the middle of the day.Unlike some other tools (like Oracle), Hadoop won’t start working on your problem until earlier jobs finish. It’s a very efficient way to use resources, but it could mean that you have to wait around for a long time.
  • So far, we’ve talked about who uses hadoop, and how hadoop works.I’d like to show an example of what you see as a hadoop user; how do you write programs for hadoop.In practice, you might have many input files from many different web servers. Or maybe one giant file. Either way, Hadoop can split up those files to divide the processing work across the cluster.
  • Most Java map/reduce jobs have three parts: a mapper, a reducer, and a job file. I’m going to walk through all three of them here.
  • Here is part of the Java Map/Reduce job for doing this calculation. At this point, it should be clear why we didn’t make this a hands on session. I’m not going to explain everything that’s going on here, but I’ll point out a few piece of how this works.
  • All the keys for a key are handled by a specific reducer. In this case, that means that all the records for each date will be sent to a single reducer, so all we have to do is to count those records.
  • Lastly, you connect everything together with a job file and run it.
  • I’ve probably scared off a lot of people in this room by showing the Java Map/Reduce code. Luckily, there are some simpler ways to solve the problem.
  • One of the coolest things about Cascading is that you can use if from other JVM languages: jython, jruby, clojure, and scala
  • Don’t need a lot of software, can run from your workstation
  • Hive is great, but it takes some work to set it upIt’s great for working with unstructured data…The big disadvantage of Hive is that every operation is a full table scan. With a database like Oracle, data is stored with indexes, so you can quickly look up single values. Hive is good for large calculations, bad for lookups.Another issue with Hive is that it’s not as mature as most databases. You can easily see a Java stack trace.
  • Wrap up, then more things to know