Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data week presentation


Published on

This is the presentation that I gave for Big Data week

Published in: Technology, Education

Big data week presentation

  1. 1. Joseph Adler April 24 2012
  2. 2. Don’t use Hadoop.(Unless you have to.)
  3. 3. • What is Hadoop?• Why do people use Hadoop?• How does it work?• When should you consider Hadoop?
  4. 4. What is Hadoop?Apache Hadoop is an open source, java-basedsystem for processing data on a network ofcommodity servers using a map-reduceparadigm.
  5. 5. How do people use Hadoop?A few examples from the Apache site – Amazon search – Facebook log storage and reporting – LinkedIn’s People You May Know – Twitter data analysis – Yahoo! Uses it for ad targetingA search on LinkedIn shows people at financialservices, biotech, oil and gas exploration, retail,and other industries are using Hadoop.
  6. 6. Where did Hadoop come from?• Hadoop was created by Doug Cutting. It’s named after his son’s toy elephant.• Hadoop was written to support Nutch, an open source web search engine. Hadoop was spun out in 2006.• Yahoo! invested in Hadoop, bringing it to “web scale” by 2008.
  7. 7. Hadoop is open source• Hadoop is an open source project (Apache license) – You can download and install it freely – You can also compile your own custom version of Hadoop• There are three subprojects
  8. 8. Hadoop is written for Java• The good news: Hadoop runs on a JVM – You can run Hadoop on your workstation (for testing), on a private cluster, or in a cloud – You can write Hadoop jobs in Java, or in Scala, Jruby, Jython, Clojure, or any other JVM language – You can use other Java libraries• The bad news: Hadoop was originally written by and for Java programmers. – You can do basic work without knowing Java. But you will quickly get stuck if you can’t write code.
  9. 9. Hadoop runs on a network of servers
  10. 10. Hadoop runs on commodity servers• Doesn’t require very fast, very big, or very reliable servers• Works better on good quality servers connected through a fast network• Hadoop is fault tolerant—multiple copies of data, protection against failed jobs
  11. 11. When should you consider Hadoop?• Big problem• Fits Map/Reduce model• Don’t need to compute in real time• Technical team
  12. 12. Picking the right tool for the job 1,000,000,000,000 100,000,000,000 ? 10,000,000,000 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 Calculator Spreadsheet Numerical Parallel Systems ? Software
  13. 13. Man / Reduce• I need 7 volunteers: – 4 mappers – 3 reducers• We’re going to show how map/reduce works by sorting and counting some notes.
  14. 14. What is Map/Reduce• You compute things in two phases – The map step • Reads the input data • Transforms the data • Tags each datum with a key and sends each datum to the right reducer – The reduce step • Collects all the data for each key • Do some work on the data by key • Outputs the results
  15. 15. Map/Reduce is over 100 years old• Hollerith machines from the 1890 census
  16. 16. Good fits for Map/Reduce• Aggregating unstructured data to enter into a database (ETL)• Creating email messages• Processing log files and creating reports
  17. 17. Problems that don’t perfectly fit• Logistic regression• Matrix operations• Social graph calculations
  18. 18. Batch computationHadoop is a shared system that allocatesresources to jobs from a queue. It’s not a realtime system.
  19. 19. Coding exampleSuppose that we had some log files with events bydate (say, page views). Let’s count the number ofevents by day!Sample data: 1335300359000,Home Page, Joe 1335300359027,Login, 1335300359031,Home Page, Romy 1335300369123,Settings, Joe …
  20. 20. A Java Example• Mappers will – Read the input files – Extract the timestamp – Round to the nearest day – Set the output key to the day• Reducers will – Iterate through records by day, counting records – Output the count for each day
  21. 21. A Java example (Mapper)public class exampleMapperextends MapReduceBaseimplements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String line = value.toString(); String[] values = line.split(","); Long timeStampLong = Long.parseLong(values[0]); DateTime timeStamp = new DateTime(timeStampLong); DateTimeFormatter dateFormat =; output.collect(new Text(dateFormat.print(timeStamp)), new Text(line)); }}
  22. 22. A Java example (Reducer)public class exampleReducer extends MapReduceBase implements Reducer<Text, Text, Text, LongWritable> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text,LongWritable> output, Reporter reporter) throws IOException { long count = 0; while (values.hasNext()) count++; output.collect(key, new LongWritable(count)); }}
  23. 23. A Java example (job file)public class exampleJob extends Configured implements Tool { @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub JobConf conf = new JobConf(getConf(), getClass()); conf.setJobName("Count events by date"); conf.setInputFormat(TextInputFormat.class); TextInputFormat.addInputPath(conf, new Path(arg0[0])); conf.setOutputFormat(TextOutputFormat.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(LongWritable.class); TextOutputFormat.setOutputPath(conf, new Path(arg0[1])); conf.setMapperClass(exampleMapper.class); conf.setReducerClass(exampleReducer.class); JobClient.runJob(conf); return 0; }}
  24. 24. • Tools that make it easier to use Hadoop: – Hive – Pig – Cascading
  25. 25. Cascading• Tool for constructing Hadoop workflows in Java• Example: Scheme pvScheme = new TextLine(new Fields (“timestamp”, …); Tap source = new Hfs(pvScheme, inpath); Scheme countScheme = new TextLine(new Files (“date”, “count”); Tap sink = new Hfs(countScheme, outpath); Pipe assembly = new Pipe(“pagesByDate”); Function function = new DateFormatter(Fields(“timestamp”), “yyyy/mm/dd”); assembly = new Each(assembly , new Fields(“date”), function); assembly = new GroupBy(assembly , new Fields (“date”)); Aggregator count = new Count( new Fields( "count" ) ); assembly = new Every(assembly , count ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "pagesByDate", source, sink, assembly ); flow.complete();
  26. 26. Pig• Tool to write SQL-like queries against Hadoop• Example: define TODATE org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(); %declare now `date "+%s000"`; page_views = LOAD ‘PAGEVIEWS’ USING PigStorage() AS (timestamp:int, page:chararray, user:chararray); last_week = FILTER page_views BY timestamp > $now – 86400000 * 7; truncated = FOREACH page_views GENERATE *, TODATE(timestamp) as date; grouped = GROUP truncated BY date; counted = FOREACH grouped GENERATE group as date, COUNT_STAR(truncated) as N; sorted = ORDER counted BY date; STORE sorted INTO ‘results’ USING PigStorage();
  27. 27. Hive• Tool from Facebook that lets you write SQL queries against Hadoop• Example code: SELECT TO_DATE(timestamp), COUNT(*) FROM PAGEVIEWS WHERE timestamp > unix_timestamp()-86400000 * 7 GROUP BY TO_DATE(timestamp) ORDER BY TO_DATE(timestamp)
  28. 28. Some important related projects• Hbase• NextGen Hadoop (0.23)• Zookeeper• Mahout• Giraph
  29. 29. What to do next• Watch training videos at• Get Hadoop (including the code!) at• Get commercial support from or• Run it in the cloud with Amazon Elastic Map Reduce: