Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

6,596 views

Published on

Flink Forward 2015

Published in: Technology
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

  1. 1. beyond mapreduce scientific data processing in real-time Chris Hillman October 13th 2015 chillman@dundee.ac.uk
  2. 2. Proteomics GENOME 21,000 PROTEOME 1,000,000+
  3. 3. Mass Spectrometry Each Experiment produces 7Gb XML file containing 40,000 scans 600,000,000 data points In approx 100 minutes Data processing can take over 24 hours • Pick 2D Peaks • De-isotope • Pick 3D Peaks • Match weights to known peptides
  4. 4. Mass Spectrometry New Lab has 12 Machines That’s a lot of data and a lot of data processing
  5. 5. Parallel Computing
  6. 6. Parallel Processing Amdahl’s Law Serial portion is fixed Gustafson’s Law Size of problem is not fixed Gunther’s Law linear scalability issues
  7. 7. Working Environment
  8. 8. Parallel Algorithm 2D peak picking fits well into a Map task – Read into memory – Decode base64 float array – Peak pick, isotopic envelope detection
  9. 9. Parallel Algorithm 3D peak picking fits well into a Reduce task – Receive partitions of 2D peaks – Detect 3D peaks – Isotopic envelopes – Output peak Mass Intensity
  10. 10. Issues XML is not a good format for parallel processing
  11. 11. Issues 38 38.5 39 39.5 40 40.5 41 372 372.2 372.4 372.6 372.8 373 373.2 373.4 373.6 373.8 374
  12. 12. Issues Data Shuffle and skew on the cluster 0 20000 40000 60000 80000 100000 120000 50 85 120 155 191 226 261 296 331 366 401 436 471 506 541 576 611 646 681 716 751 786 821 856 891 926 961 996 1031 1066 1101 1136 1172 1208 1243 1280 1316 1351 1386 1422 1457 1492 1527 1562 1597 Series2
  13. 13. Results
  14. 14. MapReduce Map Reduce Shuffle
  15. 15. MapReduce Transforming the XML and writing the modified data to • HDFS • Hbase • Cassandra Executing the MapReduce code reading from the above Batch process shows potential of speeding up the current process by scaling the size of the cluster running it.
  16. 16. Flink Experiences so far Very easy to install Very easy to understand Good documentation Very easy to adapt current code I like it!
  17. 17. public class PeakPick extends Configured implements Tool { Job job=new Job(getConf(), "peakpick"); job.setJarByClass(PeakPick.class); job.setNumReduceTasks(104); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(Text.class); job.setMapperClass(MapHDFS.class); job.setReducerClass(ReduceHDFS.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setPartitionerClass(MZPartitioner.class); FileInputFormat.setInputPaths(job, new Path(args[1])); FileOutputFormat.setOutputPath(job, new Path(args[2])); job.waitForCompletion(true); return job.isSuccessful() ? 0 : 1; } public static void main(String[] args) throws Exception{ int res = ToolRunner.run(new Configuration(), new PeakPick(), args); System.exit(res); } MR Job
  18. 18. MR Read public class MapHDFS extends Mapper<LongWritable, Text, IntWritable, Text> { public void map(LongWritable key, Text value, Context context { String inputLine = value.toString(); tempStr = inputLine.split("t"); scNumber = tempStr[1]; …… intensityString = tempStr[8] } public class MapCassandra extends Mapper<ByteBuffer, SortedMap<ByteBuffer, Cell>, Text, IntWritable> { public void map(ByteBuffer key, SortedMap<ByteBuffer, Cell> columns, Context conte { scNumber = String.valueOf(key.getInt()); for (Cell cell : columns.values()) {String name = ByteBufferUtil.string(cell.name().toByteBuffer()); if (name.contains("scan")) scNumber String.valueOf(ByteBufferUtil.toInt(cell.value())); if (name.contains("mslvl")) scLevel = String.valueOf(ByteBufferUtil.toInt(cell.value())); if (name.contains("rettime")) RT = String.valueOf(ByteBufferUtil.toDouble(cell.value())); }
  19. 19. MR Write public class MapHDFS extends Mapper<LongWritable, Text, IntWritable, Text> { public void map(LongWritable key, Text value, Context context { ………… for (int i=0; i<outputPoints.size(); i++){ mzStringOut = scNumber + "t" + scLevel + "t" + RT + "t" + Integer.toString(outputPoints.get(i).getCurveID()) + "t" + Double.toString(outputPoints.get(i).getWpm()) context.write(new IntWritable(outputPoints.get(i).getKey()), peakOut); } public class ReduceHDFS extends Reducer<IntWritable, Text, IntWritable, Text> { public void reduce(IntWritable key, Iterable<Text> values, Context context){ ………… for (int k = 0; k<MonoISO.size();k++){ outText = MonoISO.get(k).getCharge() + "t" + MonoISO.get(k).getWpm() + "t" + MonoISO.get(k).getSumI() + "t" + MonoISO.get(k).getWpRT(); context.write(new IntWritable(0), new Text(outText)); }
  20. 20. Flink Job public class PeakPickFlink_MR { final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); Job job = Job.getInstance(); //Setup input format HadoopInputFormat<LongWritable, Text> hadoopIF = new HadoopInputFormat<LongWritable, Text>( new TextInputFormat(), LongWritable.class, Text.class, job ); TextInputFormat.addInputPath(job, new Path(args[0])); //Read HDFS Data DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopIF);
  21. 21. Flink Job // use Hadoop Mapper as MapFunction DataSet<Tuple2<IntWritable, Text>> result = text .flatMap(new HadoopMapFunction<LongWritable, Text, IntWritable, Text>( new MapHDFS() )) .groupBy(0) // use Hadoop Reducer .reduceGroup(new HadoopReduceFunction<IntWritable, Text, IntWritable, Text>(new ReduceHDFS() ));
  22. 22. Flink Job // Set up the Hadoop TextOutputFormat. HadoopOutputFormat<IntWritable, Text> hadoopOF = new HadoopOutputFormat<IntWritable, Text>( new TextOutputFormat<IntWritable, Text>(), job ); //Write results back to HDFS hadoopOF.getConfiguration().set("mapreduce.output.textoutputformat.separator" , " "); TextOutputFormat.setOutputPath(job, new Path(args[1])); // Emit data using the Hadoop TextOutputFormat. result.output(hadoopOF).setParallelism(1); // Execute Code env.execute("Hadoop PeakPick"); } }
  23. 23. Interim Results Mapper only Hadoop 12m25s Flink 4m50s Mapper and Reducer Hadoop 28m32s Flink 10m20s Existing code 35m22s
  24. 24. Near real-time? Still not fast enough to be called near real-time Processing x scans per second, if the cluster was big enough then maybe… But… the mass spectrometer takes 100 minutes to complete its processing for one experiment so in fact we have more than enough time to process the data if we are streaming the results and processing the data as it is produced…
  25. 25. Streaming the data Simulate streaming data using an existing data file and kafka Ingest the data using Flink Streaming API and process the scans using the existing mapper code Existing Data File Kafka Flink Streaming API
  26. 26. Streaming the results A peptide elutes over a period of time, this means the data from many scans needs to be compared at the same time. A safe window to measure the quantity of a peptide is 10 seconds 10 seconds
  27. 27. Interim Results Overlapping 10 second windows captures 3D peaks from the 2D scans
  28. 28. Interim Results Processing the entire scan in the 10 second window means that we don’t need the overlapping window and the de-duplication step
  29. 29. All this means that the data will be fully pre-processed just over 10 seconds after the mass spectrometer completes the experiment. Near real-time? Stream Processing
  30. 30. Complete stable working system Contrast with Spark and Storm Hookup previous research on database lookup to create a complete system Pay for some EC2 system time to complete testing Write a thesis… To Do list
  31. 31. Questions? chillman@dundee.ac.uk @chillax7 http://www.thequeensarmskensington.co.uk

×