Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

15 minute presentation about Thesis

398 views

Published on

15 minute presentation about Thesis

  1. 1. Too much Data! Sven MeysSaturday 9 February 13
  2. 2. Onderwerp On-demand Information Extraction from Remote Sensing Images with MapReduceSaturday 9 February 13
  3. 3. Inhoud • Context • Literatuurstudie • PlanningSaturday 9 February 13
  4. 4. Context • VITO • Remote Sensing • Probleemstelling • OnderzoeksvragenSaturday 9 February 13
  5. 5. 16% 700 €103 Milj. 84% Government PrivateSaturday 9 February 13
  6. 6. Energy Industrial Innovation Quality of Environment Environ- mental Separation Transition Material Remote Environ- Environ- Energy Analysis & Energy & Techno- Sensing mental mental Technology & Conversion Environment logy Modelling Health Techno- Technology logySaturday 9 February 13
  7. 7. Context • VITO • Remote Sensing • Probleemstelling • OnderzoeksvragenSaturday 9 February 13
  8. 8. Saturday 9 February 13
  9. 9. Saturday 9 February 13
  10. 10. Remote SensingSaturday 9 February 13
  11. 11. 2 1 km per pixel 0.5 miljard pixels 1.2 GBSaturday 9 February 13
  12. 12. RS ToepassingenSaturday 9 February 13
  13. 13. Time Series: 01-01-2001 01-01-2012 Algorithm: NDVI Output: Mean SUBMITSaturday 9 February 13
  14. 14. Context • VITO • Remote Sensing • Probleemstelling • OnderzoeksvragenSaturday 9 February 13
  15. 15. Probleemstelling Betere beelden Betere sensoren Meer informatie Duurdere opslag Meer data Data Transport Dure supercomputersMeer rekenwerk Parallel ProcessingSaturday 9 February 13
  16. 16. Doelstellingen • Snel genoeg • Betaalbaar • Schaalbaar Bestandssysteem + Software frameworkSaturday 9 February 13
  17. 17. Onderzoeksvragen • Hoe kunnen grote satellietbeelden in een HDFS filesysteem opgeslagen worden zodat ze op een efficiënte manier in parallel verwerkt kunnen worden? • Welke algoritmes kunnen gebruikt worden met deze opslagtechniek en MapReduce?Saturday 9 February 13
  18. 18. Inhoud • Context • Literatuurstudie • PlanningSaturday 9 February 13
  19. 19. Literatuurstudie • Interessante projecten • HDFS • MapReduce • Implementaties • Distributies • Huidige LiteratuurSaturday 9 February 13
  20. 20. Interessante projecten • NA (12) • Center for Climate Simulation • Square Kilometer Array: 700 TB/sec • Open Cloud Consortium(13) • Project Matsu: Elastic Clouds for Disaster Relief • : Large Hadron Collider (14) • 20 PB/jaarSaturday 9 February 13
  21. 21. HDFS 1 • Gedistribueerd bestandssysteem 2 ... • Gebaseerd op the Google File System(1) ... n • Grote blokken (128 MiB) • Commodity hardware • Falen = standaard • Read & append (1)Saturday 9 February 13
  22. 22. A DFS usually accounts for transparent file replication and fault to HDFSbles data locality for processing tasks. A DFS does this by subdividin ese blocks within a cluster of computers. Figure 2 shows the distrib of a file (left) subdivided into three blocks. 1 1 3 1 2 2 3 3 2 2 3 1 Figure 2: File blocks, distribution and replication in a distributed file system Saturday 9 February 13
  23. 23. onsult GmbH HDFS Ca 1 1 3 1 2 2 2 3 3 3 2 1 Figure 4: Block assembly for data retrieval from the distributed file systemSaturday 9 February 13
  24. 24. rates how the file system handles node-failure by automated recov HDFS HDFS further uses checksums to verify block integrity. As long as thccessible copy of a block, it can automatically re-replicate to returntion rate. 1 1 1 1 3 3 2 3 2 3 2 3 3 2 2 2 2 3 1 1Figure 3: Automatic repair in case of cluster node failure by additional replication Saturday 9 February 13
  25. 25. HDFS - Overzicht • Schaalbaar • Snel lezen/schrijven • Robuust • Factor 10 goedkoper (2)Saturday 9 February 13
  26. 26. MapReduceSaturday 9 February 13
  27. 27. MapReduce - WordCountSaturday 9 February 13
  28. 28. MapReduce - Overzicht • Based on Google MapReduce (3) • Data Locality • Key/Value pairs • Zeer snel • Andere manier van denkenSaturday 9 February 13
  29. 29. Implementaties Hadoop Stratosphere HPCC Support + - + Extensions + - ? Community +++ +/- - Target ANY EDU BI • Apache Software Foundation • Anderen: outdated, commercieel, weinig support (4-6)Saturday 9 February 13
  30. 30. Distributies (8) • Hortonworks (7) • • Cloudera : Cloudera Manager (9) • Web Interface • 1-Click install. (yeah right...) • Interessant licentie modelSaturday 9 February 13
  31. 31. Algemeen • Vooral tekstverwerking • Voor kleine afbeeldingen (10) • Weinig detail • Commercieel (11)Saturday 9 February 13
  32. 32. Inhoud • Context • Literatuurstudie • PlanningSaturday 9 February 13
  33. 33. Planning literatuur fase 1 fase 2 fase 3 fase 4 01 01 15 20 /09 /02 / 03 /0 5 verslag stage vandaag inleveren masterproefSaturday 9 February 13
  34. 34. Fase 1 - Done Sven Workstation Workstation Workstation 192.168.10.248 TT DN Master Bruno Tim Patrick JT TT TT TT NN DN DN DN 192.168.10.245 192.168.10.246 192.168.10.247 192.168.10.249 JT = Job Tracker = Name Node NN = RedHat 6.2 = RedHat 6.2 Workstation Virtual Machine TT = Task Tracker DN = Data NodeSaturday 9 February 13
  35. 35. Fase 2 • Eenvoudig algoritme • Beeld draaien • Standaard IO • HDFSSaturday 9 February 13
  36. 36. Fase 3 • Meer complexiteit: MapReduce • Spatiaal: Convolutiemasker, ROI • Temporeel/Spectraal: Meerdere afbeeldingen •Saturday 9 February 13
  37. 37. Fase 4 • Performantie in functie van pixel afstandSaturday 9 February 13
  38. 38. Planning literatuur fase 1 fase 2 fase 3 fase 4 01 01 15 20 /09 /02 / 03 /0 5 verslag stage vandaag inleveren masterproefSaturday 9 February 13
  39. 39. The End • Veel data • Anders denken • Veel mogelijkheden • RLZ of nieuw keuzevak Big Data? ;) • Mapreduce + OpenCL? • Veel uitdagingen • Veel vragenSaturday 9 February 13
  40. 40. Referenties (1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’ (2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’ (3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’ (4) http://hadoop.apache.org/ (5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu (6) http://hpccsystems.com/ (7) http://hortonworks.com/ (8) http://mapr.com/ (9) http://cloudera.com/ (10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’ (11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/ cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image- processing-using-hadoop.htmt (12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/ SC12/demos/demo20.html (13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief (14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.Saturday 9 February 13

×