15 minute presentation about Thesis

309 views
261 views

Published on

1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
309
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

15 minute presentation about Thesis

  1. 1. Too much Data! Sven MeysSaturday 9 February 13
  2. 2. Onderwerp On-demand Information Extraction from Remote Sensing Images with MapReduceSaturday 9 February 13
  3. 3. Inhoud • Context • Literatuurstudie • PlanningSaturday 9 February 13
  4. 4. Context • VITO • Remote Sensing • Probleemstelling • OnderzoeksvragenSaturday 9 February 13
  5. 5. 16% 700 €103 Milj. 84% Government PrivateSaturday 9 February 13
  6. 6. Energy Industrial Innovation Quality of Environment Environ- mental Separation Transition Material Remote Environ- Environ- Energy Analysis & Energy & Techno- Sensing mental mental Technology & Conversion Environment logy Modelling Health Techno- Technology logySaturday 9 February 13
  7. 7. Context • VITO • Remote Sensing • Probleemstelling • OnderzoeksvragenSaturday 9 February 13
  8. 8. Saturday 9 February 13
  9. 9. Saturday 9 February 13
  10. 10. Remote SensingSaturday 9 February 13
  11. 11. 2 1 km per pixel 0.5 miljard pixels 1.2 GBSaturday 9 February 13
  12. 12. RS ToepassingenSaturday 9 February 13
  13. 13. Time Series: 01-01-2001 01-01-2012 Algorithm: NDVI Output: Mean SUBMITSaturday 9 February 13
  14. 14. Context • VITO • Remote Sensing • Probleemstelling • OnderzoeksvragenSaturday 9 February 13
  15. 15. Probleemstelling Betere beelden Betere sensoren Meer informatie Duurdere opslag Meer data Data Transport Dure supercomputersMeer rekenwerk Parallel ProcessingSaturday 9 February 13
  16. 16. Doelstellingen • Snel genoeg • Betaalbaar • Schaalbaar Bestandssysteem + Software frameworkSaturday 9 February 13
  17. 17. Onderzoeksvragen • Hoe kunnen grote satellietbeelden in een HDFS filesysteem opgeslagen worden zodat ze op een efficiënte manier in parallel verwerkt kunnen worden? • Welke algoritmes kunnen gebruikt worden met deze opslagtechniek en MapReduce?Saturday 9 February 13
  18. 18. Inhoud • Context • Literatuurstudie • PlanningSaturday 9 February 13
  19. 19. Literatuurstudie • Interessante projecten • HDFS • MapReduce • Implementaties • Distributies • Huidige LiteratuurSaturday 9 February 13
  20. 20. Interessante projecten • NA (12) • Center for Climate Simulation • Square Kilometer Array: 700 TB/sec • Open Cloud Consortium(13) • Project Matsu: Elastic Clouds for Disaster Relief • : Large Hadron Collider (14) • 20 PB/jaarSaturday 9 February 13
  21. 21. HDFS 1 • Gedistribueerd bestandssysteem 2 ... • Gebaseerd op the Google File System(1) ... n • Grote blokken (128 MiB) • Commodity hardware • Falen = standaard • Read & append (1)Saturday 9 February 13
  22. 22. A DFS usually accounts for transparent file replication and fault to HDFSbles data locality for processing tasks. A DFS does this by subdividin ese blocks within a cluster of computers. Figure 2 shows the distrib of a file (left) subdivided into three blocks. 1 1 3 1 2 2 3 3 2 2 3 1 Figure 2: File blocks, distribution and replication in a distributed file system Saturday 9 February 13
  23. 23. onsult GmbH HDFS Ca 1 1 3 1 2 2 2 3 3 3 2 1 Figure 4: Block assembly for data retrieval from the distributed file systemSaturday 9 February 13
  24. 24. rates how the file system handles node-failure by automated recov HDFS HDFS further uses checksums to verify block integrity. As long as thccessible copy of a block, it can automatically re-replicate to returntion rate. 1 1 1 1 3 3 2 3 2 3 2 3 3 2 2 2 2 3 1 1Figure 3: Automatic repair in case of cluster node failure by additional replication Saturday 9 February 13
  25. 25. HDFS - Overzicht • Schaalbaar • Snel lezen/schrijven • Robuust • Factor 10 goedkoper (2)Saturday 9 February 13
  26. 26. MapReduceSaturday 9 February 13
  27. 27. MapReduce - WordCountSaturday 9 February 13
  28. 28. MapReduce - Overzicht • Based on Google MapReduce (3) • Data Locality • Key/Value pairs • Zeer snel • Andere manier van denkenSaturday 9 February 13
  29. 29. Implementaties Hadoop Stratosphere HPCC Support + - + Extensions + - ? Community +++ +/- - Target ANY EDU BI • Apache Software Foundation • Anderen: outdated, commercieel, weinig support (4-6)Saturday 9 February 13
  30. 30. Distributies (8) • Hortonworks (7) • • Cloudera : Cloudera Manager (9) • Web Interface • 1-Click install. (yeah right...) • Interessant licentie modelSaturday 9 February 13
  31. 31. Algemeen • Vooral tekstverwerking • Voor kleine afbeeldingen (10) • Weinig detail • Commercieel (11)Saturday 9 February 13
  32. 32. Inhoud • Context • Literatuurstudie • PlanningSaturday 9 February 13
  33. 33. Planning literatuur fase 1 fase 2 fase 3 fase 4 01 01 15 20 /09 /02 / 03 /0 5 verslag stage vandaag inleveren masterproefSaturday 9 February 13
  34. 34. Fase 1 - Done Sven Workstation Workstation Workstation 192.168.10.248 TT DN Master Bruno Tim Patrick JT TT TT TT NN DN DN DN 192.168.10.245 192.168.10.246 192.168.10.247 192.168.10.249 JT = Job Tracker = Name Node NN = RedHat 6.2 = RedHat 6.2 Workstation Virtual Machine TT = Task Tracker DN = Data NodeSaturday 9 February 13
  35. 35. Fase 2 • Eenvoudig algoritme • Beeld draaien • Standaard IO • HDFSSaturday 9 February 13
  36. 36. Fase 3 • Meer complexiteit: MapReduce • Spatiaal: Convolutiemasker, ROI • Temporeel/Spectraal: Meerdere afbeeldingen •Saturday 9 February 13
  37. 37. Fase 4 • Performantie in functie van pixel afstandSaturday 9 February 13
  38. 38. Planning literatuur fase 1 fase 2 fase 3 fase 4 01 01 15 20 /09 /02 / 03 /0 5 verslag stage vandaag inleveren masterproefSaturday 9 February 13
  39. 39. The End • Veel data • Anders denken • Veel mogelijkheden • RLZ of nieuw keuzevak Big Data? ;) • Mapreduce + OpenCL? • Veel uitdagingen • Veel vragenSaturday 9 February 13
  40. 40. Referenties (1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’ (2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’ (3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’ (4) http://hadoop.apache.org/ (5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu (6) http://hpccsystems.com/ (7) http://hortonworks.com/ (8) http://mapr.com/ (9) http://cloudera.com/ (10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’ (11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/ cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image- processing-using-hadoop.htmt (12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/ SC12/demos/demo20.html (13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief (14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.Saturday 9 February 13

×