15 minute presentation about Thesis

Too much Data!
Sven Meys

Saturday 9 February 13

Onderwerp
On-demand

Information Extraction
from
Remote Sensing Images

with

MapReduce

Inhoud

• Context
• Literatuurstudie
• Planning


Context

• VITO
• Remote Sensing
• Probleemstelling
• Onderzoeksvragen


16%
700 €103 Milj. 84%

Government
Private


Energy Industrial Innovation Quality of Environment

Environ-
mental Separation
Transition Material Remote Environ- Environ-
Energy Analysis &
Energy & Techno- Sensing mental mental
Technology & Conversion
Environment logy Modelling Health
Techno- Technology
logy


Remote Sensing


2
1 km per pixel
0.5 miljard pixels
1.2 GB


RS Toepassingen


Time Series:
01-01-2001
01-01-2012
Algorithm:
NDVI
Output:
Mean

SUBMIT

Probleemstelling
Betere beelden
Betere sensoren Meer informatie

Duurdere opslag
Meer data
Data Transport

Dure supercomputers
Meer rekenwerk
Parallel Processing


Doelstellingen

• Snel genoeg
• Betaalbaar
• Schaalbaar Bestandssysteem
+
Software framework


Onderzoeksvragen
• Hoe kunnen grote satellietbeelden in
een HDFS ﬁlesysteem opgeslagen
worden zodat ze op een efﬁciënte
manier in parallel verwerkt kunnen
worden?
• Welke algoritmes kunnen gebruikt
worden met deze opslagtechniek en
MapReduce?


Literatuurstudie
• Interessante projecten
• HDFS
• MapReduce
• Implementaties
• Distributies
• Huidige Literatuur


Interessante projecten
• NA (12)
• Center for Climate Simulation

• Square Kilometer Array: 700 TB/sec

• Open Cloud Consortium(13)
• Project Matsu: Elastic Clouds for Disaster Relief

• : Large Hadron Collider (14)
• 20 PB/jaar

HDFS
1

• Gedistribueerd bestandssysteem 2

...
• Gebaseerd op the Google File System(1) ...
n
• Grote blokken (128 MiB)

• Commodity hardware

• Falen = standaard

• Read & append (1)


A DFS usually accounts for transparent file replication and fault to

HDFS
bles data locality for processing tasks. A DFS does this by subdividin
ese blocks within a cluster of computers. Figure 2 shows the distrib
of a file (left) subdivided into three blocks.
1 1

3

1 2

2 3

3

2 2

3

1

Figure 2: File blocks, distribution and replication in a distributed file system


onsult GmbH HDFS Ca

1 1

3

1 2

2 2 3 3

3

2

1

Figure 4: Block assembly for data retrieval from the distributed file system


rates how the file system handles node-failure by automated recov

HDFS
HDFS further uses checksums to verify block integrity. As long as th
ccessible copy of a block, it can automatically re-replicate to return
tion rate.
1 1 1 1

3 3

2 3 2

3 2 3 3

2

2 2 2

3

1 1

Figure 3: Automatic repair in case of cluster node failure by additional replication


HDFS - Overzicht

• Schaalbaar
• Snel lezen/schrijven
• Robuust
• Factor 10 goedkoper (2)


MapReduce


MapReduce - WordCount


MapReduce - Overzicht

• Based on Google MapReduce (3)
• Data Locality
• Key/Value pairs
• Zeer snel
• Andere manier van denken


Implementaties

Hadoop Stratosphere HPCC
Support + - +
Extensions + - ?
Community +++ +/- -
Target ANY EDU BI

• Apache Software Foundation
• Anderen: outdated, commercieel,
weinig support (4-6)


Distributies
(8)
• Hortonworks (7)

•
• Cloudera : Cloudera Manager (9)
• Web Interface
• 1-Click install. (yeah right...)
• Interessant licentie model


Algemeen

• Vooral tekstverwerking
• Voor kleine afbeeldingen (10)
• Weinig detail
• Commercieel (11)


Planning

literatuur
fase 1
fase 2
fase 3
fase 4
01 01 15 20
/09 /02 / 03 /0
5
verslag
stage vandaag inleveren
masterproef

Fase 1 - Done
Sven Workstation Workstation Workstation

192.168.10.248 TT

DN

Master Bruno Tim Patrick

JT TT TT TT

NN DN DN DN

192.168.10.245 192.168.10.246 192.168.10.247 192.168.10.249

JT = Job Tracker = Name Node
NN = RedHat 6.2 = RedHat 6.2
Workstation Virtual Machine
TT = Task Tracker DN = Data Node


Fase 2

• Eenvoudig algoritme
• Beeld draaien
• Standaard IO
• HDFS


Fase 3
• Meer complexiteit: MapReduce
• Spatiaal: Convolutiemasker, ROI
• Temporeel/Spectraal: Meerdere
afbeeldingen

•


Fase 4
• Performantie in functie van pixel
afstand


The End
• Veel data
• Anders denken
• Veel mogelijkheden
• RLZ of nieuw keuzevak Big Data? ;)

• Mapreduce + OpenCL?

• Veel uitdagingen

• Veel vragen

Referenties
(1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’
(2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’
(3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’
(4) http://hadoop.apache.org/
(5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu
(6) http://hpccsystems.com/
(7) http://hortonworks.com/
(8) http://mapr.com/
(9) http://cloudera.com/
(10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’
(11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/
cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-
processing-using-hadoop.htmt
(12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/
SC12/demos/demo20.html
(13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief
(14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.


15 minute presentation about Thesis

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to 15 minute presentation about Thesis

Similar to 15 minute presentation about Thesis (20)

15 minute presentation about Thesis