This document discusses automated summarization of large datasets from the Catlin Seaview Survey, a global coral reef monitoring effort. It collects images of reefs automatically during surveys and annotates them using machine learning. The author aims to efficiently summarize the big data using dynamic RMarkdown reports connected to a MySQL database. This will allow non-experts to explore trends in the data through interactive visualizations and maps.
1. IntRoduction
Automated Summarisation of Big Data
Using data from the Catlin Seaview Survey - a global coral reef
monitoring effort
Amy StringeR
1University of Queensland
UseR! July 2018
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 1 / 35
2. IntRoduction
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 2 / 35
3. IntRoduction
[Insert witty crowd banter]
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 3 / 35
4. Context
The Catlin Seaview Survey
Coral reef monitoring
program endeavouring to
develop a global baseline on
reef health and then monitor
the state of reefs through
resurvey efforts
5 regions around the world
so far: Australia, the
Caribbean, Southeast Asia,
the Indian Ocean, The
Pacific
Within these 5 major
regions, we have a total of
25 survey countries
(a) Bleaching at the
Maldives, 2016
(b) Bleaching at Heron
Island, 2016
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 4 / 35
5. Context The Catlin Data
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 5 / 35
6. Context The Catlin Data
Efficient Monitoring
Three main stages:
1 Collection of images
2 Annotation of images
3 Calculating proportions, and visualing trends between surveys
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 6 / 35
7. Context Image Collection
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 7 / 35
8. Context Image Collection
The Catlin Seaview Survey - Image Collection
High definition images
collected in 2km transects
along a reef section - taken
automatically every 3 seconds
Each image is GPS located
Speed of collection increased
from traditional 60m2 per dive
(45 min) to 2000m2 per dive Figure: A diver pushing the SVII scooter
during a survey of the Great Barrier reef.
For more on collection methodology, see
[7] c XL Catlin Seaview Survey
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 8 / 35
9. Context Image Collection
Figure: An example image from a survey of the Great Barrier Reef. Images like
this, along with the data, are available on the XL Catlin Global Reef Record [1].
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 9 / 35
10. Context Image Annotation
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 10 / 35
11. Context Image Annotation
Neural Network for Image Annotations
Previously a time consuming, manual task (potentially 3 decades of
work for the CSS images)
An automatic point-annotation method is now used based on
machine learning algorithms (See [3])
Colour and texture of images are used as descriptors for label
categories
Coverage estimates are uploading within a week of collection
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 11 / 35
12. Context Image Annotation
Figure: The same image from earlier showing the points used for annotation
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 12 / 35
13. The Need Efficient SummaRisation
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 13 / 35
14. The Need Efficient SummaRisation
The Next Stage
Fast data collection → fast annotation → bottle neck in processing
Data stored using MySQL database, allowing for easy integration with
R [6, 5]
Introducing Rmarkdown [2]
rmarkdown provides a solution for quick/consistent exploratory
analysis of the data in a report format
Visualisations! [8]
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 14 / 35
15. New Challenges Contextual Challenges
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 15 / 35
16. New Challenges Contextual Challenges
Contextual Challenges
Usability for non R users
Meaningful visualisations
They need to be useful for more than just the researchers; local
government and others in charge of marine protection need to get
some value
Comparisons to literature
Many label groups, and spatial scales in the dataset
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 16 / 35
17. New Challenges Data Challenges
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 17 / 35
18. New Challenges Data Challenges
Data Challenges - Structure
Sub-region Reef Count Transect Count Image Count
Cairns-Cooktown 13 43 87631
Coral Sea 3 32 23573
Far Northern 12 33 68367
Mackay-Capricorn 4 12 14151
Townsville-Whitsunday 4 10 5722
Total 36 130 199444
Table: A summary of the various spatial scales within just one region, the Great
Barrier Reef. This structure is consistent across all 5 regions.
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 18 / 35
19. New Challenges Data Challenges
Data Challenges - Labels
Benthic labels describe the community benthic category
Global labels describe morphological categories
Each region has 5 functional groups
Hard corals, soft corals
Algae
Other invertibrates, other
Region Benthic Labels Global Labels
GBR 27 13
Indian Ocean 49 17
Caribbean 67 16
Southeast Asia 71 17
Pacific 40 12
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 19 / 35
20. Solutions Dynamic Plotting Environments
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 20 / 35
21. Solutions Dynamic Plotting Environments
Plotting Hiccups
Plots have been created at multiple spatial scales; reef scale,
subregion scale, transect scale
Plot will have varying sizes based on the number of
reefs/subregions/transects in the respective regional dataset
Differing label sets among region makes visualisation an exciting
challenge at each of these spatial scales
Near impossible for clearly identifiable colours on community benthic
level
Visualisations at this level are more overwhelming than helpful
Single survey regions need some kind of conditioning on their
temporal plot construction
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 21 / 35
22. Solutions Dynamic Plotting Environments
Dynamic Plotting
The code wrapper in the rmarkdown source script allows you to set a
variable to the figure heights and widths
Figure: Plot wrapper with an exmaple of the figure height addjustment according
to the number of plot facets. Also shows here is a boolean variable for evaluation
of the code segment.
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 22 / 35
23. Solutions Dynamic Plotting Environments
Reef Scale Visualisations
Figure: An example visualisation of only 3 of the 36 GBR reefs. This plot is
created at the functional group scale. Note that the x axis is year, and the y axis
is percentage coverage over the reef in question.
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 23 / 35
24. Solutions Dynamic Plotting Environments
Reef Scale Visualisations
Figure: An example of the reef scale visualisaton for the reefs only surveyed once.
Coverage here is represented in the same way as the previous plot, giving a
percentage coverage for each basic functional group.
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 24 / 35
25. Solutions Dynamic Plotting Environments
Change at the Transect Scale
Extra challenges that arise
from survey design
Visualised at the global label
level
Investigate change only over
consecutive survey years
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 25 / 35
26. Solutions Interactive Maps using Leaflet
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 26 / 35
27. Solutions Interactive Maps using Leaflet
Leaflet
Using Leaflet [4] for interactive maps allows for readers to see where
exactly the surveys take place. Each transect marker represents a 2km
survey region.
Figure: Disclaimer - this image is not so interactive
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 27 / 35
28. Solutions RMySQL and Parameterised Rmarkdown
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 28 / 35
29. Solutions RMySQL and Parameterised Rmarkdown
Parameterising Rmarkdown Documents and RMySQL
RMySQL [5] allows for accessing the database through Rstudio negating
the need for an external program
(a) The header when making use of
document parameters. In future, more
parameters may be added to simplify the
source script, but in the current stages
things have been kept simple.
(b) Working example accessing the
database with the document input
parameters. The use of RMySQL allows
for connection to the database and
extraction of data within the Rmarkdown
source script.
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 29 / 35
30. Solutions RMySQL and Parameterised Rmarkdown
How’s the SeRenity?
Figure: The only bit of code a user needs to deal with to generate up to 25
reports.
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 30 / 35
31. Solutions Child Documents
Outline
1 IntRoduction
2 Context
The Catlin Data
Image Collection
Image Annotation
3 The Need
Efficient SummaRisation
4 New Challenges
Contextual Challenges
Data Challenges
5 Solutions
Dynamic Plotting Environments
Interactive Maps using Leaflet
RMySQL and Parameterised Rmarkdown
Child Documents
6 Future Work
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 31 / 35
32. Solutions Child Documents
Child Documents for Textual Components
Introductions and discussions will need to be different across the
regions
Using the parameters of the source we can import a specific
introduction file for each desired region
Child documents allow for easy editing of
introductions/methods/discussions, without needing to open the main
source document which is overwhelming and complicated
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 32 / 35
33. Future Work
Future Work
Currently this isn’t a fully automated process
Talk of linking these reports to a website
Extra parameterisation of the document - perhaps a structure change
based on the individual generating the report (e.g. management,
research etc)
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 33 / 35
34. Appendix For Further Reading
References I
The Ocean Agency. Global Reef Record. 2017. url:
http://globalreefrecord.org/data.
JJ Allaire et al. rmarkdown: Dynamic Documents for R. R package
version 1.6. 2017. url:
https://CRAN.R-project.org/package=rmarkdown.
O. Beijbom et al. “Towards Automated Annotation of Benthic
Survey Images: Variability of Human Experts and Operational Modes
of Automation”. In: (2015).
Joe Cheng, Bhaskar Karambelkar, and Yihui Xie. leaflet: Create
Interactive Web Maps with the JavaScript ’Leaflet’ Library. R
package version 1.1.0. 2017. url:
https://CRAN.R-project.org/package=leaflet.
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 34 / 35
35. Appendix For Further Reading
References II
Jeroen Ooms et al. RMySQL: Database Interface and ’MySQL’
Driver for R. R package version 0.10.13. 2017. url:
https://CRAN.R-project.org/package=RMySQL.
R Core Team. R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing. Vienna,
Austria, 2013. url: http://www.R-project.org/.
Manuel Gonzlez - Rivero et al. “Scaling up ecological measurements
of coral reefs using semi-automated field image collection and
analysis”. In: Remote Sensing 8 (2016). url:
http://www.mdpi.com/2072-4292/8/1/30.
Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York, 2009. isbn: 978-0-387-98140-6. url:
http://ggplot2.org.
Amy StringeR (Global Change Institute) Automated Summarisation of Big Data UseR! July 2018 35 / 35