SlideShare a Scribd company logo
Mapping using Fixed-wing Drone
Ting Wen Ong | Operation Manager FEDS Drone-powered Solutions
Agenda
• Big Data/AI and Drone
• Opportunities
• Challenges, Why is it Hard?
• Big Data Challenges…
• Toward a new architecture for drone Big Data
• Partitioning
• Storage
• Computing
• Some existing Big Data/AI frameworks for Drone
Audience Poll
• How many of you have used Big Data/AI techniques? Hadoop ? Spark ?
Tensorflow?
Big Data / AI and Drone
Why!
Reminder about Big Data
• “Big data …encompasses the volume of information, the speed at which it is
created and collected, and the variety of the data points being covered. ” source
investopedia.com
• It becomes essential to many companies’ success in today’s business landscape
(Finance, Banking, Google, Facebook, …)
Reminder about AI
• …is Learning from an amount of data to get new insights, and to help in predicting
tasks
• Many approaches have been developed to learn from data (of various forms:
text, DB, Image, video…): Deep Neural Networks based solutions
• The more data available, the more effective the learning is and the more accurate
the prediction task is.
Opportunities of Big Data and Drone
• Drone data are good example for what Big data technology has been created :
Storage and Computing
• Drones can capture, store, and transmit data, giving businesses the
opportunity to integrate more data into their current processes
Opportunities of AI and Drone
• With such amount of data, AI can access a huge amount of drone data
to learn new insights, and help in predicting tasks
• Farmers uses Drone for agriculture
• Helping in prediction crop yields
• Drones for thermal imaging
• Used for construction and maintenance
Very good, but……
• The potential of drones data is often underestimated
• Archiving collected data
• Curretly, we are doing more archiving tasks than managing drone data efficiently
• Almost no existing Big Data infrastructure can handle drones efficiently,
• Even if Big data is almost mature for other domains: Finance, Banking…
• Often it is
• Hard to store
• Hard to manage
• Hard to process
• Hard to get insight
• How ???
Hard to store: Volume
A very small drone project can generate more than
10 GB, sometimes more than 40Gb
15 million images of drone can make up more than 175
terabytes of data.
How to Store and Compute such growing volume?
FEDS : 13,000 flights this year
Hard to store: Variety
• “Drones can now provide a wide variety
of data types, everything from a few basic
photos through to complex measurable 3D
models with annotations and overlays.”
Visual Encylopedia of drone data
Aerial Photography and Video
Orthomosaic Map
Digital Elevation Model (DEM)
3D Pointcloud Model
Multispectral Mapping
Thermal Imagery and Mapping
Hard to process:
Computing Model and Scalability
• Currently, drone image processing is done in one server: NOT SCALABLE
• Scalability is the property of a system to handle a growing amount of work by
adding resources to the system
• In Big Data, It is mostly done by distributing storage and computing
• Distributed computing can provide Scalability, but drone data friendly is Difficult
Processing/Querying drone data can take up to a few hours
 Objective : real time (few seconds)
 Going beyond traditional algorithms
 Why not use Neural networks that have made great success with image:
▪ Semantic segmentation
▪ Object recognition, Classification..
▪ Description Generation for Drone Images Using Attribute Attention Mechanism
 But theses new algorithms require more storage capacities and computing
power
Hard to get Insights
Recall that Drone Data are a bit similar to
Raster data structures
• Aeriel imageries
• Satellite Imageries
• Climate data (netCDF, …)
Currently,
How Drone Data are Stored?
Internal
Storage:
for short-term storage before editing (hobbyist users )
SD Cards: The majority of drones use SD or micro SD cards as their standard
storage option.
Cloud
Storage:
The benefits of using a cloud-based system is you can access your
data anywhere in the world by logging on to your account.
Label and Organize Your Files: You save each session chronologically by date with additional
information such as the location of your shoot or the client it was for.
Good, But
Scalability
Currently,
How Drone Data are Managed?
File Systems
GIS
Drone Software
Current approaches are obsolete
we need to reinvent everything
Storage
Access Availability
Computing
Fast Accurate
Analytics
Machine
Learning
Deep
Learning
Search
By
semantic
By Spatial
Queries
…
1. New architecture to be redefined
Analytical Queries
Structured Storage
Cluster
Computing Cluster
…
Large Scale
Time series NDVI
•Distributing both STORAGE
•AND COMPUTING
2. Need to correlate drone data with external
datasets
More Insights
Census Data
Economic Data
Weather
…
3. Toward a declarative language (SQL-Like) over drone
data
Change in NDVI over the spring and early summer of 2018
Select normalized_difference(nir, red) as ndvi
From Feds_droneDataset
Where
date between ‘10-10-2017’ and ‘10-10-2019’
Examples from
‘10-10-2017’ to ‘10-10-2019’
Best option for Data Scientists
Drone Big Data
We Will focus on three Aspects
Storage
HDFS NoSQL Database Data Lake
Computing
MR Spark
Analytics
ML DL
Recall that storage should be distributed
across a cluster
• Before detailing storage techniques, let’s talk about Partitioning
Structured Storage
Cluster
…
Node A
Node B
Node F
Node G
Challenge for going distributed:
Data Partitioning
 Partitioning means the process of physically dividing data into separate data
stores
 Data is divided into partitions that can be managed and accessed separately.
Node 1
Node 2
Node 3
Node 4
Node 1
Node 2
Node 3By Band
RGB
Red Band
Green Band
Blue Band
First simple approach is to partition by band
Node 1
Node 2
Node 3
By Time
Spring
Summer
Autumn
Other simple approach is to partition by time
(season)
Node 1
Node 2
Node 3
Decompose into NxN regular grids
But the Most efficient approach is to combine Tiling and Distribution
Tiling allows large raster datasets to be broken-up into manageable pieces  higher level raster I/O interface.
Which Partition strategy to choose?
• Not in the scope of this presentation
• Check with your main objective:
• If for Scalability,
• If for Query Performance,
• If for Availability
• Many Best practices are available
• Sometimes we make use of Global Index for Optimizing Queries
1- Distributed Storage techniques
Quick reminder
HDFS- Hadoop Distributed File System
• The Most basic data store for Big Data
• We breaks down very large files into large blocks (for example, measuring 64MB),
• and stores three copies of these blocks on different nodes in the cluster to protect against
machine failures.
• The default is a replication factor of 3 (every block is stored on three machines)
Extension of HDFS to Drone Data
• HDFS cannot be used directly for managing raster data
• HDFS has no awareness of the content of these files.
• HDFS is ideally suited for write-once and read-many times use cases
• HDFS works best with a smaller number of large files
NoSQL Databases
• Relational databases cannot provide on demand scalability.
• NoSQL Offers at least three advantages:
• Data Modeling (rapidly changing ), Scalability, High Availability
Key Value
• The key-value database uses an a map where
• Key is associated with one and only one value in a collection. This kind of relationship is
referred to as a key-value pair.
• Value can be anything, including image, JSON, flexible schemas.
• Advantages:
• Simple data format makes write and read operations fast.
Key Value
Key Value
Key( )
Exp: Space Filling
Curve
• How to create a key for drone image?
NoSQL Databases
NoSQL Database are not natively compliant with Drone Data, need to be
adapted.
Open research problem
2- The computing part
• Having data storage distributed, Recall that also the computing is also
distributed in Big Data architecture
•Pipeline of Big Data Query
• 1. End user writes its Query Q,
• 2. System distribute this query Q over the cluster
• 3. Cluster servers compute individual subqueries
• 4. Subqueries Answers are aggregated to End user
The computing part
Computing
Model
HADOOP/MapReduce Spark/Spark SQL
We have at least two interesting computing models
Spark vs Hadoop MapReduce
Source: Data Flair
We will focus Next on Apache Spark
According to benchmarks studies, Spark is much better than Hadoop
MapReduce
• Spark is a distributed computing engine that lets you work with distributed data
as a collection
• Computing (mostly) in-memory data processing engine
Fastest Big Data engine for computing
• Not only Spark, but also other related projects
Two (or three!) Abstractions
• for handling computing over large datasets, Apache Spark transforms
large datasets into two abstractions
• RDD (program with scala)
• Dataframe (Dataset!) (query with SQL)
• Abstracts away (partially) the complexities of distributed computing
RDD data abstraction
Resilient
•be able to recompute
missing or damaged
partitions due to node
failures.
Partitioned
•Records are partitioned
(split into logical
partitions) and distributed
across nodes
In-Memory
•Data inside RDD is stored
in memory as much (size)
and long (time) as
possible.
Immutable
• It does not change once
created and can only be
transformed to new RDDs.
Lazy evaluated
•Data inside RDD is not available or
transformed until an action is
executed (triggers the execution).
Cacheable
•You can hold all the data
in a persistent "storage"
like memory (default and
the most preferred) or disk
• In this approach, Spark transforms a data source into RDD
(collections of elements that can be operated on in parallel)
Dataframe abstraction
• In this approach, Spark SQL creates a tabular view over your data
• Then SQL comes to play with inner Optimization
Spark RDD vs Dataframe
• Dataframe has Advantages of RDD and More:
• Unlike RDD:
• You can write program in SQL queries instead of Scala
• Optimization done automatically
Analytics with Spark
• Spark proposes a very easy pattern to
follow.
• Use Dataframe as starting point in
analytics
• Work well in distributed environment
Recap
• Drone are a good use case for big data technology
• We need to reinvent approaches for storing and computing
• Solution is to distribute Storage and Computing
Is it possible to have the same pattern
with Drone Data?
The answer is ……
Frameworks for Raster Big Data
Frameworks for Raster Big Data
Apache Spark / Spark SQL
• Rasterframes (My favorite)
Earth AI (To follow)
Google Earth Engine
Rasdaman
SciDB
• Spark project for Raster Data
• Spark Dataframe like abstraction for handling Raster Data : Provides ability to work with
Raster imagery in a convenient yet scalable format
• You can use Spark ML for building ML Models
B1
B2
B3
B4
tile or tile_n (where n is a band number)
ML Pipeline for Raster Data
• 1- You ingest data Raster
• 2- You Construct dataframe
• 3- Apply Machine learning and stats over your data
Source: astrae aearth
RasterFrames Data sources
• Raster data can be read from a number of
sources.
• Through the Spark SQL DataSource API,
RasterFrames can be constructed from
collections of :
• (preferably Cloud Optimized) GeoTIFFs,
• GeoTrellis Layers
• from an experimental catalog of Landsat 8 and
MODIS data sets on the Amazon Web Services
(AWS) Public Data Set (PDS).
• support for the evolving Spatiotemporal Asset
Catalog (STAC) specification. Source: astrae aearth
Standard Tile Operations
• Many raster operations are ready to be executed in a distributed manner : can be
executed over Spark Cluster
• Ready to use
RasterFrames: SQL Query
• Such operations can be used as predicate over tile column (like any DBMS
operator):
• Give me Min, Mean, Max over all tiles (image)… and group them by a certain key
(alphanumerical, spatial, temporal, spatio-temporal key )
RasterFrames: SQL Query
• Can I Use spatial predicate in my query: intersection query?
SQL query in Rasterframes
SELECT month, ndvi_stats.*
FROM ("
SELECT month, rf_agg_stats(rf_normalized_difference(nir, red)) as ndvi_stats
FROM red_nir_tiles_monthly_2017
WHERE st_intersects(st_reproject(rf_geometry(red), rf_crs(red), 'EPSG:4326'),
st_makePoint(34.870605, -4.729727))
GROUP BY month
ORDER BY month )
"")
 Compute the average NDVI per month for a single tile in an Area of
Interest
Demo
• https://beta.earthai.astraea.earth/user/hajjihi@gmail.com/lab?
All that is good, but…
• I hate creating and configuring cluster (Admin tasks)
• I want to focus more on my business problems not technical problems
• Can I have a cloud solution that can do that for me:
• Let me work with scalability (Tb of data)
• Provisioning large cluster for my storage and computing
• Equipped with up-to-date ML techniques
• With visual interface for composing my ML pipeline
Earth AI
• is a Cloud-native software that enables you to apply advanced machine
learning algorithms to EO data at scale
• Both a non-code-based visual interface and pre-built workflows
• Ready-To-Use Datasets
• data archive includes more years of historical imagery and scientific datasets
• Elastic Compute
• Designed for scalability from the beginning, Earth AI platform scales seamlessly, so
you can think more about insights than Dev Ops
Earth AI
Earth AI
• Classifying an ecoregion using Decision Tree Classifier
Earth AI
Google Earth Engine
• Yet another planetary-scale platform for Earth science data & analysis
• Ready-To-Use Datasets
• The public data archive includes more than thirty years of historical imagery and scientific
datasets, updated and expanded daily. It contains over twenty petabytes of geospatial data
instantly available for analysis.
• https://developers.google.com/earth-engine/datasets/catalog/
Google Earth Engine
• Web-based code editor for fast, interactive algorithm development with instant
access to petabytes of data: https://code.earthengine.google.com/
Google Earth Engine
• Google proposes:
• Earth Engine — geospatial analysis platform
• Earth Engine Data Catalog — comprehensive archive of geospatial data (including
NLCD)
• TensorFlow — machine learning platform with FCNN capabilities
• AI Platform — TensorFlow model training
• Colab — Jupyter notebook server for workflow development
Earth AI vs GEE: Quick comparison
• GEE is a closed platform
• GEE is limited from a storage and processing perspective
• GEE is really only a research system in today’s implementation. It is not
licensed for commercial use.
• RasterFrames and EarthAI, by contrast are commercial systems. Rasterframes
open source code is scrupulously managed under Eclipse Foundation's
LocationTech project to ensure you can rely on it for commercial deployments.
SpatioTemporal Asset Catalogs
• New hot topic in Spatial Big Data
• Enabling online search and discovery of geospatial assets
• “The SpatioTemporal Asset Catalog (STAC) specification provides a common
language to describe a range of geospatial information, so it can more easily
be indexed and discovered. A 'spatiotemporal asset' is any file that represents
information about the earth captured in a certain space and time.”
• “The goal is for all providers of spatiotemporal assets (Imagery, SAR, Point
Clouds, Data Cubes, Full Motion Video, etc) to expose their data as
SpatioTemporal Asset Catalogs (STAC), so that new code doesn't need to be
written whenever a new data set or API is released.”
• Technically, rasdaman is a domain independent
Array DBMS, which makes it suitable for all
applications where raster data management is an
issue.
• The petascope component of rasdaman adds on
geo semantics for example, with full support for
the OGC standard interfaces WCS, WCPS, WCS-T,
and WMS
SciDB
• Array-based data management and analytical system
• Arrays are divided into equally sized chunks
• Chunks are distributed over many SciDB instances
• Size and shape of chunks are defined by users per array and have
strong effects on computation times
• Storage is nearly sparse
• Relies on shared nothing architectures
• Open-source version available, extensible by UDFs
Thanks
Questions?
Processing Drone data @Scale

More Related Content

What's hot

Accuracy of UAV Photogrammetry
Accuracy of UAV PhotogrammetryAccuracy of UAV Photogrammetry
Accuracy of UAV Photogrammetry
baselinesurvey
 
Radar Image Processing
Radar Image ProcessingRadar Image Processing
Radar Image Processing
Jubayer Al Mahmud
 
IKONOS-2.pptx
IKONOS-2.pptxIKONOS-2.pptx
IKONOS-2.pptx
Ganesh Hebbal
 
Raster data analysis
Raster data analysisRaster data analysis
Raster data analysis
Abdul Raziq
 
TOP_407070357-Data-Governance-Playbook.pptx
TOP_407070357-Data-Governance-Playbook.pptxTOP_407070357-Data-Governance-Playbook.pptx
TOP_407070357-Data-Governance-Playbook.pptx
SabrinaLameiras1
 
Remote Sensing Imagery & Artificial Intelligence
Remote Sensing Imagery & Artificial IntelligenceRemote Sensing Imagery & Artificial Intelligence
Remote Sensing Imagery & Artificial Intelligence
Esri Ireland
 
hyperspectral remote sensing and its geological applications
hyperspectral remote sensing and its geological applicationshyperspectral remote sensing and its geological applications
hyperspectral remote sensing and its geological applications
abhijeet_banerjee
 
NDGeospatialSummit2019 - Drone Based Lidar and the Future of Survey/GIS
NDGeospatialSummit2019 - Drone Based Lidar and the Future of Survey/GISNDGeospatialSummit2019 - Drone Based Lidar and the Future of Survey/GIS
NDGeospatialSummit2019 - Drone Based Lidar and the Future of Survey/GIS
North Dakota GIS Hub
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Simplilearn
 
Hadoop 1 vs hadoop2
Hadoop 1 vs hadoop2Hadoop 1 vs hadoop2
Hadoop 1 vs hadoop2
Sandeep Patil
 
Drone flight planning - Principles and Practices
Drone flight planning - Principles and PracticesDrone flight planning - Principles and Practices
Drone flight planning - Principles and Practices
Dany Laksono
 
Satellite image processing
Satellite image processingSatellite image processing
Satellite image processing
alok ray
 
DRONES FOR ENVIRONMENTAL MONITORING
DRONES FOR ENVIRONMENTAL MONITORINGDRONES FOR ENVIRONMENTAL MONITORING
DRONES FOR ENVIRONMENTAL MONITORING
Salvatore Manfreda
 
Market oriented Cloud Computing
Market oriented Cloud ComputingMarket oriented Cloud Computing
Market oriented Cloud Computing
Jithin Parakka
 
Lecture01: Introduction to Photogrammetry
Lecture01: Introduction to PhotogrammetryLecture01: Introduction to Photogrammetry
Lecture01: Introduction to Photogrammetry
Sarhat Adam
 
A review of change detection techniques
A review of change detection techniques A review of change detection techniques
A review of change detection techniques
abhishek_bhatt
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with Snowflake
Matillion
 
Raster data ppt
Raster data pptRaster data ppt
Raster data ppt
AvinashAvi110
 
Forest fire management using gis
Forest fire management using gisForest fire management using gis
Forest fire management using gis
Kajal Thakkar
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Haluan Irsad
 

What's hot (20)

Accuracy of UAV Photogrammetry
Accuracy of UAV PhotogrammetryAccuracy of UAV Photogrammetry
Accuracy of UAV Photogrammetry
 
Radar Image Processing
Radar Image ProcessingRadar Image Processing
Radar Image Processing
 
IKONOS-2.pptx
IKONOS-2.pptxIKONOS-2.pptx
IKONOS-2.pptx
 
Raster data analysis
Raster data analysisRaster data analysis
Raster data analysis
 
TOP_407070357-Data-Governance-Playbook.pptx
TOP_407070357-Data-Governance-Playbook.pptxTOP_407070357-Data-Governance-Playbook.pptx
TOP_407070357-Data-Governance-Playbook.pptx
 
Remote Sensing Imagery & Artificial Intelligence
Remote Sensing Imagery & Artificial IntelligenceRemote Sensing Imagery & Artificial Intelligence
Remote Sensing Imagery & Artificial Intelligence
 
hyperspectral remote sensing and its geological applications
hyperspectral remote sensing and its geological applicationshyperspectral remote sensing and its geological applications
hyperspectral remote sensing and its geological applications
 
NDGeospatialSummit2019 - Drone Based Lidar and the Future of Survey/GIS
NDGeospatialSummit2019 - Drone Based Lidar and the Future of Survey/GISNDGeospatialSummit2019 - Drone Based Lidar and the Future of Survey/GIS
NDGeospatialSummit2019 - Drone Based Lidar and the Future of Survey/GIS
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 
Hadoop 1 vs hadoop2
Hadoop 1 vs hadoop2Hadoop 1 vs hadoop2
Hadoop 1 vs hadoop2
 
Drone flight planning - Principles and Practices
Drone flight planning - Principles and PracticesDrone flight planning - Principles and Practices
Drone flight planning - Principles and Practices
 
Satellite image processing
Satellite image processingSatellite image processing
Satellite image processing
 
DRONES FOR ENVIRONMENTAL MONITORING
DRONES FOR ENVIRONMENTAL MONITORINGDRONES FOR ENVIRONMENTAL MONITORING
DRONES FOR ENVIRONMENTAL MONITORING
 
Market oriented Cloud Computing
Market oriented Cloud ComputingMarket oriented Cloud Computing
Market oriented Cloud Computing
 
Lecture01: Introduction to Photogrammetry
Lecture01: Introduction to PhotogrammetryLecture01: Introduction to Photogrammetry
Lecture01: Introduction to Photogrammetry
 
A review of change detection techniques
A review of change detection techniques A review of change detection techniques
A review of change detection techniques
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with Snowflake
 
Raster data ppt
Raster data pptRaster data ppt
Raster data ppt
 
Forest fire management using gis
Forest fire management using gisForest fire management using gis
Forest fire management using gis
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Similar to Processing Drone data @Scale

Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
Khazret Sapenov
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Denodo
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
Adi Challa
 
Spark
SparkSpark
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
GeekNightHyderabad
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Csaba Toth
 
SEED4NA _AI4DRONE.pdf
SEED4NA _AI4DRONE.pdfSEED4NA _AI4DRONE.pdf
SEED4NA _AI4DRONE.pdf
Dr Hajji Hicham
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
Nagarjuna D.N
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
Furqan Haider
 
Big data business case
Big data   business caseBig data   business case
Big data business case
Karthik Padmanabhan ( MLE℠)
 

Similar to Processing Drone data @Scale (20)

Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Spark
SparkSpark
Spark
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
SEED4NA _AI4DRONE.pdf
SEED4NA _AI4DRONE.pdfSEED4NA _AI4DRONE.pdf
SEED4NA _AI4DRONE.pdf
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 

More from Dr Hajji Hicham

Urban Big Data .pdf
Urban Big Data .pdfUrban Big Data .pdf
Urban Big Data .pdf
Dr Hajji Hicham
 
Slides Edataday2021_V2.pdf
Slides Edataday2021_V2.pdfSlides Edataday2021_V2.pdf
Slides Edataday2021_V2.pdf
Dr Hajji Hicham
 
Visual Transformer Overview
Visual Transformer OverviewVisual Transformer Overview
Visual Transformer Overview
Dr Hajji Hicham
 
Distributed computing with Spark 2.x
Distributed computing with Spark 2.xDistributed computing with Spark 2.x
Distributed computing with Spark 2.x
Dr Hajji Hicham
 
Overview of Interpretability Approaches in Deep learning: Focus on Convnet ar...
Overview of Interpretability Approaches in Deep learning: Focus on Convnet ar...Overview of Interpretability Approaches in Deep learning: Focus on Convnet ar...
Overview of Interpretability Approaches in Deep learning: Focus on Convnet ar...
Dr Hajji Hicham
 
Presentation intis 2017 version27112017
Presentation intis 2017 version27112017Presentation intis 2017 version27112017
Presentation intis 2017 version27112017
Dr Hajji Hicham
 
Syllabus advanced big data with spark
Syllabus advanced big data with sparkSyllabus advanced big data with spark
Syllabus advanced big data with spark
Dr Hajji Hicham
 

More from Dr Hajji Hicham (7)

Urban Big Data .pdf
Urban Big Data .pdfUrban Big Data .pdf
Urban Big Data .pdf
 
Slides Edataday2021_V2.pdf
Slides Edataday2021_V2.pdfSlides Edataday2021_V2.pdf
Slides Edataday2021_V2.pdf
 
Visual Transformer Overview
Visual Transformer OverviewVisual Transformer Overview
Visual Transformer Overview
 
Distributed computing with Spark 2.x
Distributed computing with Spark 2.xDistributed computing with Spark 2.x
Distributed computing with Spark 2.x
 
Overview of Interpretability Approaches in Deep learning: Focus on Convnet ar...
Overview of Interpretability Approaches in Deep learning: Focus on Convnet ar...Overview of Interpretability Approaches in Deep learning: Focus on Convnet ar...
Overview of Interpretability Approaches in Deep learning: Focus on Convnet ar...
 
Presentation intis 2017 version27112017
Presentation intis 2017 version27112017Presentation intis 2017 version27112017
Presentation intis 2017 version27112017
 
Syllabus advanced big data with spark
Syllabus advanced big data with sparkSyllabus advanced big data with spark
Syllabus advanced big data with spark
 

Recently uploaded

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Processing Drone data @Scale

  • 1. Mapping using Fixed-wing Drone Ting Wen Ong | Operation Manager FEDS Drone-powered Solutions
  • 2. Agenda • Big Data/AI and Drone • Opportunities • Challenges, Why is it Hard? • Big Data Challenges… • Toward a new architecture for drone Big Data • Partitioning • Storage • Computing • Some existing Big Data/AI frameworks for Drone
  • 3. Audience Poll • How many of you have used Big Data/AI techniques? Hadoop ? Spark ? Tensorflow?
  • 4. Big Data / AI and Drone
  • 6. Reminder about Big Data • “Big data …encompasses the volume of information, the speed at which it is created and collected, and the variety of the data points being covered. ” source investopedia.com • It becomes essential to many companies’ success in today’s business landscape (Finance, Banking, Google, Facebook, …)
  • 7. Reminder about AI • …is Learning from an amount of data to get new insights, and to help in predicting tasks • Many approaches have been developed to learn from data (of various forms: text, DB, Image, video…): Deep Neural Networks based solutions • The more data available, the more effective the learning is and the more accurate the prediction task is.
  • 8. Opportunities of Big Data and Drone • Drone data are good example for what Big data technology has been created : Storage and Computing • Drones can capture, store, and transmit data, giving businesses the opportunity to integrate more data into their current processes
  • 9. Opportunities of AI and Drone • With such amount of data, AI can access a huge amount of drone data to learn new insights, and help in predicting tasks • Farmers uses Drone for agriculture • Helping in prediction crop yields • Drones for thermal imaging • Used for construction and maintenance
  • 10. Very good, but…… • The potential of drones data is often underestimated • Archiving collected data • Curretly, we are doing more archiving tasks than managing drone data efficiently • Almost no existing Big Data infrastructure can handle drones efficiently, • Even if Big data is almost mature for other domains: Finance, Banking… • Often it is • Hard to store • Hard to manage • Hard to process • Hard to get insight • How ???
  • 11. Hard to store: Volume A very small drone project can generate more than 10 GB, sometimes more than 40Gb 15 million images of drone can make up more than 175 terabytes of data. How to Store and Compute such growing volume? FEDS : 13,000 flights this year
  • 12. Hard to store: Variety • “Drones can now provide a wide variety of data types, everything from a few basic photos through to complex measurable 3D models with annotations and overlays.” Visual Encylopedia of drone data Aerial Photography and Video Orthomosaic Map Digital Elevation Model (DEM) 3D Pointcloud Model Multispectral Mapping Thermal Imagery and Mapping
  • 13. Hard to process: Computing Model and Scalability • Currently, drone image processing is done in one server: NOT SCALABLE • Scalability is the property of a system to handle a growing amount of work by adding resources to the system • In Big Data, It is mostly done by distributing storage and computing • Distributed computing can provide Scalability, but drone data friendly is Difficult Processing/Querying drone data can take up to a few hours  Objective : real time (few seconds)
  • 14.  Going beyond traditional algorithms  Why not use Neural networks that have made great success with image: ▪ Semantic segmentation ▪ Object recognition, Classification.. ▪ Description Generation for Drone Images Using Attribute Attention Mechanism  But theses new algorithms require more storage capacities and computing power Hard to get Insights
  • 15. Recall that Drone Data are a bit similar to Raster data structures • Aeriel imageries • Satellite Imageries • Climate data (netCDF, …)
  • 16. Currently, How Drone Data are Stored? Internal Storage: for short-term storage before editing (hobbyist users ) SD Cards: The majority of drones use SD or micro SD cards as their standard storage option. Cloud Storage: The benefits of using a cloud-based system is you can access your data anywhere in the world by logging on to your account. Label and Organize Your Files: You save each session chronologically by date with additional information such as the location of your shoot or the client it was for.
  • 17. Good, But Scalability Currently, How Drone Data are Managed? File Systems GIS Drone Software
  • 18. Current approaches are obsolete we need to reinvent everything Storage Access Availability Computing Fast Accurate Analytics Machine Learning Deep Learning Search By semantic By Spatial Queries …
  • 19. 1. New architecture to be redefined Analytical Queries Structured Storage Cluster Computing Cluster … Large Scale Time series NDVI •Distributing both STORAGE •AND COMPUTING
  • 20. 2. Need to correlate drone data with external datasets More Insights Census Data Economic Data Weather …
  • 21. 3. Toward a declarative language (SQL-Like) over drone data Change in NDVI over the spring and early summer of 2018 Select normalized_difference(nir, red) as ndvi From Feds_droneDataset Where date between ‘10-10-2017’ and ‘10-10-2019’ Examples from ‘10-10-2017’ to ‘10-10-2019’ Best option for Data Scientists
  • 22. Drone Big Data We Will focus on three Aspects Storage HDFS NoSQL Database Data Lake Computing MR Spark Analytics ML DL
  • 23. Recall that storage should be distributed across a cluster • Before detailing storage techniques, let’s talk about Partitioning Structured Storage Cluster … Node A Node B Node F Node G
  • 24. Challenge for going distributed: Data Partitioning  Partitioning means the process of physically dividing data into separate data stores  Data is divided into partitions that can be managed and accessed separately. Node 1 Node 2 Node 3 Node 4
  • 25. Node 1 Node 2 Node 3By Band RGB Red Band Green Band Blue Band First simple approach is to partition by band
  • 26. Node 1 Node 2 Node 3 By Time Spring Summer Autumn Other simple approach is to partition by time (season)
  • 27. Node 1 Node 2 Node 3 Decompose into NxN regular grids But the Most efficient approach is to combine Tiling and Distribution Tiling allows large raster datasets to be broken-up into manageable pieces  higher level raster I/O interface.
  • 28. Which Partition strategy to choose? • Not in the scope of this presentation • Check with your main objective: • If for Scalability, • If for Query Performance, • If for Availability • Many Best practices are available • Sometimes we make use of Global Index for Optimizing Queries
  • 29. 1- Distributed Storage techniques Quick reminder
  • 30. HDFS- Hadoop Distributed File System • The Most basic data store for Big Data • We breaks down very large files into large blocks (for example, measuring 64MB), • and stores three copies of these blocks on different nodes in the cluster to protect against machine failures. • The default is a replication factor of 3 (every block is stored on three machines)
  • 31. Extension of HDFS to Drone Data • HDFS cannot be used directly for managing raster data • HDFS has no awareness of the content of these files. • HDFS is ideally suited for write-once and read-many times use cases • HDFS works best with a smaller number of large files
  • 32. NoSQL Databases • Relational databases cannot provide on demand scalability. • NoSQL Offers at least three advantages: • Data Modeling (rapidly changing ), Scalability, High Availability
  • 33. Key Value • The key-value database uses an a map where • Key is associated with one and only one value in a collection. This kind of relationship is referred to as a key-value pair. • Value can be anything, including image, JSON, flexible schemas. • Advantages: • Simple data format makes write and read operations fast.
  • 34. Key Value Key Value Key( ) Exp: Space Filling Curve • How to create a key for drone image?
  • 35. NoSQL Databases NoSQL Database are not natively compliant with Drone Data, need to be adapted. Open research problem
  • 36. 2- The computing part • Having data storage distributed, Recall that also the computing is also distributed in Big Data architecture •Pipeline of Big Data Query • 1. End user writes its Query Q, • 2. System distribute this query Q over the cluster • 3. Cluster servers compute individual subqueries • 4. Subqueries Answers are aggregated to End user
  • 37. The computing part Computing Model HADOOP/MapReduce Spark/Spark SQL We have at least two interesting computing models
  • 38. Spark vs Hadoop MapReduce Source: Data Flair We will focus Next on Apache Spark According to benchmarks studies, Spark is much better than Hadoop MapReduce
  • 39. • Spark is a distributed computing engine that lets you work with distributed data as a collection • Computing (mostly) in-memory data processing engine Fastest Big Data engine for computing • Not only Spark, but also other related projects
  • 40. Two (or three!) Abstractions • for handling computing over large datasets, Apache Spark transforms large datasets into two abstractions • RDD (program with scala) • Dataframe (Dataset!) (query with SQL) • Abstracts away (partially) the complexities of distributed computing
  • 41. RDD data abstraction Resilient •be able to recompute missing or damaged partitions due to node failures. Partitioned •Records are partitioned (split into logical partitions) and distributed across nodes In-Memory •Data inside RDD is stored in memory as much (size) and long (time) as possible. Immutable • It does not change once created and can only be transformed to new RDDs. Lazy evaluated •Data inside RDD is not available or transformed until an action is executed (triggers the execution). Cacheable •You can hold all the data in a persistent "storage" like memory (default and the most preferred) or disk • In this approach, Spark transforms a data source into RDD (collections of elements that can be operated on in parallel)
  • 42. Dataframe abstraction • In this approach, Spark SQL creates a tabular view over your data • Then SQL comes to play with inner Optimization
  • 43. Spark RDD vs Dataframe • Dataframe has Advantages of RDD and More: • Unlike RDD: • You can write program in SQL queries instead of Scala • Optimization done automatically
  • 44. Analytics with Spark • Spark proposes a very easy pattern to follow. • Use Dataframe as starting point in analytics • Work well in distributed environment
  • 45. Recap • Drone are a good use case for big data technology • We need to reinvent approaches for storing and computing • Solution is to distribute Storage and Computing Is it possible to have the same pattern with Drone Data? The answer is ……
  • 47. Frameworks for Raster Big Data Apache Spark / Spark SQL • Rasterframes (My favorite) Earth AI (To follow) Google Earth Engine Rasdaman SciDB
  • 48. • Spark project for Raster Data • Spark Dataframe like abstraction for handling Raster Data : Provides ability to work with Raster imagery in a convenient yet scalable format • You can use Spark ML for building ML Models B1 B2 B3 B4 tile or tile_n (where n is a band number)
  • 49. ML Pipeline for Raster Data • 1- You ingest data Raster • 2- You Construct dataframe • 3- Apply Machine learning and stats over your data Source: astrae aearth
  • 50. RasterFrames Data sources • Raster data can be read from a number of sources. • Through the Spark SQL DataSource API, RasterFrames can be constructed from collections of : • (preferably Cloud Optimized) GeoTIFFs, • GeoTrellis Layers • from an experimental catalog of Landsat 8 and MODIS data sets on the Amazon Web Services (AWS) Public Data Set (PDS). • support for the evolving Spatiotemporal Asset Catalog (STAC) specification. Source: astrae aearth
  • 51. Standard Tile Operations • Many raster operations are ready to be executed in a distributed manner : can be executed over Spark Cluster • Ready to use
  • 52. RasterFrames: SQL Query • Such operations can be used as predicate over tile column (like any DBMS operator): • Give me Min, Mean, Max over all tiles (image)… and group them by a certain key (alphanumerical, spatial, temporal, spatio-temporal key )
  • 53. RasterFrames: SQL Query • Can I Use spatial predicate in my query: intersection query?
  • 54. SQL query in Rasterframes SELECT month, ndvi_stats.* FROM (" SELECT month, rf_agg_stats(rf_normalized_difference(nir, red)) as ndvi_stats FROM red_nir_tiles_monthly_2017 WHERE st_intersects(st_reproject(rf_geometry(red), rf_crs(red), 'EPSG:4326'), st_makePoint(34.870605, -4.729727)) GROUP BY month ORDER BY month ) "")  Compute the average NDVI per month for a single tile in an Area of Interest
  • 55.
  • 57. All that is good, but… • I hate creating and configuring cluster (Admin tasks) • I want to focus more on my business problems not technical problems • Can I have a cloud solution that can do that for me: • Let me work with scalability (Tb of data) • Provisioning large cluster for my storage and computing • Equipped with up-to-date ML techniques • With visual interface for composing my ML pipeline
  • 58. Earth AI • is a Cloud-native software that enables you to apply advanced machine learning algorithms to EO data at scale • Both a non-code-based visual interface and pre-built workflows • Ready-To-Use Datasets • data archive includes more years of historical imagery and scientific datasets • Elastic Compute • Designed for scalability from the beginning, Earth AI platform scales seamlessly, so you can think more about insights than Dev Ops
  • 60. Earth AI • Classifying an ecoregion using Decision Tree Classifier
  • 62. Google Earth Engine • Yet another planetary-scale platform for Earth science data & analysis • Ready-To-Use Datasets • The public data archive includes more than thirty years of historical imagery and scientific datasets, updated and expanded daily. It contains over twenty petabytes of geospatial data instantly available for analysis. • https://developers.google.com/earth-engine/datasets/catalog/
  • 63. Google Earth Engine • Web-based code editor for fast, interactive algorithm development with instant access to petabytes of data: https://code.earthengine.google.com/
  • 64. Google Earth Engine • Google proposes: • Earth Engine — geospatial analysis platform • Earth Engine Data Catalog — comprehensive archive of geospatial data (including NLCD) • TensorFlow — machine learning platform with FCNN capabilities • AI Platform — TensorFlow model training • Colab — Jupyter notebook server for workflow development
  • 65. Earth AI vs GEE: Quick comparison • GEE is a closed platform • GEE is limited from a storage and processing perspective • GEE is really only a research system in today’s implementation. It is not licensed for commercial use. • RasterFrames and EarthAI, by contrast are commercial systems. Rasterframes open source code is scrupulously managed under Eclipse Foundation's LocationTech project to ensure you can rely on it for commercial deployments.
  • 66. SpatioTemporal Asset Catalogs • New hot topic in Spatial Big Data • Enabling online search and discovery of geospatial assets • “The SpatioTemporal Asset Catalog (STAC) specification provides a common language to describe a range of geospatial information, so it can more easily be indexed and discovered. A 'spatiotemporal asset' is any file that represents information about the earth captured in a certain space and time.” • “The goal is for all providers of spatiotemporal assets (Imagery, SAR, Point Clouds, Data Cubes, Full Motion Video, etc) to expose their data as SpatioTemporal Asset Catalogs (STAC), so that new code doesn't need to be written whenever a new data set or API is released.”
  • 67. • Technically, rasdaman is a domain independent Array DBMS, which makes it suitable for all applications where raster data management is an issue. • The petascope component of rasdaman adds on geo semantics for example, with full support for the OGC standard interfaces WCS, WCPS, WCS-T, and WMS
  • 68. SciDB • Array-based data management and analytical system • Arrays are divided into equally sized chunks • Chunks are distributed over many SciDB instances • Size and shape of chunks are defined by users per array and have strong effects on computation times • Storage is nearly sparse • Relies on shared nothing architectures • Open-source version available, extensible by UDFs