Presentation by Gerben de Boer (van Oord) at the Symposium Earth Observation and Data Science, during Delft Software Days - Edition 2017. Thursday, 2 November 2017, Delft.
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
DSD-INT 2017 The use of big data for dredging - De Boer
1. The use of big data for dredging
Gerben de Boer, Van Oord , Engineering, OpenEarth data management
Delft Software Days 2017
2. Hire right cloud provider
• Hadoop / HD insight
• Sparq
• Cassandra
• uSQL
• CosmosDB
Relations: AI.
burn money on cloud providers
Big data philosophies: Statistics requires 30+ realizations
2
Brute force Smart force
Hire right people
• Thematic nerds (any engineering)
• Software developer (py, js, sql)
• DevOps
• Sales, social
• Graphic designer
Relations: Business logic + physics
burn money on wages
• Data scientist
• Data analytics manager
• Data architect
• Data engineer
• Statistician
• DBA
• Business analyst
• Data analyst
5. SQL has almost no limits
5
For most users SQL is not big data.
Only your wallet is a limiting factor
• Out of preview 15 nov
• 1TB
• 99.99% availability
• 35 days point-in-time restore
• We tried 0.5 TB, limited by SSD disk IO.
• 4TB
Azure postgres
Azure SQL server
Postgres in Azure VM
6. • Pure SQL
• TB SQL database no problem
• Postgres single threaded
• Use indexing, views, caching tools:
think about Content that’s needs to
be Delivered (CDN)
• Postgres native jsonb datatype
• MS uSQL can reach ascii files, and
use R and python code
Overcome SQL limits: hybrid and noSQL
6
SQL
• Put (jsonb) as files on disk
and load the subset you need,
or when replication needed
• csv, json, xml, yml, netcdf
• + many legacy formats
• Database as API, not archive
• Only index to files on disk
• E.g. Tiff postgis raster
• Van Oord vessellog = netCDF
+ PG index “(NASA
technology”):
hybrid
• Pure noSQL: structured
folder with structured
files
device/yy/mm/dd/signal
• Micro service to handle
files on demands
• Regular expressions
are your friend.
• netCDF/HDF was
originally devised to
overcome SQL limit
noSQL = files
7. OpenEarthRawData: partial checkout
Git has binary file extention. Git canot make a
partial checkout
How to get local copy of a subset of the data
7
vcs
Data to WxS on server
WxS to data by client
2 unnecesarry processing steps
WxS webservices
8. First computer was
designed to print
gonio tables
flawless.
Now we replicate
the algorithm, not
the table.
Babbage: storage, bandwidth, compute
Babbage: table vs calculator: 2 retrieval methods
Trade-off made explicit by cloud pay-as-you-go
• Storage: disk occupancy + IO operations
• Compute: CPU + Memory
• Bandwidth: too slow:
Replicate database vs replicate raw data + ETL
Cloud
Copy DB dump.
Copy raw data
and rerun ETL.
13. Good idea to stream graphics to screens: WMS.
Limits grid data to what you can actually see
People actually use quad-trees, not WMTS: tiled.
Use (geo)json for plotting vector data: plot.ly
geojson only OGC in 2017, 9 years after conception !
Bad idea to stream big data: WCS, WFS
Keep all processing in the datacenters.
Only graphical results.
INSPIRE + OGC: not front-runners.
WXS > CDN
13
WXS
• CDN - content delivery network
• The backbone behind youtube, netflix
• Makes datacenters geospatially redundant
• Rapidly replicates raw data files (tiff)
• Use your own ETL tools locally
CDN
16. Overload of historic data formats: parsing
Datawell wave buoy: 30 kB code to parse 93 bytes
OGC SOS is not a solution:
xml garbage.
Satellite data still very expensive
Solutions are available:
Google protobuffers
Variety: parsing is ETL
16
Sensor supplier, SCADA
17. ETL processes are run once
Database is considered archive
ETL removes some raw data features
Collect once, maybe re-use many times
Parsers do not evolve: waterfall
Good for: known knowns
Share data and processing (Manhattan optimization)
17
ETL
In ELT the generic parsers run each request
Parsers can run on-the-fly in a micro-service
All raw data features can be kept as parsers evolve
Collect once, allow any future use
Parsers evolve agile: extra from_* methods
Good for: unknown unknowns
ELT: share code via github !
parser.to_sql()
parser.from_garbage()
18. • SQL server can now un R and python code
• Windows and linux can run same containers
Big unstructured Datalake
• SQL sources + noSQL sources
• Brute force to run ELT jobs: Hadoop
• Economic trade-off brains vs clouds
Datalake
18
Datalake
18
Codelake
parser.from_garbage()
parser.from_garbage()
parser.from_garbage()
parser.from_garbage()
parser.from_garbage()
19. L0 raw data
L0_L1 code
L1 products
L1_L2 code
L2 products
…
Big data reinvented wheel (2)
19
Big data reinvented the wheel
21. Run micro services on top of Datalake
One for each specific question.
This software needs to work at any data replication
• Localhost
• Azure
• Amazon
• On-premise
• On-vessel
We need to make servers redistributable
CONTAINERS
Micro services
Datalake
22. OpenEarth: monthly Docker sprintsession @ Microsoft NL, Schiphol
22
Van Oord, Deltares, Tu Delft, KNMI, NLeSC, Sogeti, Microsoft, Maris, …
23. • Docker sprint session every month
• https://github.com/openearth-stack
• Van Oord, Tu Delft, Deltares, Microsoft
NLeSC, KNMI, Maris
• Gerben.deboer@vanoord.com
OpenEarth Docker Azure DigiShape
23
Organization
• Pyramid python web framework
• PostgreSQL
• KNMI Adaguc
• Geoserver
• ….
Components
25. Excel is our only Big data nightmare
Old, grey clerks and managers.
The use Excel as paper.
Manual data can be digitized with rapid apps.
Low-code revolution: app-in-a-day.
Variety
25
Low-code Apps
http://www.janbanning.com/
26. Excel course: who ever read the instructions?
https://danjharrington.wordpress.com/2012/08/01/excel-logos-over-the-years/ Gerben J de Boer, Van Oord, E&E, OpenEarth Data Management