Slides by Stavros Papadopoulos (TileDB) and Jason Brown (Capella Space) from the joint TileDB-Capella Space webinar held in April 2022 on SAR and LiDAR data analytics.
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
TileDB webinars - Nov 4, 2021
The document summarizes a webinar about TileDB, a universal data management platform that represents data as dense and sparse multi-dimensional arrays. It addresses the data management problems in population genomics by storing variant call data as 3D sparse arrays. TileDB provides a unified storage and serverless computing model that allows efficient data access and analysis at global scale through its open source TileDB Embedded storage and TileDB Cloud platform. The webinar highlights how TileDB solves data production, distribution, and consumption problems and empowers data sharing and collaboration through its marketplace and security features.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
Bradley Skelton, Chief Technology Strategist for Geospatial Portfolio at Hexagon Geospatial, looks at the increasing amount and variety of data available that can be turned into actionable information.
See more presentations from the FME User Conference 2014 at: www.safe.com/fmeuc
Irish Earth Observation Symposium 2014: Point Cloud Data Management with ERDA...IMGS
Rapid increase in the volume and number of imagery data sets places significant pressures on data managers – they need to be able to place new data sets in a controlled enterprise data management environment that ensures the maximum number of authorised users are able to find and access the resource.
Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Codiax
This document provides an overview of Overstory, a company that uses satellite data and AI to monitor forests and tackle issues like deforestation and wildfires. It discusses how Overstory uses machine learning on high-resolution satellite imagery to create segmentation maps and monitor changes in forests over time. It also describes Overstory's infrastructure including its use of JupyterHub, Dask, and Papermill to enable large-scale distributed processing of satellite data and training of deep learning models.
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
TileDB webinars - Nov 4, 2021
The document summarizes a webinar about TileDB, a universal data management platform that represents data as dense and sparse multi-dimensional arrays. It addresses the data management problems in population genomics by storing variant call data as 3D sparse arrays. TileDB provides a unified storage and serverless computing model that allows efficient data access and analysis at global scale through its open source TileDB Embedded storage and TileDB Cloud platform. The webinar highlights how TileDB solves data production, distribution, and consumption problems and empowers data sharing and collaboration through its marketplace and security features.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
Bradley Skelton, Chief Technology Strategist for Geospatial Portfolio at Hexagon Geospatial, looks at the increasing amount and variety of data available that can be turned into actionable information.
See more presentations from the FME User Conference 2014 at: www.safe.com/fmeuc
Irish Earth Observation Symposium 2014: Point Cloud Data Management with ERDA...IMGS
Rapid increase in the volume and number of imagery data sets places significant pressures on data managers – they need to be able to place new data sets in a controlled enterprise data management environment that ensures the maximum number of authorised users are able to find and access the resource.
Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Codiax
This document provides an overview of Overstory, a company that uses satellite data and AI to monitor forests and tackle issues like deforestation and wildfires. It discusses how Overstory uses machine learning on high-resolution satellite imagery to create segmentation maps and monitor changes in forests over time. It also describes Overstory's infrastructure including its use of JupyterHub, Dask, and Papermill to enable large-scale distributed processing of satellite data and training of deep learning models.
This document presents an overview of big data. It defines big data as large, diverse data that requires new techniques to manage and extract value from. It discusses the 3 V's of big data - volume, velocity and variety. Examples of big data sources include social media, sensors, photos and business transactions. Challenges of big data include storage, transfer, processing, privacy and data sharing. Past solutions discussed include data sharding, while modern solutions include Hadoop, MapReduce, HDFS and RDF.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This document provides an overview of Hadoop and several related big data technologies. It begins by defining the challenges of big data as the 3Vs - volume, velocity and variety. It then explains that traditional databases cannot handle this type and scale of unstructured data. The document goes on to describe how Hadoop works using HDFS for storage and MapReduce as the programming model. It also summarizes several Hadoop ecosystem projects including YARN, Hive, Pig, HBase, Zookeeper and Spark that help to process and analyze large datasets.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
The document discusses big data concepts and Hadoop technologies. It provides an overview of massive parallel processing and the Hadoop architecture. It describes common processing engines like MapReduce, Spark, Hive, Pig and BigSQL. It also discusses Hadoop distributions from Hortonworks, Cloudera and IBM along with stream processing and advanced analytics on Hadoop platforms.
WASP is a framework to develop big data pipelines, working with streaming analytics, multi model storages and machine learning models. Everything is in real time.
IMGS Geospatial User Group 2014 - Big data management with ApolloIMGS
1) Geospatial data volumes are growing rapidly as resolution, coverage, and frequency increase, along with new data sources.
2) There is an increasing need to leverage large amounts of geospatial data across applications and devices, requiring quicker access to dispersed data.
3) ERDAS APOLLO is a comprehensive data management system that allows users to catalog, search, discover, process, and disseminate massive volumes of geospatial data in a secure manner.
This document discusses big data solutions and introduces Hadoop. It defines common big data problems related to volume, velocity, and variety of data. Traditional storage does not work well for this type of unstructured data. Hadoop provides solutions through HDFS for storage, MapReduce for processing, and additional tools like HBase, Pig, Hive, Zookeeper, and Spark to handle different data and analytic needs. Each tool is described briefly in terms of its purpose and how it works with Hadoop.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
This document provides an introduction to big data. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It discusses the three V's of big data - volume, variety and velocity. Volume refers to the large scale of data. Variety means different data types. Velocity means the speed at which data is generated and processed. The document outlines topics that will be covered, including Hadoop, MapReduce, data mining techniques and graph databases. It provides examples of big data sources and challenges in capturing, analyzing and visualizing large and diverse data sets.
This document discusses big data and high performance computing. It begins by outlining where big data comes from, including sources like people, organizations, and machines. It then discusses opportunities that can be derived from big data analysis. The document explains the "big data problem" of how to process and store massive amounts of data across clusters. It provides background on why distributed computing solutions are needed now given exponential growth in digital data. The Hadoop ecosystem is introduced as a big data technology stack. The document outlines MapReduce and HDFS as core distributed computing architectures. It also discusses GPUs and massive parallelization using CUDA to enable high performance computing for big data workloads.
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.
Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.
Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.
For more information, visit www.casertaconcepts.com
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR.
Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet.
Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME.
Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
The document discusses MapReduce and big data. MapReduce is a programming model used for processing large datasets in a distributed manner. It involves two steps - the map step where data is processed in parallel to generate intermediate key-value pairs, and the reduce step where the intermediate outputs are aggregated. The document provides examples of counting word occurrences in text as a MapReduce job. It also discusses challenges in managing large and diverse datasets.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
This document provides an overview of big data and Hadoop. It defines big data using the 3Vs - volume, variety, and velocity. It describes Hadoop as an open-source software framework for distributed storage and processing of large datasets. The key components of Hadoop are HDFS for storage and MapReduce for processing. HDFS stores data across clusters of commodity hardware and provides redundancy. MapReduce allows parallel processing of large datasets. Careers in big data involve working with Hadoop and related technologies to extract insights from large and diverse datasets.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
This document presents an overview of big data. It defines big data as large, diverse data that requires new techniques to manage and extract value from. It discusses the 3 V's of big data - volume, velocity and variety. Examples of big data sources include social media, sensors, photos and business transactions. Challenges of big data include storage, transfer, processing, privacy and data sharing. Past solutions discussed include data sharding, while modern solutions include Hadoop, MapReduce, HDFS and RDF.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This document provides an overview of Hadoop and several related big data technologies. It begins by defining the challenges of big data as the 3Vs - volume, velocity and variety. It then explains that traditional databases cannot handle this type and scale of unstructured data. The document goes on to describe how Hadoop works using HDFS for storage and MapReduce as the programming model. It also summarizes several Hadoop ecosystem projects including YARN, Hive, Pig, HBase, Zookeeper and Spark that help to process and analyze large datasets.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
The document discusses big data concepts and Hadoop technologies. It provides an overview of massive parallel processing and the Hadoop architecture. It describes common processing engines like MapReduce, Spark, Hive, Pig and BigSQL. It also discusses Hadoop distributions from Hortonworks, Cloudera and IBM along with stream processing and advanced analytics on Hadoop platforms.
WASP is a framework to develop big data pipelines, working with streaming analytics, multi model storages and machine learning models. Everything is in real time.
IMGS Geospatial User Group 2014 - Big data management with ApolloIMGS
1) Geospatial data volumes are growing rapidly as resolution, coverage, and frequency increase, along with new data sources.
2) There is an increasing need to leverage large amounts of geospatial data across applications and devices, requiring quicker access to dispersed data.
3) ERDAS APOLLO is a comprehensive data management system that allows users to catalog, search, discover, process, and disseminate massive volumes of geospatial data in a secure manner.
This document discusses big data solutions and introduces Hadoop. It defines common big data problems related to volume, velocity, and variety of data. Traditional storage does not work well for this type of unstructured data. Hadoop provides solutions through HDFS for storage, MapReduce for processing, and additional tools like HBase, Pig, Hive, Zookeeper, and Spark to handle different data and analytic needs. Each tool is described briefly in terms of its purpose and how it works with Hadoop.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
This document provides an introduction to big data. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It discusses the three V's of big data - volume, variety and velocity. Volume refers to the large scale of data. Variety means different data types. Velocity means the speed at which data is generated and processed. The document outlines topics that will be covered, including Hadoop, MapReduce, data mining techniques and graph databases. It provides examples of big data sources and challenges in capturing, analyzing and visualizing large and diverse data sets.
This document discusses big data and high performance computing. It begins by outlining where big data comes from, including sources like people, organizations, and machines. It then discusses opportunities that can be derived from big data analysis. The document explains the "big data problem" of how to process and store massive amounts of data across clusters. It provides background on why distributed computing solutions are needed now given exponential growth in digital data. The Hadoop ecosystem is introduced as a big data technology stack. The document outlines MapReduce and HDFS as core distributed computing architectures. It also discusses GPUs and massive parallelization using CUDA to enable high performance computing for big data workloads.
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.
Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.
Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.
For more information, visit www.casertaconcepts.com
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR.
Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet.
Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME.
Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
The document discusses MapReduce and big data. MapReduce is a programming model used for processing large datasets in a distributed manner. It involves two steps - the map step where data is processed in parallel to generate intermediate key-value pairs, and the reduce step where the intermediate outputs are aggregated. The document provides examples of counting word occurrences in text as a MapReduce job. It also discusses challenges in managing large and diverse datasets.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
This document provides an overview of big data and Hadoop. It defines big data using the 3Vs - volume, variety, and velocity. It describes Hadoop as an open-source software framework for distributed storage and processing of large datasets. The key components of Hadoop are HDFS for storage and MapReduce for processing. HDFS stores data across clusters of commodity hardware and provides redundancy. MapReduce allows parallel processing of large datasets. Careers in big data involve working with Hadoop and related technologies to extract insights from large and diverse datasets.
Similar to Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, 04-12-22).pdf (20)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, 04-12-22).pdf
1. Analyzing LiDAR & SAR data
with Capella Space and TileDB
TileDB webinars - April 12, 2022
Founder & CEO of TileDB, Inc.
Stavros Papadopoulos
2. Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
45+ members with expertise across all applications and domains
Who we are
TileDB was spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M from world-class investors
INVESTORS
3. The Problem
Low productivity for data analysts and scientists
Huge TCO for organizations
Organizations are drowning in a data infrastructure mess
Too many domain-specific file formats
Difficult to handle data beyond tables and SQL
Overly complex metadata handling and data sharing
Numerous vendors and in-house solutions
Difficult to govern all data holistically
4. The Solution | Universal Database
All Data. Faster. Cheaper.
Securely manage all your data assets and
supercharge your analytics, data science and
machine learning with a universal database
All data in one place
Superior performance, at a lower cost
Analytics, data science and ML
Holistic governance and collaboration
5. The Universal Database Pillars
All data in one place
Manage any type of data – tables, files,
images, video, genomics, ML features,
metadata, even flat files and folders – in a
single powerful database.
Superior performance,
at a lower cost
Structure all your data in a canonical,
multi-dimensional array format, which adapts
to any data shape and workload for
maximum performance and minimum cost.
Analytics, data science
and ML
Run data science and machine learning
workloads in a single platform that unifies
data management with analytics and
scientific workloads.
Holistic governance and
collaboration
Securely control the access over all your
data assets, and enable collaboration and
reproducibility, while monitoring all activity
in a centralized way.
6. The Secret Sauce | The Data Model
Dense array
Store everything as dense or sparse multi-dimensional arrays
Sparse array
8. Applications
What can be modeled as an array
LiDAR (3D sparse)
SAR (2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense) Even flat files!!! (1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
Tables (1D dense or ND sparse)
9. How we built a Universal Database
SQL ML & Data Science
Distributed Computing
Applications
APIs
Access control and logging
Serverless SQL, UDFs, task graphs
Jupyter notebooks and dashboards
C L O U D
Parallel IO, rapid reads and writes
Columnar, cloud-optimized
Data versioning and time traveling
E M B E D D E D
Open-source interoperable
storage with a universal
open-spec array format
Unified data management
and easy serverless
compute at global scale
Efficient APIs and tool Integrations with zero-copy techniques
10. Superior
performance
Built in C++
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
13. TileDB Cloud
Works as SaaS: https://cloud.tiledb.com
Works on premises
Currently on AWS, soon on any cloud
Built to work anywhere
Slicing, SQL, UDFs, task graphs
It is completely serverless
On-demand JupyterHub instances
Can launch Jupyter notebooks
Compute sent to the data
It is geo-aware
Authentication, compliance, etc.
It is secure
14. TileDB Cloud
Full marketplace (via Stripe)
Everything is monetizable
Access control inside and outside your
organization
Make any data and code public
Discover any public data and code
(central catalog)
Everything is shareable at global scale
Jupyter notebooks
UDFs and task graphs
ML models
Everything is an array!
Dashboards (e.g., R shiny apps)
All types of data (even flat files)
Full auditability (data, code, any action)
Everything is logged
15. SAR in TileDB
SAR data is stored in TileDB as 3D dense arrays
Rapid dense array slicing via implicit indexing on dimensions
Width, height, time are the dimensions
Cloud-native (rapid writes and reads)
Versioning and time traveling
Integration with GDAL
Visualization on TileDB Cloud
16. LiDAR in TileDB
LiDAR data is stored in TileDB as 3D sparse arrays
Efficient indexing with R-trees and Hilbert curves
Native float indexing - e.g, A[123.34:124.22, 30.23:31.00, :]
Cloud-native (rapid writes and reads)
Versioning and time traveling
Schema evolution
Integration with PDAL and PCL
Visualization on TileDB Cloud
17. A slicing query would just traverse the tree
top-down, visiting only nodes/MBRs that
intersect the slice
Indexing
Given the non-empty domain, the space tile extents and the
tile order, we can find easily that this slice overlaps the
second and fourth tile
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
row-major tile order
2x2
space
tiles
MBR1
MBR2
MBR3
MBR4
col-major tile order
row-major cell order
2x2
space
tiles
capacity
2
R-tree
(stored in fragment metadata)
MBR1 MBR2 MBR3 MBR4
18. Machine Learning in TileDB
Fusion of SAR with LiDAR data in a single platform
Integration with TensorFlow, PyTorch and more
Storage of ML models on TileDB Cloud
A full-fledged platform for exploration, analytics and ML
21. SAR: A Window to See What Others Can't
Optical With SAR
Only observable
25% of the time
Observable
100% of the time
High Revisit
Low Latency
Cloud & Smoke
piercing visibility
Night Vision for
the planet’s activity
22. Capella Space is Changing Access to Earth Information
3
Any time,
Any Weather
Frequent Revisit Very High-
Resolution Imaging
Fastest From
Order to Delivery
High-cadencerevisit with
multiple imaging
opportunities per day at
various times of
day/night
Fully automated tasking
& data processing with
fastestorder-to-delivery
times available in market
Very High Resolution
(VHR) and
radiometrically enhanced
multi-looked imagery
with low noise
High-cadencerevisit with
multiple imaging
opportunities per day at
various times of
day/night
23. 4
Capella SAR Imaging
Central Frequency X-Band
Polarization Single-Pol HH or VV
Imaging Bandwidth Up to 500 MHz
Acquisition Direction
Ascending+Descending Orbit Direction
Left+Right Look Direction
Imaging Modes
Spotlight
Sliding Spotlight
Stripmap
SAR Imagery Products
Spot (spotlight imaging mode)
Site (sliding spotlight imaging mode)
Strip (stripmap imaging mode)
24. SAR Imagery Product Scenes
5
VERY HIGH RESOLUTION
Spot| 5 km x 5 km | 0.5 m
VERY HIGH RESOLUTION
Site | 5 km x 10 km | 1.0 m
HIGH RESOLUTION
Strip| 5 km x 20 km| 1.2 m
26. 7
Capella Console
Simple-to-Use GUI
Task or purchase archived imagery via coordinates, AOI creation tool or
shapefile upload.
Fully automated and secure operations: Satellite ops, SAR processing
and data storage are cloud based, fully confidential.
Real-Time Status Updates
New tasking scheduling in ≤ 15 minutes and users are
provided real-time status updates.
Predicted time of collection displayed to enable
timely post-imaging operations.
27. Capella API Integration
● Tip-and-cue scenario for immediate responsiveness via API integration. Existing systemalerts can push task requests.
● React to emergencies in real-time. Deliver data to teams on the ground hours after image capture.
8
Task the Capella Constellation
Queue from Existing Systems Pull Scenes & Metadata