Slides by Stavros Papadopoulos (TileDB) and Jason Brown (Capella Space) from the joint TileDB-Capella Space webinar held in April 2022 on SAR and LiDAR data analytics.
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
TileDB webinars - Nov 4, 2021
The document summarizes a webinar about TileDB, a universal data management platform that represents data as dense and sparse multi-dimensional arrays. It addresses the data management problems in population genomics by storing variant call data as 3D sparse arrays. TileDB provides a unified storage and serverless computing model that allows efficient data access and analysis at global scale through its open source TileDB Embedded storage and TileDB Cloud platform. The webinar highlights how TileDB solves data production, distribution, and consumption problems and empowers data sharing and collaboration through its marketplace and security features.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
Bradley Skelton, Chief Technology Strategist for Geospatial Portfolio at Hexagon Geospatial, looks at the increasing amount and variety of data available that can be turned into actionable information.
See more presentations from the FME User Conference 2014 at: www.safe.com/fmeuc
Irish Earth Observation Symposium 2014: Point Cloud Data Management with ERDA...IMGS
Rapid increase in the volume and number of imagery data sets places significant pressures on data managers – they need to be able to place new data sets in a controlled enterprise data management environment that ensures the maximum number of authorised users are able to find and access the resource.
Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Codiax
This document provides an overview of Overstory, a company that uses satellite data and AI to monitor forests and tackle issues like deforestation and wildfires. It discusses how Overstory uses machine learning on high-resolution satellite imagery to create segmentation maps and monitor changes in forests over time. It also describes Overstory's infrastructure including its use of JupyterHub, Dask, and Papermill to enable large-scale distributed processing of satellite data and training of deep learning models.
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
TileDB webinars - Nov 4, 2021
The document summarizes a webinar about TileDB, a universal data management platform that represents data as dense and sparse multi-dimensional arrays. It addresses the data management problems in population genomics by storing variant call data as 3D sparse arrays. TileDB provides a unified storage and serverless computing model that allows efficient data access and analysis at global scale through its open source TileDB Embedded storage and TileDB Cloud platform. The webinar highlights how TileDB solves data production, distribution, and consumption problems and empowers data sharing and collaboration through its marketplace and security features.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
Bradley Skelton, Chief Technology Strategist for Geospatial Portfolio at Hexagon Geospatial, looks at the increasing amount and variety of data available that can be turned into actionable information.
See more presentations from the FME User Conference 2014 at: www.safe.com/fmeuc
Irish Earth Observation Symposium 2014: Point Cloud Data Management with ERDA...IMGS
Rapid increase in the volume and number of imagery data sets places significant pressures on data managers – they need to be able to place new data sets in a controlled enterprise data management environment that ensures the maximum number of authorised users are able to find and access the resource.
Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Codiax
This document provides an overview of Overstory, a company that uses satellite data and AI to monitor forests and tackle issues like deforestation and wildfires. It discusses how Overstory uses machine learning on high-resolution satellite imagery to create segmentation maps and monitor changes in forests over time. It also describes Overstory's infrastructure including its use of JupyterHub, Dask, and Papermill to enable large-scale distributed processing of satellite data and training of deep learning models.
This document presents an overview of big data. It defines big data as large, diverse data that requires new techniques to manage and extract value from. It discusses the 3 V's of big data - volume, velocity and variety. Examples of big data sources include social media, sensors, photos and business transactions. Challenges of big data include storage, transfer, processing, privacy and data sharing. Past solutions discussed include data sharding, while modern solutions include Hadoop, MapReduce, HDFS and RDF.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This document provides an overview of Hadoop and several related big data technologies. It begins by defining the challenges of big data as the 3Vs - volume, velocity and variety. It then explains that traditional databases cannot handle this type and scale of unstructured data. The document goes on to describe how Hadoop works using HDFS for storage and MapReduce as the programming model. It also summarizes several Hadoop ecosystem projects including YARN, Hive, Pig, HBase, Zookeeper and Spark that help to process and analyze large datasets.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
The document discusses big data concepts and Hadoop technologies. It provides an overview of massive parallel processing and the Hadoop architecture. It describes common processing engines like MapReduce, Spark, Hive, Pig and BigSQL. It also discusses Hadoop distributions from Hortonworks, Cloudera and IBM along with stream processing and advanced analytics on Hadoop platforms.
WASP is a framework to develop big data pipelines, working with streaming analytics, multi model storages and machine learning models. Everything is in real time.
IMGS Geospatial User Group 2014 - Big data management with ApolloIMGS
1) Geospatial data volumes are growing rapidly as resolution, coverage, and frequency increase, along with new data sources.
2) There is an increasing need to leverage large amounts of geospatial data across applications and devices, requiring quicker access to dispersed data.
3) ERDAS APOLLO is a comprehensive data management system that allows users to catalog, search, discover, process, and disseminate massive volumes of geospatial data in a secure manner.
This document discusses big data solutions and introduces Hadoop. It defines common big data problems related to volume, velocity, and variety of data. Traditional storage does not work well for this type of unstructured data. Hadoop provides solutions through HDFS for storage, MapReduce for processing, and additional tools like HBase, Pig, Hive, Zookeeper, and Spark to handle different data and analytic needs. Each tool is described briefly in terms of its purpose and how it works with Hadoop.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
This document provides an introduction to big data. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It discusses the three V's of big data - volume, variety and velocity. Volume refers to the large scale of data. Variety means different data types. Velocity means the speed at which data is generated and processed. The document outlines topics that will be covered, including Hadoop, MapReduce, data mining techniques and graph databases. It provides examples of big data sources and challenges in capturing, analyzing and visualizing large and diverse data sets.
This document discusses big data and high performance computing. It begins by outlining where big data comes from, including sources like people, organizations, and machines. It then discusses opportunities that can be derived from big data analysis. The document explains the "big data problem" of how to process and store massive amounts of data across clusters. It provides background on why distributed computing solutions are needed now given exponential growth in digital data. The Hadoop ecosystem is introduced as a big data technology stack. The document outlines MapReduce and HDFS as core distributed computing architectures. It also discusses GPUs and massive parallelization using CUDA to enable high performance computing for big data workloads.
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.
Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.
Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.
For more information, visit www.casertaconcepts.com
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR.
Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet.
Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME.
Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
The document discusses MapReduce and big data. MapReduce is a programming model used for processing large datasets in a distributed manner. It involves two steps - the map step where data is processed in parallel to generate intermediate key-value pairs, and the reduce step where the intermediate outputs are aggregated. The document provides examples of counting word occurrences in text as a MapReduce job. It also discusses challenges in managing large and diverse datasets.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
This document provides an overview of big data and Hadoop. It defines big data using the 3Vs - volume, variety, and velocity. It describes Hadoop as an open-source software framework for distributed storage and processing of large datasets. The key components of Hadoop are HDFS for storage and MapReduce for processing. HDFS stores data across clusters of commodity hardware and provides redundancy. MapReduce allows parallel processing of large datasets. Careers in big data involve working with Hadoop and related technologies to extract insights from large and diverse datasets.
This document presents an overview of big data. It defines big data as large, diverse data that requires new techniques to manage and extract value from. It discusses the 3 V's of big data - volume, velocity and variety. Examples of big data sources include social media, sensors, photos and business transactions. Challenges of big data include storage, transfer, processing, privacy and data sharing. Past solutions discussed include data sharding, while modern solutions include Hadoop, MapReduce, HDFS and RDF.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This document provides an overview of Hadoop and several related big data technologies. It begins by defining the challenges of big data as the 3Vs - volume, velocity and variety. It then explains that traditional databases cannot handle this type and scale of unstructured data. The document goes on to describe how Hadoop works using HDFS for storage and MapReduce as the programming model. It also summarizes several Hadoop ecosystem projects including YARN, Hive, Pig, HBase, Zookeeper and Spark that help to process and analyze large datasets.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
The document discusses big data concepts and Hadoop technologies. It provides an overview of massive parallel processing and the Hadoop architecture. It describes common processing engines like MapReduce, Spark, Hive, Pig and BigSQL. It also discusses Hadoop distributions from Hortonworks, Cloudera and IBM along with stream processing and advanced analytics on Hadoop platforms.
WASP is a framework to develop big data pipelines, working with streaming analytics, multi model storages and machine learning models. Everything is in real time.
IMGS Geospatial User Group 2014 - Big data management with ApolloIMGS
1) Geospatial data volumes are growing rapidly as resolution, coverage, and frequency increase, along with new data sources.
2) There is an increasing need to leverage large amounts of geospatial data across applications and devices, requiring quicker access to dispersed data.
3) ERDAS APOLLO is a comprehensive data management system that allows users to catalog, search, discover, process, and disseminate massive volumes of geospatial data in a secure manner.
This document discusses big data solutions and introduces Hadoop. It defines common big data problems related to volume, velocity, and variety of data. Traditional storage does not work well for this type of unstructured data. Hadoop provides solutions through HDFS for storage, MapReduce for processing, and additional tools like HBase, Pig, Hive, Zookeeper, and Spark to handle different data and analytic needs. Each tool is described briefly in terms of its purpose and how it works with Hadoop.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
This document provides an introduction to big data. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It discusses the three V's of big data - volume, variety and velocity. Volume refers to the large scale of data. Variety means different data types. Velocity means the speed at which data is generated and processed. The document outlines topics that will be covered, including Hadoop, MapReduce, data mining techniques and graph databases. It provides examples of big data sources and challenges in capturing, analyzing and visualizing large and diverse data sets.
This document discusses big data and high performance computing. It begins by outlining where big data comes from, including sources like people, organizations, and machines. It then discusses opportunities that can be derived from big data analysis. The document explains the "big data problem" of how to process and store massive amounts of data across clusters. It provides background on why distributed computing solutions are needed now given exponential growth in digital data. The Hadoop ecosystem is introduced as a big data technology stack. The document outlines MapReduce and HDFS as core distributed computing architectures. It also discusses GPUs and massive parallelization using CUDA to enable high performance computing for big data workloads.
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.
Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.
Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.
For more information, visit www.casertaconcepts.com
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR.
Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet.
Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME.
Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
The document discusses MapReduce and big data. MapReduce is a programming model used for processing large datasets in a distributed manner. It involves two steps - the map step where data is processed in parallel to generate intermediate key-value pairs, and the reduce step where the intermediate outputs are aggregated. The document provides examples of counting word occurrences in text as a MapReduce job. It also discusses challenges in managing large and diverse datasets.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
This document provides an overview of big data and Hadoop. It defines big data using the 3Vs - volume, variety, and velocity. It describes Hadoop as an open-source software framework for distributed storage and processing of large datasets. The key components of Hadoop are HDFS for storage and MapReduce for processing. HDFS stores data across clusters of commodity hardware and provides redundancy. MapReduce allows parallel processing of large datasets. Careers in big data involve working with Hadoop and related technologies to extract insights from large and diverse datasets.
Similar to Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, 04-12-22).pdf (20)
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, 04-12-22).pdf
1. Analyzing LiDAR & SAR data
with Capella Space and TileDB
TileDB webinars - April 12, 2022
Founder & CEO of TileDB, Inc.
Stavros Papadopoulos
2. Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
45+ members with expertise across all applications and domains
Who we are
TileDB was spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M from world-class investors
INVESTORS
3. The Problem
Low productivity for data analysts and scientists
Huge TCO for organizations
Organizations are drowning in a data infrastructure mess
Too many domain-specific file formats
Difficult to handle data beyond tables and SQL
Overly complex metadata handling and data sharing
Numerous vendors and in-house solutions
Difficult to govern all data holistically
4. The Solution | Universal Database
All Data. Faster. Cheaper.
Securely manage all your data assets and
supercharge your analytics, data science and
machine learning with a universal database
All data in one place
Superior performance, at a lower cost
Analytics, data science and ML
Holistic governance and collaboration
5. The Universal Database Pillars
All data in one place
Manage any type of data – tables, files,
images, video, genomics, ML features,
metadata, even flat files and folders – in a
single powerful database.
Superior performance,
at a lower cost
Structure all your data in a canonical,
multi-dimensional array format, which adapts
to any data shape and workload for
maximum performance and minimum cost.
Analytics, data science
and ML
Run data science and machine learning
workloads in a single platform that unifies
data management with analytics and
scientific workloads.
Holistic governance and
collaboration
Securely control the access over all your
data assets, and enable collaboration and
reproducibility, while monitoring all activity
in a centralized way.
6. The Secret Sauce | The Data Model
Dense array
Store everything as dense or sparse multi-dimensional arrays
Sparse array
8. Applications
What can be modeled as an array
LiDAR (3D sparse)
SAR (2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense) Even flat files!!! (1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
Tables (1D dense or ND sparse)
9. How we built a Universal Database
SQL ML & Data Science
Distributed Computing
Applications
APIs
Access control and logging
Serverless SQL, UDFs, task graphs
Jupyter notebooks and dashboards
C L O U D
Parallel IO, rapid reads and writes
Columnar, cloud-optimized
Data versioning and time traveling
E M B E D D E D
Open-source interoperable
storage with a universal
open-spec array format
Unified data management
and easy serverless
compute at global scale
Efficient APIs and tool Integrations with zero-copy techniques
10. Superior
performance
Built in C++
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
13. TileDB Cloud
Works as SaaS: https://cloud.tiledb.com
Works on premises
Currently on AWS, soon on any cloud
Built to work anywhere
Slicing, SQL, UDFs, task graphs
It is completely serverless
On-demand JupyterHub instances
Can launch Jupyter notebooks
Compute sent to the data
It is geo-aware
Authentication, compliance, etc.
It is secure
14. TileDB Cloud
Full marketplace (via Stripe)
Everything is monetizable
Access control inside and outside your
organization
Make any data and code public
Discover any public data and code
(central catalog)
Everything is shareable at global scale
Jupyter notebooks
UDFs and task graphs
ML models
Everything is an array!
Dashboards (e.g., R shiny apps)
All types of data (even flat files)
Full auditability (data, code, any action)
Everything is logged
15. SAR in TileDB
SAR data is stored in TileDB as 3D dense arrays
Rapid dense array slicing via implicit indexing on dimensions
Width, height, time are the dimensions
Cloud-native (rapid writes and reads)
Versioning and time traveling
Integration with GDAL
Visualization on TileDB Cloud
16. LiDAR in TileDB
LiDAR data is stored in TileDB as 3D sparse arrays
Efficient indexing with R-trees and Hilbert curves
Native float indexing - e.g, A[123.34:124.22, 30.23:31.00, :]
Cloud-native (rapid writes and reads)
Versioning and time traveling
Schema evolution
Integration with PDAL and PCL
Visualization on TileDB Cloud
17. A slicing query would just traverse the tree
top-down, visiting only nodes/MBRs that
intersect the slice
Indexing
Given the non-empty domain, the space tile extents and the
tile order, we can find easily that this slice overlaps the
second and fourth tile
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
row-major tile order
2x2
space
tiles
MBR1
MBR2
MBR3
MBR4
col-major tile order
row-major cell order
2x2
space
tiles
capacity
2
R-tree
(stored in fragment metadata)
MBR1 MBR2 MBR3 MBR4
18. Machine Learning in TileDB
Fusion of SAR with LiDAR data in a single platform
Integration with TensorFlow, PyTorch and more
Storage of ML models on TileDB Cloud
A full-fledged platform for exploration, analytics and ML
21. SAR: A Window to See What Others Can't
Optical With SAR
Only observable
25% of the time
Observable
100% of the time
High Revisit
Low Latency
Cloud & Smoke
piercing visibility
Night Vision for
the planet’s activity
22. Capella Space is Changing Access to Earth Information
3
Any time,
Any Weather
Frequent Revisit Very High-
Resolution Imaging
Fastest From
Order to Delivery
High-cadencerevisit with
multiple imaging
opportunities per day at
various times of
day/night
Fully automated tasking
& data processing with
fastestorder-to-delivery
times available in market
Very High Resolution
(VHR) and
radiometrically enhanced
multi-looked imagery
with low noise
High-cadencerevisit with
multiple imaging
opportunities per day at
various times of
day/night
23. 4
Capella SAR Imaging
Central Frequency X-Band
Polarization Single-Pol HH or VV
Imaging Bandwidth Up to 500 MHz
Acquisition Direction
Ascending+Descending Orbit Direction
Left+Right Look Direction
Imaging Modes
Spotlight
Sliding Spotlight
Stripmap
SAR Imagery Products
Spot (spotlight imaging mode)
Site (sliding spotlight imaging mode)
Strip (stripmap imaging mode)
24. SAR Imagery Product Scenes
5
VERY HIGH RESOLUTION
Spot| 5 km x 5 km | 0.5 m
VERY HIGH RESOLUTION
Site | 5 km x 10 km | 1.0 m
HIGH RESOLUTION
Strip| 5 km x 20 km| 1.2 m
26. 7
Capella Console
Simple-to-Use GUI
Task or purchase archived imagery via coordinates, AOI creation tool or
shapefile upload.
Fully automated and secure operations: Satellite ops, SAR processing
and data storage are cloud based, fully confidential.
Real-Time Status Updates
New tasking scheduling in ≤ 15 minutes and users are
provided real-time status updates.
Predicted time of collection displayed to enable
timely post-imaging operations.
27. Capella API Integration
● Tip-and-cue scenario for immediate responsiveness via API integration. Existing systemalerts can push task requests.
● React to emergencies in real-time. Deliver data to teams on the ground hours after image capture.
8
Task the Capella Constellation
Queue from Existing Systems Pull Scenes & Metadata