More Related Content

Similar to 2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI -


2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI -

  1. Synthetic Data Tools for Computer Vision-Based AI Chris Andrews COO, Dan Hedges Lead Solution Architect,
  2. Presenters 3 • COO & Head of Product, • 25 years experience in commercial and government geospatial-related products and technologies • 3D, enterprise integration, BIM-GIS, defense-related apps & solutions • 15 years experience with product development at companies including Esri, IBM, and Autodesk • Lead Solution Engineer, • 11+ years experience building geospatial solutions for industry verticals including urban planning, local government, federal government • Subject matter expertise in remote sensing, 3D data, and feature extraction Chris Andrews Dan Hedges
  3.’s cloud-hosted platform for synthetic data enables customers to overcome the costs and challenges of acquiring and using real data for training and validating computer vision ML and AI systems and algorithms • Established in 2019 in Bellevue, WA • Inclusive subscription encompasses 2D & 3D content creation, simulation design, data generation • Rapid setup and configuration for shortest path to synthetic data generation for multiple applications • Available on the AWS Marketplace Synthetic Data experts with experience in: • Remote sensing – Satellite, Aerial • Ground-based imagery & video • Non-visible EM spectra • 2D and 3D modeling and simulation • GAN training and dataset post processing • Dataset comparison and validation The Platform as a Service for Synthetic Data Partnering with Member
  4. The AI Data Problem BIAS COST & TIME INNOVATION PRIVACY/SECURITY Real data is expensive and often costly and time consuming to acquire and label Rare objects and scenarios are hard to capture Without data it is impossible to explore new sensors and data types Real data can have security or high-risk information concerns that limit usage
  5. Intro to Synthetic Data 6
  6. Synthetic Data solves the AI data problem is a PaaS and developer framework for synthetic data Synthetic Data is engineered data that AI interprets as real data 60% of data used for AI and data analytics projects will be synthetic, and by 2030, synthetic data will have completely overtaken real data in AI models. - Gartner, September 2021 Imagine if it were possible to produce infinite amounts of the world’s most valuable resource, cheaply and quickly… This is a reality today. It is called synthetic data. - Forbes, July 2022
  7. What do we mean by Synthetic Data? Synthetic data can be created for any type of data used to train or validate AI/ML systems, even for sensors or systems that don’t exist CV-based synthetic data simulates bitmap sensor data capture whether from sensors, recorded spatial patterns, or other CV input content Physics-based synthetic data includes creation of 2D/3D/4D output based on ‘digital twins’ of physical sensors, the sensor platform, and the scene in which the sensor would operate can be used to generate any kind of synthetic data Initial focus has been on physics-based synthetic data generation for CV workflows • RGB imagery and video, RGB microscopy, IR imagery, X-ray, SAR,… Source: Wikipedia
  8. Today’s AI workflow relies on finding or acquiring data Acquire or find data Train algorithm Test algorithm Accept/Reject result Expensive, unpredictable data acquisition costs Difficulty training algorithms on inconsistent data Testing requires reuse of real datasets Results are limited to what can be achieved with real datasets
  9. Tomorrow’s AI workflow incorporates synthetic data Inexpensive, unlimited data generation 100% accurate labeling, consistent data Real datasets used for comparison and post processing Data can be designed for edge or impossible cases and for removing bias Create data Train algorithm Test algorithm Compare datasets
  10. Simulator Dataset and metadata Managed Compute Improved and explainable outcomes World building and procedural gen Asset Acquisition / Integration AI Model Real-world workflow For more Post processing / Domain adaptation Quality assessment Synthetic Data Engineers Data Scientist Platform Automation Simulator Synthetic Dataset AI Model Hypothetical workflow
  11. Synthetic data generation steps 1. Scenario characterization - Data output, variability, problem(s) addressed or tested 2. World building - Asset and scene content composition and aggregation 3. Sensor modeling & simulation - Rendering, visual effects, environmental effects 4. Annotation & mask calculation 5. Job execution & dataset compilation 6. Annotation mapping 7. Domain Adaptation post-processing 8. Dataset characterization and comparison 12
  12. New AI job: Synthetic Data Engineer If most data used to train AI will be synthetic… …who will be engineering the data? Design & engineer datasets to achieve specific AI outcomes Software development-oriented • python, data science, 3D, game engines Domain or industry expertise Expert in specific data types & technologies • Sensors, Renderers, Modeling, Simulation
  13. What about Generative AI? Physics-based synthetic data • Starts from a 3D simulation • Can add wide variation including absurd, unnatural, or extremely rare phenomena • Can generate multiple ‘maps’ for depth, instances, surfaces, normals, motion • Can generate fully pixel-labeled content • Can incorporate accurate physics-based models for imagery generation Generative AI (2023) • Starts with large, known datasets • Can add variation, but must be driven by addition of additional training data • Cannot generate extra maps with information in the scene • Cannot label at the pixel level • Does not incorporate physics-based models Generative AI is moving fast and we see it as another tool for both content generation and post processing or consuming other synthetic data
  14. New AI job: Prompt Engineer In the world of Generative AI, someone needs to tell the AI what to produce! Design & engineer inputs to Generative AI systems to achieve specific outcomes Narrative-oriented • Good at defining context, describing problems Domain or industry expertise Expert in specific data types & technologies • Sensors, Renderers, Modeling, Simulation
  15. Common gaps when introducing customers to synthetic data • Hyper focus on the bounds of found or acquired data only • Most data scientists aren’t sensor experts • Concern about ‘good data’ • Concern about one-off datasets vs. investment in data • Belief that human perception is good enough to judge data quality • Confusion over Generative AI vs. simulation ntechniques … Note that the biggest hurdle is that customers rarely stop to ask what the ideal dataset would be that would address their business problem!
  16. Synthetic data generation is an empirical process 17 Identify the problem Describe the (ideal) data Generate data Can I achieve any training? Refine data generation Can I improve training?
  17. Supporting GEOINT workflows with continuously evolving AI Model digital sensor Aggregate & create scene content Create Channel configuration Publish to Add Channel to Workspace Create & configure Graphs Run Jobs Channel development (GIS Developer, Database Engineer, Synthetic Data Engineer) Train and Evaluate AI Datasets Graph configuration and job execution (GIS Analyst, Computer Vision Engineers, Data Scientists & Automated Workflows) Change graph configuration Add/update sensor configuration, Scene content, scene configuration Annotation Images Masks Statistics GIS tools Data Science toolkits Embedded AI tools
  18. Scene config & generation Render simulation Post- processing Dataset Packaging Sensor configuration Platform configuration Sensor simulator Sensor simulator Environment effects Objects of interest Animated people & animals Geospatial services World construction Scenario composition Content distribution Environmental conditions Filters Architecture of Synthetic Data Channels Cloud-hosted PaaS (COTS) Job manager User management & roles Archive & search Content volumes Remote access APIs Characterization (UMAP) Annotation microservices Images Masks Statistics Annotation CycleGAN microservices Annotation Channels become open-source examples for users to build upon Textures & shaders
  19. Don’t rebuild everything for every AI application Remote Sensing Supply Chain Object detection Automotive Economic monitoring Medical Imaging Security … Sensors Radar Imagery RGB Camera Panchromatic Infrared High-Definition Radar Microscopy X-Ray/CT Scan MRI … Applications Reusable modular architecture in the cloud • Content pipelines • Sensor models • Analytics toolsets • AI integrations Enabling access to synthetic data as an enterprise capability
  20. Channel Development | Blender Content Code: SATRDEMO - Dependencies installed: - Blender and Python (versions harmonized), OpenCV, GPU drivers, Ana, Anatools SDK - Can Edit and Deploy Channels with SDK - Offered as AMI or from git with .devcontainer for VS-code - ArcGIS integration for 2D raster backgrounds Custom Code Available now
  21. Case study slide: EO scenarios Searching for cranes, and crane trucks as an economic indicator in satellite imagery Objects are rare relative to other features in overhead imagery. Which means very large labelling campaigns are needed to collect examples. Original dataset only had ~100 examples of each class. Objects are difficult to label. Inconsistent sizing of crane bounding boxes and similarities between crane trucks and cement pumps were two notable challenges in the real datasets. Synthetic and real datasets 2-3x improvement in AP scores over peak performance without Synthetic data
  22. Channel Development | DIRSIGTM Content Code: DIRSIGDEMO - DIRSIG accessed through python and web interface - Can Edit and Deploy Channels with SDK - No RIT DIRSIG training required! Custom Code Available now
  23. Example Applications: Hyper-spectral Imaging, Multi-spectral Imaging Unique relationship with RIT allows to package DIRSIG in synthetic data channels for customers MSI, HSI, other radiometrically complex imagery output Validation possible with calibration panels, 3rd party consulting Pixel-level geospatial accuracy Geospatially accurate, high resolution scene content used in cloud-based generation for very large datasets RGB bands from MSI, HSI images created with DIRSIG and
  24. Channel Development | Omniverse Available on request • Preinstalled dependencies: • USD, Python, OpenCV, GPU drivers, Ana, Anatools SDK • Edit and Deploy Channels to with SDK • Offered as AMI or from git with .devcontainer for VS-code Custom Code
  25. Example Applications: Omniverse Replicator channel Use industry-leading 3D toolkit in the cloud Configurable in a web-based SaaS experience Starting place for users who may already have some experience or investment in NVIDIA tools Familiar architecture that extends to multiple use cases Synthetic imagery chips generated with Omniverse Replicator running inside on AWS
  26. Example Application: Synthetic Aperture Radar Enterprise & Developer Subscription Customers Experimental, cutting-edge Synthetic Aperture Radar simulation built by SAR output is not human readable, making human labeling impossible Emerging commercial SAR industry seeking better tools for exploitation, value creation Applications in defense, disaster response, Earth observation & monitoring, insuretech Synthetic SAR images generated using Identical object shown with several image capture scenarios
  27. Example Application: Marine Imagery Enterprise tier customer Vessel detection in open ocean scenarios for defense and contraband interdiction Supporting edge-based, onboard object detection systems Variable weather, wave, obstruction characteristics Variable object placement generators Synthetic RGB images simulating marine UAV imagery capture
  28. Satellite Visible Synthetic IR (MWIR) Synthetic SAR Over 1.2TB of synthetic images produced with channel coverage growing Security Imaging FLIR Camera Examples of synthetic CV content X-Ray and CT scans Urban & natural environments Industrial and residential settings
  29. And after you have your imagery… compare it! Creating datasets is a starting point Training and Validation are next Compare datasets to explore similarity • Real-synthetic, synthetic-synthetic Use tools such as UMAP, FID Use inference to change SDG Try again! UMAP analysis enables data scientists To explore similarities and differences in The parameter embedding space of multiple datasets
  30. Demo 31
  31. Internal Past experience with cost or failure of one-off synthetic data experiments Unprepared for experimentation Effort to achieve acceptable level of realism Complexity/difficulty with physics- based modeling External • Information about emerging tools • TCO of yet-another-IT project • Talent shortage • Lack of benchmarks/standards ─ Need for analytic tools ─ Need for sensitivity analysis • Lack of industry collaboration Typical challenges adopting synthetic data
  32. Opportunity of Synthetic Data Supplement real data Evaluate and remove bias Reduce expensive dataset labeling and reacquisition Explore scenarios Simulate sensor models and collection techniques Create novel data with zero PII or security concerns
  33. Synthetic data as a Standard Synthetic data is rapidly moving from uncertain value to required tool. Synthetic data has the opportunity to be used as part of regulatory and ethical frameworks around bias reduction, demonstrable sensitivity analysis, and reducing the need for human curation of training data. Regulatory & compliance • Bias reduction and testing • Sensitivity analysis • Efficacy demonstration • Removing human-in-the-loop from ethical/harmful scenarios
  34. Synthetic data as an enabler for innovation As synthetic data generation capabilities improve and become more accessible, users will have expanded opportunity to experiment, innovate, and build AI without expensive or impossible real sensor dataset collection. Innovation • Complex sensor fusion • New & hard-to-acquire sensors • New dataset combinations • Digital Twins
  35. Synthetic data driving sustainability Synthetic data is 100% reliably labeled, has been shown to reduce the size of training datasets, and potentially reduces the need for real sensor-based data collection. Cost and impact • Reducing labeling costs • Reducing collection costs • Reducing environmental footprint of real sensor data collection • Enabling innovation without physical material consumption/investment
  36. Wrap up 37 For slides and supporting content: Try it at:
  37. Thank you 38