Classification: Public Information
Text-GuidedWell Log-Constrained High-FidelitySubsurface ModelGeneration via Stable Diffusion
OlegOvcharenko1,Vladimir Kazei2,Weichang Li2, IssamSaid1
1 – NVIDIA, Dubai,UAE 2 – AramcoAmericas, Houston ResearchCenter
Classification: Public Information
Generation of random subsurface models from text description
”Anticlinal structure with potential
reservoir below a salt dome"
Classification: Public Information
Why do we need synthetic data?
• Train DL * models
• Create benchmarks
Classification: Public Information
Problem
Solution
• Latent diffusion
• Low-Rank Adaptation
Method
• Data generation
• Training
• Results
Conclusions
Classification: Public Information
Problem
Subsurface
models
Synthetic data Inverted data
Well-logs Metadata
Text
Solver Training
Text description of the desired model
distribution is missing out
Classification: Public Information
Solution
Need to merge knowledge from different
modalities
Text  Image
Leverage diffusion model for conditional image
generation
Example generated with OpenAI DALL-E
Subsurface
models
Text
Classification: Public Information
Getting Lost in the Noise
Forward Diffusion
t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10
𝑞 𝐱𝑡 𝐱𝑡−1 = 𝑁(𝐱𝑡; 1 − 𝛽𝑡 ⋅ 𝐱𝑡−1, 𝛽𝑡 ⋅ 𝐈)
• Generate noise from a standard
normal distribution
• Multiply result by 𝛽𝑡
Step1:
• Multiply image at previous step by
1 − 𝛽𝑡
• Add it to the result from step 1
Step2:
noise = torch.randn_like(x_t)
x_t = torch.sqrt(1 - B[t]) * x_t +
torch.sqrt(B[t]) * noise
Code:
Classification: Public Information
Getting Lost in the Noise
Reverse Diffusion
t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10
𝜇𝑡 =
1
𝛼𝑡
𝑥𝑡 −
1 − 𝛼𝑡
1 − 𝛼𝑡
𝜖𝑡
The noise added at time 𝑡
This is what the neural network will
estimate
𝑞 𝐱𝑡−1 𝐱𝑡, 𝐱0 = 𝑁(𝐱𝑡−1; 𝝁 𝐱𝑡, 𝐱0 , 𝛽𝑡 ⋅ 𝐈)
𝑝 𝐱0 𝐱1 𝑝 𝐱1 𝐱2 𝑝 𝐱2 𝐱3 𝑝 𝐱3 𝐱4 𝑝 𝐱4 𝐱5 𝑝 𝐱5 𝐱6 𝑝 𝐱6 𝐱7 𝑝 𝐱7 𝐱8 𝑝 𝐱8 𝐱9 𝑝 𝐱9 𝐱10
𝑝𝜃 𝐱𝑡−1 𝐱𝑡 = 𝑁(𝐱𝑡−1; 𝝁𝜃 𝐱𝑡, 𝑡 , Σ𝜃 𝐱𝑡, 𝑡 )
𝑞 𝐱1 𝐱0 𝑞 𝐱6 𝐱5
𝑞 𝐱3 𝐱2 𝑞 𝐱4 𝐱3 𝑞 𝐱5 𝐱4
𝑞 𝐱2 𝐱1 𝑞 𝐱8 𝐱7
𝑞 𝐱7 𝐱6 𝑞 𝐱9 𝐱8 𝑞 𝐱10 𝐱9
Classification: Public Information
ModelTraining
Training Data
Noisy image at time 𝑡 Noise added at time 𝑡
𝑞 𝐱𝑡 𝐱0 = 𝑁(𝐱𝑡; 𝛼𝑡 ⋅ 𝐱0, 1 − 𝛼𝑡 ⋅ 𝐈) “Skip ahead” function
𝑡 = 5 𝑡 = 25 𝑡 = 100
Classification: Public Information
Down0
Copy
Down1
Copy
It’sAboutTime
Feature Map
(Down2)
Feature
Map
(Down1
)
Latent Vector
Down2 Copy
Feature Map
(Up0)
Feature
Map
(Up1)
Featur
e Map
(Up2)
Feature
Map
(Down0)
AddingTime
Time
Time
Embed 2
Time
Embed 1
Classification: Public Information
LoRA
https://arxiv.org/pdf/2106.09685
Classification: Public Information
Image by Aayush Agrawal
https://towardsdatascience.com/stable-diffusion-using-hugging-face-501d8dbdd8
Classification: Public Information
Make text and image embeddings
CLIP
Contrastive Language-Image Pre-Training
VAE
Variational Autoencoder
Image by Aayush Agrawal
https://towardsdatascience.com/stable-diffusion-using-hugging-face-501d8dbdd8
Classification: Public Information
Dataset creation: fetch public subsurface models
2D 3D
https://wiki.seg.org/wiki/
SEG Salt model Overthrust 3D
Marmousi II
BP 2004
Sigsbee
Classification: Public Information
Dataset creation: sample crops from base models
Keep aspect ratio but rescale to min side of 512
Total 45 images after QC
Classification: Public Information
Dataset creation: prompt for image annotation
"Overhanging salt dome with flank collapse and potential
hydrocarbon trap at crest."
"Thrust faulting with folded sedimentary layers and potential
hydrocarbon trap in anticlinal structure."
Classification: Public Information
Dataset creation: prompt for image annotation
Naïve image descriptions are far from ideal. Context-aware annotation is the way
Classification: Public Information
Training takes 5 min on 1GPU 11
121
231
451
1551
Classification: Public Information
Condition generation by well-log and text
Classification: Public Information
Results
"Anticline fold layered structure with potential
hydrocarbon traps" (t/l)
"Flat sedimentary layers with angular
unconformities" (t/r)
"Salt dome intrusion with potential hydrocarbon
traps" (b/l)
"Complex layered geological environment” (b/r)
Text and Log conditioned
Text conditioned only
Classification: Public Information
How to accelerate?
https://github.com/NVIDIA/TensorRT/tree/release/8.6/demo/D
iffusion
Classification: Public Information
https://github.com/NVIDIA/TensorRT/blob/release/9.3/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb
Classification: Public Information
Conclusion
Stable diffusion allows to incorporate textual geological descriptions as a source of
information into numerical subsurface model building
Iterative nature of stable diffusion is a computational bottleneck
Performance-efficient fine tuning (PEFT) methods such as LoRA are suitable to
introduce domain knowledge even with limited data sets
Diffusion hyperparameters affects the fidelity of generated data

Text-Guided Well Log-Constrained Realistic Subsurface Model Generation via Stable Diffusion

Editor's Notes

  • #2 Hi I am Oleg Ovcharenko of NVIDIA, my co-authors are Vladimir Kazei (Aramco Americas, Houston Research Center), Weichang Li (Aramco Americas, Houston Research Center) and Issam Said (NVIDIA, Dubai, UAE). I am looking forward to present our novel approach for generating high-fidelity subsurface velocity models via stable diffusion.
  • #3 Iterative denoising over 30 iterations (0, 4, 8, 12, 16, 20, 24, 30) are shown) by a diffusion model fine-tuned with Low-rank Adaptation fine-tuning (LoRA) on 45 sample images text-guided as "Anticlinal structure with potential reservoir below a salt dome".
  • #5 Let me introduce the ouline: Problem Introduce the challenge or issue related to seismic imaging. Highlight the need for improved subsurface velocity models. Solution Present the proposed approach. Mention the use of latent diffusion and LoRA. Method Describe the steps involved: Data Generation: Explain how data is collected or generated. Training: Discuss the model training process. Results Share key findings or improvements achieved. Highlight any performance metrics. Conclusions Summarize the study’s impact. Discuss future directions.
  • #6 In this work, we experiment with a text-conditioned diffusion process by supplying descriptions of the desired geological features of velocity models to generate realistically-textured subsurface models tailored to particular regions and areas. Moreover we show how well log information can be incorporated into the stable diffusion process enabling generated models to be even more realizable for a given area of interest.
  • #7 The problem set up forces us to merge knowledge coming as text and numerical (typically float point number) arrays such as logs or seismic data.
  • #8 Here is the q probability distribution. The mu or average of the distribution will be the square root of 1 – Beta t multiplied by the corresponding pixel value of the various timestep. The variance, which is the standard deviation, will be Beta t. Hence the name, Variance Schedule. The code looks less intense than the math equations. We’ll use the same function we used in the previous lab to generate noise by sampling from a standard normal distribution. Then, we’ll multiply the result of that by the square root of Beta t. Then we’ll multiply the value of each pixel in the image we want to modify by the square root of one minus Beta t. Finally, we add the modified image and the generated noise to create a new noisier image for the next timestep. https://arxiv.org/pdf/1503.03585.pdf
  • #9 We can create an estimate for what the average image at timestep t would look like. We know our neural network is decent at recognizing added noise, which is 𝜖 𝑡 in this equation. We will use our network to predict the noise, plug in our values of alpha, and this will help us predict what an image at the previous timestep could look like. Then, we can start from pure noise, and use this function repeatedly to generate an unseen image!
  • #10 Given a noisy at time t, our model will predict what noise was added to it between t-1 and t. Thanks to some handy properties of the normal function, we don’t have to generate a noisy image at each timestep to do this. We can use the power of recursion to predict what a noisy image at time t would like like straight from the original clean image at t = 0/
  • #11 This. This is a time embedding. Since the variance schedule changes over time, the model will perform better if it know which timestep it is removing the noise for. We’re going to be adding this embedding straight into our feature maps. Let’s take a moment to break down what that means.
  • #12 LoRA (Low-Rank Adaptation) is a highly efficient method for fine-tuning large language models (LLMs). It involves freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. This significantly reduces the number of trainable parameters for downstream tasks, making it feasible to run specialized LLM models on a single machine. Essentially, LoRA strikes a balance between model quality and computational efficiency, opening up opportunities for LLM development in the broader data science community.
  • #13 Full text-to-image generation workflow involves three main components: a variational autoencoder (VAE), a UNet (or other image-to-image network), and a text-encoder. Encoded representation is processed by the UNet featuring cross-attention layers in its structure. Transformer-based Contrastive Language-Image Pre- Training (CLIP) produces text-embeddings which guide the generation of conditioned images by inject- ing the text-encoding into cross-attention layers of the network. The output latent representation from the UNet is then decoded by the VAE into the pixel space.
  • #14 The VAE first compresses the image into a lower-dimensional latent representation. This representation is then processed by the UNet featuring cross-attention layers in its structure. Transformer-based Contrastive Language-Image Pre- Training (CLIP) produces text-embeddings which guide the generation of conditioned images by inject- ing the text-encoding into cross-attention layers of the network.
  • #15 Here we selected five public P-wave velocity models as sources of information about the target: Marmousi II (Martin et al., 2006), Overthrust and Salt 3D SEG/EAGE models (Aminzadeh et al., 1996), Sigsbee (Irons, 2007), BP 2004 (Billette and Brandsberg-Dahl, 2005).
  • #16 We extracted 45 patches from these models by randomly sampling subsets equivalent to half-depth from each model (Figure 2). Specifically, we sample 5 images from each 2D model by cropping and 2 additional images cropped from central cross-line and inline projections of 3D models.
  • #17 We then generate annotations for each image utilizing the vision capability of GPT-4 vision API. The CLIP text- encoder used in this work expects input of up to 77 tokens so the image description generation should be prompted to account for this limitation.
  • #18 Generated annotations are generally meaningful but may not be accurate from geological point of view. We find these generated descriptions sufficient for a prototype but alternative ways of getting geological descriptions should be considered. For example, using the original model description from research paper and asking the LLM to build on this description with variations for each extracted subset would yield a decent dataset that requires minimal editing.
  • #19 We build on an open-source diffusion model for text-to-image and image-to-image inpainting pipelines (von Platen et al., 2022). We select four prompts to assess the generation capability of the network: "Anticline fold layered structure with potential hydrocarbon traps", "Flat sedimentary layers with an- gular unconformities", "Salt dome intrusion with potential hydrocarbon traps", and "Complex layered geological environment
  • #20 To condition the model generation on well-logs from the Marmousi model we follow the inpainting approach and create a binary mask where water layer and well-logs are set to be unchanged (Figure at the bottom left). The initial model is built as a random Gaussian noise scaled to match selected well-logs (Figure at the bottom right).
  • #21 Here we show some examples with (left) and without (right) log conditioning.
  • #24 Text is a valuable source of geological information We demonstrate feasibility and viability of the approach Iterative nature of stable diffusion is a bottleneck PEFT methods are sufficient to introduce domain knowledge given limited data Selection of diffusion hyperparameters affects fidelity of generated data