Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Reverse Time Migration via Resilient
Distributed Datasets: Towards In-Memory
Coherence of Seismic-Reflection Wavefields
us...
Outline
● The challenges and opportunities of RTM
● Refactoring RTM with Spark/RDDs
o Spark’ing coherence between wavefiel...
http://www.acceleware.com/technical-papers
Zhou 2014
Fig. 7.25
Motivation
● RTM is performance-challenged
o Algorithms research remains topical
 GPUs responsible for compelling results...
Key Performance Challenges
● RTM modeling kernel is compute intensive
o Stable, non-dispersive solution via FDM requires
...
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing
● Fault-tolerant, parallel data structures
o C...
RTM via RDDs: Implementation using Spark
● Apache Spark is an implementation of RDDs
● Make use of HDFS or alternative FS
...
RTM via RDDs: Implementation using Spark (2)
● Deployable on bare metal … clouds
o Monitoring/management Bright Cluster Ma...
RTM via RDDs: Opportunities
● Apply RDDs to gathers of seismic data
o Partition RDDs optimally for wavefields calculations...
Spark
Workers
Spark (YARN) Master
Spark
or YARN
http://www.informationweek.com/big-data/big-data-analytics/apache-spark-3-
promising-use-cases/a/d-id/1319660
http://ipython.org/notebook.html
Thunder: Initial Impressions
● Written in Spark's Python API (Pyspark)
o Makes use of scipy, numpy, and scikit-learn
● IPy...
Is there a case for migration?
● In-memory computing via RDDs is promising
o Application to gathers and wavefields
● Spark...
Summary
● Is there a case for migration?
o From: RTM via HPC
o To: RTM via Big Data or ( Big Data and HPC )
● Does it make...
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing
● Fault-tolerant, parallel data structures
o C...
Refactoring HPC with Spark/RDDs …
● Could Spark/RDDs replace MPI?
o Spark has primitives for distributed in-memory
paralle...
Acknowledgements
● M. Zaharia et al. for RDDs
● Communities responsible for Spark, Python & Thunder
● M. Lamarca, P. Labro...
Questions?
Ian Lumb
ianlumb@yorku.ca
ian.lumb@brightcomputing.com
Resources
● RTM's scientific context
● Spark support in Bright Cluster Manager for
Apache Hadoop
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields us...
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields us...
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields us...
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields us...
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields us...
Upcoming SlideShare
Loading in …5
×

Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark

1,314 views

Published on

Ultimately, in Reverse Time Seismic Migration (RTM), the coherence between two wavefields is determined across all depth-common gathers (i.e., source-receiver pairings) of seismic-reflection data. Because coherence between the two wavefields minimizes the impact of artifacts in the imaged section (or volume) arising from complex geological structures (e.g., folds, faults, domes, steeply dipping lithological interfaces), seismic-reflection data processed via RTM most-accurately depicts all reflectors in their actual locations in space and time (e.g., Zhou, Practical Seismic Data Analysis, Cambridge University Press, 2014).

In the classical approach for RTM, forward modeling involving the three-dimensional wave equation (3D-WEM) results in source wavefields that are computed using the Finite Difference Method (FDM), and then stored to disk. In a subsequent step, and on a per-gather basis, source wavefields are read from disk so that they can be cross-correlated with the backwards-propagated (i.e., time-reversed) wavefields corresponding to the receivers - a step that again requires use of the FDM modeling kernel for the 3D-WEM. The inherent requirement for disk I/O involving multiple TB volumes of seismic-reflection data, during the application of the imaging condition (i.e., the cross-correlation step), results in a performance penalty well known to be highly problematical throughout the petroleum-exploration industry.

Over the past decade or so, General Purpose Graphics Processing Units (GPGPUs) have been employed to significantly reduce the processing burden of disk I/O in executing RTM. Broadly speaking, in applying RTM’s imaging condition, algorithms have made effective and efficient use of both the memory hierarchy as well as parallel-processing capabilities inherent in GPGPUs. Despite the progress that has been made, particularly in the implementation of algorithms using CUDA for programming GPGPUs, the computational performance of RTM remains an active area of research that continues to engage academics as well as industry.

The need to cross-correlate two wavefields in the application of RTM’s imaging condition remains one of two fundamental challenges with use of the method in practice (e.g., Liu et al., Computers & Geosciences 59, 17–23, 2013). In a significant departure from previous approaches, this computational challenge is addressed here through the introduction of Resilient Distributed Datasets (RDDs) for RTM’s precomputed source wavefields. RDDs are a relatively recent abstraction for in-memory computing ideally suited to distributed computing environments like clusters (Zaharia et al., NSDI 2012, http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). Originally introduced for Big Data Analytics and popularized (e.g., Lumb, “8 Reasons Apache Spark is So Hot”, insideBIGDATA, http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/, 2015) through the open-source implementation known as Apache Spa

Published in: Science
  • Be the first to comment

Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark

  1. 1. Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark Ian Lumb HPCS 2015 - Montreal http://hpcs.ca
  2. 2. Outline ● The challenges and opportunities of RTM ● Refactoring RTM with Spark/RDDs o Spark’ing coherence between wavefields ● Summary
  3. 3. http://www.acceleware.com/technical-papers
  4. 4. Zhou 2014 Fig. 7.25
  5. 5. Motivation ● RTM is performance-challenged o Algorithms research remains topical  GPUs responsible for compelling results ● Revisit RTM as a ‘Big Data problem’ o In-memory analytics has the potential to  Improve performance of data and wavefield manipulations in concert with computations  Introduce new prospects for imaging conditions
  6. 6. Key Performance Challenges ● RTM modeling kernel is compute intensive o Stable, non-dispersive solution via FDM requires  Small time steps and small grid intervals  Higher-order approximations of the spatial derivatives ● RTM wavefields exceed memory capacity o Multiple-TB source volumes must be stored to disk e.g., Liu et al., Computers & Geosciences 59 (2013) 17–23
  7. 7. Resilient Distributed Datasets (RDDs) ● Abstraction for in-memory computing ● Fault-tolerant, parallel data structures o Cluster-ready ● Optionally persistent ● Can be partitioned for optimal placement ● Manipulated via operators Zaharia et al., NSDI 2012
  8. 8. RTM via RDDs: Implementation using Spark ● Apache Spark is an implementation of RDDs ● Make use of HDFS or alternative FS o GPFS, AWS S3, OpenStack Swift, Ceph or Lustre ● Choose appropriate programming model(s) o Not limited to MapReduce o Iterative and/or interactive (including streaming) ● Manage Spark workloads o Built-in mode or YARN mode, Mesos o Univa Universal Resource Brokerafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons- apache-spark-hot/
  9. 9. RTM via RDDs: Implementation using Spark (2) ● Deployable on bare metal … clouds o Monitoring/management Bright Cluster Manager ● Introduces analytics possibilities for RTM o Program in Java (C/C++ via JNA), Scala or Python ● Uptake is significant - rapidly growing community ● Results are extremely impressive o Exploit CPUs and/or GPUsafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons- apache-spark-hot/
  10. 10. RTM via RDDs: Opportunities ● Apply RDDs to gathers of seismic data o Partition RDDs optimally for wavefields calculations ● Apply RDDs to source wavefields o Partition RDDs optimally for cross-correlation of forward and reverse time wavefields  Significantly reduce/eliminate disk I/O ● Investigate alternate imaging conditions o Machine-learning and/or graph-analytics algorithms in addition to cross-correlation
  11. 11. Spark Workers Spark (YARN) Master Spark or YARN
  12. 12. http://www.informationweek.com/big-data/big-data-analytics/apache-spark-3- promising-use-cases/a/d-id/1319660
  13. 13. http://ipython.org/notebook.html
  14. 14. Thunder: Initial Impressions ● Written in Spark's Python API (Pyspark) o Makes use of scipy, numpy, and scikit-learn ● IPython Notebook serves as interactive GUI  Runs in a Web browser  Notebooks can include text and graphics  Secure, remote access to an in-cluster IPython Notebook server ● Includes modular functions for time-series analysis ● Can interface with C/C++ from Python http://thunder-project.org/
  15. 15. Is there a case for migration? ● In-memory computing via RDDs is promising o Application to gathers and wavefields ● Spark provides analytics upside o Imaging conditions other than cross-correlation ● Spark may be applicable to modeling kernels ● Spark can be easily incorporated into pre-existing IT infrastructures o Compliments existing HPC environments http://rice2015oghpc.rice.edu/technical-program/
  16. 16. Summary ● Is there a case for migration? o From: RTM via HPC o To: RTM via Big Data or ( Big Data and HPC ) ● Does it make sense to refactor other HPC problems as ‘Big Data problems’?
  17. 17. Resilient Distributed Datasets (RDDs) ● Abstraction for in-memory computing ● Fault-tolerant, parallel data structures o Cluster-ready ● Optionally persistent ● Can be partitioned for optimal placement ● Manipulated via operators Zaharia et al., NSDI 2012
  18. 18. Refactoring HPC with Spark/RDDs … ● Could Spark/RDDs replace MPI? o Spark has primitives for distributed in-memory parallel computing … including fault tolerance
  19. 19. Acknowledgements ● M. Zaharia et al. for RDDs ● Communities responsible for Spark, Python & Thunder ● M. Lamarca, P. Labropoulos, D. Shestakov & L. Gibbons at Bright Computing
  20. 20. Questions? Ian Lumb ianlumb@yorku.ca ian.lumb@brightcomputing.com
  21. 21. Resources ● RTM's scientific context ● Spark support in Bright Cluster Manager for Apache Hadoop

×