Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Bridging the Completeness of Big Data on Databricks

Download to read offline

Data completeness is key for building any machine learning and deep learning model. The reality is that outliers and nulls widely exist in the data. The traditional methods of using fixed values or statistical metrics (min, max and mean) does not consider the relationship and patterns within the data. Most time it offers poor accuracy and would introduce additional outliers. Also, given our large data size, the computation is an extremely time-consuming process and a lot of time it could be constrained by the limited resource on local computer.
To address those issues, we have developed a new approach that will first leverage the similarity within our data points based on the nature of data source then using a collaborative AI model to fill null values and correct outliers.
In this talk, we will walk through the way we use a distributed framework to partition data by KDB tree for neighbor discovery and a collaborative filtering AI technology to fill the missing values and correct outliers. In addition, we will demonstrate how we reply on delta lake and MLflow for data and model management.

  • Be the first to like this

Bridging the Completeness of Big Data on Databricks

  1. 1. Bridging the Completeness of Big Data on Databricks Yanyan Wu VP of Data Wood Mackenzie, Verisk Chao Yang Director of Data Wood Mackenzie, Verisk
  2. 2. Acknowledgement • This work was based on US patent application number 63/142,551 filed on 01/28/2021 • Thanks to the coinventors’ support: Bernard Ajiboye, Hugh Hopewell, Rhodri Thomas @Wood Mackenzie (Verisk)
  3. 3. Agenda • Introduction & use cases • Limitation of existing approaches • Null filling processes • Similarity discovery • Collaborative AI • AI model management • Our application • Application tips
  4. 4. Why? Null values existed almost in all data set Limited data or Key data can’t be thrown away just because null existed in some attributes Machine learning models do not work with null values very well The importance of data completeness
  5. 5. Our Data Platform for Clients - LENS Energy data powerhouse augmented by world-class platform Upstream conventional Oil & Gas Discover, model and value upstream data worldwide Unconventional Oil & Gas Operational analysis for improved business performance Subsurface Analytics-ready, global subsurface data to optimise your resource portfolios with confidence Data Directly Integrated Into Clients' System Power & Renewables Navigate the energy transition by connecting the dots across the electricity value chain
  6. 6. Issues with existing null filling methods Low accuracy for backward or forward filling, filling with fixed values or statists metric (min, max, mean) Time consuming when using machine learning or regression methods Isolated, does not take account other attributes into filling nulls for one attribute We need a new method that can fill nulls with better speed & accuracy
  7. 7. Lens Data Platform Apache Sedona MLflow Databricks: Unified Platform Parquet files on AWS S3 Spark MLlib Build with Spark Parquet data files with null values in S3 Neighbor discovery 1. Spatial RDD partitioned by KDB tree 2. Distance based spatial join 3. Replace null values with neighbor information 4. Save data in Delta Lake Collaborative AI model 1. Label encoding 2. Remove noise 3. Bin to create userID group 4. Reformat for ALS model Enriched data with high completeness • 02 • 01 • 03 AI model management 1. Used ML pipeline & cross validation 2. Saved model hyper parameters with MLFlow 3. Set model to production stage
  8. 8. 01. Neighbor discovery • Discover neighbors of every entity (oil well) within defined limit • Challenges: • Large data size • Long compute time • Limited compute power on single machine • Apache Sedona: • Distributed framework for processing large-scale spatial data • KDB-Tree • Geometrical approach • Subsequently divide data into a n-dimensional space • Tree structure and fast query processing Distributed spatial data partitioning on Spark
  9. 9. • Load data • Create geometry object column • Set up Spark context • Import libraries 01. Neighbor discovery Distributed spatial data partitioning on Spark
  10. 10. • Distance join • Convert to DataFrame 01. Neighbor discovery Distributed spatial data partitioning on Spark • Create Spatial RDD • Create Circle RDD with defined range • Partition data by KDB tree
  11. 11. 01. Neighbor discovery Distributed spatial data partitioning on Spark
  12. 12. 02. Collaborative AI • Like popular methods used for movie recommendation • Leverage ALS (Alternating Least Squares) model from Spark MLlib • Code example: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html • Mapping: • UserID: each object or each object group (better to use group due to noise in data) • Item: attributes of the object • Rating: attributes’ value Leverage Spark MLlib
  13. 13. 02. Collaborative AI Spark MLlib: ALS and Pipeline • ML pipeline • Grid search • Cross Validation
  14. 14. 02. Collaborative AI Transform data to fit the format required by ALS
  15. 15. 03. AI model management Use MLFlow to manage model revisions/stages
  16. 16. Our Application Result • 314,000+ of well objects with >20 attributes with missing values • Neighbor discovery • <10 mins to generate 144,000,000+ neighbor combinations • Fill null with Similarity • Null reduction: Vertical_depth 36%->9.5% and lateral_length 46%- >14% • Fill null with collaborative AI • 3.7 million training records (80% training, 20% testing). • Took 5 minutes to train with grid search and cross validation on Databricks • Null reduction: to 0% null value • Accuracy: error% is 7% to 18% for key attributes. On Oil&Gas Unconventional Well Data
  17. 17. Tips for Applications • Remove outliers in the training data for AI model • No need to normalize the value • Form object UserId groups to deal with the noise in the data for AI model • More attributes, more data leads to a higher accuracy • Accuracy is higher for non-derived attributes (higher accuracy for the attributes with less noise) Attention to details
  18. 18. Thank you! Questions?
  19. 19. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Data completeness is key for building any machine learning and deep learning model. The reality is that outliers and nulls widely exist in the data. The traditional methods of using fixed values or statistical metrics (min, max and mean) does not consider the relationship and patterns within the data. Most time it offers poor accuracy and would introduce additional outliers. Also, given our large data size, the computation is an extremely time-consuming process and a lot of time it could be constrained by the limited resource on local computer. To address those issues, we have developed a new approach that will first leverage the similarity within our data points based on the nature of data source then using a collaborative AI model to fill null values and correct outliers. In this talk, we will walk through the way we use a distributed framework to partition data by KDB tree for neighbor discovery and a collaborative filtering AI technology to fill the missing values and correct outliers. In addition, we will demonstrate how we reply on delta lake and MLflow for data and model management.

Views

Total views

63

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×