SlideShare a Scribd company logo
ADLSg2 management toolkit
aka “OctopuFS”
Jacek Tokar
Lead Data Engineer, Advanced Analytics @ Procter&Gamble
Agenda
Use case and design approach
Warm up – get size of your data
Distributed file copy
Basic file operations (multithreaded)
Managing file ACLs
File delta operations
Metastore functions
Databricks setup requirements
Challenges and learnings
Where to get OctopuFS
A Company of Leading Brands
4
P&G – main IT hubs
Data/ML Engineering:
• Warsaw, Poland
• Cincinnati, OH, USA
• Guangzhou, China
• San Jose, Costa Rica
Data Science:
• Cincinnati, OH, USA
• Geneva, Switzerland
• Guangzhou, China
We’re hiring!
https://www.pgcareers.com/
Use case and design approach
“I am not lazy! I’m efficient!”
Use case
Reporting
Prev
Pre-PROD
Copy
Move
Delete
ACLs
ADLSgen2
▪ Avoid direct use of Storage Account API - use Hadoop FS
Performance
0
10
20
30
40
50
60
70
80
Copy (1.6TB) Move (folder with 21k files) Modify ACL (16k paths)
standard OctopuFS
Killed @70min
Sparkread/write
dbutils.fs.mv
Lackofalternatives
Warm up
I’d like to know size of my data
Function getSize
▪ Function in com.pg.bigdata.octopufs.fs
▪ Prints size and number of files
▪ Returns FsSizes – returns all paths with their size
▪ Enables drilldown without sending requests to the storage
val sizes = getSize("abfss://dev@myAdls.dfs.core.windows.net/somePath")
Number of files in abfss://dev@myAdls.dfs.core.windows.net/somePath is 21003
Size of abfss://dev@myAdls.dfs.core.windows.net/somePath is 1.58 TB
sizes.getSizeOfPath("abfss://dev@myAdls.dfs.core.windows.net/somePath/myData/myDataset")
File copy
I’d like to backup my data
Distributed copy
▪ Evenly distributed files across tasks
▪ Runs 1 file per task by default
▪ Number of tasks can be customized – may be helpful if many small files
▪ Performance depends on network throughput (vs CPU)
▪ Can copy between different filesystems
Leverages spark tasks to perform FileSystem copy operation
Package com.pg.bigdata.octopufs.fs
DistributedExecution.copyFolder(sourceFolderPath, destinationFolderPath)
(implicit val spark: SparkSession)
Copy operation – listing files – on driver node
DS4_V2 (28GB RAM, 8 cores)
Listing 21k files
21 seconds
Copy operation – cluster load
spark.read.parquet(path).
write.mode("overwrite").parquet(path2)
DistributedExecution.copyFolder(path, path2)
Distributed copy - summary
▪ 3x faster than spark read/write
▪ Uses all worker nodes
▪ Maximizes usage of network throughput
Basic file operations
Promote the data without interruption
Local (multi-threaded) fs operations
▪ Runs on driver node only
▪ movePaths - FileSystem.rename on all provided Paths
▪ moveFolderContent - FileSystem.rename on all descendants
▪ deletePaths – FileSystem.delete on all Paths
▪ deleteFolder – deletes folder or its content only
The Future is here
Package com.pg.bigdata.octopufs.fs.LocalExecution
Paths case class com.pg.bigdata.octopufs.fs.Paths
Local (multi-threaded) fs operations
▪ Default parallelism is 1000
▪ Storage Account limit 20,000 requests/s
▪ Avg request time is ~50ms
▪ Parallelism can be customized by modification of com.pg.bigdata.octopufs.helpers.implicits
▪ Retry built-in (up to 5 attempts)
▪ If operation fails, move will resume from where it failed
The Future is here
Package com.pg.bigdata.octopufs.fs.LocalExecution
Paths case class com.pg.bigdata.octopufs.fs.Paths
Local (multi-threaded) fs operations - summary
▪ ∞ faster than dbutils.fs.mv
▪ Does not require cluster to run
▪ Driver VM requirements are low
▪ The only real limitation is Storage Account request throughput
Managing files ACLs
Modify access of loaded files
File ACL modification
com.pg.bigdata.octopufs.acl.AclManager
val acl = AclManager.FsPermission(
"group", "rwx","ACCESS","b###################c")
Access type: “ACCESS”,”DEFAULT”Grantee type:
“user”, “group”
Access details: r(ead),w(rite),(e)x(ecute)
▪ Applying ACL to directory tree
▪ Applies ACCESS type to files and ACCESS+DEFAULT to folders
▪ Takes 19 seconds for 15k files and 1.3k folders
AclManager.modifyFolderTreeAcl(path, acl)
Synchronize ACLs between folder trees
com.pg.bigdata.octopufs.acl.AclManager
AclManager.synchronizeAcls (source, target)
Source
Folder1
Folder2
Target
Folder1
Folder3
File1
Source
Folder1
Folder2
Target
Folder1
Folder3
File1
AclManager.synchronizeAcls (target, target)
Target
File1
File2
File3
Target
File1
File2
File3
Synchronize ACLs between folder trees
com.pg.bigdata.octopufs.acl.AclManager
▪ Gets ACL list from “source” folder tree
▪ Finds corresponding folder/files in “target” folder tree
▪ When path was matched, copies ACL over to “target” path
▪ If not matched, inherits security from target parent folder
AclManager.synchronizeAcls (source, target)
Source
Folder1
Folder2
Target
Folder1
Folder3
File1
Source
Folder1
Folder2
Target
Folder1
Folder3
File1
Other ACL functions
com.pg.bigdata.octopufs.acl.AclManager
▪ Modify table ACLs
▪ Modifies ACLs on files/folders related to Hive table (based on table location)
▪ Modify ACLs for paths
▪ Get ACLs for paths
File delta
I’d like to copy only what has changed
File delta
▪ getDelta
▪ Returns lists of paths which vary (exist/don’t exist, have different size) in both folders
▪ synchronize
▪ Executes delete operation on paths not existing in source
▪ Executes distributed copy for files existing only in source
Package com.pg.bigdata.octopufs.Delta
Source
File1
File3
Target
File1
File2
Target
File1
File3File2
Hive tables/metastore operations
Hive Tables/Metastore operations
▪ Copy / move files between tables
▪ Copy / move table partitions
▪ Partition exchange not available for non-Delta tables
▪ Relies on metastore file list for the table
▪ Keep hive metadata up to date
▪ refreshTable
▪ recoverPartitions
Package com.pg.bigdata.octopufs.Promotor
Interesting metastore functions com.pg.bigdata.octopufs.metastore
Prerequisites
Prerequisits
▪ RDD API security setup
▪ https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-datalake-gen2#rdd-api
▪ Turn off (or tune) speculative execution (recommended)
▪ spark.conf.set("spark.speculation","false")
▪ Most methods require implicit parameter
▪ SparkSession – for distributed copy
▪ Configuration – for local, multithreaded operation implicit val c = spark.sparkContext.hadoopConfiguration
implicit val s = spark
Challenges and learnings
Challenge #1
▪ Hadoop configuration not available in task function
▪ Initial approach:
▪ Create serializable shell-class and put configuration inside
▪ ”unpack” configuration in task function
▪ Solution:
▪ Broadcast configuration from driver to the tasks
Access storage from spark task
Driver: val confBroadcast = spark.sparkContext.broadcast(
new SerializableWritable(spark.sparkContext.hadoopConfiguration))
Task: val conf: Configuration = confBroadcast.value.value
Challenge #2
▪ Default spark partitioner was not ideal
▪ Solution:
▪ Index each path
▪ Define very simple custom partitioner
Precisely control distribution of file paths in copy operation
class PromotorPartitioner(override val numPartitions: Int) extends Partitioner {
override def getPartition(key: Any): Int = key match {
case (ind: Int) => ind % numPartitions
}
}
Where to find OctopuFS?
GitHub repo
OctopuFS is now open-sourced
Use and contribute!
https://github.com/procter-gamble-tech/octopufs
Special thanks to
▪ NAS Development team @P&G
▪ Jason Hubbard @Databricks
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Backup slides
Use case
▪ Cloud – Azure
▪ Data promotion to reporting layer with minimal interruption
▪ Data backup or copy to non-Prod environment
▪ Synchronize file security of newly loaded data with production
▪ File delta detection and synchronization
▪ Do all the above without using Storage Account API directly

More Related Content

What's hot

Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
 
Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFood
Databricks
 
How to performance tune spark applications in large clusters
How to performance tune spark applications in large clustersHow to performance tune spark applications in large clusters
How to performance tune spark applications in large clusters
Omkar Joshi
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
Databricks
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
Databricks
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Databricks
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Databricks
 
Continuous Processing in Structured Streaming with Jose Torres
 Continuous Processing in Structured Streaming with Jose Torres Continuous Processing in Structured Streaming with Jose Torres
Continuous Processing in Structured Streaming with Jose Torres
Databricks
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
Databricks
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 

What's hot (20)

Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFood
 
How to performance tune spark applications in large clusters
How to performance tune spark applications in large clustersHow to performance tune spark applications in large clusters
How to performance tune spark applications in large clusters
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Continuous Processing in Structured Streaming with Jose Torres
 Continuous Processing in Structured Streaming with Jose Torres Continuous Processing in Structured Streaming with Jose Torres
Continuous Processing in Structured Streaming with Jose Torres
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 

Similar to Managing ADLS gen2 using Apache Spark

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Radu Chilom
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
Kyle Burke
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
Marilyn Waldman
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
felixcss
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
DataWorks Summit
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
ProTechSkills Training
 
Hadoop spark online demo
Hadoop spark online demoHadoop spark online demo
Hadoop spark online demo
Tripti Jha
 

Similar to Managing ADLS gen2 using Apache Spark (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop spark online demo
Hadoop spark online demoHadoop spark online demo
Hadoop spark online demo
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 

Recently uploaded (20)

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 

Managing ADLS gen2 using Apache Spark

  • 1.
  • 2. ADLSg2 management toolkit aka “OctopuFS” Jacek Tokar Lead Data Engineer, Advanced Analytics @ Procter&Gamble
  • 3. Agenda Use case and design approach Warm up – get size of your data Distributed file copy Basic file operations (multithreaded) Managing file ACLs File delta operations Metastore functions Databricks setup requirements Challenges and learnings Where to get OctopuFS
  • 4. A Company of Leading Brands 4
  • 5. P&G – main IT hubs Data/ML Engineering: • Warsaw, Poland • Cincinnati, OH, USA • Guangzhou, China • San Jose, Costa Rica Data Science: • Cincinnati, OH, USA • Geneva, Switzerland • Guangzhou, China We’re hiring! https://www.pgcareers.com/
  • 6. Use case and design approach “I am not lazy! I’m efficient!”
  • 7. Use case Reporting Prev Pre-PROD Copy Move Delete ACLs ADLSgen2 ▪ Avoid direct use of Storage Account API - use Hadoop FS
  • 8. Performance 0 10 20 30 40 50 60 70 80 Copy (1.6TB) Move (folder with 21k files) Modify ACL (16k paths) standard OctopuFS Killed @70min Sparkread/write dbutils.fs.mv Lackofalternatives
  • 9. Warm up I’d like to know size of my data
  • 10. Function getSize ▪ Function in com.pg.bigdata.octopufs.fs ▪ Prints size and number of files ▪ Returns FsSizes – returns all paths with their size ▪ Enables drilldown without sending requests to the storage val sizes = getSize("abfss://dev@myAdls.dfs.core.windows.net/somePath") Number of files in abfss://dev@myAdls.dfs.core.windows.net/somePath is 21003 Size of abfss://dev@myAdls.dfs.core.windows.net/somePath is 1.58 TB sizes.getSizeOfPath("abfss://dev@myAdls.dfs.core.windows.net/somePath/myData/myDataset")
  • 11. File copy I’d like to backup my data
  • 12. Distributed copy ▪ Evenly distributed files across tasks ▪ Runs 1 file per task by default ▪ Number of tasks can be customized – may be helpful if many small files ▪ Performance depends on network throughput (vs CPU) ▪ Can copy between different filesystems Leverages spark tasks to perform FileSystem copy operation Package com.pg.bigdata.octopufs.fs DistributedExecution.copyFolder(sourceFolderPath, destinationFolderPath) (implicit val spark: SparkSession)
  • 13. Copy operation – listing files – on driver node DS4_V2 (28GB RAM, 8 cores) Listing 21k files 21 seconds
  • 14. Copy operation – cluster load spark.read.parquet(path). write.mode("overwrite").parquet(path2) DistributedExecution.copyFolder(path, path2)
  • 15. Distributed copy - summary ▪ 3x faster than spark read/write ▪ Uses all worker nodes ▪ Maximizes usage of network throughput
  • 16. Basic file operations Promote the data without interruption
  • 17. Local (multi-threaded) fs operations ▪ Runs on driver node only ▪ movePaths - FileSystem.rename on all provided Paths ▪ moveFolderContent - FileSystem.rename on all descendants ▪ deletePaths – FileSystem.delete on all Paths ▪ deleteFolder – deletes folder or its content only The Future is here Package com.pg.bigdata.octopufs.fs.LocalExecution Paths case class com.pg.bigdata.octopufs.fs.Paths
  • 18. Local (multi-threaded) fs operations ▪ Default parallelism is 1000 ▪ Storage Account limit 20,000 requests/s ▪ Avg request time is ~50ms ▪ Parallelism can be customized by modification of com.pg.bigdata.octopufs.helpers.implicits ▪ Retry built-in (up to 5 attempts) ▪ If operation fails, move will resume from where it failed The Future is here Package com.pg.bigdata.octopufs.fs.LocalExecution Paths case class com.pg.bigdata.octopufs.fs.Paths
  • 19. Local (multi-threaded) fs operations - summary ▪ ∞ faster than dbutils.fs.mv ▪ Does not require cluster to run ▪ Driver VM requirements are low ▪ The only real limitation is Storage Account request throughput
  • 20. Managing files ACLs Modify access of loaded files
  • 21. File ACL modification com.pg.bigdata.octopufs.acl.AclManager val acl = AclManager.FsPermission( "group", "rwx","ACCESS","b###################c") Access type: “ACCESS”,”DEFAULT”Grantee type: “user”, “group” Access details: r(ead),w(rite),(e)x(ecute) ▪ Applying ACL to directory tree ▪ Applies ACCESS type to files and ACCESS+DEFAULT to folders ▪ Takes 19 seconds for 15k files and 1.3k folders AclManager.modifyFolderTreeAcl(path, acl)
  • 22. Synchronize ACLs between folder trees com.pg.bigdata.octopufs.acl.AclManager AclManager.synchronizeAcls (source, target) Source Folder1 Folder2 Target Folder1 Folder3 File1 Source Folder1 Folder2 Target Folder1 Folder3 File1 AclManager.synchronizeAcls (target, target) Target File1 File2 File3 Target File1 File2 File3
  • 23. Synchronize ACLs between folder trees com.pg.bigdata.octopufs.acl.AclManager ▪ Gets ACL list from “source” folder tree ▪ Finds corresponding folder/files in “target” folder tree ▪ When path was matched, copies ACL over to “target” path ▪ If not matched, inherits security from target parent folder AclManager.synchronizeAcls (source, target) Source Folder1 Folder2 Target Folder1 Folder3 File1 Source Folder1 Folder2 Target Folder1 Folder3 File1
  • 24. Other ACL functions com.pg.bigdata.octopufs.acl.AclManager ▪ Modify table ACLs ▪ Modifies ACLs on files/folders related to Hive table (based on table location) ▪ Modify ACLs for paths ▪ Get ACLs for paths
  • 25. File delta I’d like to copy only what has changed
  • 26. File delta ▪ getDelta ▪ Returns lists of paths which vary (exist/don’t exist, have different size) in both folders ▪ synchronize ▪ Executes delete operation on paths not existing in source ▪ Executes distributed copy for files existing only in source Package com.pg.bigdata.octopufs.Delta Source File1 File3 Target File1 File2 Target File1 File3File2
  • 28. Hive Tables/Metastore operations ▪ Copy / move files between tables ▪ Copy / move table partitions ▪ Partition exchange not available for non-Delta tables ▪ Relies on metastore file list for the table ▪ Keep hive metadata up to date ▪ refreshTable ▪ recoverPartitions Package com.pg.bigdata.octopufs.Promotor Interesting metastore functions com.pg.bigdata.octopufs.metastore
  • 30. Prerequisits ▪ RDD API security setup ▪ https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-datalake-gen2#rdd-api ▪ Turn off (or tune) speculative execution (recommended) ▪ spark.conf.set("spark.speculation","false") ▪ Most methods require implicit parameter ▪ SparkSession – for distributed copy ▪ Configuration – for local, multithreaded operation implicit val c = spark.sparkContext.hadoopConfiguration implicit val s = spark
  • 32. Challenge #1 ▪ Hadoop configuration not available in task function ▪ Initial approach: ▪ Create serializable shell-class and put configuration inside ▪ ”unpack” configuration in task function ▪ Solution: ▪ Broadcast configuration from driver to the tasks Access storage from spark task Driver: val confBroadcast = spark.sparkContext.broadcast( new SerializableWritable(spark.sparkContext.hadoopConfiguration)) Task: val conf: Configuration = confBroadcast.value.value
  • 33. Challenge #2 ▪ Default spark partitioner was not ideal ▪ Solution: ▪ Index each path ▪ Define very simple custom partitioner Precisely control distribution of file paths in copy operation class PromotorPartitioner(override val numPartitions: Int) extends Partitioner { override def getPartition(key: Any): Int = key match { case (ind: Int) => ind % numPartitions } }
  • 34. Where to find OctopuFS?
  • 35. GitHub repo OctopuFS is now open-sourced Use and contribute! https://github.com/procter-gamble-tech/octopufs
  • 36. Special thanks to ▪ NAS Development team @P&G ▪ Jason Hubbard @Databricks
  • 37. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 38.
  • 40. Use case ▪ Cloud – Azure ▪ Data promotion to reporting layer with minimal interruption ▪ Data backup or copy to non-Prod environment ▪ Synchronize file security of newly loaded data with production ▪ File delta detection and synchronization ▪ Do all the above without using Storage Account API directly