Managing ADLS gen2 using Apache Spark

ADLSg2 management toolkit
aka “OctopuFS”
Jacek Tokar
Lead Data Engineer, Advanced Analytics @ Procter&Gamble

Agenda
Use case and design approach
Warm up – get size of your data
Distributed file copy
Basic file operations (multithreaded)
Managing file ACLs
File delta operations
Metastore functions
Databricks setup requirements
Challenges and learnings
Where to get OctopuFS

P&G – main IT hubs
Data/ML Engineering:
• Warsaw, Poland
• Cincinnati, OH, USA
• Guangzhou, China
• San Jose, Costa Rica
Data Science:
• Cincinnati, OH, USA
• Geneva, Switzerland
• Guangzhou, China
We’re hiring!
https://www.pgcareers.com/

Use case and design approach
“I am not lazy! I’m efficient!”

Use case
Reporting
Prev
Pre-PROD
Copy
Move
Delete
ACLs
ADLSgen2
▪ Avoid direct use of Storage Account API - use Hadoop FS

Performance
0
10
20
30
40
50
60
70
80
Copy (1.6TB) Move (folder with 21k files) Modify ACL (16k paths)
standard OctopuFS
Killed @70min
Sparkread/write
dbutils.fs.mv
Lackofalternatives

Warm up
I’d like to know size of my data

Function getSize
▪ Function in com.pg.bigdata.octopufs.fs
▪ Prints size and number of files
▪ Returns FsSizes – returns all paths with their size
▪ Enables drilldown without sending requests to the storage
val sizes = getSize("abfss://dev@myAdls.dfs.core.windows.net/somePath")
Number of files in abfss://dev@myAdls.dfs.core.windows.net/somePath is 21003
Size of abfss://dev@myAdls.dfs.core.windows.net/somePath is 1.58 TB
sizes.getSizeOfPath("abfss://dev@myAdls.dfs.core.windows.net/somePath/myData/myDataset")

File copy
I’d like to backup my data

Distributed copy
▪ Evenly distributed files across tasks
▪ Runs 1 file per task by default
▪ Number of tasks can be customized – may be helpful if many small files
▪ Performance depends on network throughput (vs CPU)
▪ Can copy between different filesystems
Leverages spark tasks to perform FileSystem copy operation
Package com.pg.bigdata.octopufs.fs
DistributedExecution.copyFolder(sourceFolderPath, destinationFolderPath)
(implicit val spark: SparkSession)

Copy operation – listing files – on driver node
DS4_V2 (28GB RAM, 8 cores)
Listing 21k files
21 seconds

Copy operation – cluster load
spark.read.parquet(path).
write.mode("overwrite").parquet(path2)
DistributedExecution.copyFolder(path, path2)

Distributed copy - summary
▪ 3x faster than spark read/write
▪ Uses all worker nodes
▪ Maximizes usage of network throughput

Basic file operations
Promote the data without interruption

Local (multi-threaded) fs operations
▪ Runs on driver node only
▪ movePaths - FileSystem.rename on all provided Paths
▪ moveFolderContent - FileSystem.rename on all descendants
▪ deletePaths – FileSystem.delete on all Paths
▪ deleteFolder – deletes folder or its content only
The Future is here
Package com.pg.bigdata.octopufs.fs.LocalExecution
Paths case class com.pg.bigdata.octopufs.fs.Paths

Local (multi-threaded) fs operations
▪ Default parallelism is 1000
▪ Storage Account limit 20,000 requests/s
▪ Avg request time is ~50ms
▪ Parallelism can be customized by modification of com.pg.bigdata.octopufs.helpers.implicits
▪ Retry built-in (up to 5 attempts)
▪ If operation fails, move will resume from where it failed
The Future is here
Package com.pg.bigdata.octopufs.fs.LocalExecution
Paths case class com.pg.bigdata.octopufs.fs.Paths

Local (multi-threaded) fs operations - summary
▪ ∞ faster than dbutils.fs.mv
▪ Does not require cluster to run
▪ Driver VM requirements are low
▪ The only real limitation is Storage Account request throughput

Managing files ACLs
Modify access of loaded files

File ACL modification
com.pg.bigdata.octopufs.acl.AclManager
val acl = AclManager.FsPermission(
"group", "rwx","ACCESS","b###################c")
Access type: “ACCESS”,”DEFAULT”Grantee type:
“user”, “group”
Access details: r(ead),w(rite),(e)x(ecute)
▪ Applying ACL to directory tree
▪ Applies ACCESS type to files and ACCESS+DEFAULT to folders
▪ Takes 19 seconds for 15k files and 1.3k folders
AclManager.modifyFolderTreeAcl(path, acl)

Synchronize ACLs between folder trees
AclManager.synchronizeAcls (source, target)
Source
Folder1
Folder2
Target
Folder1
Folder3
File1
Source
Folder1
Folder2
Target
Folder1
Folder3
File1
AclManager.synchronizeAcls (target, target)
Target
File1
File2
File3
Target
File1
File2
File3

Synchronize ACLs between folder trees
▪ Gets ACL list from “source” folder tree
▪ Finds corresponding folder/files in “target” folder tree
▪ When path was matched, copies ACL over to “target” path
▪ If not matched, inherits security from target parent folder
AclManager.synchronizeAcls (source, target)
Source
Folder1
Folder2
Target
Folder1
Folder3
File1
Source
Folder1
Folder2
Target
Folder1
Folder3
File1

Other ACL functions
▪ Modify table ACLs
▪ Modifies ACLs on files/folders related to Hive table (based on table location)
▪ Modify ACLs for paths
▪ Get ACLs for paths

File delta
I’d like to copy only what has changed

File delta
▪ getDelta
▪ Returns lists of paths which vary (exist/don’t exist, have different size) in both folders
▪ synchronize
▪ Executes delete operation on paths not existing in source
▪ Executes distributed copy for files existing only in source
Package com.pg.bigdata.octopufs.Delta
Source
File1
File3
Target
File1
File2
Target
File1
File3File2

Hive tables/metastore operations

Hive Tables/Metastore operations
▪ Copy / move files between tables
▪ Copy / move table partitions
▪ Partition exchange not available for non-Delta tables
▪ Relies on metastore file list for the table
▪ Keep hive metadata up to date
▪ refreshTable
▪ recoverPartitions
Package com.pg.bigdata.octopufs.Promotor
Interesting metastore functions com.pg.bigdata.octopufs.metastore

Prerequisits
▪ RDD API security setup
▪ https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-datalake-gen2#rdd-api
▪ Turn off (or tune) speculative execution (recommended)
▪ spark.conf.set("spark.speculation","false")
▪ Most methods require implicit parameter
▪ SparkSession – for distributed copy
▪ Configuration – for local, multithreaded operation implicit val c = spark.sparkContext.hadoopConfiguration
implicit val s = spark

Challenge #1
▪ Hadoop configuration not available in task function
▪ Initial approach:
▪ Create serializable shell-class and put configuration inside
▪ ”unpack” configuration in task function
▪ Solution:
▪ Broadcast configuration from driver to the tasks
Access storage from spark task
Driver: val confBroadcast = spark.sparkContext.broadcast(
new SerializableWritable(spark.sparkContext.hadoopConfiguration))
Task: val conf: Configuration = confBroadcast.value.value

Challenge #2
▪ Default spark partitioner was not ideal
▪ Solution:
▪ Index each path
▪ Define very simple custom partitioner
Precisely control distribution of file paths in copy operation
class PromotorPartitioner(override val numPartitions: Int) extends Partitioner {
override def getPartition(key: Any): Int = key match {
case (ind: Int) => ind % numPartitions
}
}

GitHub repo
OctopuFS is now open-sourced
Use and contribute!
https://github.com/procter-gamble-tech/octopufs

Special thanks to
▪ NAS Development team @P&G
▪ Jason Hubbard @Databricks

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Use case
▪ Cloud – Azure
▪ Data promotion to reporting layer with minimal interruption
▪ Data backup or copy to non-Prod environment
▪ Synchronize file security of newly loaded data with production
▪ File delta detection and synchronization
▪ Do all the above without using Storage Account API directly

Managing ADLS gen2 using Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Managing ADLS gen2 using Apache Spark

Similar to Managing ADLS gen2 using Apache Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Managing ADLS gen2 using Apache Spark