Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Unafraid of change
Optimizing ETL, ML & AI in fast-paced environments
Sim Simeonov, CTO, Swoop
sim at swoop.com / @simeons...
omni-channel marketing for your ideal population
supported by privacy-preserving petabyte-scale
ML/AI over rich data (thou...
exploratory data science
the myth of
data science
The reality of data science is
the curse of too many choices
(most of which are no good)
• 10? Too slow to train.
• 2? Ins...
exploratory data science is an
interactive search process
(with fuzzy termination criteria)
driven by the discovery of new...
notebook
cell evaluation
backtracking is
code modification
Exploration tree
Backtracking changes production plans
df.sample(0.1, seed = 0).select('x)
df.where('x.isNotNull).sample(0.1, seed = 0).sel...
Backtracking is not easy with Spark & notebooks
Go back & change code
• Past state is lost
Duplicate code
val df1 = …
val ...
many do this
one does this
thick branches
benefit from caching
Team view
Spark’s df.cache() has many limitations
• Single cluster only
– cached data cannot be shared by the team
• Lost after clus...
Ideal exploratory data production requires…
• Automatic cross-cluster reuse based on the Spark plan
– Configuration-addres...
df.cache()
df.capCache()
part of https://github.com/swoop-inc/spark-alchemy
large production table (updated frequently)
data sample
sub-sample i
result
data enrichment dimension (updated sometimes)
...
Are primes % 10,000 interesting numbers?
1 // Get the data
2 val inum = spark.table("interesting_numbers")
3 val primes = spark.table("primes")
4 val limit = 10000...
1 // Get the data
2 val inum = spark.table("interesting_numbers")
3 val primes = spark.table("primes")
4 val limit = 10000...
behind the curtain…
primes.sample(0.1, seed=0).explain(true)
== Analyzed Logical Plan ==
ordinal: int, prime: int, delta: int
Sample 0.0, 0.1,...
ordinal: int, prime: int, delta: int
Sample 0.0, 0.1, false, 0
+- Relation[ordinal#2034,prime#2035,delta#2036] parquet
+- ...
Spark’s user-level data reading/writing APIs
are inconsistent, inflexible and not extensible.
To implement CAP, we had to ...
Reading
• Inconsistent
spark.table(tableName)
spark.read + configuration + .load(path)
.load(path1, path2, …)
.jdbc(…)
• C...
Writing
• Inconsistent
df.write + configuration + .save()
.saveAsTable(tableName)
.jdbc(…)
• Consistent
df.writer + config...
Spark has two implicit mechanisms
to manage where data is written & read from:
the Hive metastore & “you’re on your own”.
...
Spark offers no ability to inject behaviors into reading/writing.
We added reading & writing interceptors.
Dependency mana...
It’s not hard to fix the user-level reading/writing APIs.
Let’s not waste the chance to do it for Spark 3.0.
Interested in challenging data engineering, ML & AI on big data?
I’d love to hear from you. sim at swoop.com / @simeons
Da...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments with Sim Simeonov
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments with Sim Simeonov

Download to read offline

While processing more data through an existing set of ETL or ML/AI pipelines is easy with Spark, dealing with an ever expanding and/or changing set of pipelines can be quite challenging, all the more so when there are complex inter-dependencies. Workflow-based job orchestration offers some help in the case of relatively static flows but fails miserably when it comes to supporting fast-paced data production such as data science experimentation (feature exploration, model tuning, …), ad hoc analytics and root cause analysis.

This talk will introduce three patterns for large-scale data production in fast-paced environments–just-in-time dependency resolution (JDR), configuration-addressed production (CAP) and automated lifecycle management (ALM)–with ETL & ML/AI demos as well as open-source code you can use in your projects. These patterns have been production-tested in Swoop’s petabyte-scale environment where they have significantly increased human productivity and processing flexibility while reducing costs by more than 10x.

By adopting these patterns you’ll get the benefits typically associated with rigidly-planned and highly-coordinated data production quickly & efficiently, without endless meetings or even a workflow server. You will be able to transparently ensure result accuracy even in the face of hundreds of constantly-changing inputs, eliminate duplicate computation within and across clusters and automate lifecycle management.

  • Be the first to like this

Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments with Sim Simeonov

  1. 1. Unafraid of change Optimizing ETL, ML & AI in fast-paced environments Sim Simeonov, CTO, Swoop sim at swoop.com / @simeons #SAISDD13
  2. 2. omni-channel marketing for your ideal population supported by privacy-preserving petabyte-scale ML/AI over rich data (thousands of columns) e.g., we improve health outcomes by increasing the diagnosis rate of rare diseases through doctor/patient education
  3. 3. exploratory data science
  4. 4. the myth of data science
  5. 5. The reality of data science is the curse of too many choices (most of which are no good) • 10? Too slow to train. • 2? Insufficient differentiation. • 5? False differentiation. • 3? Insufficient differentiation. • 4? Looks OK to me. Ship it!
  6. 6. exploratory data science is an interactive search process (with fuzzy termination criteria) driven by the discovery of new information
  7. 7. notebook cell evaluation backtracking is code modification Exploration tree
  8. 8. Backtracking changes production plans df.sample(0.1, seed = 0).select('x) df.where('x.isNotNull).sample(0.1, seed = 0).select('x, 'y) df.where('x > 0).sample(0.1, seed = 0).select('x, 'y) df.where('x > 0).sample(0.2, seed = 0).select('x, 'y) df.where('x > 0).sample(0.2, seed = 123).select('x, 'y)
  9. 9. Backtracking is not easy with Spark & notebooks Go back & change code • Past state is lost Duplicate code val df1 = … val df2 = … val df3 = … // No, extensibility // through functions // doesn’t work
  10. 10. many do this one does this thick branches benefit from caching Team view
  11. 11. Spark’s df.cache() has many limitations • Single cluster only – cached data cannot be shared by the team • Lost after cluster crash or restart – hours of work gone due to OOM or spot instance churn • Requires df.unpersist() to release resources – but backtracking means df (the cached plan) is lost • No dependency checks – inconsistent computation across team; stale data inputs
  12. 12. Ideal exploratory data production requires… • Automatic cross-cluster reuse based on the Spark plan – Configuration-addressed production (CAP) • Not having to worry about saving/updating/deleting – Automated lifecycle management (ALM) • Easy control over the freshness/staleness of data used – Just-in-time dependency resolution (JDR)
  13. 13. df.cache() df.capCache() part of https://github.com/swoop-inc/spark-alchemy
  14. 14. large production table (updated frequently) data sample sub-sample i result data enrichment dimension (updated sometimes) join filter(condition) & sample(rate, seed) 1..k: sample(rate, seed) union project Example: resampling with data enhancement project
  15. 15. Are primes % 10,000 interesting numbers?
  16. 16. 1 // Get the data 2 val inum = spark.table("interesting_numbers") 3 val primes = spark.table("primes") 4 val limit = 10000 5 // Sample the primes 6 val sample = primes.sample(0.1, seed = 0) 7 // Resample, calculate stats & union 8 val stats = (1 to 30).map { i => 9 sample.select((col("prime") % limit).as("n")) 10 .sample(0.2, seed = i).distinct() 11 .join(inum.select(col("n")), Seq("n")) 12 .select(lit(i).as("sample"), count(col("*")).as("cnt_ns")) 13 }.reduce(_ union _) 14 // Show distribution of the percentage of interesting prime modulos 15 display(stats.select(('cnt_ns * 100 / limit).as("pct_ns_of_limit")))
  17. 17. 1 // Get the data 2 val inum = spark.table("interesting_numbers") 3 val primes = spark.table("primes") 4 val limit = 10000 5 // Sample the primes 6 val sample = primes.sample(0.1, seed = 0).capCache() 7 // Resample, calculate stats & union 8 val stats = (1 to 30).map { i => 9 sample.select((col("prime") % limit).as("n")) 10 .sample(0.2, seed = i).distinct() 11 .join(inum.select(col("n")), Seq("n")) 12 .select(lit(i).as("sample"), count(col("*")).as("cnt_ns")).capCache() 13 }.reduce(_ union _) 14 // Show distribution of the percentage of interesting prime modulos 15 display(stats.select(('cnt_ns * 100 / limit).as("pct_ns_of_limit")))
  18. 18. behind the curtain…
  19. 19. primes.sample(0.1, seed=0).explain(true) == Analyzed Logical Plan == ordinal: int, prime: int, delta: int Sample 0.0, 0.1, false, 0 +- SubqueryAlias primes +- Relation[ordinal#2034,prime#2035,delta#2036] parquet == Optimized Logical Plan == Sample 0.0, 0.1, false, 0 +- Relation[ordinal#2034,prime#2035,delta#2036] parquet == Physical Plan == *(1) Sample 0.0, 0.1, false, 0 +- *(1) FileScan parquet default.primes[ordinal#2034,prime#2035,delta#2036] Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/primes]
  20. 20. ordinal: int, prime: int, delta: int Sample 0.0, 0.1, false, 0 +- Relation[ordinal#2034,prime#2035,delta#2036] parquet +- InMemoryFileIndex[dbfs:/user/hive/warehouse/primes] ordinal: int, prime: int, delta: int Sample 0.0, 0.1, false, 0 +- Relation[ordinal#0,prime#1,delta#2] parquet +- InMemoryFileIndex[dbfs:/user/hive/warehouse/primes] CAP (hash: 4bb0070bd5f50bec95dd95f76130f55cd2d0c6bc) ALM (createdAt: {now}, expiresAt: {based on TTL}) JDR (dependencies: ["table:default.primes"]) cross-cluster canonicalization
  21. 21. Spark’s user-level data reading/writing APIs are inconsistent, inflexible and not extensible. To implement CAP, we had to design a parallel set of APIs.
  22. 22. Reading • Inconsistent spark.table(tableName) spark.read + configuration + .load(path) .load(path1, path2, …) .jdbc(…) • Consistent spark.reader + configuration + .read()
  23. 23. Writing • Inconsistent df.write + configuration + .save() .saveAsTable(tableName) .jdbc(…) • Consistent df.writer + configuration + .write() // You can read() from the result of write()
  24. 24. Spark has two implicit mechanisms to manage where data is written & read from: the Hive metastore & “you’re on your own”. We made authorities explicit first class objects. df.capCache() is df.writer.authority("CAP").readOrCreate()
  25. 25. Spark offers no ability to inject behaviors into reading/writing. We added reading & writing interceptors. Dependency management works via an interceptor. trait InterceptorSupport[In, Out] { def intercept(f: In => Out): (In => Out) }
  26. 26. It’s not hard to fix the user-level reading/writing APIs. Let’s not waste the chance to do it for Spark 3.0.
  27. 27. Interested in challenging data engineering, ML & AI on big data? I’d love to hear from you. sim at swoop.com / @simeons Data science productivity matters. https://github.com/swoop-inc/spark-alchemy

While processing more data through an existing set of ETL or ML/AI pipelines is easy with Spark, dealing with an ever expanding and/or changing set of pipelines can be quite challenging, all the more so when there are complex inter-dependencies. Workflow-based job orchestration offers some help in the case of relatively static flows but fails miserably when it comes to supporting fast-paced data production such as data science experimentation (feature exploration, model tuning, …), ad hoc analytics and root cause analysis. This talk will introduce three patterns for large-scale data production in fast-paced environments–just-in-time dependency resolution (JDR), configuration-addressed production (CAP) and automated lifecycle management (ALM)–with ETL & ML/AI demos as well as open-source code you can use in your projects. These patterns have been production-tested in Swoop’s petabyte-scale environment where they have significantly increased human productivity and processing flexibility while reducing costs by more than 10x. By adopting these patterns you’ll get the benefits typically associated with rigidly-planned and highly-coordinated data production quickly & efficiently, without endless meetings or even a workflow server. You will be able to transparently ensure result accuracy even in the face of hundreds of constantly-changing inputs, eliminate duplicate computation within and across clusters and automate lifecycle management.

Views

Total views

423

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

9

Shares

0

Comments

0

Likes

0

×