SlideShare a Scribd company logo
www.scling.com
Schema on read is obsolete.
Welcome metaprogramming.
Data Innovation Summit, 2024-04-24
Lars Albertsson
Scling
1
www.scling.com
IT craft to factory
2
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
www.scling.com
Security Waterfall
Data factories
3
Application
delivery
Traditional
operations
DevSecOps
Traditional
QA
Infrastructure
DB-oriented
architecture
Agile
Containers
DevOps CI/CD
Infrastructure
as code
Data factories,
data pipelines,
DataOps
www.scling.com
Craft vs industry
4
● Each step steered by human
○ Or primitive automation
● Improving artifacts
● Craft is primary competence
● Components made for humans
○ Look nice, "easy to use"
○ More popular
● Autonomous processes
● Improving process that creates artifacts
● Multitude of competences
● Some components unusable by humans
○ Hard, greasy
○ Made for integration
○ Less popular
www.scling.com
Data engineering in the future
5
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education
www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 100-1000
6
2014: 6500 datasets / day
2016: 20000 datasets / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2021: 500B events collected / day
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value
www.scling.com
Data-factory-as-a-service
7
Data lake
● Data factory
○ Collected, raw data →
processed, valuable data
● Data pipelines customised for client
○ Analytics (BI, reports, A/B testing)
○ Data-fed features (autocomplete, search)
○ Learning systems (recommendations, fraud)
● Compete with data leaders:
○ Quick idea-to-production
○ Operational efficiency
{....}
{....}
{....}
www.scling.com
Data agility
8
● Siloed: 6+ months
Cultural work
● Autonomous: 1 month
Technical work
● Coordinated: days
Data lake
∆
∆
Latency?
www.scling.com
● Lowest common denominator = name, type, required
○ Types: string, long, double, binary, array, map, union, record
● Schema specification may support additional constraints, e.g. integer range, other collections
What is a schema?
9
Id Name Age Phone
1 "Anna" 34 null
2 "Bob" 42 "08-123456"
Fields
Name Type Required?
In RDBMS, relations are explicit
In lake/stream datasets, relations are implicit
www.scling.com
Schema definitions
10
{
"type" : "record",
"namespace" : "com.mapflat.example",
"name" : "User",
"fields" : [
{ "name" : "id" , "type" : "int" },
{ "name" : "name" , "type" : "string" },
{ "name" : "age" , "type" : "int" },
{ "name" : "phone" , "type" : ["null", "string"],
"default": null }
]
}
● RDBMS: Table metadata
● Avro format: JSON/DSL definition
○ Definition is bundled with avro data files
○ Reused by Parquet format
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
{ "id": 1, "name": "Alice", "age": "34" }
{ "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" }
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
Schema on write
11
● Schema defined by writer
● Destination (table / dataset / stream topic) has defined schema
○ Technical definition with metadata (e.g. RDMBS, Kafka + registry)
○ By convention
● Writes not in compliance are not accepted
○ Technically aborted (e.g. RDBMS)
○ In violation of intent (e.g. HDFS datasets)
● Can be technically enforced by producer driver
○ Through ORM / code generation
○ Schema registry lookup
Strict checking philosophy
www.scling.com
Schema on read
12
● Anything (technically) accepted when writing
● Schema defined by reader, at consumption
○ Reader may impose requirements on type & value
● In dynamic languages, field propagate implicitly
○ E-shopping example:
i. Join order + customer.
ii. Add device_type to order schema
iii. device_type becomes available in downstream datasets
● Violations of constraints are detected at read
○ Perhaps long after production?
○ By team not owning producer?
Loose checking philosophy
www.scling.com
Dynamic vs static typing
13
Schema on write Schema on read
Static typing Dynamic typing
Strict Loose
Possible
Java:
user.setName("Alice");
user2.getName();
Scala:
user = User(name = "Alice", ...)
user2.name
Java:
user.set("name", "Alice");
user2.get("name");
Python:
user.name = "Alice"
user2.name
www.scling.com
Schema on read or write?
14
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
● Expressive
● Custom types
● IDE support
● Avro for data lake storage
Schema definition choice
15
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
● Parquet
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
Schema offspring Test record
difference render
type classes
16
case classes
test equality
type classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
Avro type
annotations
MySQL
schemas
CSV codecs
Privacy by
design
machinery
Python
Logical types
www.scling.com
Avro codecs
17
case classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
{
"name": "JavaUser",
{ "name": "age", "type": "int" }
{ "name": "phone", "type": [ "null", "string" ] }
}
public class JavaUser implements SpecificRecord {
public Integer getAge() { ... }
public String getPhone() { ... }
}
object UserConverter extends AvroConverter[User] {
def fromSpecific(u: JavaUser): User
def toSpecific(u: User): JavaUser
}
case class User(age: Int,
phone: Option[String] = None)
www.scling.com
Scalameta
● Parsing and analysis of scala
source code
18
val a = b() + 3
["val", " ", "a", " ", "=", " ", "b",
"(", ")", " ", "+", " ", "3"]
[val, "a", =, Call("b"), +, Int(3)]
[val, Int(a), =,
Call(com.scling.func.b), +, Int(3)]
lex
parse
semantic
analysis
www.scling.com
Scalameta use cases
● Scalafmt
● Scalafix
○ Static analysis
○ Code transformation
● Online code generation - macros
● Offline code generation
19
// Example from scio 0.7 -> 0.8 upgrade rules
final class FixTensorflow extends SemanticRule("FixTensorflow") {
override def fix(implicit doc: SemanticDocument): Patch =
doc.tree.collect {
case t @ Term.Select(s, Term.Name(
"saveAsTfExampleFile")) =>
Patch.replaceTree(t, q"$s.saveAsTfRecordFile".syntax)
}.asPatch
}
www.scling.com
Schema & syntax tree
20
Defn.Class(
List(Mod.Annot(Init(Type.Name("PrivacyShielded"), , List())), case),
Type.Name("SaleTransaction"),
List(),
Ctor.Primary(
List(),
,
List(
List(
Term.Param(
List(Mod.Annot(Init(Type.Name("PersonalId"), , List()))),
Term.Name("customerClubId"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(
List(Mod.Annot(Init(Type.Name("PersonalData"), , List()))),
Term.Name("storeId"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(
List(),
Term.Name("item"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(List(), Term.Name("timestamp"), Some(Type.Name("String")), None)
)
)
),
Template(List(), List(), Self(, None), List()))
@PrivacyShielded
case class SaleTransaction(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
www.scling.com
Quasiquotes
21
val stat: Stat = "val a = b() + 3".parse[Stat].get
val stat: Stat = q"val a = b() + 3"
www.scling.com
Quasiquotes in practice
22
q"""
object $converterName extends AvroConverter[${srcClass.clazz.name}
] {
import RecordFieldConverters._
type S = $jClassName
def schema: Schema = $javaClassTerm.getClassSchema()
def tag: ClassTag[S] = implicitly[ClassTag[S]]
def datumReader: SpecificDatumReader[S] = new SpecificDatumReader[$jClassName](classOf[$jClassName])
def datumWriter: SpecificDatumWriter[S] = new SpecificDatumWriter[$jClassName](classOf[$jClassName])
def fromSpecific(record: $jClassName): ${srcClass.clazz.name} =
${Term.Name(srcClass.clazz.name.value)}
(..$fromInits )
def toSpecific(record: ${srcClass.clazz.name}
): $jClassName =
new $jClassName(..$specificArgs)
}
"""
www.scling.com
Test equality Test record
difference render
type classes
23
case classes
test equality
type classes
trait REquality[ T] { def equal(value: T, right: T): Boolean }
object REquality {
implicit val double: REquality[Double] = new REquality[Double] {
def equal(left: Double, right: Double): Boolean = {
// Use a combination of absolute and relative tolerance
left === right +- 1e-5.max(left.abs * 1e-5).max(right.abs * 1e-5)
}
}
/** binds the Magnolia macro to the `gen` method */
implicit def gen[T]: REquality[ T] = macro Magnolia. gen[T]
}
object Equalities {
implicit val equalityUser: REquality[User] =
REquality. gen[User]
}
www.scling.com
case class User(
age: Int,
@AvroProp ("sqlType", "varchar(1012)")
phone: Option[String] = None)
Python + RDBMS
24
case classes
Avro
definitions
Avro type
annotations
MySQL
schemas
Python
{
"name": "User",
{ "name": "age", "type": "int" }
{ "name": "phone",
"type": [ "null", "string" ],
"sqlType": "varchar(1012)",
}
}
class UserEgressJob(CopyToTable):
columns = [
( "age", "int"),
( "name", "varchar(1012)"),
]
...
www.scling.com
Logical types
25
case classes
Logical types
case t"""Instant""" =>
JObject(List(JField("type", JString("long")), JField("logicalType",
JString("timestamp-micros"))))
case t"""LocalDate""" => JObject(List(JField("type", JString("int")),
JField("logicalType", JString("date"))))
case t"""YearMonth""" => JObject(List(JField("type", JString("int"))))
case t"""JObject""" => JString("string")
● Avro logical types
○ E.g. date → int, timestamp → long
○ Default is timestamp-millis
■ Great for year > 294441 (!)
● Custom logical types
○ Time
○ Collections
○ Physical
www.scling.com
Schema on read or write?
26
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
Hydration boilerplate
27
www.scling.com
Chimney - case class transformer
28
● Commonality in schema classes
○ Copy + a few more fields
○ Drop fields
● Statically typed
○ Forgot a field - error
○ Wrong type - error
www.scling.com
Chimney in real code
29
www.scling.com
Stretching the type system
30
● Fail: mixup kW and kWh
● Could be a compile-time error. Should be.
● Physical dimension libraries
○ Boost.Units - C++
○ Coulomb - Scala
www.scling.com
Data
lake
Private
pond
Cold
store
Ingest prepared for deletion
31
Mutation
Landing
pond
Append +
delete
Immutable,
limited
retention
www.scling.com
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Single dataset leak → no PII leak
+ Handles transformed PII fields
Lost key pattern
32
www.scling.com
Shieldformation
33
@PrivacyShielded
case class Sale(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
case class SaleShielded(
shieldId: Option[String],
customerClubIdEncrypted: Option[String],
storeIdEncrypted: Option[String],
item: Option[String],
timestamp: String
)
case class SaleAnonymous(
item: Option[String],
timestamp: String
)
object SaleAnonymize extends SparkJob {
...
}
ShieldForm
object SaleExpose extends SparkJob {
...
}
object SaleShield extends SparkJob {
...
}
case class Shield(
shieldId: String,
personId: Option[String],
keyStr: Option[String],
encounterDate: String
)
www.scling.com
Shield
Shieldformation & lost key
34
SaleShield
Sale
Sale
Shielded
Shield
Deletion
requests
Customer
History
Exposed egress
SaleExpose
Limited retention
SaleAnonymize
Sale
Anonymous
Sale
Stats
www.scling.com
Schema on write!
35
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
Data factory track record
36
Time to
first flow
Staff size 1st flow
effort, weeks
1st flow cost
(w * 50K ?)
Time to
innovation
Flows 1y
after first
Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ?
Finance 2 years 10-50 2000? 100M? Years 10?
Media 3 weeks 4.5 - 8 15 750K 3 months 30
Retail 7 weeks 1-3 7 500K * 6 months 70
Telecom 12 weeks 2-5 30 1500K 6 months 50
Consumer
products
20+ weeks 1.5 30+ 1200+K 6+ months 20
Construction 8 weeks 0.5 4 150K * 7 months 10
Manufacturing 8 weeks 0.5 4 200K * 6 months ?

More Related Content

Similar to Schema on read is obsolete. Welcome metaprogramming..pdf

Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
DataWorks Summit
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
Travis Oliphant
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
Jorge Lopez-Malla
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
Amazon Web Services
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, PuppetPuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
Puppet
 
Puppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on WindowsPuppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on Windows
Nicolas Corrarello
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
Databricks
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
Anuj Sahni
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
RichardWarburton
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 

Similar to Schema on read is obsolete. Welcome metaprogramming..pdf (20)

Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, PuppetPuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
 
Puppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on WindowsPuppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on Windows
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 

More from Lars Albertsson

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Lars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
Lars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
Lars Albertsson
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
Lars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Lars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
Lars Albertsson
 
Data democratised
Data democratisedData democratised
Data democratised
Lars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
Lars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
Lars Albertsson
 

More from Lars Albertsson (20)

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 

Recently uploaded

🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
kuldeepsharmaks8120
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
tanupasswan6
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
birajmohan012
 
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
norina2645
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
MinThetLwin1
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
TARIKU ENDALE
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
uapta
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
6459astrid
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
kinni singh$A17
 
Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learnersData Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
mohamed Ibrahim
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
LINAT
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
doctorzlife786
 
Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...
kittycrispy617
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
huseindihon
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 

Recently uploaded (20)

🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
 
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
 
Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learnersData Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
 
Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
 

Schema on read is obsolete. Welcome metaprogramming..pdf

  • 1. www.scling.com Schema on read is obsolete. Welcome metaprogramming. Data Innovation Summit, 2024-04-24 Lars Albertsson Scling 1
  • 2. www.scling.com IT craft to factory 2 Security Waterfall Application delivery Traditional operations Traditional QA Infrastructure DevSecOps Agile Containers DevOps CI/CD Infrastructure as code
  • 4. www.scling.com Craft vs industry 4 ● Each step steered by human ○ Or primitive automation ● Improving artifacts ● Craft is primary competence ● Components made for humans ○ Look nice, "easy to use" ○ More popular ● Autonomous processes ● Improving process that creates artifacts ● Multitude of competences ● Some components unusable by humans ○ Hard, greasy ○ Made for integration ○ Less popular
  • 5. www.scling.com Data engineering in the future 5 DW ~10 year capability gap "data factory engineering" Enterprise big data failures "Modern data stack" - traditional workflows, new technology 4GL / UML phase of data engineering Data engineering education
  • 6. www.scling.com Efficiency gap, data cost & value ● Data processing produces datasets ○ Each dataset has business value ● Proxy value/cost metric: datasets / day ○ S-M traditional: < 10 ○ Bank, telecom, media: 100-1000 6 2014: 6500 datasets / day 2016: 20000 datasets / day 2018: 100000+ datasets / day, 25% of staff use BigQuery 2021: 500B events collected / day 2016: 1600 000 000 datasets / day Disruptive value of data, machine learning Financial, reporting Insights, data-fed features effort value
  • 7. www.scling.com Data-factory-as-a-service 7 Data lake ● Data factory ○ Collected, raw data → processed, valuable data ● Data pipelines customised for client ○ Analytics (BI, reports, A/B testing) ○ Data-fed features (autocomplete, search) ○ Learning systems (recommendations, fraud) ● Compete with data leaders: ○ Quick idea-to-production ○ Operational efficiency {....} {....} {....}
  • 8. www.scling.com Data agility 8 ● Siloed: 6+ months Cultural work ● Autonomous: 1 month Technical work ● Coordinated: days Data lake ∆ ∆ Latency?
  • 9. www.scling.com ● Lowest common denominator = name, type, required ○ Types: string, long, double, binary, array, map, union, record ● Schema specification may support additional constraints, e.g. integer range, other collections What is a schema? 9 Id Name Age Phone 1 "Anna" 34 null 2 "Bob" 42 "08-123456" Fields Name Type Required? In RDBMS, relations are explicit In lake/stream datasets, relations are implicit
  • 10. www.scling.com Schema definitions 10 { "type" : "record", "namespace" : "com.mapflat.example", "name" : "User", "fields" : [ { "name" : "id" , "type" : "int" }, { "name" : "name" , "type" : "string" }, { "name" : "age" , "type" : "int" }, { "name" : "phone" , "type" : ["null", "string"], "default": null } ] } ● RDBMS: Table metadata ● Avro format: JSON/DSL definition ○ Definition is bundled with avro data files ○ Reused by Parquet format ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema { "id": 1, "name": "Alice", "age": "34" } { "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" } case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 11. www.scling.com Schema on write 11 ● Schema defined by writer ● Destination (table / dataset / stream topic) has defined schema ○ Technical definition with metadata (e.g. RDMBS, Kafka + registry) ○ By convention ● Writes not in compliance are not accepted ○ Technically aborted (e.g. RDBMS) ○ In violation of intent (e.g. HDFS datasets) ● Can be technically enforced by producer driver ○ Through ORM / code generation ○ Schema registry lookup Strict checking philosophy
  • 12. www.scling.com Schema on read 12 ● Anything (technically) accepted when writing ● Schema defined by reader, at consumption ○ Reader may impose requirements on type & value ● In dynamic languages, field propagate implicitly ○ E-shopping example: i. Join order + customer. ii. Add device_type to order schema iii. device_type becomes available in downstream datasets ● Violations of constraints are detected at read ○ Perhaps long after production? ○ By team not owning producer? Loose checking philosophy
  • 13. www.scling.com Dynamic vs static typing 13 Schema on write Schema on read Static typing Dynamic typing Strict Loose Possible Java: user.setName("Alice"); user2.getName(); Scala: user = User(name = "Alice", ...) user2.name Java: user.set("name", "Alice"); user2.get("name"); Python: user.name = "Alice" user2.name
  • 14. www.scling.com Schema on read or write? 14 DB DB DB Service Service Export Business intelligence Change agility important here Production stability important here
  • 15. www.scling.com ● Expressive ● Custom types ● IDE support ● Avro for data lake storage Schema definition choice 15 ● RDBMS: Table metadata ● Avro: JSON/DSL definition ○ Definition is bundled with avro data files ● Parquet ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 16. www.scling.com Schema offspring Test record difference render type classes 16 case classes test equality type classes Avro definitions Java Avro codec classes Java <-> Scala converters Avro type annotations MySQL schemas CSV codecs Privacy by design machinery Python Logical types
  • 17. www.scling.com Avro codecs 17 case classes Avro definitions Java Avro codec classes Java <-> Scala converters { "name": "JavaUser", { "name": "age", "type": "int" } { "name": "phone", "type": [ "null", "string" ] } } public class JavaUser implements SpecificRecord { public Integer getAge() { ... } public String getPhone() { ... } } object UserConverter extends AvroConverter[User] { def fromSpecific(u: JavaUser): User def toSpecific(u: User): JavaUser } case class User(age: Int, phone: Option[String] = None)
  • 18. www.scling.com Scalameta ● Parsing and analysis of scala source code 18 val a = b() + 3 ["val", " ", "a", " ", "=", " ", "b", "(", ")", " ", "+", " ", "3"] [val, "a", =, Call("b"), +, Int(3)] [val, Int(a), =, Call(com.scling.func.b), +, Int(3)] lex parse semantic analysis
  • 19. www.scling.com Scalameta use cases ● Scalafmt ● Scalafix ○ Static analysis ○ Code transformation ● Online code generation - macros ● Offline code generation 19 // Example from scio 0.7 -> 0.8 upgrade rules final class FixTensorflow extends SemanticRule("FixTensorflow") { override def fix(implicit doc: SemanticDocument): Patch = doc.tree.collect { case t @ Term.Select(s, Term.Name( "saveAsTfExampleFile")) => Patch.replaceTree(t, q"$s.saveAsTfRecordFile".syntax) }.asPatch }
  • 20. www.scling.com Schema & syntax tree 20 Defn.Class( List(Mod.Annot(Init(Type.Name("PrivacyShielded"), , List())), case), Type.Name("SaleTransaction"), List(), Ctor.Primary( List(), , List( List( Term.Param( List(Mod.Annot(Init(Type.Name("PersonalId"), , List()))), Term.Name("customerClubId"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param( List(Mod.Annot(Init(Type.Name("PersonalData"), , List()))), Term.Name("storeId"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param( List(), Term.Name("item"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param(List(), Term.Name("timestamp"), Some(Type.Name("String")), None) ) ) ), Template(List(), List(), Self(, None), List())) @PrivacyShielded case class SaleTransaction( @PersonalId customerClubId: Option[String], @PersonalData storeId: Option[String], item: Option[String], timestamp: String )
  • 21. www.scling.com Quasiquotes 21 val stat: Stat = "val a = b() + 3".parse[Stat].get val stat: Stat = q"val a = b() + 3"
  • 22. www.scling.com Quasiquotes in practice 22 q""" object $converterName extends AvroConverter[${srcClass.clazz.name} ] { import RecordFieldConverters._ type S = $jClassName def schema: Schema = $javaClassTerm.getClassSchema() def tag: ClassTag[S] = implicitly[ClassTag[S]] def datumReader: SpecificDatumReader[S] = new SpecificDatumReader[$jClassName](classOf[$jClassName]) def datumWriter: SpecificDatumWriter[S] = new SpecificDatumWriter[$jClassName](classOf[$jClassName]) def fromSpecific(record: $jClassName): ${srcClass.clazz.name} = ${Term.Name(srcClass.clazz.name.value)} (..$fromInits ) def toSpecific(record: ${srcClass.clazz.name} ): $jClassName = new $jClassName(..$specificArgs) } """
  • 23. www.scling.com Test equality Test record difference render type classes 23 case classes test equality type classes trait REquality[ T] { def equal(value: T, right: T): Boolean } object REquality { implicit val double: REquality[Double] = new REquality[Double] { def equal(left: Double, right: Double): Boolean = { // Use a combination of absolute and relative tolerance left === right +- 1e-5.max(left.abs * 1e-5).max(right.abs * 1e-5) } } /** binds the Magnolia macro to the `gen` method */ implicit def gen[T]: REquality[ T] = macro Magnolia. gen[T] } object Equalities { implicit val equalityUser: REquality[User] = REquality. gen[User] }
  • 24. www.scling.com case class User( age: Int, @AvroProp ("sqlType", "varchar(1012)") phone: Option[String] = None) Python + RDBMS 24 case classes Avro definitions Avro type annotations MySQL schemas Python { "name": "User", { "name": "age", "type": "int" } { "name": "phone", "type": [ "null", "string" ], "sqlType": "varchar(1012)", } } class UserEgressJob(CopyToTable): columns = [ ( "age", "int"), ( "name", "varchar(1012)"), ] ...
  • 25. www.scling.com Logical types 25 case classes Logical types case t"""Instant""" => JObject(List(JField("type", JString("long")), JField("logicalType", JString("timestamp-micros")))) case t"""LocalDate""" => JObject(List(JField("type", JString("int")), JField("logicalType", JString("date")))) case t"""YearMonth""" => JObject(List(JField("type", JString("int")))) case t"""JObject""" => JString("string") ● Avro logical types ○ E.g. date → int, timestamp → long ○ Default is timestamp-millis ■ Great for year > 294441 (!) ● Custom logical types ○ Time ○ Collections ○ Physical
  • 26. www.scling.com Schema on read or write? 26 DB DB DB Service Service Export Business intelligence Change agility important here Production stability important here
  • 28. www.scling.com Chimney - case class transformer 28 ● Commonality in schema classes ○ Copy + a few more fields ○ Drop fields ● Statically typed ○ Forgot a field - error ○ Wrong type - error
  • 30. www.scling.com Stretching the type system 30 ● Fail: mixup kW and kWh ● Could be a compile-time error. Should be. ● Physical dimension libraries ○ Boost.Units - C++ ○ Coulomb - Scala
  • 31. www.scling.com Data lake Private pond Cold store Ingest prepared for deletion 31 Mutation Landing pond Append + delete Immutable, limited retention
  • 32. www.scling.com ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Decryption (user) id needed + Multi-field oblivion + Single dataset leak → no PII leak + Handles transformed PII fields Lost key pattern 32
  • 33. www.scling.com Shieldformation 33 @PrivacyShielded case class Sale( @PersonalId customerClubId: Option[String], @PersonalData storeId: Option[String], item: Option[String], timestamp: String ) case class SaleShielded( shieldId: Option[String], customerClubIdEncrypted: Option[String], storeIdEncrypted: Option[String], item: Option[String], timestamp: String ) case class SaleAnonymous( item: Option[String], timestamp: String ) object SaleAnonymize extends SparkJob { ... } ShieldForm object SaleExpose extends SparkJob { ... } object SaleShield extends SparkJob { ... } case class Shield( shieldId: String, personId: Option[String], keyStr: Option[String], encounterDate: String )
  • 34. www.scling.com Shield Shieldformation & lost key 34 SaleShield Sale Sale Shielded Shield Deletion requests Customer History Exposed egress SaleExpose Limited retention SaleAnonymize Sale Anonymous Sale Stats
  • 36. www.scling.com Data factory track record 36 Time to first flow Staff size 1st flow effort, weeks 1st flow cost (w * 50K ?) Time to innovation Flows 1y after first Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ? Finance 2 years 10-50 2000? 100M? Years 10? Media 3 weeks 4.5 - 8 15 750K 3 months 30 Retail 7 weeks 1-3 7 500K * 6 months 70 Telecom 12 weeks 2-5 30 1500K 6 months 50 Consumer products 20+ weeks 1.5 30+ 1200+K 6+ months 20 Construction 8 weeks 0.5 4 150K * 7 months 10 Manufacturing 8 weeks 0.5 4 200K * 6 months ?