SlideShare a Scribd company logo
Extensible JSON validation
at scale with Cerberus and PySpark
Bartosz Konieczny
@waitingforcode
First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode
#github.com/bartosz25 /data-generator /spark-scala-playground ...
Cerberus
API
● from cerberus import Validator
○ def __init(..., schema, ignore_none_values, allow_unknown,
purge_unknown, error_handler)
○ def validate(self, document, schema=None, update=False,
normalize=True)
● from cerberus.errors import BaseErrorHandler
○ def __call__(self, errors)
● min/max
'id': {'type': 'integer', 'min': 1}
● RegEx
'email': {'type': 'string', 'regex': '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-
]+.[a-zA-Z0-9-.]+$'}
● empty, contains (all values)
'items': {'type': 'list', 'items': [{'type': 'string'}], 'empty':
False, 'contains': ['item 1', 'item 2']}
● allowed, forbidden
{'role': {'type': 'string', 'allowed': ['agent', 'client',
'supplier']}}
{'role': {'forbidden': ['owner']}}
● required
'first_order': {'type': 'datetime', 'required': False}
Validation rules
class ExtendedValidator(Validator):
def _validate_productexists(self, lookup_table,
field, value):
if lookup_table == 'memory':
existing_items = ['item1', 'item2',
'item3', 'item4']
not_existing_items = list(filter(lambda
item_name: item_name not in existing_items,
value))
if not_existing_items:
self._error(field, "{} items don't
exist in the lookup table"
.format(not_existing_items))
Custom validation rule
extend Validator
prefix custom rules with
_validate_{rule_name}
methods
call def _error(self,
*args) to add errors
Validation process
{
'id': { 'type': 'integer', 'min': 1},
'first_order': {
'type': 'datetime', 'required':
False
},
Validator
#validate{"id": -3, "amount": 30.97, ...} True / False
#errors
Cerberus and
PySpark
PySpark integration - whole pipeline
dataframe_schema = StructType(
fields=[ # ...
StructField("source", StructType(
fields=[StructField("site", StringType(), True),
StructField("api_version", StringType(), True)] ), False)
])
def sum_errors_number(errors_count_1, errors_count_2):
merged_dict = {dict_key: errors_count_1.get(dict_key, 0) +
errors_count_2.get(dict_key, 0) for dict_key in
set(errors_count_1) | set(errors_count_2)}
return merged_dict
spark = SparkSession.builder.master("local[4]")
.appName("...").getOrCreate()
errors_distribution = spark.read
.json(input, schema=dataframe_schema, lineSep='n')
.rdd.mapPartitions(check_for_errors)
.reduceByKey(sum_errors_number).collectAsMap()
potential
data quality
issue
collect data to
the driver /!
PySpark integration - extended Cerberus
UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists')
class ExtendedValidator(Validator):
def _validate_networkexists(self, allowed_values, field, value):
if value not in allowed_values:
self._error(field, UNKNOWN_NETWORK, {})
class ErrorCodesHandler(SchemaErrorHandler):
def __call__(self, validation_errors):
def concat_path(document_path):
return '.'.join(document_path)
output_errors = {}
for error in validation_errors:
if error.is_group_error:
for child_error in error.child_errors:
output_errors[concat_path(child_error.document_path)] =
child_error.code
else:
output_errors[concat_path(error.document_path)] = error.code
return output_errors
extended
Validator,
custom
validation
rule
not the
same as
previously
custom
output for
#errors
call
PySpark integration - .mapPartitions function
def check_for_errors(rows):
validator = ExtendedValidator(schema,
error_handler=ErrorCodesHandler())
def default_dictionary():
return defaultdict(int)
errors = defaultdict(default_dictionary)
for row in rows:
validation_result =
validator.validate(row.asDict(recursive=True),
normalize=False)
if not validation_result:
for error_field, error_code in
validator.errors.items():
errors[error_field][error_code] += 1
return [(k, dict(v)) for k, v in errors.items()]
disabled
normalization
Resources
● Cerberus: http://docs.python-cerberus.org/en/stable/
● Github Cerberus+PySpark demo: https://github.com/bartosz25/paris.py-cerberus-pyspark-talk
● Github data generator: https://github.com/bartosz25/data-generator
● PySpark + Cerberus series: https://www.waitingforcode.com/tags/cerberus-pyspark
Thank you !
@waitingforcode / waitingforcode.com

More Related Content

What's hot

How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
oysteing
 
Top 65 SQL Interview Questions and Answers | Edureka
Top 65 SQL Interview Questions and Answers | EdurekaTop 65 SQL Interview Questions and Answers | Edureka
Top 65 SQL Interview Questions and Answers | Edureka
Edureka!
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
New Features for Multitenant in Oracle Database 21c
New Features for Multitenant in Oracle Database 21cNew Features for Multitenant in Oracle Database 21c
New Features for Multitenant in Oracle Database 21c
Markus Flechtner
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
Amazon Web Services
 
PLSQL Cursors
PLSQL CursorsPLSQL Cursors
PLSQL Cursors
spin_naresh
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
A Cloud Journey - Move to the Oracle Cloud
A Cloud Journey - Move to the Oracle CloudA Cloud Journey - Move to the Oracle Cloud
A Cloud Journey - Move to the Oracle Cloud
Markus Michalewicz
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Oracle DBA Competency Roadmap
Oracle DBA Competency RoadmapOracle DBA Competency Roadmap
Oracle DBA Competency Roadmap
Mahesh Vallampati
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
Mohammad Hossein Rimaz
 
Smart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at ScaleSmart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at Scale
Databricks
 

What's hot (20)

How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
 
Top 65 SQL Interview Questions and Answers | Edureka
Top 65 SQL Interview Questions and Answers | EdurekaTop 65 SQL Interview Questions and Answers | Edureka
Top 65 SQL Interview Questions and Answers | Edureka
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
New Features for Multitenant in Oracle Database 21c
New Features for Multitenant in Oracle Database 21cNew Features for Multitenant in Oracle Database 21c
New Features for Multitenant in Oracle Database 21c
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
PLSQL Cursors
PLSQL CursorsPLSQL Cursors
PLSQL Cursors
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
A Cloud Journey - Move to the Oracle Cloud
A Cloud Journey - Move to the Oracle CloudA Cloud Journey - Move to the Oracle Cloud
A Cloud Journey - Move to the Oracle Cloud
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad Query
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Oracle DBA Competency Roadmap
Oracle DBA Competency RoadmapOracle DBA Competency Roadmap
Oracle DBA Competency Roadmap
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Smart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at ScaleSmart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at Scale
 

Similar to Using Cerberus and PySpark to validate semi-structured datasets

AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScript
Ingvar Stepanyan
 
Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0
Frost
 
Future of Web Apps: Google Gears
Future of Web Apps: Google GearsFuture of Web Apps: Google Gears
Future of Web Apps: Google Gears
dion
 
Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17
LogeekNightUkraine
 
用Tornado开发RESTful API运用
用Tornado开发RESTful API运用用Tornado开发RESTful API运用
用Tornado开发RESTful API运用
Felinx Lee
 
Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2
Shinya Ohyanagi
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)
Night Sailer
 
The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019
Ludmila Nesvitiy
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Fastly
 
Design Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonDesign Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron Patterson
ManageIQ
 
JS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless Bebop
JSFestUA
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
Bartosz Konieczny
 
Expert JavaScript tricks of the masters
Expert JavaScript  tricks of the mastersExpert JavaScript  tricks of the masters
Expert JavaScript tricks of the masters
Ara Pehlivanian
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful wedding
Stéphane Wirtel
 
Python magicmethods
Python magicmethodsPython magicmethods
Python magicmethods
dreampuf
 
Postman On Steroids
Postman On SteroidsPostman On Steroids
Postman On Steroids
Sara Tornincasa
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveTools for Making Machine Learning more Reactive
Tools for Making Machine Learning more Reactive
Jeff Smith
 
DataMapper
DataMapperDataMapper
DataMapper
Yehuda Katz
 
Flashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas MacFlashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas Mac
ESET Latinoamérica
 
Cnam azure 2014 mobile services
Cnam azure 2014   mobile servicesCnam azure 2014   mobile services
Cnam azure 2014 mobile services
Aymeric Weinbach
 

Similar to Using Cerberus and PySpark to validate semi-structured datasets (20)

AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScript
 
Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0
 
Future of Web Apps: Google Gears
Future of Web Apps: Google GearsFuture of Web Apps: Google Gears
Future of Web Apps: Google Gears
 
Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17
 
用Tornado开发RESTful API运用
用Tornado开发RESTful API运用用Tornado开发RESTful API运用
用Tornado开发RESTful API运用
 
Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)
 
The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
 
Design Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonDesign Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron Patterson
 
JS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless Bebop
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
 
Expert JavaScript tricks of the masters
Expert JavaScript  tricks of the mastersExpert JavaScript  tricks of the masters
Expert JavaScript tricks of the masters
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful wedding
 
Python magicmethods
Python magicmethodsPython magicmethods
Python magicmethods
 
Postman On Steroids
Postman On SteroidsPostman On Steroids
Postman On Steroids
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveTools for Making Machine Learning more Reactive
Tools for Making Machine Learning more Reactive
 
DataMapper
DataMapperDataMapper
DataMapper
 
Flashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas MacFlashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas Mac
 
Cnam azure 2014 mobile services
Cnam azure 2014   mobile servicesCnam azure 2014   mobile services
Cnam azure 2014 mobile services
 

Recently uploaded

1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
MadhavJungKarki
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
MuhammadJazib15
 
5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf
AlvianRamadhani5
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
ijaia
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
AnasAhmadNoor
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
um7474492
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
pvpriya2
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 

Recently uploaded (20)

1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
 
5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 

Using Cerberus and PySpark to validate semi-structured datasets

  • 1. Extensible JSON validation at scale with Cerberus and PySpark Bartosz Konieczny @waitingforcode
  • 2. First things first Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 /data-generator /spark-scala-playground ...
  • 4. API ● from cerberus import Validator ○ def __init(..., schema, ignore_none_values, allow_unknown, purge_unknown, error_handler) ○ def validate(self, document, schema=None, update=False, normalize=True) ● from cerberus.errors import BaseErrorHandler ○ def __call__(self, errors)
  • 5. ● min/max 'id': {'type': 'integer', 'min': 1} ● RegEx 'email': {'type': 'string', 'regex': '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9- ]+.[a-zA-Z0-9-.]+$'} ● empty, contains (all values) 'items': {'type': 'list', 'items': [{'type': 'string'}], 'empty': False, 'contains': ['item 1', 'item 2']} ● allowed, forbidden {'role': {'type': 'string', 'allowed': ['agent', 'client', 'supplier']}} {'role': {'forbidden': ['owner']}} ● required 'first_order': {'type': 'datetime', 'required': False} Validation rules
  • 6. class ExtendedValidator(Validator): def _validate_productexists(self, lookup_table, field, value): if lookup_table == 'memory': existing_items = ['item1', 'item2', 'item3', 'item4'] not_existing_items = list(filter(lambda item_name: item_name not in existing_items, value)) if not_existing_items: self._error(field, "{} items don't exist in the lookup table" .format(not_existing_items)) Custom validation rule extend Validator prefix custom rules with _validate_{rule_name} methods call def _error(self, *args) to add errors
  • 7. Validation process { 'id': { 'type': 'integer', 'min': 1}, 'first_order': { 'type': 'datetime', 'required': False }, Validator #validate{"id": -3, "amount": 30.97, ...} True / False #errors
  • 9. PySpark integration - whole pipeline dataframe_schema = StructType( fields=[ # ... StructField("source", StructType( fields=[StructField("site", StringType(), True), StructField("api_version", StringType(), True)] ), False) ]) def sum_errors_number(errors_count_1, errors_count_2): merged_dict = {dict_key: errors_count_1.get(dict_key, 0) + errors_count_2.get(dict_key, 0) for dict_key in set(errors_count_1) | set(errors_count_2)} return merged_dict spark = SparkSession.builder.master("local[4]") .appName("...").getOrCreate() errors_distribution = spark.read .json(input, schema=dataframe_schema, lineSep='n') .rdd.mapPartitions(check_for_errors) .reduceByKey(sum_errors_number).collectAsMap() potential data quality issue collect data to the driver /!
  • 10. PySpark integration - extended Cerberus UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists') class ExtendedValidator(Validator): def _validate_networkexists(self, allowed_values, field, value): if value not in allowed_values: self._error(field, UNKNOWN_NETWORK, {}) class ErrorCodesHandler(SchemaErrorHandler): def __call__(self, validation_errors): def concat_path(document_path): return '.'.join(document_path) output_errors = {} for error in validation_errors: if error.is_group_error: for child_error in error.child_errors: output_errors[concat_path(child_error.document_path)] = child_error.code else: output_errors[concat_path(error.document_path)] = error.code return output_errors extended Validator, custom validation rule not the same as previously custom output for #errors call
  • 11. PySpark integration - .mapPartitions function def check_for_errors(rows): validator = ExtendedValidator(schema, error_handler=ErrorCodesHandler()) def default_dictionary(): return defaultdict(int) errors = defaultdict(default_dictionary) for row in rows: validation_result = validator.validate(row.asDict(recursive=True), normalize=False) if not validation_result: for error_field, error_code in validator.errors.items(): errors[error_field][error_code] += 1 return [(k, dict(v)) for k, v in errors.items()] disabled normalization
  • 12. Resources ● Cerberus: http://docs.python-cerberus.org/en/stable/ ● Github Cerberus+PySpark demo: https://github.com/bartosz25/paris.py-cerberus-pyspark-talk ● Github data generator: https://github.com/bartosz25/data-generator ● PySpark + Cerberus series: https://www.waitingforcode.com/tags/cerberus-pyspark
  • 13. Thank you ! @waitingforcode / waitingforcode.com

Editor's Notes

  1. SchemaErrorHandler ⇒ child of BasicErrorHandler; BasicErrorHandler child of BaseErrorHandler
  2. define what is the type (datetime, int, string, ..) mapping ⇒ type, items, empty ⇒ validator types https://docs.python-cerberus.org/en/stable/validation-rules.html
  3. explain why normalize=False calls __normalize_mapping * you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}}) >>> v.normalized({'foo': 0}) {'bar': 0} * later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified * applies the defaults also * can also call a coercion, a callable that can for instance convert the type of a field Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated check if validator.clear_caches can help
  4. explain why normalize=False calls __normalize_mapping * you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}}) >>> v.normalized({'foo': 0}) {'bar': 0} * later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified * applies the defaults also * can also call a coercion, a callable that can for instance convert the type of a field Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated check if validator.clear_caches can help