SlideShare a Scribd company logo
1 of 13
Extensible JSON validation
at scale with Cerberus and PySpark
Bartosz Konieczny
@waitingforcode
First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode
#github.com/bartosz25 /data-generator /spark-scala-playground ...
Cerberus
API
● from cerberus import Validator
○ def __init(..., schema, ignore_none_values, allow_unknown,
purge_unknown, error_handler)
○ def validate(self, document, schema=None, update=False,
normalize=True)
● from cerberus.errors import BaseErrorHandler
○ def __call__(self, errors)
● min/max
'id': {'type': 'integer', 'min': 1}
● RegEx
'email': {'type': 'string', 'regex': '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-
]+.[a-zA-Z0-9-.]+$'}
● empty, contains (all values)
'items': {'type': 'list', 'items': [{'type': 'string'}], 'empty':
False, 'contains': ['item 1', 'item 2']}
● allowed, forbidden
{'role': {'type': 'string', 'allowed': ['agent', 'client',
'supplier']}}
{'role': {'forbidden': ['owner']}}
● required
'first_order': {'type': 'datetime', 'required': False}
Validation rules
class ExtendedValidator(Validator):
def _validate_productexists(self, lookup_table,
field, value):
if lookup_table == 'memory':
existing_items = ['item1', 'item2',
'item3', 'item4']
not_existing_items = list(filter(lambda
item_name: item_name not in existing_items,
value))
if not_existing_items:
self._error(field, "{} items don't
exist in the lookup table"
.format(not_existing_items))
Custom validation rule
extend Validator
prefix custom rules with
_validate_{rule_name}
methods
call def _error(self,
*args) to add errors
Validation process
{
'id': { 'type': 'integer', 'min': 1},
'first_order': {
'type': 'datetime', 'required':
False
},
Validator
#validate{"id": -3, "amount": 30.97, ...} True / False
#errors
Cerberus and
PySpark
PySpark integration - whole pipeline
dataframe_schema = StructType(
fields=[ # ...
StructField("source", StructType(
fields=[StructField("site", StringType(), True),
StructField("api_version", StringType(), True)] ), False)
])
def sum_errors_number(errors_count_1, errors_count_2):
merged_dict = {dict_key: errors_count_1.get(dict_key, 0) +
errors_count_2.get(dict_key, 0) for dict_key in
set(errors_count_1) | set(errors_count_2)}
return merged_dict
spark = SparkSession.builder.master("local[4]")
.appName("...").getOrCreate()
errors_distribution = spark.read
.json(input, schema=dataframe_schema, lineSep='n')
.rdd.mapPartitions(check_for_errors)
.reduceByKey(sum_errors_number).collectAsMap()
potential
data quality
issue
collect data to
the driver /!
PySpark integration - extended Cerberus
UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists')
class ExtendedValidator(Validator):
def _validate_networkexists(self, allowed_values, field, value):
if value not in allowed_values:
self._error(field, UNKNOWN_NETWORK, {})
class ErrorCodesHandler(SchemaErrorHandler):
def __call__(self, validation_errors):
def concat_path(document_path):
return '.'.join(document_path)
output_errors = {}
for error in validation_errors:
if error.is_group_error:
for child_error in error.child_errors:
output_errors[concat_path(child_error.document_path)] =
child_error.code
else:
output_errors[concat_path(error.document_path)] = error.code
return output_errors
extended
Validator,
custom
validation
rule
not the
same as
previously
custom
output for
#errors
call
PySpark integration - .mapPartitions function
def check_for_errors(rows):
validator = ExtendedValidator(schema,
error_handler=ErrorCodesHandler())
def default_dictionary():
return defaultdict(int)
errors = defaultdict(default_dictionary)
for row in rows:
validation_result =
validator.validate(row.asDict(recursive=True),
normalize=False)
if not validation_result:
for error_field, error_code in
validator.errors.items():
errors[error_field][error_code] += 1
return [(k, dict(v)) for k, v in errors.items()]
disabled
normalization
Resources
● Cerberus: http://docs.python-cerberus.org/en/stable/
● Github Cerberus+PySpark demo: https://github.com/bartosz25/paris.py-cerberus-pyspark-talk
● Github data generator: https://github.com/bartosz25/data-generator
● PySpark + Cerberus series: https://www.waitingforcode.com/tags/cerberus-pyspark
Thank you !
@waitingforcode / waitingforcode.com

More Related Content

What's hot

Sum and Product Types - The Fruit Salad & Fruit Snack Example - From F# to Ha...
Sum and Product Types -The Fruit Salad & Fruit Snack Example - From F# to Ha...Sum and Product Types -The Fruit Salad & Fruit Snack Example - From F# to Ha...
Sum and Product Types - The Fruit Salad & Fruit Snack Example - From F# to Ha...Philip Schwarz
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and ProtobufGuido Schmutz
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
Keep your code clean
Keep your code cleanKeep your code clean
Keep your code cleanmacrochen
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLVillu Ruusmann
 
RSpec 讓你愛上寫測試
RSpec 讓你愛上寫測試RSpec 讓你愛上寫測試
RSpec 讓你愛上寫測試Wen-Tien Chang
 
Izumi 1.0: Your Next Scala Stack
Izumi 1.0: Your Next Scala StackIzumi 1.0: Your Next Scala Stack
Izumi 1.0: Your Next Scala Stack7mind
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Inside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source DatabaseInside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source DatabaseMike Dirolf
 
ZIO-Direct - Functional Scala 2022
ZIO-Direct - Functional Scala 2022ZIO-Direct - Functional Scala 2022
ZIO-Direct - Functional Scala 2022Alexander Ioffe
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 
좌충우돌 ORM 개발기 | Devon 2012
좌충우돌 ORM 개발기 | Devon 2012좌충우돌 ORM 개발기 | Devon 2012
좌충우돌 ORM 개발기 | Devon 2012Daum DNA
 
An in Depth Journey into Odoo's ORM
An in Depth Journey into Odoo's ORMAn in Depth Journey into Odoo's ORM
An in Depth Journey into Odoo's ORMOdoo
 

What's hot (20)

Sum and Product Types - The Fruit Salad & Fruit Snack Example - From F# to Ha...
Sum and Product Types -The Fruit Salad & Fruit Snack Example - From F# to Ha...Sum and Product Types -The Fruit Salad & Fruit Snack Example - From F# to Ha...
Sum and Product Types - The Fruit Salad & Fruit Snack Example - From F# to Ha...
 
Practical Celery
Practical CeleryPractical Celery
Practical Celery
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf
 
Defensive Apex Programming
Defensive Apex ProgrammingDefensive Apex Programming
Defensive Apex Programming
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Keep your code clean
Keep your code cleanKeep your code clean
Keep your code clean
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
RSpec 讓你愛上寫測試
RSpec 讓你愛上寫測試RSpec 讓你愛上寫測試
RSpec 讓你愛上寫測試
 
Izumi 1.0: Your Next Scala Stack
Izumi 1.0: Your Next Scala StackIzumi 1.0: Your Next Scala Stack
Izumi 1.0: Your Next Scala Stack
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Inside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source DatabaseInside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source Database
 
ZIO-Direct - Functional Scala 2022
ZIO-Direct - Functional Scala 2022ZIO-Direct - Functional Scala 2022
ZIO-Direct - Functional Scala 2022
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Introduction to Celery
Introduction to CeleryIntroduction to Celery
Introduction to Celery
 
Clean coding-practices
Clean coding-practicesClean coding-practices
Clean coding-practices
 
좌충우돌 ORM 개발기 | Devon 2012
좌충우돌 ORM 개발기 | Devon 2012좌충우돌 ORM 개발기 | Devon 2012
좌충우돌 ORM 개발기 | Devon 2012
 
An in Depth Journey into Odoo's ORM
An in Depth Journey into Odoo's ORMAn in Depth Journey into Odoo's ORM
An in Depth Journey into Odoo's ORM
 

Similar to Using Cerberus and PySpark to validate semi-structured datasets

AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptIngvar Stepanyan
 
Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0Frost
 
Future of Web Apps: Google Gears
Future of Web Apps: Google GearsFuture of Web Apps: Google Gears
Future of Web Apps: Google Gearsdion
 
Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17LogeekNightUkraine
 
用Tornado开发RESTful API运用
用Tornado开发RESTful API运用用Tornado开发RESTful API运用
用Tornado开发RESTful API运用Felinx Lee
 
Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2Shinya Ohyanagi
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)Night Sailer
 
The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019Ludmila Nesvitiy
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverFastly
 
Design Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonDesign Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonManageIQ
 
JS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJSFestUA
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationBartosz Konieczny
 
Expert JavaScript tricks of the masters
Expert JavaScript  tricks of the mastersExpert JavaScript  tricks of the masters
Expert JavaScript tricks of the mastersAra Pehlivanian
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful weddingStéphane Wirtel
 
Python magicmethods
Python magicmethodsPython magicmethods
Python magicmethodsdreampuf
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveTools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveJeff Smith
 
Flashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas MacFlashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas MacESET Latinoamérica
 
Cnam azure 2014 mobile services
Cnam azure 2014   mobile servicesCnam azure 2014   mobile services
Cnam azure 2014 mobile servicesAymeric Weinbach
 

Similar to Using Cerberus and PySpark to validate semi-structured datasets (20)

AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScript
 
Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0
 
Future of Web Apps: Google Gears
Future of Web Apps: Google GearsFuture of Web Apps: Google Gears
Future of Web Apps: Google Gears
 
Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17
 
用Tornado开发RESTful API运用
用Tornado开发RESTful API运用用Tornado开发RESTful API运用
用Tornado开发RESTful API运用
 
Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)
 
The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
 
Design Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonDesign Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron Patterson
 
JS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless Bebop
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
 
Expert JavaScript tricks of the masters
Expert JavaScript  tricks of the mastersExpert JavaScript  tricks of the masters
Expert JavaScript tricks of the masters
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful wedding
 
Python magicmethods
Python magicmethodsPython magicmethods
Python magicmethods
 
Postman On Steroids
Postman On SteroidsPostman On Steroids
Postman On Steroids
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveTools for Making Machine Learning more Reactive
Tools for Making Machine Learning more Reactive
 
DataMapper
DataMapperDataMapper
DataMapper
 
Flashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas MacFlashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas Mac
 
Cnam azure 2014 mobile services
Cnam azure 2014   mobile servicesCnam azure 2014   mobile services
Cnam azure 2014 mobile services
 

Recently uploaded

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 

Recently uploaded (20)

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 

Using Cerberus and PySpark to validate semi-structured datasets

  • 1. Extensible JSON validation at scale with Cerberus and PySpark Bartosz Konieczny @waitingforcode
  • 2. First things first Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 /data-generator /spark-scala-playground ...
  • 4. API ● from cerberus import Validator ○ def __init(..., schema, ignore_none_values, allow_unknown, purge_unknown, error_handler) ○ def validate(self, document, schema=None, update=False, normalize=True) ● from cerberus.errors import BaseErrorHandler ○ def __call__(self, errors)
  • 5. ● min/max 'id': {'type': 'integer', 'min': 1} ● RegEx 'email': {'type': 'string', 'regex': '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9- ]+.[a-zA-Z0-9-.]+$'} ● empty, contains (all values) 'items': {'type': 'list', 'items': [{'type': 'string'}], 'empty': False, 'contains': ['item 1', 'item 2']} ● allowed, forbidden {'role': {'type': 'string', 'allowed': ['agent', 'client', 'supplier']}} {'role': {'forbidden': ['owner']}} ● required 'first_order': {'type': 'datetime', 'required': False} Validation rules
  • 6. class ExtendedValidator(Validator): def _validate_productexists(self, lookup_table, field, value): if lookup_table == 'memory': existing_items = ['item1', 'item2', 'item3', 'item4'] not_existing_items = list(filter(lambda item_name: item_name not in existing_items, value)) if not_existing_items: self._error(field, "{} items don't exist in the lookup table" .format(not_existing_items)) Custom validation rule extend Validator prefix custom rules with _validate_{rule_name} methods call def _error(self, *args) to add errors
  • 7. Validation process { 'id': { 'type': 'integer', 'min': 1}, 'first_order': { 'type': 'datetime', 'required': False }, Validator #validate{"id": -3, "amount": 30.97, ...} True / False #errors
  • 9. PySpark integration - whole pipeline dataframe_schema = StructType( fields=[ # ... StructField("source", StructType( fields=[StructField("site", StringType(), True), StructField("api_version", StringType(), True)] ), False) ]) def sum_errors_number(errors_count_1, errors_count_2): merged_dict = {dict_key: errors_count_1.get(dict_key, 0) + errors_count_2.get(dict_key, 0) for dict_key in set(errors_count_1) | set(errors_count_2)} return merged_dict spark = SparkSession.builder.master("local[4]") .appName("...").getOrCreate() errors_distribution = spark.read .json(input, schema=dataframe_schema, lineSep='n') .rdd.mapPartitions(check_for_errors) .reduceByKey(sum_errors_number).collectAsMap() potential data quality issue collect data to the driver /!
  • 10. PySpark integration - extended Cerberus UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists') class ExtendedValidator(Validator): def _validate_networkexists(self, allowed_values, field, value): if value not in allowed_values: self._error(field, UNKNOWN_NETWORK, {}) class ErrorCodesHandler(SchemaErrorHandler): def __call__(self, validation_errors): def concat_path(document_path): return '.'.join(document_path) output_errors = {} for error in validation_errors: if error.is_group_error: for child_error in error.child_errors: output_errors[concat_path(child_error.document_path)] = child_error.code else: output_errors[concat_path(error.document_path)] = error.code return output_errors extended Validator, custom validation rule not the same as previously custom output for #errors call
  • 11. PySpark integration - .mapPartitions function def check_for_errors(rows): validator = ExtendedValidator(schema, error_handler=ErrorCodesHandler()) def default_dictionary(): return defaultdict(int) errors = defaultdict(default_dictionary) for row in rows: validation_result = validator.validate(row.asDict(recursive=True), normalize=False) if not validation_result: for error_field, error_code in validator.errors.items(): errors[error_field][error_code] += 1 return [(k, dict(v)) for k, v in errors.items()] disabled normalization
  • 12. Resources ● Cerberus: http://docs.python-cerberus.org/en/stable/ ● Github Cerberus+PySpark demo: https://github.com/bartosz25/paris.py-cerberus-pyspark-talk ● Github data generator: https://github.com/bartosz25/data-generator ● PySpark + Cerberus series: https://www.waitingforcode.com/tags/cerberus-pyspark
  • 13. Thank you ! @waitingforcode / waitingforcode.com

Editor's Notes

  1. SchemaErrorHandler ⇒ child of BasicErrorHandler; BasicErrorHandler child of BaseErrorHandler
  2. define what is the type (datetime, int, string, ..) mapping ⇒ type, items, empty ⇒ validator types https://docs.python-cerberus.org/en/stable/validation-rules.html
  3. explain why normalize=False calls __normalize_mapping * you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}}) >>> v.normalized({'foo': 0}) {'bar': 0} * later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified * applies the defaults also * can also call a coercion, a callable that can for instance convert the type of a field Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated check if validator.clear_caches can help
  4. explain why normalize=False calls __normalize_mapping * you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}}) >>> v.normalized({'foo': 0}) {'bar': 0} * later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified * applies the defaults also * can also call a coercion, a callable that can for instance convert the type of a field Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated check if validator.clear_caches can help