Using Cerberus and PySpark to validate semi-structured datasets

Extensible JSON validation
at scale with Cerberus and PySpark
Bartosz Konieczny
@waitingforcode

First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode
#github.com/bartosz25 /data-generator /spark-scala-playground ...

API
● from cerberus import Validator
○ def __init(..., schema, ignore_none_values, allow_unknown,
purge_unknown, error_handler)
○ def validate(self, document, schema=None, update=False,
normalize=True)
● from cerberus.errors import BaseErrorHandler
○ def __call__(self, errors)

● min/max
'id': {'type': 'integer', 'min': 1}
● RegEx
'email': {'type': 'string', 'regex': '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-
]+.[a-zA-Z0-9-.]+$'}
● empty, contains (all values)
'items': {'type': 'list', 'items': [{'type': 'string'}], 'empty':
False, 'contains': ['item 1', 'item 2']}
● allowed, forbidden
{'role': {'type': 'string', 'allowed': ['agent', 'client',
'supplier']}}
{'role': {'forbidden': ['owner']}}
● required
'first_order': {'type': 'datetime', 'required': False}
Validation rules

class ExtendedValidator(Validator):
def _validate_productexists(self, lookup_table,
field, value):
if lookup_table == 'memory':
existing_items = ['item1', 'item2',
'item3', 'item4']
not_existing_items = list(filter(lambda
item_name: item_name not in existing_items,
value))
if not_existing_items:
self._error(field, "{} items don't
exist in the lookup table"
.format(not_existing_items))
Custom validation rule
extend Validator
prefix custom rules with
_validate_{rule_name}
methods
call def _error(self,
*args) to add errors

Validation process
{
'id': { 'type': 'integer', 'min': 1},
'first_order': {
'type': 'datetime', 'required':
False
},
Validator
#validate{"id": -3, "amount": 30.97, ...} True / False
#errors

PySpark integration - whole pipeline
dataframe_schema = StructType(
fields=[ # ...
StructField("source", StructType(
fields=[StructField("site", StringType(), True),
StructField("api_version", StringType(), True)] ), False)
])
def sum_errors_number(errors_count_1, errors_count_2):
merged_dict = {dict_key: errors_count_1.get(dict_key, 0) +
errors_count_2.get(dict_key, 0) for dict_key in
set(errors_count_1) | set(errors_count_2)}
return merged_dict
spark = SparkSession.builder.master("local[4]")
.appName("...").getOrCreate()
errors_distribution = spark.read
.json(input, schema=dataframe_schema, lineSep='n')
.rdd.mapPartitions(check_for_errors)
.reduceByKey(sum_errors_number).collectAsMap()
potential
data quality
issue
collect data to
the driver /!

PySpark integration - extended Cerberus
UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists')
class ExtendedValidator(Validator):
def _validate_networkexists(self, allowed_values, field, value):
if value not in allowed_values:
self._error(field, UNKNOWN_NETWORK, {})
class ErrorCodesHandler(SchemaErrorHandler):
def __call__(self, validation_errors):
def concat_path(document_path):
return '.'.join(document_path)
output_errors = {}
for error in validation_errors:
if error.is_group_error:
for child_error in error.child_errors:
output_errors[concat_path(child_error.document_path)] =
child_error.code
else:
output_errors[concat_path(error.document_path)] = error.code
return output_errors
extended
Validator,
custom
validation
rule
not the
same as
previously
custom
output for
#errors
call

PySpark integration - .mapPartitions function
def check_for_errors(rows):
validator = ExtendedValidator(schema,
error_handler=ErrorCodesHandler())
def default_dictionary():
return defaultdict(int)
errors = defaultdict(default_dictionary)
for row in rows:
validation_result =
validator.validate(row.asDict(recursive=True),
normalize=False)
if not validation_result:
for error_field, error_code in
validator.errors.items():
errors[error_field][error_code] += 1
return [(k, dict(v)) for k, v in errors.items()]
disabled
normalization

Resources
● Cerberus: http://docs.python-cerberus.org/en/stable/
● Github Cerberus+PySpark demo: https://github.com/bartosz25/paris.py-cerberus-pyspark-talk
● Github data generator: https://github.com/bartosz25/data-generator
● PySpark + Cerberus series: https://www.waitingforcode.com/tags/cerberus-pyspark

Thank you !
@waitingforcode / waitingforcode.com

Using Cerberus and PySpark to validate semi-structured datasets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using Cerberus and PySpark to validate semi-structured datasets

Similar to Using Cerberus and PySpark to validate semi-structured datasets (20)

Recently uploaded

Recently uploaded (20)

Using Cerberus and PySpark to validate semi-structured datasets

Editor's Notes