Successfully reported this slideshow.
Your SlideShare is downloading. ×

Using Cerberus and PySpark to validate semi-structured datasets

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 13 Ad

Using Cerberus and PySpark to validate semi-structured datasets

Download to read offline

This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)

This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Using Cerberus and PySpark to validate semi-structured datasets (20)

Advertisement

Recently uploaded (20)

Using Cerberus and PySpark to validate semi-structured datasets

  1. 1. Extensible JSON validation at scale with Cerberus and PySpark Bartosz Konieczny @waitingforcode
  2. 2. First things first Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 /data-generator /spark-scala-playground ...
  3. 3. Cerberus
  4. 4. API ● from cerberus import Validator ○ def __init(..., schema, ignore_none_values, allow_unknown, purge_unknown, error_handler) ○ def validate(self, document, schema=None, update=False, normalize=True) ● from cerberus.errors import BaseErrorHandler ○ def __call__(self, errors)
  5. 5. ● min/max 'id': {'type': 'integer', 'min': 1} ● RegEx 'email': {'type': 'string', 'regex': '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9- ]+.[a-zA-Z0-9-.]+$'} ● empty, contains (all values) 'items': {'type': 'list', 'items': [{'type': 'string'}], 'empty': False, 'contains': ['item 1', 'item 2']} ● allowed, forbidden {'role': {'type': 'string', 'allowed': ['agent', 'client', 'supplier']}} {'role': {'forbidden': ['owner']}} ● required 'first_order': {'type': 'datetime', 'required': False} Validation rules
  6. 6. class ExtendedValidator(Validator): def _validate_productexists(self, lookup_table, field, value): if lookup_table == 'memory': existing_items = ['item1', 'item2', 'item3', 'item4'] not_existing_items = list(filter(lambda item_name: item_name not in existing_items, value)) if not_existing_items: self._error(field, "{} items don't exist in the lookup table" .format(not_existing_items)) Custom validation rule extend Validator prefix custom rules with _validate_{rule_name} methods call def _error(self, *args) to add errors
  7. 7. Validation process { 'id': { 'type': 'integer', 'min': 1}, 'first_order': { 'type': 'datetime', 'required': False }, Validator #validate{"id": -3, "amount": 30.97, ...} True / False #errors
  8. 8. Cerberus and PySpark
  9. 9. PySpark integration - whole pipeline dataframe_schema = StructType( fields=[ # ... StructField("source", StructType( fields=[StructField("site", StringType(), True), StructField("api_version", StringType(), True)] ), False) ]) def sum_errors_number(errors_count_1, errors_count_2): merged_dict = {dict_key: errors_count_1.get(dict_key, 0) + errors_count_2.get(dict_key, 0) for dict_key in set(errors_count_1) | set(errors_count_2)} return merged_dict spark = SparkSession.builder.master("local[4]") .appName("...").getOrCreate() errors_distribution = spark.read .json(input, schema=dataframe_schema, lineSep='n') .rdd.mapPartitions(check_for_errors) .reduceByKey(sum_errors_number).collectAsMap() potential data quality issue collect data to the driver /!
  10. 10. PySpark integration - extended Cerberus UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists') class ExtendedValidator(Validator): def _validate_networkexists(self, allowed_values, field, value): if value not in allowed_values: self._error(field, UNKNOWN_NETWORK, {}) class ErrorCodesHandler(SchemaErrorHandler): def __call__(self, validation_errors): def concat_path(document_path): return '.'.join(document_path) output_errors = {} for error in validation_errors: if error.is_group_error: for child_error in error.child_errors: output_errors[concat_path(child_error.document_path)] = child_error.code else: output_errors[concat_path(error.document_path)] = error.code return output_errors extended Validator, custom validation rule not the same as previously custom output for #errors call
  11. 11. PySpark integration - .mapPartitions function def check_for_errors(rows): validator = ExtendedValidator(schema, error_handler=ErrorCodesHandler()) def default_dictionary(): return defaultdict(int) errors = defaultdict(default_dictionary) for row in rows: validation_result = validator.validate(row.asDict(recursive=True), normalize=False) if not validation_result: for error_field, error_code in validator.errors.items(): errors[error_field][error_code] += 1 return [(k, dict(v)) for k, v in errors.items()] disabled normalization
  12. 12. Resources ● Cerberus: http://docs.python-cerberus.org/en/stable/ ● Github Cerberus+PySpark demo: https://github.com/bartosz25/paris.py-cerberus-pyspark-talk ● Github data generator: https://github.com/bartosz25/data-generator ● PySpark + Cerberus series: https://www.waitingforcode.com/tags/cerberus-pyspark
  13. 13. Thank you ! @waitingforcode / waitingforcode.com

Editor's Notes

  • SchemaErrorHandler ⇒ child of BasicErrorHandler; BasicErrorHandler child of BaseErrorHandler
  • define what is the type (datetime, int, string, ..)
    mapping ⇒ type, items, empty ⇒ validator types
    https://docs.python-cerberus.org/en/stable/validation-rules.html
  • explain why normalize=False
    calls __normalize_mapping
    * you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}})
    >>> v.normalized({'foo': 0})
    {'bar': 0}


    * later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified

    * applies the defaults also

    * can also call a coercion, a callable that can for instance convert the type of a field
    Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated



    check if validator.clear_caches can help
  • explain why normalize=False
    calls __normalize_mapping
    * you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}})
    >>> v.normalized({'foo': 0})
    {'bar': 0}


    * later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified

    * applies the defaults also

    * can also call a coercion, a callable that can for instance convert the type of a field
    Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated



    check if validator.clear_caches can help

×