Automatic Data Validation and Cleaning with PySemantic
1. Automatic Data Validation &
Cleaning with PySemantic
Jaidev Deshpande
Data Scientist, Cube26 Software Pvt Ltd
2. About Me
● Data Scientist at Cube26 Software Pvt Ltd
● Previously software developer at Enthought
● Research assistant at TIFR and UoP
● Active contributor to the SciPy stack
/ jaidevd
/ jaidevd
4. The Problem
● Curating and the data and standardizing across the team
● Data quality problems:
○ Unstructured data
○ Unorganized data
○ Duplicated data
○ Irrelevant data
● Communication problems:
○ Large and distributed teams
○ “What has happened to get the dataset to the current stage?”
○ Messier data means more communication.
HOW DO I DESCRIBE THE STRUCTURE OF THE DATA EFFECTIVELY?
7. Pythonically, PySemantic is:
● A wrapper around pandas parsers and dataframe manipulation routines.
● Not a parser
● A loader for feature extraction for machine learning tasks
● A logger for all operations on a dataset
PySemantic supports:
● Recursive elimination of parser errors
● Automatic validation based on rules
8. How it works
$ semantic add mydictionary.yaml
mydataset1:
path: /path/to/mydataset.csv
nrows: 100
use_columns:
- col_a
- col_b
- col_c
>>> from pysemantic import
Project
>>> project = Project
(“myproject”)
>>>project.load_dataset
(“mydataset”)
9. PySemantic Internals
● Infer and validate parser arguments from the schema
using traits
● Dynamically change parser arguments based on the
errors raised, if any
● Log everything
● Post loading a dataset, apply common preprocessing
methods by default
11. Limitations
● Only supports local files and MySQL tables (untested)
● Not as smart as MS Excel
● Architecture isn’t very clean - the main classes are
somewhat confusing