SlideShare a Scribd company logo
1 of 12
Download to read offline
Automatic Data Validation &
Cleaning with PySemantic
Jaidev Deshpande
Data Scientist, Cube26 Software Pvt Ltd
About Me
● Data Scientist at Cube26 Software Pvt Ltd
● Previously software developer at Enthought
● Research assistant at TIFR and UoP
● Active contributor to the SciPy stack
/ jaidevd
/ jaidevd
Typical Data Pipeline
The Problem
● Curating and the data and standardizing across the team
● Data quality problems:
○ Unstructured data
○ Unorganized data
○ Duplicated data
○ Irrelevant data
● Communication problems:
○ Large and distributed teams
○ “What has happened to get the dataset to the current stage?”
○ Messier data means more communication.
HOW DO I DESCRIBE THE STRUCTURE OF THE DATA EFFECTIVELY?
PySemantic
Pythonically, PySemantic is:
● A wrapper around pandas parsers and dataframe manipulation routines.
● Not a parser
● A loader for feature extraction for machine learning tasks
● A logger for all operations on a dataset
PySemantic supports:
● Recursive elimination of parser errors
● Automatic validation based on rules
How it works
$ semantic add mydictionary.yaml
mydataset1:
path: /path/to/mydataset.csv
nrows: 100
use_columns:
- col_a
- col_b
- col_c
>>> from pysemantic import
Project
>>> project = Project
(“myproject”)
>>>project.load_dataset
(“mydataset”)
PySemantic Internals
● Infer and validate parser arguments from the schema
using traits
● Dynamically change parser arguments based on the
errors raised, if any
● Log everything
● Post loading a dataset, apply common preprocessing
methods by default
Software Development Practices
● Fully test-driven
● Fully documented
● Pylint score > 9.0
Limitations
● Only supports local files and MySQL tables (untested)
● Not as smart as MS Excel
● Architecture isn’t very clean - the main classes are
somewhat confusing
Feedback, Issues, PRs Welcome!
http://github.com/jaidevd/pysemantic

More Related Content

Similar to Automatic Data Validation and Cleaning with PySemantic

Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)
Thinkful
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Simplilearn
 
Architecting the Right System for Your AI Application—without the Vendor Fluff
Architecting the Right System for Your AI Application—without the Vendor FluffArchitecting the Right System for Your AI Application—without the Vendor Fluff
Architecting the Right System for Your AI Application—without the Vendor Fluff
inside-BigData.com
 

Similar to Automatic Data Validation and Cleaning with PySemantic (20)

Data analysis in JavaScript
Data analysis in JavaScriptData analysis in JavaScript
Data analysis in JavaScript
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data Scientists
 
EPF-datagov-part1-1.pdf
EPF-datagov-part1-1.pdfEPF-datagov-part1-1.pdf
EPF-datagov-part1-1.pdf
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyone
 
R Studio Conference
R Studio Conference R Studio Conference
R Studio Conference
 
Are we there yet?
Are we there yet?Are we there yet?
Are we there yet?
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
JOSA Data Science Bootcamp Overview
JOSA Data Science Bootcamp OverviewJOSA Data Science Bootcamp Overview
JOSA Data Science Bootcamp Overview
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
Self service BI for humans
Self service BI for humansSelf service BI for humans
Self service BI for humans
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Transition to a modern data platform
Transition to a modern data platform Transition to a modern data platform
Transition to a modern data platform
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 
Datascience methodology
Datascience methodologyDatascience methodology
Datascience methodology
 
Working with data.pdf
Working with data.pdfWorking with data.pdf
Working with data.pdf
 
Architecting the Right System for Your AI Application—without the Vendor Fluff
Architecting the Right System for Your AI Application—without the Vendor FluffArchitecting the Right System for Your AI Application—without the Vendor Fluff
Architecting the Right System for Your AI Application—without the Vendor Fluff
 

Recently uploaded

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Recently uploaded (20)

NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 

Automatic Data Validation and Cleaning with PySemantic

  • 1. Automatic Data Validation & Cleaning with PySemantic Jaidev Deshpande Data Scientist, Cube26 Software Pvt Ltd
  • 2. About Me ● Data Scientist at Cube26 Software Pvt Ltd ● Previously software developer at Enthought ● Research assistant at TIFR and UoP ● Active contributor to the SciPy stack / jaidevd / jaidevd
  • 4. The Problem ● Curating and the data and standardizing across the team ● Data quality problems: ○ Unstructured data ○ Unorganized data ○ Duplicated data ○ Irrelevant data ● Communication problems: ○ Large and distributed teams ○ “What has happened to get the dataset to the current stage?” ○ Messier data means more communication. HOW DO I DESCRIBE THE STRUCTURE OF THE DATA EFFECTIVELY?
  • 5.
  • 7. Pythonically, PySemantic is: ● A wrapper around pandas parsers and dataframe manipulation routines. ● Not a parser ● A loader for feature extraction for machine learning tasks ● A logger for all operations on a dataset PySemantic supports: ● Recursive elimination of parser errors ● Automatic validation based on rules
  • 8. How it works $ semantic add mydictionary.yaml mydataset1: path: /path/to/mydataset.csv nrows: 100 use_columns: - col_a - col_b - col_c >>> from pysemantic import Project >>> project = Project (“myproject”) >>>project.load_dataset (“mydataset”)
  • 9. PySemantic Internals ● Infer and validate parser arguments from the schema using traits ● Dynamically change parser arguments based on the errors raised, if any ● Log everything ● Post loading a dataset, apply common preprocessing methods by default
  • 10. Software Development Practices ● Fully test-driven ● Fully documented ● Pylint score > 9.0
  • 11. Limitations ● Only supports local files and MySQL tables (untested) ● Not as smart as MS Excel ● Architecture isn’t very clean - the main classes are somewhat confusing
  • 12. Feedback, Issues, PRs Welcome! http://github.com/jaidevd/pysemantic