Data profiling
is toprovide a clear and detailed understanding of
the structure, content and quality of the data,
which is essential prior to its use in any application
3.
It can eliminateerrors that are common in
databases. These errors include incorrect or
missing values, values outside the range,
unexpected patterns in data, etc. It involves the
following processes:
4.
The basic featuresof the data while going
through the profiling process.
•Performing data quality assessment.
•Identifying data types, recurring
patterns, etc.
•Tagging data with descriptions and
keywords.
•Group data into categories.
•Identifying the metadata and its
accuracy.
•Performing inter-table analysis.
•Identifying functional dependencies,
6.
First of all,data profiling helps to cover the basics of
the data, and verify that the information in the table
matches the description. Secondly, it can help you
have a better understanding of your data through
revealing relationships across different databases,
source applications, or tables.
Why do you need data profiling?
7.
Structure Discovery: Thistype of profiling involves performing mathematical checks on the data, such as sum, minimum, maximum, etc., along with other descriptive statistics
Types of data profiling
For example, consider a database with the contact
numbers of all users. A Structure Discovery process would
involve finding the percentage of phone numbers that do
not have the correct number of digits.
9.
Content Discovery: Contentdiscovery profiling involves
looking into individual data records to identify errors.
Content discovery identifies which rows in a given
dataset contain problems or any systemic issues
occurring in the data.
For example, consider a database with the contact
numbers of all users. A content discovery process
would involve determining the percentage of phone
numbers with no area code.
11.
•Relationship Discovery: Relationshipdiscovery
involves identifying how parts of the data are related
to each other.
•For example:identifying key relationships between
tables in a database, references between cells and
tables in a spreadsheet, etc.
13.
Data Profiling Methods
•ColumnProfiling: In this method, the number of times
every value appears within each table column is counted.
This method helps to uncover patterns within the data.
•Cross-column Profiling: In this method, users look across
columns to perform key and dependency analysis.
• Key analysis is implemented to scan the collections of
values in a table to identify a potential primary key.
•Dependency analysis determines the dependent
relationships within data sets. These analyses can be
leveraged to determine the relationships and dependencies
across tables.
14.
•Cross-table Profiling: Inthis method, users look across
tables to identify all potential foreign keys. It also
attempts to identify similarities and differences among
data types and syntax between tables to determine
which data can be mapped together and which might be
redundant.