1. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 1
U21CS303 –Data Visualization
2. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 2
UNIT II : SHAPES OF DATAAND VISUALIZATION IDIOMS
Input for Visualization – Data and Tasks – Loading and Parsing Data – Data Cleansing and its importance
– Reusable Dynamic Components using the General Update Pattern – Reusable Scatter Plot – Common
Visualization Idioms – Bar Chart – Vertical Pie Chart – Horizontal Pie Chart – Coxcomb Plot Line Chart
– Area Chart – Criteria to choose charts
CO2: Apply visualization techniques for various data analysis tasks
3. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 3
INPUT FOR VISUALIZATION (DATAAND TASKS)
Data is the core of any visualization. Understanding the nature of the data is vital before attempting to
visualize it.
Tasks refer to the goals or objectives of the visualization. Understanding the task at hand helps in
choosing the right visualization technique.
Nature of Data:
1. Numerical Data
a. Continuous Data
b. Discrete Data
2. Categorical Data
3. Ordinal Data
4. Interval Data
4. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 4
INPUT FOR VISUALIZATION (DATAAND TASKS)
Nature of Data:
1. Numerical Data
Numerical data represents quantities or measurements.
a. Continuous Data
Data that can take any value within a given range.
Examples:
Height of individuals (someone can be 5.7 feet tall or 5.71 feet tall), Temperature measurements.
Visualization Techniques: Histograms, line graphs, scatter plots.
b. Discrete Data
Data that can take only specific, separate values.
Examples:
Number of cars in a household (can't have 2.5 cars), Number of students in a class.
Visualization Techniques: Bar charts, pie charts.
5. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 5
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Categorical Data
Categorical data represents categories or labels.
Examples:
Colors (red, blue, green), Genders (male, female).
Visualization Techniques:
Bar charts, pie charts, stacked bars.
3. Ordinal Data
Ordinal data represents categories with a specific order or rank.
Examples:
Education level (high school, bachelor's, master's, PhD).
Survey responses (strongly disagree, disagree, neutral, agree, strongly agree).
Visualization Techniques:
Bar charts (ordered), line graphs.
6. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 6
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Interval Data
Interval data is similar to ordinal data but with consistent differences between values but lacks a true zero
point.
Examples:
Temperature in Celsius (0°C doesn't mean absence of temperature).
IQ scores.
Visualization Techniques:
Line graphs, scatter plots.
7. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 7
INPUT FOR VISUALIZATION (DATAAND TASKS)
Hands-on Exercise:
8. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 8
INPUT FOR VISUALIZATION (DATAAND TASKS)
The amount of data impacts the choice of visualization. Large datasets might require aggregation or
sampling.
Small datasets can be visualized in detail, while large datasets often require techniques to simplify,
aggregate, or sample the data to make it understandable.
1.Small Datasets
2. Large Datasets
3. Techniques for Handling Large Datasets
A. Aggregation
B. Sampling
Volume of Data:
9. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 9
INPUT FOR VISUALIZATION (DATAAND TASKS)
Volume of Data:
10. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 10
INPUT FOR VISUALIZATION (DATAAND TASKS)
Volume of Data:
Techniques for Handling Large Datasets:
11. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 11
INPUT FOR VISUALIZATION (DATAAND TASKS)
Sources of Data:
In the modern digital era, data is abundant and can be sourced from a variety of channels.
Data can come from
Databases
Spreadsheets
APIs
Web scraping
Manual input
1. Databases
A structured collection of data, often managed by a database management system (DBMS).
Databases are typically used by organizations to store vast amounts of transactional or operational data.
Characteristics:
Structured, often relational in nature.
Large volumes.
Can be queried using SQL or other query languages.
12. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 12
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Spreadsheets
Files that allow users to organize, format, and calculate data using a grid of cells.
Examples: Microsoft Excel and Google Sheets.
Characteristics:
Tabular format.
Easily accessible and modifiable.
Contains formulas and calculations.
3. APIs (Application Programming Interfaces)
A set of rules and protocols that allow different software entities to communicate with each other. APIs
often serve as gateways to retrieve data from web servers or platforms.
Characteristics:
Automated data retrieval.
Real-time or periodic updates.
Structured, often in XML or JSON format.
13. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 13
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Web Scraping
The process of extracting data from websites. This is done using scripts or bots that fetch web pages and
extract necessary information.
Characteristics:
Data can be unstructured or semi-structured.
Prone to changes based on website layout updates.
5. Manual Input
Data entry done by individuals manually, often into systems, applications, or forms.
Characteristics:
Time-consuming.
Highly susceptible to human error.
14. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 14
INPUT FOR VISUALIZATION (DATAAND TASKS)
Data Quality:
In any data-driven decision-making process, the quality of the data used is paramount.
Poor data quality can lead to incorrect conclusions, flawed business decisions, and resource wastage.
1. Accuracy
2. Consistency
3. Completeness
4. Reliability
5. Timeliness of data
15. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 15
INPUT FOR VISUALIZATION (DATAAND TASKS)
1. Accuracy
Definition: Accuracy refers to the closeness of a data value to its true or actual value.
Key Points:
Errors: Result from incorrect data entry, faulty sensors, or misinformation.
Validation: Use validation rules or checks to ensure data accuracy.
Verification: Cross-reference with trusted data sources.
Real-world Scenario:
Imagine an e-commerce platform recording product prices. If a product's price is incorrectly listed as $10
instead of $100, it could lead to significant revenue loss.
16. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 16
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Consistency
Definition: Consistency ensures that data is uniform and consistent across all data sources and does not
contradict other recorded data.
Key Points:
Standardization: Use uniform units, formats, and conventions.
Deduplication: Ensure no repetitive or contradicting records exist.
Harmonization: Make sure data from different sources aligns well when integrated.
Real-world Scenario:
Consider a multinational company's employee database. If one branch records dates in "DD/MM/YYYY"
and another in "MM/DD/YYYY", inconsistencies can arise, causing confusion
17. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 17
INPUT FOR VISUALIZATION (DATAAND TASKS)
3. Completeness
Definition: Completeness ensures that no necessary data is missing.
Key Points:
Null Values: Identify and address missing data points.
Default Values: Ensure they don't misrepresent actual data.
Data Imputation: Techniques can be used to fill in missing data based on existing data patterns.
Real-world Scenario:
In a medical trial, if patient outcomes are missing, the results of the trial could be skewed or inconclusive.
18. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 18
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Reliability
Definition: Reliability pertains to the trustworthiness and consistency of the data over time.
Key Points:
Source: Data from reputable sources is often more reliable.
Duplication: Repeated or duplicate entries can compromise reliability.
Audit Trails: Maintain logs to trace back data changes or updates.
Real-world Scenario:
If a weather sensor occasionally malfunctions and records extreme values, the overall climate data
becomes less reliable.
19. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 19
INPUT FOR VISUALIZATION (DATAAND TASKS)
5. Timeliness
Definition: Timeliness relates to the age of the data and its relevance to the current scenario.
Key Points:
Update Frequency: Ensure data is updated at appropriate intervals.
Real-time Data: Necessary for certain applications, like stock trading.
Historical Data: While not "timely", it's crucial for trend analysis.
Real-world Scenario:
In stock market analytics, using outdated stock prices can result in financial misjudgements.
20. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 20
INPUT FOR VISUALIZATION (DATAAND TASKS)
Tasks
Tasks refer to the goals or objectives of the visualization.
Understanding the task at hand helps in choosing the right visualization technique.
Some common tasks include:
Comparison: Comparing two or more variables to see how they relate.
Trends: Observing data over a period of time.
Distribution: Understanding the spread and distribution of data.
Relationship: Identifying correlation between variables.
Geospatial: Mapping data to geographical locations.
21. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 21
INPUT FOR VISUALIZATION (DATAAND TASKS)
1. Comparison
Definition: Comparison involves assessing two or more variables to understand their similarities,
differences, or relationships.
Key Points:
Helps in benchmarking and relative judgment.
Can be used for both categorical and numerical data.
Visualization Techniques:
Bar Charts: Useful for comparing quantities across categories.
Grouped or Stacked Bar Charts: Compare multiple variables across categories.
Dot Plots: Good for comparing individual data points.
Real-world Scenario:
Imagine a company wants to compare sales of different products across regions. A grouped bar chart can
show each product's sales in each region, allowing for a clear comparison.
22. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 22
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Trends
Definition: Trends involve observing data changes over a continuous interval, usually time.
Key Points:
Reveals patterns, fluctuations, and cycles.
Helps in forecasting or predicting future values.
Visualization Techniques:
Line Charts: Ideal for showing trends over time.
Area Charts: Similar to line charts but with the area under the curve shaded.
Real-world Scenario:
Consider stock market data. Investors often use line charts to track the price of a stock over time and
identify its upward or downward trend.
23. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 23
INPUT FOR VISUALIZATION (DATAAND TASKS)
3. Distribution
Definition: Distribution refers to understanding how data is spread across different categories or ranges.
Key Points:
Helps identify central tendencies, spread, and outliers.
Reveals the overall shape of data distribution.
Visualization Techniques:
Histograms: Shows the distribution of a single variable.
Box Plots: Highlights median, quartiles, and potential outliers.
Violin Plots: Combines aspects of box plots and density plots.
Real-world Scenario:
In a manufacturing unit, understanding the distribution of product sizes can help identify anomalies or
deviations from standards.
24. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 24
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Relationship
Definition: This task focuses on determining if, how, and to what extent two or more variables are
related.
Key Points:
Helps in identifying correlations or potential causations.
Reveals how variables interact with each other.
Visualization Techniques:
Scatter Plots: Plot two variables to see their relationship.
Heat Maps: Great for showing correlations in larger datasets.
Pair Plots: Visualize pairwise relationships in a dataset.
Real-world Scenario:
In healthcare, researchers might use scatter plots to see if there's a relationship between calorie intake and
weight gain among participants.
25. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 25
INPUT FOR VISUALIZATION (DATAAND TASKS)
5. Geospatial
Definition: Geospatial visualization pertains to data that has a geographical or spatial component.
Key Points:
Maps real-world data to geographical locations.
Helps in spatial analysis and location-based insights.
Visualization Techniques:
Choropleth Maps: Regions colored based on a variable.
Point Maps: Individual data points plotted on a map.
Heat Maps: Shows density or intensity of data points on a map.
Real-world Scenario:
During an epidemic, health agencies might use choropleth maps to show the spread and intensity of the
disease across regions.
26. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 26
INPUT FOR VISUALIZATION (DATAAND TASKS)
Shapes of Data
In the world of data analysis and visualization, the format or "shape" in which data is presented plays a
crucial role in determining how it can be processed, analyzed, and visualized.
Data can be represented in various shapes or structures.
The most common ones include:
Tabular Data: Data in rows and columns, like a spreadsheet.
Hierarchical Data: Data in tree structures, like JSON.
Network Data: Data representing connections, like social networks.
Geospatial Data: Data with geographic information, like maps.
27. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 27
INPUT FOR VISUALIZATION (DATAAND TASKS)
1. Tabular Data
Definition: Tabular data is structured data that's organized in rows and columns, much like a table or
spreadsheet.
Characteristics:
Each row typically represents a single record or entity.
Each column represents a specific attribute or field of the entity.
Contains headers to label each column.
Visualization Techniques:
Bar charts, line charts, and scatter plots for various columns.
Heatmaps for visualizing patterns within the table.
Real-world Example:
A company's sales record where each row represents a transaction, and columns may include date,
product ID, quantity sold, and total price.
28. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 28
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Hierarchical Data
Definition:
Hierarchical data represents relationships in a tree-like structure, where entities can have parent-child
relationships.
Characteristics:
Data is nested.
Each entity (except the root) has one parent and can have multiple children.
Commonly represented in formats like JSON or XML.
Visualization Techniques:
Tree diagrams.
Dendrograms in hierarchical clustering.
Real-world Example:
A company's organizational chart where the CEO is at the top, followed by VPs, managers, and individual
contributors in a tree structure.
29. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 29
INPUT FOR VISUALIZATION (DATAAND TASKS)
3. Network Data
Definition:
Network data, also known as graph data, represents entities (nodes) and the relationships or connections
(edges) between them.
Characteristics:
Nodes represent entities.
Edges represent relationships.
Can be directed (with a starting and ending point) or undirected.
Visualization Techniques:
Network graphs.
Force-directed layouts.
Matrix representation.
Real-world Example:
A social media platform where individuals are nodes and their friendships are edges connecting them.
30. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 30
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Geospatial Data
Definition:
Geospatial data pertains to information associated with a specific geographic location.
Characteristics:
Contains spatial coordinates (like latitude and longitude).
Can also include altitude, distance, or other geographical metrics.
Often used in conjunction with maps.
Visualization Techniques:
Choropleth maps (regions colored based on data).
Point maps (points placed based on coordinates).
Heatmaps (density visualization).
Real-world Example:
A city's public health department mapping reported cases of a disease to identify hotspots.
31. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 31
LOADING AND PARSING DATA
Loading data involves importing data from its source into the environment where the visualization
will be created.
Once data is loaded, it might not be in a ready-to-use state.
Parsing involves converting or arranging data into a usable format.
Loading Data
• Loading data is a foundational step in the process of data visualization.
• It essentially entails the importation of data from its original source into the specific environment
where the visualization will be constructed.
1.File Format
2.Database
3.Application Programming Interfaces
32. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 32
LOADING AND PARSING DATA
1. File formats:
One of the primary aspects to consider is the format of the data.
Comma-Separated Values (CSV)
Microsoft Excel
JavaScript Object Notation (JSON), a hierarchical and key-value pair format
XML, another hierarchical format, is also used in various applications, from web services to
document storage.
33. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 33
LOADING AND PARSING DATA
2.Databases:
Beyond file formats, databases play a crucial role in the world of data.
Relational databases organize data into tables, allowing for complex queries that can join, filter,
and aggregate data.
NoSQL databases, like MongoDB or Cassandra, offer more flexible storage solutions, especially
for unstructured or semi-structured data.
3. Application Programming Interfaces
APIs allow applications to communicate with each other, often fetching real-time data.
For instance, if one were creating a visualization about global weather patterns, they might fetch
data in real-time from a weather API.
34. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 34
LOADING AND PARSING DATA
Parsing Data:
Data parsing is the process of converting or arranging data into a format or structure that's more
usable and insightful for our specific needs.
1. Data Cleaning
2. Data Transformation
3. Feature Engineering
4. Date and Time Parsing
35. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 35
LOADING AND PARSING DATA
Data Cleaning: The first and most fundamental step in data parsing is data cleaning.
Common Issues in Raw Data
Missing values.
Duplicate entries.
Incorrect data types.
Outliers or anomalies.
Importance of Data Cleaning:
Eliminate errors and inconsistencies.
Enhance accuracy of data analysis.
Save time and resources in the long run
Real-world Example
Scenario: Imagine an e-commerce company collecting data on customer purchases. Incorrect data
might show a product being purchased before it was even launched. Data cleaning will identify and
rectify such anomalies.
Outcome: Clean data ensures accurate customer insights and better business decisions.
36. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 36
LOADING AND PARSING DATA
Data Transformation: Once our data is clean, it often needs to be transformed into a more suitable format
or scale.
Data types might need conversion, like turning a number stored as a string into an actual numeric value.
Such transformations are essential for mathematical operations and comparisons.
Examples of Data Transformation
Numeric scaling: Normalizing or standardizing values.
Encoding: converting categorical data into numeric format.
Date formatting: ensuring a consistent date format across data.
Aggregation: summarizing data into broader categories or metrics.
Real-world Example
Scenario: Consider a global e-commerce platform collecting data on customer purchases. Prices are stored
in various currencies. Before analyzing global sales, data needs to be transformed to a standard currency, like
USD. Clean data ensures correct exchange rates are applied consistently.
Outcome: Accurate insights on global sales and profit margins.
37. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 37
LOADING AND PARSING DATA
Feature Engineering: Feature engineering is the process of transforming raw data into features that better
represent the underlying patterns in the data, making them more suitable for modeling.
Parsing: Breaking down data into understandable and usable components.
Feature Engineering: Optimizing these components (features) for analysis or modelling.
Common Feature Engineering Techniques
Normalization: Scaling features to a standard scale (e.g., 0 to 1) so they contribute equally to model
performance.
One-hot encoding: Converting categorical variables into a format that can be provided to ML algorithms to
improve predictions.
Binning: Converting continuous variables into discrete bins to handle outliers or make data more
interpretable.
Feature extraction: Transforming high-dimensional data into a lower-dimensional form, retaining the most
important information
Feature interaction: Creating new features by combining two or more features, capturing interactions
between them.
38. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 38
LOADING AND PARSING DATA
Date and Time Parsing: The process of converting textual data representing dates and times into a
standardized and usable format.
Time-based data is common in various domains (e.g., finance, healthcare, logistics). Proper parsing ensures
it's utilized effectively for analysis and modeling.
Proper date and time parsing is foundational for many data analysis tasks.
Common Formats & Challenges
Standard Date Formats: YYYY-MM-DD, DD/MM/YYYY, MM/DD/YYYY, etc.
Time Formats: HH:MM:SS, HH:MM AM/PM, etc.
Time Zones: UTC, GMT, PST, etc.
Daylight Saving Time: Adjustments can shift times.
Ambiguities: 03/04/2020 (Is it March 4th or April 3rd?)
Real-world Scenarios & Case Studies
Healthcare: Parsing appointment and treatment dates for patient data analysis
39. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 39
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Data Cleansing is a systematic approach to ensuring the accuracy, consistency, and reliability of data by
identifying and rectifying errors and inconsistencies.
Every step, from data collection to visualization, can introduce errors. Without cleansing, these errors can
propagate and mislead.
Why Data Cleansing is essential for Data Visualization?
Ensuring Accuracy
Maintaining Consistency
Building Trust
Enhancing Efficiency
Common Data Quality Issues
Missing Data
Duplicate Data
Inaccurate Data
Outliers
Inconsistent Formats
40. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 40
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Steps in Data Cleansing
Data cleansing can be explained as a 5 step process.
Step 1. Data Auditing
Step 2. Data Cleaning
Step 3. Data Verification
Step 4. Data Reporting
Step 5. Data Maintenance
41. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 41
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Step 1. Data Auditing:
Data auditing is the initial phase where data is meticulously examined to detect any anomalies,
inconsistencies, or inaccuracies.
This process typically employs both descriptive statistics and visual methods.
Descriptive statistics, such as means, medians, modes, standard deviations, and variance, can provide
insights into the central tendencies and dispersions within the data.
Visual methods involve representing the data graphically to detect outliers or patterns that might not be
evident in tabulated data.
Tools:
Histograms
Scatter Plots
Frequency Distributions
42. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 42
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Step 2. Data Cleaning:
Once data auditing has identified problems, data cleaning rectifies these issues. It is the actual process of
amending or removing data in a dataset that is incorrect, incomplete, improperly formatted, or duplicated.
Methods:
Imputation: Missing data can distort analysis. One way to address this is by filling in the gaps with imputed
values.
Outlier Treatment: Not all outliers are errors; some might be genuine extreme values. The key is to ascertain
whether to keep them (if they're valid), adjust them (if they've been recorded incorrectly), or remove them (if
they're anomalies without relevance).
Deduplication: Duplicate data can arise due to various reasons, such as data merge errors or multiple data
entries. This step involves identifying and removing these redundant records to ensure each data point is
unique and accurate.
43. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 43
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Step 3. Data Verification:
After cleaning, it's vital to revisit the data to ensure no new errors were introduced during the cleaning
process. Additionally, data verification ensures that no essential information was inadvertently removed.
Methods: This might involve cross-referencing with other trusted datasets, re-running the initial auditing
techniques, or even sampling and manually reviewing records.
Step 4. Data Reporting:
Documenting the cleansing process is crucial for several reasons. It maintains transparency, ensures
replicability, and builds trustworthiness.
Methods: This involves creating a detailed report outlining what anomalies were identified, the
techniques used to address them, and any potential implications of the cleaning process. This
documentation can be invaluable for future reference, for stakeholders to understand the data's quality, or
for other data professionals working on the dataset.
44. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 44
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Step 5. Data Maintenance:
Data doesn't remain static. As new data is added or as existing data evolves, it's crucial to ensure that it
maintains its quality.
Methods: Implementing routine checks, validations, and periodic audits can help in identifying and
rectifying new issues promptly. Automated scripts, data quality software, or even manual reviews (for
smaller datasets) can be employed to maintain data cleanliness over time.
45. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 45
Data Cleansing Tools and Techniques
Manual Review:
Pros: Full control, detailed understanding.
Cons: Time-consuming, not feasible for large datasets.
Automated Cleaning Tools:
DataWrangler, OpenRefine, and Trifacta.
Pros: Faster, can handle large datasets.
Cons: Might require a learning curve, can miss domain-specific nuances.
Programming Libraries:
Pandas in Python, dplyr in R.
Pros: Highly customizable, can integrate with other data processing steps.
Cons: Requires programming knowledge.
Data Quality Software:
Tools like Talend and Informatica offer end-to-end data management solutions.
Pros: Comprehensive, often have built-in workflows.
Cons: Can be expensive, might have a steeper learning curve.
46. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 46
Reusable Dynamic Components using the General Update Pattern
The General Update Pattern in D3.js (a popular data visualization library for JavaScript) provides such a mechanism,
facilitating the creation of dynamic and efficient visualizations.
What is the General Update Pattern?
The General Update Pattern is a methodology used in D3.js to handle data changes in a visualization. It breaks the
data-binding process into three distinct phases:
Enter: Elements that need to be added to the visualization because they have new data associated.
Update: Elements whose data has changed since the last render.
Exit: Elements that no longer have associated data and need to be removed.
47. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 47
Reusable Dynamic Components using the General Update Pattern
The Importance of Reusability
Efficiency: Avoids re-writing code, saving time and reducing errors.
Consistency: Ensures visualizations maintain a consistent look and feel across different parts of an application or
different applications.
Modularity: Allows for easier debugging, testing, and understanding, as each component can be developed and
maintained separately.
48. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 48
Reusable Dynamic Components using the General Update Pattern
Implementing the General Update Pattern
Binding Data
Start by binding a dataset to a selection. This binding process involves pairing data points from dataset with DOM
elements in visualization.
Handling the Enter Selection
Create new elements for each of these data points.
Define attributes and styles for these new elements based on their data.
Handling the Update Selection
• Adjust attributes and styles of these elements to reflect any changes in their bound data.
Handling the Exit Selection
• Remove these elements or adjust them as necessary, e.g., fading them out or shrinking them.
49. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 49
Making Components Reusable
Encapsulation
Wrap your visualization code inside a function. This function can then be called whenever ,want to create a
new instance of visualization.
Customizable Properties
Allow users of your component to customize its properties. This could be achieved using getter-setter functions
for each property.
Event Handling
For interactivity, allow users to define custom event handlers. This ensures that the component can respond to
user interactions in a way that suits the specific application it's being used in.
50. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 50
REUSABLE SCATTER PLOT
A scatter plot, one of the most versatile visualization tools, displays individual data points on a two-dimensional
plane.
Components of a Scatter Plot
1. Axes and Scale
X and Y Axes: Represent the domains of the two variables being compared.
Scale: The mechanism that translates data values into pixel values. Common scales include linear,
logarithmic, and time-based.
2. Data Points
Typically represented as circles, though other symbols can be used.
Position determined by the data value for each variable.
3. Optional Components
Gridlines: Enhance readability.
Labels: Provide context or additional data about specific points.
Trend Line: Indicates a general direction of data points.
51. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 51
A reusable scatter plot is a type of data visualization that displays individual data points as markers on a
graph.
The term "reusable" implies that the scatter plot is designed in a way that allows it to be easily used and
applied to different datasets or situations without requiring extensive modification.
In a scatter plot, each data point is represented by a marker, typically a dot, and the position of the dot is
determined by the values of two variables (usually one on the x-axis and one on the y-axis).
This allows for a visual representation of the relationship, if any, between the two variables.
REUSABLE SCATTER PLOT
52. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 52
COMMON VISUALIZATION IDIOMS
Bar Chart
A bar chart is a statistical approach to represent given data using vertical and horizontal rectangular bars.
The length of each bar is proportional to the value they represent.
It is basically a graphical representation of data with the help of horizontal or vertical bars with different
heights.
They are also known as bar graphs. The bar charts have three major characteristics:
The bar charts are used to compare the different data among different groups.
Bar charts show the relationship with the help of two axes. On one axis it represents the
categories and on another axis, it represents the discrete values.
Over a period of time bar charts shows the major changes in available data.
53. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 53
COMMON VISUALIZATION IDIOMS
Types of Bar Chart
The grouped bar chart is also referred the clustered bar chart (graph).
The stacked bar chart is also known as the composite bar chart.
54. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 54
COMMON VISUALIZATION IDIOMS
Pie Chart
A pie chart is a pictorial representation of data in the form of a circular chart or pie where the slices of
the pie show the size of the data.
A list of numerical variables along with categorical variables is needed to represent data in the form of
a pie chart.
The whole pie represents a value of 100. It is divided into 10
slices or sectors. The various colors represent the ingredients
used to prepare the cake. What would be the exact quantity of
each of the ingredients represented in specific colors in the
following pie chart?
Quantity of Flour 30
Quantity of Sugar 20
Quantity of Egg 40
Quantity of Butter 10
55. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 55
COMMON VISUALIZATION IDIOMS
Line Graph
Line graphs, also called line charts, are used to represent quantitative data collected over a specific subject
and a specific time interval.
All the data points are connected by a line.
Data points represent the observations that are collected on a survey or research.
The line graph has an x-axis and a y-axis.
56. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 56
COMMON VISUALIZATION IDIOMS
Line Graph:
Shows continuous data over a period, setting against a general scale, and connecting individual data points
together, ideal for showing growth rate or trends at even intervals.
Scatter Plot
It works best when comparing large numbers of data points without regard to time. This tool is very
powerful when we are trying to show the relationship between two variables (x and y-axis), for example, a
person's weight and height.
57. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 57
COMMON VISUALIZATION IDIOMS
AREA CHART
An area chart, also known as a mountain chart, is a data visualization type that combines the
appearance of a line chart and a bar chart.
It is commonly used to show how numerical values change based on a second variable, usually a time
period.
58. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 58
COMMON VISUALIZATION IDIOMS
TYPES OF AREA CHART
Basic area chart-tracking the number of weekly profile views from a business’s LinkedIn page.
Overlapping area chart- compare two or more data groups based on a specific variable
Stacked area chart(cumulative area chart)-deeper than just comparing values but also understand how
each group contributes to a total.
59. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 59
DIFFERENT CRITERIA TO CHOOSE CHARTS
Choosing the right chart type is crucial in data visualization.
It ensures that data is presented clearly, accurately, and effectively.
Selecting an inappropriate chart can lead to misinterpretation of data or obscure critical insights.
Nature of Data
1. Qualitative vs. Quantitative
Qualitative Data (Categorical): Data that can be divided into categories.
Suitable Charts: Pie charts, bar charts, treemaps.
Quantitative Data (Numerical): Data that can be measured.
Suitable Charts: Line charts, scatter plots, histograms.
2. Time Series vs. Non-Time Series
Time Series Data: Data points indexed in time order.
Suitable Charts: Line charts, area charts, stacked area charts.
Non-Time Series Data: Data without a chronological order.
Suitable Charts: Bar charts, pie charts, scatter plots.
60. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 60
DIFFERENT CRITERIA TO CHOOSE CHARTS
Number of Variables
1. Univariate Data
Data focusing on a single variable.
Suitable Charts: Histograms, pie charts, line charts (for time series).
2. Bivariate Data
Data exploring the relationship between two variables.
Suitable Charts: Scatter plots, line charts, bar charts.
3. Multivariate Data
Data involving more than two variables.
Suitable Charts: Scatter plot matrices, parallel coordinate plots, radar charts.
61. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 61
DIFFERENT CRITERIA TO CHOOSE CHARTS
Relationships and Comparisons
1. Comparing Categories : Viewing the difference in values across different categories.
Suitable Charts: Bar charts, column charts, dot plots.
2. Distribution of Data : Observing the spread and shape of a dataset.
Suitable Charts: Histograms, box plots, density plots.
3. Trend Over Time : Analyzing patterns or changes over a period.
Suitable Charts: Line charts, area charts, waterfall charts.
4. Correlation : Examining the relationship between two or more variables.
Suitable Charts: Scatter plots, bubble charts.
62. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 62
DIFFERENT CRITERIA TO CHOOSE CHARTS
Data Volume and Complexity
1. Small Data Sets : Emphasize individual data points or fine details.
Suitable Charts: Dot plots, bar charts.
2. Large Data Sets : Offer a general overview or highlight trends.
Suitable Charts: Line charts, heatmaps, histograms.
3. Hierarchical Data : Data with inherent structure or nested categories.
Suitable Charts: Treemaps, dendrograms, sunburst charts.
63. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 63
DIFFERENT CRITERIA TO CHOOSE CHARTS
Audience and Context
1. Level of Expertise
General Audience: Opt for simpler, more intuitive charts.
Expert Audience: Can utilize more complex visualizations.
2. Presentation Medium
Digital Displays: Interactive visualizations like dynamic dashboards.
Print: Static visualizations with high clarity.
3. Storytelling vs. Exploration
Storytelling: Use charts that drive home a specific point or narrative.
Exploration: Use charts that allow users to delve into the data and discover insights
64. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 64
DIFFERENT CRITERIA TO CHOOSE CHARTS
Aesthetics and Design
1. Clarity and Simplicity
Avoid clutter, prioritize readability.
2. Color Considerations
Ensure color choices are meaningful and accessible.
3. Consistency
If presenting multiple charts, maintain a consistent design language.