SlideShare a Scribd company logo
1 of 64
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 1
U21CS303 –Data Visualization
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 2
UNIT II : SHAPES OF DATAAND VISUALIZATION IDIOMS
Input for Visualization – Data and Tasks – Loading and Parsing Data – Data Cleansing and its importance
– Reusable Dynamic Components using the General Update Pattern – Reusable Scatter Plot – Common
Visualization Idioms – Bar Chart – Vertical Pie Chart – Horizontal Pie Chart – Coxcomb Plot Line Chart
– Area Chart – Criteria to choose charts
CO2: Apply visualization techniques for various data analysis tasks
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 3
INPUT FOR VISUALIZATION (DATAAND TASKS)
 Data is the core of any visualization. Understanding the nature of the data is vital before attempting to
visualize it.
 Tasks refer to the goals or objectives of the visualization. Understanding the task at hand helps in
choosing the right visualization technique.
Nature of Data:
1. Numerical Data
a. Continuous Data
b. Discrete Data
2. Categorical Data
3. Ordinal Data
4. Interval Data
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 4
INPUT FOR VISUALIZATION (DATAAND TASKS)
Nature of Data:
1. Numerical Data
Numerical data represents quantities or measurements.
a. Continuous Data
Data that can take any value within a given range.
Examples:
Height of individuals (someone can be 5.7 feet tall or 5.71 feet tall), Temperature measurements.
Visualization Techniques: Histograms, line graphs, scatter plots.
b. Discrete Data
Data that can take only specific, separate values.
Examples:
Number of cars in a household (can't have 2.5 cars), Number of students in a class.
Visualization Techniques: Bar charts, pie charts.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 5
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Categorical Data
Categorical data represents categories or labels.
Examples:
Colors (red, blue, green), Genders (male, female).
Visualization Techniques:
Bar charts, pie charts, stacked bars.
3. Ordinal Data
Ordinal data represents categories with a specific order or rank.
Examples:
Education level (high school, bachelor's, master's, PhD).
Survey responses (strongly disagree, disagree, neutral, agree, strongly agree).
Visualization Techniques:
Bar charts (ordered), line graphs.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 6
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Interval Data
Interval data is similar to ordinal data but with consistent differences between values but lacks a true zero
point.
Examples:
 Temperature in Celsius (0°C doesn't mean absence of temperature).
 IQ scores.
Visualization Techniques:
Line graphs, scatter plots.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 7
INPUT FOR VISUALIZATION (DATAAND TASKS)
Hands-on Exercise:
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 8
INPUT FOR VISUALIZATION (DATAAND TASKS)
The amount of data impacts the choice of visualization. Large datasets might require aggregation or
sampling.
Small datasets can be visualized in detail, while large datasets often require techniques to simplify,
aggregate, or sample the data to make it understandable.
1.Small Datasets
2. Large Datasets
3. Techniques for Handling Large Datasets
A. Aggregation
B. Sampling
Volume of Data:
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 9
INPUT FOR VISUALIZATION (DATAAND TASKS)
Volume of Data:
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 10
INPUT FOR VISUALIZATION (DATAAND TASKS)
Volume of Data:
Techniques for Handling Large Datasets:
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 11
INPUT FOR VISUALIZATION (DATAAND TASKS)
Sources of Data:
 In the modern digital era, data is abundant and can be sourced from a variety of channels.
 Data can come from
 Databases
 Spreadsheets
 APIs
 Web scraping
 Manual input
1. Databases
A structured collection of data, often managed by a database management system (DBMS).
Databases are typically used by organizations to store vast amounts of transactional or operational data.
Characteristics:
 Structured, often relational in nature.
 Large volumes.
 Can be queried using SQL or other query languages.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 12
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Spreadsheets
Files that allow users to organize, format, and calculate data using a grid of cells.
Examples: Microsoft Excel and Google Sheets.
Characteristics:
 Tabular format.
 Easily accessible and modifiable.
 Contains formulas and calculations.
3. APIs (Application Programming Interfaces)
A set of rules and protocols that allow different software entities to communicate with each other. APIs
often serve as gateways to retrieve data from web servers or platforms.
 Characteristics:
 Automated data retrieval.
 Real-time or periodic updates.
 Structured, often in XML or JSON format.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 13
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Web Scraping
The process of extracting data from websites. This is done using scripts or bots that fetch web pages and
extract necessary information.
Characteristics:
 Data can be unstructured or semi-structured.
 Prone to changes based on website layout updates.
5. Manual Input
Data entry done by individuals manually, often into systems, applications, or forms.
Characteristics:
 Time-consuming.
 Highly susceptible to human error.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 14
INPUT FOR VISUALIZATION (DATAAND TASKS)
Data Quality:
 In any data-driven decision-making process, the quality of the data used is paramount.
 Poor data quality can lead to incorrect conclusions, flawed business decisions, and resource wastage.
1. Accuracy
2. Consistency
3. Completeness
4. Reliability
5. Timeliness of data
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 15
INPUT FOR VISUALIZATION (DATAAND TASKS)
1. Accuracy
Definition: Accuracy refers to the closeness of a data value to its true or actual value.
Key Points:
Errors: Result from incorrect data entry, faulty sensors, or misinformation.
Validation: Use validation rules or checks to ensure data accuracy.
Verification: Cross-reference with trusted data sources.
Real-world Scenario:
Imagine an e-commerce platform recording product prices. If a product's price is incorrectly listed as $10
instead of $100, it could lead to significant revenue loss.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 16
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Consistency
Definition: Consistency ensures that data is uniform and consistent across all data sources and does not
contradict other recorded data.
Key Points:
Standardization: Use uniform units, formats, and conventions.
Deduplication: Ensure no repetitive or contradicting records exist.
Harmonization: Make sure data from different sources aligns well when integrated.
Real-world Scenario:
Consider a multinational company's employee database. If one branch records dates in "DD/MM/YYYY"
and another in "MM/DD/YYYY", inconsistencies can arise, causing confusion
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 17
INPUT FOR VISUALIZATION (DATAAND TASKS)
3. Completeness
Definition: Completeness ensures that no necessary data is missing.
Key Points:
Null Values: Identify and address missing data points.
Default Values: Ensure they don't misrepresent actual data.
Data Imputation: Techniques can be used to fill in missing data based on existing data patterns.
Real-world Scenario:
In a medical trial, if patient outcomes are missing, the results of the trial could be skewed or inconclusive.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 18
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Reliability
Definition: Reliability pertains to the trustworthiness and consistency of the data over time.
Key Points:
Source: Data from reputable sources is often more reliable.
Duplication: Repeated or duplicate entries can compromise reliability.
Audit Trails: Maintain logs to trace back data changes or updates.
Real-world Scenario:
If a weather sensor occasionally malfunctions and records extreme values, the overall climate data
becomes less reliable.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 19
INPUT FOR VISUALIZATION (DATAAND TASKS)
5. Timeliness
Definition: Timeliness relates to the age of the data and its relevance to the current scenario.
Key Points:
Update Frequency: Ensure data is updated at appropriate intervals.
Real-time Data: Necessary for certain applications, like stock trading.
Historical Data: While not "timely", it's crucial for trend analysis.
Real-world Scenario:
In stock market analytics, using outdated stock prices can result in financial misjudgements.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 20
INPUT FOR VISUALIZATION (DATAAND TASKS)
Tasks
 Tasks refer to the goals or objectives of the visualization.
 Understanding the task at hand helps in choosing the right visualization technique.
 Some common tasks include:
 Comparison: Comparing two or more variables to see how they relate.
 Trends: Observing data over a period of time.
 Distribution: Understanding the spread and distribution of data.
 Relationship: Identifying correlation between variables.
 Geospatial: Mapping data to geographical locations.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 21
INPUT FOR VISUALIZATION (DATAAND TASKS)
1. Comparison
Definition: Comparison involves assessing two or more variables to understand their similarities,
differences, or relationships.
Key Points:
 Helps in benchmarking and relative judgment.
 Can be used for both categorical and numerical data.
Visualization Techniques:
 Bar Charts: Useful for comparing quantities across categories.
 Grouped or Stacked Bar Charts: Compare multiple variables across categories.
 Dot Plots: Good for comparing individual data points.
Real-world Scenario:
Imagine a company wants to compare sales of different products across regions. A grouped bar chart can
show each product's sales in each region, allowing for a clear comparison.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 22
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Trends
Definition: Trends involve observing data changes over a continuous interval, usually time.
Key Points:
 Reveals patterns, fluctuations, and cycles.
 Helps in forecasting or predicting future values.
Visualization Techniques:
 Line Charts: Ideal for showing trends over time.
 Area Charts: Similar to line charts but with the area under the curve shaded.
Real-world Scenario:
Consider stock market data. Investors often use line charts to track the price of a stock over time and
identify its upward or downward trend.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 23
INPUT FOR VISUALIZATION (DATAAND TASKS)
3. Distribution
Definition: Distribution refers to understanding how data is spread across different categories or ranges.
Key Points:
 Helps identify central tendencies, spread, and outliers.
 Reveals the overall shape of data distribution.
Visualization Techniques:
 Histograms: Shows the distribution of a single variable.
 Box Plots: Highlights median, quartiles, and potential outliers.
 Violin Plots: Combines aspects of box plots and density plots.
Real-world Scenario:
In a manufacturing unit, understanding the distribution of product sizes can help identify anomalies or
deviations from standards.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 24
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Relationship
Definition: This task focuses on determining if, how, and to what extent two or more variables are
related.
Key Points:
 Helps in identifying correlations or potential causations.
 Reveals how variables interact with each other.
Visualization Techniques:
 Scatter Plots: Plot two variables to see their relationship.
 Heat Maps: Great for showing correlations in larger datasets.
 Pair Plots: Visualize pairwise relationships in a dataset.
Real-world Scenario:
In healthcare, researchers might use scatter plots to see if there's a relationship between calorie intake and
weight gain among participants.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 25
INPUT FOR VISUALIZATION (DATAAND TASKS)
5. Geospatial
Definition: Geospatial visualization pertains to data that has a geographical or spatial component.
Key Points:
 Maps real-world data to geographical locations.
 Helps in spatial analysis and location-based insights.
Visualization Techniques:
 Choropleth Maps: Regions colored based on a variable.
 Point Maps: Individual data points plotted on a map.
 Heat Maps: Shows density or intensity of data points on a map.
Real-world Scenario:
During an epidemic, health agencies might use choropleth maps to show the spread and intensity of the
disease across regions.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 26
INPUT FOR VISUALIZATION (DATAAND TASKS)
Shapes of Data
 In the world of data analysis and visualization, the format or "shape" in which data is presented plays a
crucial role in determining how it can be processed, analyzed, and visualized.
 Data can be represented in various shapes or structures.
The most common ones include:
 Tabular Data: Data in rows and columns, like a spreadsheet.
 Hierarchical Data: Data in tree structures, like JSON.
 Network Data: Data representing connections, like social networks.
 Geospatial Data: Data with geographic information, like maps.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 27
INPUT FOR VISUALIZATION (DATAAND TASKS)
1. Tabular Data
Definition: Tabular data is structured data that's organized in rows and columns, much like a table or
spreadsheet.
Characteristics:
Each row typically represents a single record or entity.
Each column represents a specific attribute or field of the entity.
Contains headers to label each column.
Visualization Techniques:
Bar charts, line charts, and scatter plots for various columns.
Heatmaps for visualizing patterns within the table.
Real-world Example:
A company's sales record where each row represents a transaction, and columns may include date,
product ID, quantity sold, and total price.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 28
INPUT FOR VISUALIZATION (DATAAND TASKS)
2. Hierarchical Data
Definition:
Hierarchical data represents relationships in a tree-like structure, where entities can have parent-child
relationships.
Characteristics:
Data is nested.
Each entity (except the root) has one parent and can have multiple children.
Commonly represented in formats like JSON or XML.
Visualization Techniques:
Tree diagrams.
Dendrograms in hierarchical clustering.
Real-world Example:
A company's organizational chart where the CEO is at the top, followed by VPs, managers, and individual
contributors in a tree structure.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 29
INPUT FOR VISUALIZATION (DATAAND TASKS)
3. Network Data
Definition:
Network data, also known as graph data, represents entities (nodes) and the relationships or connections
(edges) between them.
Characteristics:
Nodes represent entities.
Edges represent relationships.
Can be directed (with a starting and ending point) or undirected.
Visualization Techniques:
Network graphs.
Force-directed layouts.
Matrix representation.
Real-world Example:
A social media platform where individuals are nodes and their friendships are edges connecting them.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 30
INPUT FOR VISUALIZATION (DATAAND TASKS)
4. Geospatial Data
Definition:
Geospatial data pertains to information associated with a specific geographic location.
Characteristics:
Contains spatial coordinates (like latitude and longitude).
Can also include altitude, distance, or other geographical metrics.
Often used in conjunction with maps.
Visualization Techniques:
Choropleth maps (regions colored based on data).
Point maps (points placed based on coordinates).
Heatmaps (density visualization).
Real-world Example:
A city's public health department mapping reported cases of a disease to identify hotspots.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 31
LOADING AND PARSING DATA
 Loading data involves importing data from its source into the environment where the visualization
will be created.
 Once data is loaded, it might not be in a ready-to-use state.
 Parsing involves converting or arranging data into a usable format.
Loading Data
• Loading data is a foundational step in the process of data visualization.
• It essentially entails the importation of data from its original source into the specific environment
where the visualization will be constructed.
1.File Format
2.Database
3.Application Programming Interfaces
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 32
LOADING AND PARSING DATA
1. File formats:
One of the primary aspects to consider is the format of the data.
 Comma-Separated Values (CSV)
 Microsoft Excel
 JavaScript Object Notation (JSON), a hierarchical and key-value pair format
 XML, another hierarchical format, is also used in various applications, from web services to
document storage.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 33
LOADING AND PARSING DATA
2.Databases:
Beyond file formats, databases play a crucial role in the world of data.
 Relational databases organize data into tables, allowing for complex queries that can join, filter,
and aggregate data.
 NoSQL databases, like MongoDB or Cassandra, offer more flexible storage solutions, especially
for unstructured or semi-structured data.
3. Application Programming Interfaces
 APIs allow applications to communicate with each other, often fetching real-time data.
 For instance, if one were creating a visualization about global weather patterns, they might fetch
data in real-time from a weather API.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 34
LOADING AND PARSING DATA
Parsing Data:
 Data parsing is the process of converting or arranging data into a format or structure that's more
usable and insightful for our specific needs.
1. Data Cleaning
2. Data Transformation
3. Feature Engineering
4. Date and Time Parsing
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 35
LOADING AND PARSING DATA
Data Cleaning: The first and most fundamental step in data parsing is data cleaning.
Common Issues in Raw Data
 Missing values.
 Duplicate entries.
 Incorrect data types.
 Outliers or anomalies.
Importance of Data Cleaning:
 Eliminate errors and inconsistencies.
 Enhance accuracy of data analysis.
 Save time and resources in the long run
Real-world Example
Scenario: Imagine an e-commerce company collecting data on customer purchases. Incorrect data
might show a product being purchased before it was even launched. Data cleaning will identify and
rectify such anomalies.
Outcome: Clean data ensures accurate customer insights and better business decisions.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 36
LOADING AND PARSING DATA
Data Transformation: Once our data is clean, it often needs to be transformed into a more suitable format
or scale.
Data types might need conversion, like turning a number stored as a string into an actual numeric value.
Such transformations are essential for mathematical operations and comparisons.
Examples of Data Transformation
 Numeric scaling: Normalizing or standardizing values.
 Encoding: converting categorical data into numeric format.
 Date formatting: ensuring a consistent date format across data.
 Aggregation: summarizing data into broader categories or metrics.
Real-world Example
Scenario: Consider a global e-commerce platform collecting data on customer purchases. Prices are stored
in various currencies. Before analyzing global sales, data needs to be transformed to a standard currency, like
USD. Clean data ensures correct exchange rates are applied consistently.
Outcome: Accurate insights on global sales and profit margins.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 37
LOADING AND PARSING DATA
Feature Engineering: Feature engineering is the process of transforming raw data into features that better
represent the underlying patterns in the data, making them more suitable for modeling.
Parsing: Breaking down data into understandable and usable components.
Feature Engineering: Optimizing these components (features) for analysis or modelling.
Common Feature Engineering Techniques
 Normalization: Scaling features to a standard scale (e.g., 0 to 1) so they contribute equally to model
performance.
 One-hot encoding: Converting categorical variables into a format that can be provided to ML algorithms to
improve predictions.
 Binning: Converting continuous variables into discrete bins to handle outliers or make data more
interpretable.
 Feature extraction: Transforming high-dimensional data into a lower-dimensional form, retaining the most
important information
 Feature interaction: Creating new features by combining two or more features, capturing interactions
between them.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 38
LOADING AND PARSING DATA
Date and Time Parsing: The process of converting textual data representing dates and times into a
standardized and usable format.
 Time-based data is common in various domains (e.g., finance, healthcare, logistics). Proper parsing ensures
it's utilized effectively for analysis and modeling.
 Proper date and time parsing is foundational for many data analysis tasks.
Common Formats & Challenges
Standard Date Formats: YYYY-MM-DD, DD/MM/YYYY, MM/DD/YYYY, etc.
Time Formats: HH:MM:SS, HH:MM AM/PM, etc.
Time Zones: UTC, GMT, PST, etc.
Daylight Saving Time: Adjustments can shift times.
Ambiguities: 03/04/2020 (Is it March 4th or April 3rd?)
Real-world Scenarios & Case Studies
Healthcare: Parsing appointment and treatment dates for patient data analysis
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 39
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Data Cleansing is a systematic approach to ensuring the accuracy, consistency, and reliability of data by
identifying and rectifying errors and inconsistencies.
Every step, from data collection to visualization, can introduce errors. Without cleansing, these errors can
propagate and mislead.
Why Data Cleansing is essential for Data Visualization?
 Ensuring Accuracy
 Maintaining Consistency
 Building Trust
 Enhancing Efficiency
Common Data Quality Issues
 Missing Data
 Duplicate Data
 Inaccurate Data
 Outliers
 Inconsistent Formats
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 40
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Steps in Data Cleansing
Data cleansing can be explained as a 5 step process.
 Step 1. Data Auditing
 Step 2. Data Cleaning
 Step 3. Data Verification
 Step 4. Data Reporting
 Step 5. Data Maintenance
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 41
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Step 1. Data Auditing:
 Data auditing is the initial phase where data is meticulously examined to detect any anomalies,
inconsistencies, or inaccuracies.
 This process typically employs both descriptive statistics and visual methods.
 Descriptive statistics, such as means, medians, modes, standard deviations, and variance, can provide
insights into the central tendencies and dispersions within the data.
 Visual methods involve representing the data graphically to detect outliers or patterns that might not be
evident in tabulated data.
Tools:
 Histograms
 Scatter Plots
 Frequency Distributions
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 42
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Step 2. Data Cleaning:
Once data auditing has identified problems, data cleaning rectifies these issues. It is the actual process of
amending or removing data in a dataset that is incorrect, incomplete, improperly formatted, or duplicated.
Methods:
Imputation: Missing data can distort analysis. One way to address this is by filling in the gaps with imputed
values.
Outlier Treatment: Not all outliers are errors; some might be genuine extreme values. The key is to ascertain
whether to keep them (if they're valid), adjust them (if they've been recorded incorrectly), or remove them (if
they're anomalies without relevance).
Deduplication: Duplicate data can arise due to various reasons, such as data merge errors or multiple data
entries. This step involves identifying and removing these redundant records to ensure each data point is
unique and accurate.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 43
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Step 3. Data Verification:
After cleaning, it's vital to revisit the data to ensure no new errors were introduced during the cleaning
process. Additionally, data verification ensures that no essential information was inadvertently removed.
 Methods: This might involve cross-referencing with other trusted datasets, re-running the initial auditing
techniques, or even sampling and manually reviewing records.
Step 4. Data Reporting:
Documenting the cleansing process is crucial for several reasons. It maintains transparency, ensures
replicability, and builds trustworthiness.
 Methods: This involves creating a detailed report outlining what anomalies were identified, the
techniques used to address them, and any potential implications of the cleaning process. This
documentation can be invaluable for future reference, for stakeholders to understand the data's quality, or
for other data professionals working on the dataset.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 44
DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION
Step 5. Data Maintenance:
Data doesn't remain static. As new data is added or as existing data evolves, it's crucial to ensure that it
maintains its quality.
 Methods: Implementing routine checks, validations, and periodic audits can help in identifying and
rectifying new issues promptly. Automated scripts, data quality software, or even manual reviews (for
smaller datasets) can be employed to maintain data cleanliness over time.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 45
Data Cleansing Tools and Techniques
Manual Review:
 Pros: Full control, detailed understanding.
 Cons: Time-consuming, not feasible for large datasets.
Automated Cleaning Tools:
DataWrangler, OpenRefine, and Trifacta.
 Pros: Faster, can handle large datasets.
 Cons: Might require a learning curve, can miss domain-specific nuances.
Programming Libraries:
Pandas in Python, dplyr in R.
 Pros: Highly customizable, can integrate with other data processing steps.
 Cons: Requires programming knowledge.
Data Quality Software:
Tools like Talend and Informatica offer end-to-end data management solutions.
 Pros: Comprehensive, often have built-in workflows.
 Cons: Can be expensive, might have a steeper learning curve.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 46
Reusable Dynamic Components using the General Update Pattern
The General Update Pattern in D3.js (a popular data visualization library for JavaScript) provides such a mechanism,
facilitating the creation of dynamic and efficient visualizations.
What is the General Update Pattern?
The General Update Pattern is a methodology used in D3.js to handle data changes in a visualization. It breaks the
data-binding process into three distinct phases:
 Enter: Elements that need to be added to the visualization because they have new data associated.
 Update: Elements whose data has changed since the last render.
 Exit: Elements that no longer have associated data and need to be removed.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 47
Reusable Dynamic Components using the General Update Pattern
The Importance of Reusability
 Efficiency: Avoids re-writing code, saving time and reducing errors.
 Consistency: Ensures visualizations maintain a consistent look and feel across different parts of an application or
different applications.
 Modularity: Allows for easier debugging, testing, and understanding, as each component can be developed and
maintained separately.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 48
Reusable Dynamic Components using the General Update Pattern
Implementing the General Update Pattern
 Binding Data
Start by binding a dataset to a selection. This binding process involves pairing data points from dataset with DOM
elements in visualization.
 Handling the Enter Selection
 Create new elements for each of these data points.
 Define attributes and styles for these new elements based on their data.
 Handling the Update Selection
• Adjust attributes and styles of these elements to reflect any changes in their bound data.
 Handling the Exit Selection
• Remove these elements or adjust them as necessary, e.g., fading them out or shrinking them.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 49
Making Components Reusable
 Encapsulation
Wrap your visualization code inside a function. This function can then be called whenever ,want to create a
new instance of visualization.
 Customizable Properties
Allow users of your component to customize its properties. This could be achieved using getter-setter functions
for each property.
 Event Handling
For interactivity, allow users to define custom event handlers. This ensures that the component can respond to
user interactions in a way that suits the specific application it's being used in.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 50
REUSABLE SCATTER PLOT
A scatter plot, one of the most versatile visualization tools, displays individual data points on a two-dimensional
plane.
Components of a Scatter Plot
1. Axes and Scale
 X and Y Axes: Represent the domains of the two variables being compared.
 Scale: The mechanism that translates data values into pixel values. Common scales include linear,
logarithmic, and time-based.
2. Data Points
 Typically represented as circles, though other symbols can be used.
 Position determined by the data value for each variable.
3. Optional Components
 Gridlines: Enhance readability.
 Labels: Provide context or additional data about specific points.
 Trend Line: Indicates a general direction of data points.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 51
 A reusable scatter plot is a type of data visualization that displays individual data points as markers on a
graph.
 The term "reusable" implies that the scatter plot is designed in a way that allows it to be easily used and
applied to different datasets or situations without requiring extensive modification.
 In a scatter plot, each data point is represented by a marker, typically a dot, and the position of the dot is
determined by the values of two variables (usually one on the x-axis and one on the y-axis).
 This allows for a visual representation of the relationship, if any, between the two variables.
REUSABLE SCATTER PLOT
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 52
COMMON VISUALIZATION IDIOMS
Bar Chart
 A bar chart is a statistical approach to represent given data using vertical and horizontal rectangular bars.
 The length of each bar is proportional to the value they represent.
 It is basically a graphical representation of data with the help of horizontal or vertical bars with different
heights.
 They are also known as bar graphs. The bar charts have three major characteristics:
 The bar charts are used to compare the different data among different groups.
 Bar charts show the relationship with the help of two axes. On one axis it represents the
categories and on another axis, it represents the discrete values.
 Over a period of time bar charts shows the major changes in available data.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 53
COMMON VISUALIZATION IDIOMS
Types of Bar Chart
The grouped bar chart is also referred the clustered bar chart (graph).
The stacked bar chart is also known as the composite bar chart.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 54
COMMON VISUALIZATION IDIOMS
Pie Chart
 A pie chart is a pictorial representation of data in the form of a circular chart or pie where the slices of
the pie show the size of the data.
 A list of numerical variables along with categorical variables is needed to represent data in the form of
a pie chart.
The whole pie represents a value of 100. It is divided into 10
slices or sectors. The various colors represent the ingredients
used to prepare the cake. What would be the exact quantity of
each of the ingredients represented in specific colors in the
following pie chart?
Quantity of Flour 30
Quantity of Sugar 20
Quantity of Egg 40
Quantity of Butter 10
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 55
COMMON VISUALIZATION IDIOMS
Line Graph
 Line graphs, also called line charts, are used to represent quantitative data collected over a specific subject
and a specific time interval.
 All the data points are connected by a line.
 Data points represent the observations that are collected on a survey or research.
 The line graph has an x-axis and a y-axis.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 56
COMMON VISUALIZATION IDIOMS
Line Graph:
Shows continuous data over a period, setting against a general scale, and connecting individual data points
together, ideal for showing growth rate or trends at even intervals.
Scatter Plot
It works best when comparing large numbers of data points without regard to time. This tool is very
powerful when we are trying to show the relationship between two variables (x and y-axis), for example, a
person's weight and height.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 57
COMMON VISUALIZATION IDIOMS
AREA CHART
 An area chart, also known as a mountain chart, is a data visualization type that combines the
appearance of a line chart and a bar chart.
 It is commonly used to show how numerical values change based on a second variable, usually a time
period.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 58
COMMON VISUALIZATION IDIOMS
TYPES OF AREA CHART
 Basic area chart-tracking the number of weekly profile views from a business’s LinkedIn page.
 Overlapping area chart- compare two or more data groups based on a specific variable
 Stacked area chart(cumulative area chart)-deeper than just comparing values but also understand how
each group contributes to a total.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 59
DIFFERENT CRITERIA TO CHOOSE CHARTS
 Choosing the right chart type is crucial in data visualization.
 It ensures that data is presented clearly, accurately, and effectively.
 Selecting an inappropriate chart can lead to misinterpretation of data or obscure critical insights.
Nature of Data
1. Qualitative vs. Quantitative
Qualitative Data (Categorical): Data that can be divided into categories.
Suitable Charts: Pie charts, bar charts, treemaps.
Quantitative Data (Numerical): Data that can be measured.
Suitable Charts: Line charts, scatter plots, histograms.
2. Time Series vs. Non-Time Series
Time Series Data: Data points indexed in time order.
Suitable Charts: Line charts, area charts, stacked area charts.
Non-Time Series Data: Data without a chronological order.
Suitable Charts: Bar charts, pie charts, scatter plots.

2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 60
DIFFERENT CRITERIA TO CHOOSE CHARTS
Number of Variables
1. Univariate Data
Data focusing on a single variable.
Suitable Charts: Histograms, pie charts, line charts (for time series).
2. Bivariate Data
Data exploring the relationship between two variables.
Suitable Charts: Scatter plots, line charts, bar charts.
3. Multivariate Data
Data involving more than two variables.
Suitable Charts: Scatter plot matrices, parallel coordinate plots, radar charts.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 61
DIFFERENT CRITERIA TO CHOOSE CHARTS
Relationships and Comparisons
1. Comparing Categories : Viewing the difference in values across different categories.
Suitable Charts: Bar charts, column charts, dot plots.
2. Distribution of Data : Observing the spread and shape of a dataset.
Suitable Charts: Histograms, box plots, density plots.
3. Trend Over Time : Analyzing patterns or changes over a period.
Suitable Charts: Line charts, area charts, waterfall charts.
4. Correlation : Examining the relationship between two or more variables.
Suitable Charts: Scatter plots, bubble charts.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 62
DIFFERENT CRITERIA TO CHOOSE CHARTS
Data Volume and Complexity
1. Small Data Sets : Emphasize individual data points or fine details.
Suitable Charts: Dot plots, bar charts.
2. Large Data Sets : Offer a general overview or highlight trends.
Suitable Charts: Line charts, heatmaps, histograms.
3. Hierarchical Data : Data with inherent structure or nested categories.
Suitable Charts: Treemaps, dendrograms, sunburst charts.
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 63
DIFFERENT CRITERIA TO CHOOSE CHARTS
Audience and Context
1. Level of Expertise
General Audience: Opt for simpler, more intuitive charts.
Expert Audience: Can utilize more complex visualizations.
2. Presentation Medium
Digital Displays: Interactive visualizations like dynamic dashboards.
Print: Static visualizations with high clarity.
3. Storytelling vs. Exploration
Storytelling: Use charts that drive home a specific point or narrative.
Exploration: Use charts that allow users to delve into the data and discover insights
2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 64
DIFFERENT CRITERIA TO CHOOSE CHARTS
Aesthetics and Design
1. Clarity and Simplicity
Avoid clutter, prioritize readability.
2. Color Considerations
Ensure color choices are meaningful and accessible.
3. Consistency
If presenting multiple charts, maintain a consistent design language.

More Related Content

Similar to Data Visualization

Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Health Catalyst
 

Similar to Data Visualization (20)

KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
IRJET - Student Pass Percentage Dedection using Ensemble Learninng
IRJET  - Student Pass Percentage Dedection using Ensemble LearninngIRJET  - Student Pass Percentage Dedection using Ensemble Learninng
IRJET - Student Pass Percentage Dedection using Ensemble Learninng
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
High performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayesHigh performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayes
 
High performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayesHigh performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayes
 
Agile, qa and data projects geek night 2020
Agile, qa and data projects   geek night 2020Agile, qa and data projects   geek night 2020
Agile, qa and data projects geek night 2020
 
Transformación de datos_Preprocssing.ppt
Transformación de datos_Preprocssing.pptTransformación de datos_Preprocssing.ppt
Transformación de datos_Preprocssing.ppt
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
 
Query-Based Retrieval of Annotated Document
Query-Based Retrieval of Annotated DocumentQuery-Based Retrieval of Annotated Document
Query-Based Retrieval of Annotated Document
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
 
Smart Health Guide App
Smart Health Guide AppSmart Health Guide App
Smart Health Guide App
 
Data Mining Module 1 Business Analytics.
Data Mining Module 1 Business Analytics.Data Mining Module 1 Business Analytics.
Data Mining Module 1 Business Analytics.
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Enhanced Privacy Preserving Access Control in Incremental Data using microagg...
Enhanced Privacy Preserving Access Control in Incremental Data using microagg...Enhanced Privacy Preserving Access Control in Incremental Data using microagg...
Enhanced Privacy Preserving Access Control in Incremental Data using microagg...
 
Chaitra masur 2015 dec19
Chaitra masur 2015 dec19Chaitra masur 2015 dec19
Chaitra masur 2015 dec19
 
Chaitra masur 2015 dec19
Chaitra masur 2015 dec19Chaitra masur 2015 dec19
Chaitra masur 2015 dec19
 
IRJET- A Survey on Mining of Tweeter Data for Predicting User Behavior
IRJET- A Survey on Mining of Tweeter Data for Predicting User BehaviorIRJET- A Survey on Mining of Tweeter Data for Predicting User Behavior
IRJET- A Survey on Mining of Tweeter Data for Predicting User Behavior
 
IRJET- An Efficient Solitude Securing Ranked Keyword Search Technique
IRJET- An Efficient Solitude Securing Ranked Keyword Search TechniqueIRJET- An Efficient Solitude Securing Ranked Keyword Search Technique
IRJET- An Efficient Solitude Securing Ranked Keyword Search Technique
 
Comparative Study of Enchancement of Automated Student Attendance System Usin...
Comparative Study of Enchancement of Automated Student Attendance System Usin...Comparative Study of Enchancement of Automated Student Attendance System Usin...
Comparative Study of Enchancement of Automated Student Attendance System Usin...
 
IRJET - Encoded Polymorphic Aspect of Clustering
IRJET - Encoded Polymorphic Aspect of ClusteringIRJET - Encoded Polymorphic Aspect of Clustering
IRJET - Encoded Polymorphic Aspect of Clustering
 

Recently uploaded

☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
mikehavy0
 
一比一原版(NEU毕业证书)东北大学毕业证成绩单原件一模一样
一比一原版(NEU毕业证书)东北大学毕业证成绩单原件一模一样一比一原版(NEU毕业证书)东北大学毕业证成绩单原件一模一样
一比一原版(NEU毕业证书)东北大学毕业证成绩单原件一模一样
A
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 

Recently uploaded (20)

☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligence
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Students
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailing
 
一比一原版(NEU毕业证书)东北大学毕业证成绩单原件一模一样
一比一原版(NEU毕业证书)东北大学毕业证成绩单原件一模一样一比一原版(NEU毕业证书)东北大学毕业证成绩单原件一模一样
一比一原版(NEU毕业证书)东北大学毕业证成绩单原件一模一样
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
DBMS-Report on Student management system.pptx
DBMS-Report on Student management system.pptxDBMS-Report on Student management system.pptx
DBMS-Report on Student management system.pptx
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & Modernization
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Station
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptx
 

Data Visualization

  • 1. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 1 U21CS303 –Data Visualization
  • 2. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 2 UNIT II : SHAPES OF DATAAND VISUALIZATION IDIOMS Input for Visualization – Data and Tasks – Loading and Parsing Data – Data Cleansing and its importance – Reusable Dynamic Components using the General Update Pattern – Reusable Scatter Plot – Common Visualization Idioms – Bar Chart – Vertical Pie Chart – Horizontal Pie Chart – Coxcomb Plot Line Chart – Area Chart – Criteria to choose charts CO2: Apply visualization techniques for various data analysis tasks
  • 3. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 3 INPUT FOR VISUALIZATION (DATAAND TASKS)  Data is the core of any visualization. Understanding the nature of the data is vital before attempting to visualize it.  Tasks refer to the goals or objectives of the visualization. Understanding the task at hand helps in choosing the right visualization technique. Nature of Data: 1. Numerical Data a. Continuous Data b. Discrete Data 2. Categorical Data 3. Ordinal Data 4. Interval Data
  • 4. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 4 INPUT FOR VISUALIZATION (DATAAND TASKS) Nature of Data: 1. Numerical Data Numerical data represents quantities or measurements. a. Continuous Data Data that can take any value within a given range. Examples: Height of individuals (someone can be 5.7 feet tall or 5.71 feet tall), Temperature measurements. Visualization Techniques: Histograms, line graphs, scatter plots. b. Discrete Data Data that can take only specific, separate values. Examples: Number of cars in a household (can't have 2.5 cars), Number of students in a class. Visualization Techniques: Bar charts, pie charts.
  • 5. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 5 INPUT FOR VISUALIZATION (DATAAND TASKS) 2. Categorical Data Categorical data represents categories or labels. Examples: Colors (red, blue, green), Genders (male, female). Visualization Techniques: Bar charts, pie charts, stacked bars. 3. Ordinal Data Ordinal data represents categories with a specific order or rank. Examples: Education level (high school, bachelor's, master's, PhD). Survey responses (strongly disagree, disagree, neutral, agree, strongly agree). Visualization Techniques: Bar charts (ordered), line graphs.
  • 6. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 6 INPUT FOR VISUALIZATION (DATAAND TASKS) 4. Interval Data Interval data is similar to ordinal data but with consistent differences between values but lacks a true zero point. Examples:  Temperature in Celsius (0°C doesn't mean absence of temperature).  IQ scores. Visualization Techniques: Line graphs, scatter plots.
  • 7. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 7 INPUT FOR VISUALIZATION (DATAAND TASKS) Hands-on Exercise:
  • 8. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 8 INPUT FOR VISUALIZATION (DATAAND TASKS) The amount of data impacts the choice of visualization. Large datasets might require aggregation or sampling. Small datasets can be visualized in detail, while large datasets often require techniques to simplify, aggregate, or sample the data to make it understandable. 1.Small Datasets 2. Large Datasets 3. Techniques for Handling Large Datasets A. Aggregation B. Sampling Volume of Data:
  • 9. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 9 INPUT FOR VISUALIZATION (DATAAND TASKS) Volume of Data:
  • 10. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 10 INPUT FOR VISUALIZATION (DATAAND TASKS) Volume of Data: Techniques for Handling Large Datasets:
  • 11. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 11 INPUT FOR VISUALIZATION (DATAAND TASKS) Sources of Data:  In the modern digital era, data is abundant and can be sourced from a variety of channels.  Data can come from  Databases  Spreadsheets  APIs  Web scraping  Manual input 1. Databases A structured collection of data, often managed by a database management system (DBMS). Databases are typically used by organizations to store vast amounts of transactional or operational data. Characteristics:  Structured, often relational in nature.  Large volumes.  Can be queried using SQL or other query languages.
  • 12. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 12 INPUT FOR VISUALIZATION (DATAAND TASKS) 2. Spreadsheets Files that allow users to organize, format, and calculate data using a grid of cells. Examples: Microsoft Excel and Google Sheets. Characteristics:  Tabular format.  Easily accessible and modifiable.  Contains formulas and calculations. 3. APIs (Application Programming Interfaces) A set of rules and protocols that allow different software entities to communicate with each other. APIs often serve as gateways to retrieve data from web servers or platforms.  Characteristics:  Automated data retrieval.  Real-time or periodic updates.  Structured, often in XML or JSON format.
  • 13. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 13 INPUT FOR VISUALIZATION (DATAAND TASKS) 4. Web Scraping The process of extracting data from websites. This is done using scripts or bots that fetch web pages and extract necessary information. Characteristics:  Data can be unstructured or semi-structured.  Prone to changes based on website layout updates. 5. Manual Input Data entry done by individuals manually, often into systems, applications, or forms. Characteristics:  Time-consuming.  Highly susceptible to human error.
  • 14. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 14 INPUT FOR VISUALIZATION (DATAAND TASKS) Data Quality:  In any data-driven decision-making process, the quality of the data used is paramount.  Poor data quality can lead to incorrect conclusions, flawed business decisions, and resource wastage. 1. Accuracy 2. Consistency 3. Completeness 4. Reliability 5. Timeliness of data
  • 15. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 15 INPUT FOR VISUALIZATION (DATAAND TASKS) 1. Accuracy Definition: Accuracy refers to the closeness of a data value to its true or actual value. Key Points: Errors: Result from incorrect data entry, faulty sensors, or misinformation. Validation: Use validation rules or checks to ensure data accuracy. Verification: Cross-reference with trusted data sources. Real-world Scenario: Imagine an e-commerce platform recording product prices. If a product's price is incorrectly listed as $10 instead of $100, it could lead to significant revenue loss.
  • 16. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 16 INPUT FOR VISUALIZATION (DATAAND TASKS) 2. Consistency Definition: Consistency ensures that data is uniform and consistent across all data sources and does not contradict other recorded data. Key Points: Standardization: Use uniform units, formats, and conventions. Deduplication: Ensure no repetitive or contradicting records exist. Harmonization: Make sure data from different sources aligns well when integrated. Real-world Scenario: Consider a multinational company's employee database. If one branch records dates in "DD/MM/YYYY" and another in "MM/DD/YYYY", inconsistencies can arise, causing confusion
  • 17. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 17 INPUT FOR VISUALIZATION (DATAAND TASKS) 3. Completeness Definition: Completeness ensures that no necessary data is missing. Key Points: Null Values: Identify and address missing data points. Default Values: Ensure they don't misrepresent actual data. Data Imputation: Techniques can be used to fill in missing data based on existing data patterns. Real-world Scenario: In a medical trial, if patient outcomes are missing, the results of the trial could be skewed or inconclusive.
  • 18. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 18 INPUT FOR VISUALIZATION (DATAAND TASKS) 4. Reliability Definition: Reliability pertains to the trustworthiness and consistency of the data over time. Key Points: Source: Data from reputable sources is often more reliable. Duplication: Repeated or duplicate entries can compromise reliability. Audit Trails: Maintain logs to trace back data changes or updates. Real-world Scenario: If a weather sensor occasionally malfunctions and records extreme values, the overall climate data becomes less reliable.
  • 19. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 19 INPUT FOR VISUALIZATION (DATAAND TASKS) 5. Timeliness Definition: Timeliness relates to the age of the data and its relevance to the current scenario. Key Points: Update Frequency: Ensure data is updated at appropriate intervals. Real-time Data: Necessary for certain applications, like stock trading. Historical Data: While not "timely", it's crucial for trend analysis. Real-world Scenario: In stock market analytics, using outdated stock prices can result in financial misjudgements.
  • 20. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 20 INPUT FOR VISUALIZATION (DATAAND TASKS) Tasks  Tasks refer to the goals or objectives of the visualization.  Understanding the task at hand helps in choosing the right visualization technique.  Some common tasks include:  Comparison: Comparing two or more variables to see how they relate.  Trends: Observing data over a period of time.  Distribution: Understanding the spread and distribution of data.  Relationship: Identifying correlation between variables.  Geospatial: Mapping data to geographical locations.
  • 21. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 21 INPUT FOR VISUALIZATION (DATAAND TASKS) 1. Comparison Definition: Comparison involves assessing two or more variables to understand their similarities, differences, or relationships. Key Points:  Helps in benchmarking and relative judgment.  Can be used for both categorical and numerical data. Visualization Techniques:  Bar Charts: Useful for comparing quantities across categories.  Grouped or Stacked Bar Charts: Compare multiple variables across categories.  Dot Plots: Good for comparing individual data points. Real-world Scenario: Imagine a company wants to compare sales of different products across regions. A grouped bar chart can show each product's sales in each region, allowing for a clear comparison.
  • 22. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 22 INPUT FOR VISUALIZATION (DATAAND TASKS) 2. Trends Definition: Trends involve observing data changes over a continuous interval, usually time. Key Points:  Reveals patterns, fluctuations, and cycles.  Helps in forecasting or predicting future values. Visualization Techniques:  Line Charts: Ideal for showing trends over time.  Area Charts: Similar to line charts but with the area under the curve shaded. Real-world Scenario: Consider stock market data. Investors often use line charts to track the price of a stock over time and identify its upward or downward trend.
  • 23. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 23 INPUT FOR VISUALIZATION (DATAAND TASKS) 3. Distribution Definition: Distribution refers to understanding how data is spread across different categories or ranges. Key Points:  Helps identify central tendencies, spread, and outliers.  Reveals the overall shape of data distribution. Visualization Techniques:  Histograms: Shows the distribution of a single variable.  Box Plots: Highlights median, quartiles, and potential outliers.  Violin Plots: Combines aspects of box plots and density plots. Real-world Scenario: In a manufacturing unit, understanding the distribution of product sizes can help identify anomalies or deviations from standards.
  • 24. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 24 INPUT FOR VISUALIZATION (DATAAND TASKS) 4. Relationship Definition: This task focuses on determining if, how, and to what extent two or more variables are related. Key Points:  Helps in identifying correlations or potential causations.  Reveals how variables interact with each other. Visualization Techniques:  Scatter Plots: Plot two variables to see their relationship.  Heat Maps: Great for showing correlations in larger datasets.  Pair Plots: Visualize pairwise relationships in a dataset. Real-world Scenario: In healthcare, researchers might use scatter plots to see if there's a relationship between calorie intake and weight gain among participants.
  • 25. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 25 INPUT FOR VISUALIZATION (DATAAND TASKS) 5. Geospatial Definition: Geospatial visualization pertains to data that has a geographical or spatial component. Key Points:  Maps real-world data to geographical locations.  Helps in spatial analysis and location-based insights. Visualization Techniques:  Choropleth Maps: Regions colored based on a variable.  Point Maps: Individual data points plotted on a map.  Heat Maps: Shows density or intensity of data points on a map. Real-world Scenario: During an epidemic, health agencies might use choropleth maps to show the spread and intensity of the disease across regions.
  • 26. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 26 INPUT FOR VISUALIZATION (DATAAND TASKS) Shapes of Data  In the world of data analysis and visualization, the format or "shape" in which data is presented plays a crucial role in determining how it can be processed, analyzed, and visualized.  Data can be represented in various shapes or structures. The most common ones include:  Tabular Data: Data in rows and columns, like a spreadsheet.  Hierarchical Data: Data in tree structures, like JSON.  Network Data: Data representing connections, like social networks.  Geospatial Data: Data with geographic information, like maps.
  • 27. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 27 INPUT FOR VISUALIZATION (DATAAND TASKS) 1. Tabular Data Definition: Tabular data is structured data that's organized in rows and columns, much like a table or spreadsheet. Characteristics: Each row typically represents a single record or entity. Each column represents a specific attribute or field of the entity. Contains headers to label each column. Visualization Techniques: Bar charts, line charts, and scatter plots for various columns. Heatmaps for visualizing patterns within the table. Real-world Example: A company's sales record where each row represents a transaction, and columns may include date, product ID, quantity sold, and total price.
  • 28. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 28 INPUT FOR VISUALIZATION (DATAAND TASKS) 2. Hierarchical Data Definition: Hierarchical data represents relationships in a tree-like structure, where entities can have parent-child relationships. Characteristics: Data is nested. Each entity (except the root) has one parent and can have multiple children. Commonly represented in formats like JSON or XML. Visualization Techniques: Tree diagrams. Dendrograms in hierarchical clustering. Real-world Example: A company's organizational chart where the CEO is at the top, followed by VPs, managers, and individual contributors in a tree structure.
  • 29. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 29 INPUT FOR VISUALIZATION (DATAAND TASKS) 3. Network Data Definition: Network data, also known as graph data, represents entities (nodes) and the relationships or connections (edges) between them. Characteristics: Nodes represent entities. Edges represent relationships. Can be directed (with a starting and ending point) or undirected. Visualization Techniques: Network graphs. Force-directed layouts. Matrix representation. Real-world Example: A social media platform where individuals are nodes and their friendships are edges connecting them.
  • 30. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 30 INPUT FOR VISUALIZATION (DATAAND TASKS) 4. Geospatial Data Definition: Geospatial data pertains to information associated with a specific geographic location. Characteristics: Contains spatial coordinates (like latitude and longitude). Can also include altitude, distance, or other geographical metrics. Often used in conjunction with maps. Visualization Techniques: Choropleth maps (regions colored based on data). Point maps (points placed based on coordinates). Heatmaps (density visualization). Real-world Example: A city's public health department mapping reported cases of a disease to identify hotspots.
  • 31. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 31 LOADING AND PARSING DATA  Loading data involves importing data from its source into the environment where the visualization will be created.  Once data is loaded, it might not be in a ready-to-use state.  Parsing involves converting or arranging data into a usable format. Loading Data • Loading data is a foundational step in the process of data visualization. • It essentially entails the importation of data from its original source into the specific environment where the visualization will be constructed. 1.File Format 2.Database 3.Application Programming Interfaces
  • 32. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 32 LOADING AND PARSING DATA 1. File formats: One of the primary aspects to consider is the format of the data.  Comma-Separated Values (CSV)  Microsoft Excel  JavaScript Object Notation (JSON), a hierarchical and key-value pair format  XML, another hierarchical format, is also used in various applications, from web services to document storage.
  • 33. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 33 LOADING AND PARSING DATA 2.Databases: Beyond file formats, databases play a crucial role in the world of data.  Relational databases organize data into tables, allowing for complex queries that can join, filter, and aggregate data.  NoSQL databases, like MongoDB or Cassandra, offer more flexible storage solutions, especially for unstructured or semi-structured data. 3. Application Programming Interfaces  APIs allow applications to communicate with each other, often fetching real-time data.  For instance, if one were creating a visualization about global weather patterns, they might fetch data in real-time from a weather API.
  • 34. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 34 LOADING AND PARSING DATA Parsing Data:  Data parsing is the process of converting or arranging data into a format or structure that's more usable and insightful for our specific needs. 1. Data Cleaning 2. Data Transformation 3. Feature Engineering 4. Date and Time Parsing
  • 35. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 35 LOADING AND PARSING DATA Data Cleaning: The first and most fundamental step in data parsing is data cleaning. Common Issues in Raw Data  Missing values.  Duplicate entries.  Incorrect data types.  Outliers or anomalies. Importance of Data Cleaning:  Eliminate errors and inconsistencies.  Enhance accuracy of data analysis.  Save time and resources in the long run Real-world Example Scenario: Imagine an e-commerce company collecting data on customer purchases. Incorrect data might show a product being purchased before it was even launched. Data cleaning will identify and rectify such anomalies. Outcome: Clean data ensures accurate customer insights and better business decisions.
  • 36. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 36 LOADING AND PARSING DATA Data Transformation: Once our data is clean, it often needs to be transformed into a more suitable format or scale. Data types might need conversion, like turning a number stored as a string into an actual numeric value. Such transformations are essential for mathematical operations and comparisons. Examples of Data Transformation  Numeric scaling: Normalizing or standardizing values.  Encoding: converting categorical data into numeric format.  Date formatting: ensuring a consistent date format across data.  Aggregation: summarizing data into broader categories or metrics. Real-world Example Scenario: Consider a global e-commerce platform collecting data on customer purchases. Prices are stored in various currencies. Before analyzing global sales, data needs to be transformed to a standard currency, like USD. Clean data ensures correct exchange rates are applied consistently. Outcome: Accurate insights on global sales and profit margins.
  • 37. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 37 LOADING AND PARSING DATA Feature Engineering: Feature engineering is the process of transforming raw data into features that better represent the underlying patterns in the data, making them more suitable for modeling. Parsing: Breaking down data into understandable and usable components. Feature Engineering: Optimizing these components (features) for analysis or modelling. Common Feature Engineering Techniques  Normalization: Scaling features to a standard scale (e.g., 0 to 1) so they contribute equally to model performance.  One-hot encoding: Converting categorical variables into a format that can be provided to ML algorithms to improve predictions.  Binning: Converting continuous variables into discrete bins to handle outliers or make data more interpretable.  Feature extraction: Transforming high-dimensional data into a lower-dimensional form, retaining the most important information  Feature interaction: Creating new features by combining two or more features, capturing interactions between them.
  • 38. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 38 LOADING AND PARSING DATA Date and Time Parsing: The process of converting textual data representing dates and times into a standardized and usable format.  Time-based data is common in various domains (e.g., finance, healthcare, logistics). Proper parsing ensures it's utilized effectively for analysis and modeling.  Proper date and time parsing is foundational for many data analysis tasks. Common Formats & Challenges Standard Date Formats: YYYY-MM-DD, DD/MM/YYYY, MM/DD/YYYY, etc. Time Formats: HH:MM:SS, HH:MM AM/PM, etc. Time Zones: UTC, GMT, PST, etc. Daylight Saving Time: Adjustments can shift times. Ambiguities: 03/04/2020 (Is it March 4th or April 3rd?) Real-world Scenarios & Case Studies Healthcare: Parsing appointment and treatment dates for patient data analysis
  • 39. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 39 DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION Data Cleansing is a systematic approach to ensuring the accuracy, consistency, and reliability of data by identifying and rectifying errors and inconsistencies. Every step, from data collection to visualization, can introduce errors. Without cleansing, these errors can propagate and mislead. Why Data Cleansing is essential for Data Visualization?  Ensuring Accuracy  Maintaining Consistency  Building Trust  Enhancing Efficiency Common Data Quality Issues  Missing Data  Duplicate Data  Inaccurate Data  Outliers  Inconsistent Formats
  • 40. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 40 DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION Steps in Data Cleansing Data cleansing can be explained as a 5 step process.  Step 1. Data Auditing  Step 2. Data Cleaning  Step 3. Data Verification  Step 4. Data Reporting  Step 5. Data Maintenance
  • 41. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 41 DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION Step 1. Data Auditing:  Data auditing is the initial phase where data is meticulously examined to detect any anomalies, inconsistencies, or inaccuracies.  This process typically employs both descriptive statistics and visual methods.  Descriptive statistics, such as means, medians, modes, standard deviations, and variance, can provide insights into the central tendencies and dispersions within the data.  Visual methods involve representing the data graphically to detect outliers or patterns that might not be evident in tabulated data. Tools:  Histograms  Scatter Plots  Frequency Distributions
  • 42. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 42 DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION Step 2. Data Cleaning: Once data auditing has identified problems, data cleaning rectifies these issues. It is the actual process of amending or removing data in a dataset that is incorrect, incomplete, improperly formatted, or duplicated. Methods: Imputation: Missing data can distort analysis. One way to address this is by filling in the gaps with imputed values. Outlier Treatment: Not all outliers are errors; some might be genuine extreme values. The key is to ascertain whether to keep them (if they're valid), adjust them (if they've been recorded incorrectly), or remove them (if they're anomalies without relevance). Deduplication: Duplicate data can arise due to various reasons, such as data merge errors or multiple data entries. This step involves identifying and removing these redundant records to ensure each data point is unique and accurate.
  • 43. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 43 DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION Step 3. Data Verification: After cleaning, it's vital to revisit the data to ensure no new errors were introduced during the cleaning process. Additionally, data verification ensures that no essential information was inadvertently removed.  Methods: This might involve cross-referencing with other trusted datasets, re-running the initial auditing techniques, or even sampling and manually reviewing records. Step 4. Data Reporting: Documenting the cleansing process is crucial for several reasons. It maintains transparency, ensures replicability, and builds trustworthiness.  Methods: This involves creating a detailed report outlining what anomalies were identified, the techniques used to address them, and any potential implications of the cleaning process. This documentation can be invaluable for future reference, for stakeholders to understand the data's quality, or for other data professionals working on the dataset.
  • 44. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 44 DATA CLEANSING AND ITS IMPORTANCE IN DATA VISUALIZATION Step 5. Data Maintenance: Data doesn't remain static. As new data is added or as existing data evolves, it's crucial to ensure that it maintains its quality.  Methods: Implementing routine checks, validations, and periodic audits can help in identifying and rectifying new issues promptly. Automated scripts, data quality software, or even manual reviews (for smaller datasets) can be employed to maintain data cleanliness over time.
  • 45. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 45 Data Cleansing Tools and Techniques Manual Review:  Pros: Full control, detailed understanding.  Cons: Time-consuming, not feasible for large datasets. Automated Cleaning Tools: DataWrangler, OpenRefine, and Trifacta.  Pros: Faster, can handle large datasets.  Cons: Might require a learning curve, can miss domain-specific nuances. Programming Libraries: Pandas in Python, dplyr in R.  Pros: Highly customizable, can integrate with other data processing steps.  Cons: Requires programming knowledge. Data Quality Software: Tools like Talend and Informatica offer end-to-end data management solutions.  Pros: Comprehensive, often have built-in workflows.  Cons: Can be expensive, might have a steeper learning curve.
  • 46. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 46 Reusable Dynamic Components using the General Update Pattern The General Update Pattern in D3.js (a popular data visualization library for JavaScript) provides such a mechanism, facilitating the creation of dynamic and efficient visualizations. What is the General Update Pattern? The General Update Pattern is a methodology used in D3.js to handle data changes in a visualization. It breaks the data-binding process into three distinct phases:  Enter: Elements that need to be added to the visualization because they have new data associated.  Update: Elements whose data has changed since the last render.  Exit: Elements that no longer have associated data and need to be removed.
  • 47. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 47 Reusable Dynamic Components using the General Update Pattern The Importance of Reusability  Efficiency: Avoids re-writing code, saving time and reducing errors.  Consistency: Ensures visualizations maintain a consistent look and feel across different parts of an application or different applications.  Modularity: Allows for easier debugging, testing, and understanding, as each component can be developed and maintained separately.
  • 48. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 48 Reusable Dynamic Components using the General Update Pattern Implementing the General Update Pattern  Binding Data Start by binding a dataset to a selection. This binding process involves pairing data points from dataset with DOM elements in visualization.  Handling the Enter Selection  Create new elements for each of these data points.  Define attributes and styles for these new elements based on their data.  Handling the Update Selection • Adjust attributes and styles of these elements to reflect any changes in their bound data.  Handling the Exit Selection • Remove these elements or adjust them as necessary, e.g., fading them out or shrinking them.
  • 49. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 49 Making Components Reusable  Encapsulation Wrap your visualization code inside a function. This function can then be called whenever ,want to create a new instance of visualization.  Customizable Properties Allow users of your component to customize its properties. This could be achieved using getter-setter functions for each property.  Event Handling For interactivity, allow users to define custom event handlers. This ensures that the component can respond to user interactions in a way that suits the specific application it's being used in.
  • 50. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 50 REUSABLE SCATTER PLOT A scatter plot, one of the most versatile visualization tools, displays individual data points on a two-dimensional plane. Components of a Scatter Plot 1. Axes and Scale  X and Y Axes: Represent the domains of the two variables being compared.  Scale: The mechanism that translates data values into pixel values. Common scales include linear, logarithmic, and time-based. 2. Data Points  Typically represented as circles, though other symbols can be used.  Position determined by the data value for each variable. 3. Optional Components  Gridlines: Enhance readability.  Labels: Provide context or additional data about specific points.  Trend Line: Indicates a general direction of data points.
  • 51. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 51  A reusable scatter plot is a type of data visualization that displays individual data points as markers on a graph.  The term "reusable" implies that the scatter plot is designed in a way that allows it to be easily used and applied to different datasets or situations without requiring extensive modification.  In a scatter plot, each data point is represented by a marker, typically a dot, and the position of the dot is determined by the values of two variables (usually one on the x-axis and one on the y-axis).  This allows for a visual representation of the relationship, if any, between the two variables. REUSABLE SCATTER PLOT
  • 52. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 52 COMMON VISUALIZATION IDIOMS Bar Chart  A bar chart is a statistical approach to represent given data using vertical and horizontal rectangular bars.  The length of each bar is proportional to the value they represent.  It is basically a graphical representation of data with the help of horizontal or vertical bars with different heights.  They are also known as bar graphs. The bar charts have three major characteristics:  The bar charts are used to compare the different data among different groups.  Bar charts show the relationship with the help of two axes. On one axis it represents the categories and on another axis, it represents the discrete values.  Over a period of time bar charts shows the major changes in available data.
  • 53. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 53 COMMON VISUALIZATION IDIOMS Types of Bar Chart The grouped bar chart is also referred the clustered bar chart (graph). The stacked bar chart is also known as the composite bar chart.
  • 54. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 54 COMMON VISUALIZATION IDIOMS Pie Chart  A pie chart is a pictorial representation of data in the form of a circular chart or pie where the slices of the pie show the size of the data.  A list of numerical variables along with categorical variables is needed to represent data in the form of a pie chart. The whole pie represents a value of 100. It is divided into 10 slices or sectors. The various colors represent the ingredients used to prepare the cake. What would be the exact quantity of each of the ingredients represented in specific colors in the following pie chart? Quantity of Flour 30 Quantity of Sugar 20 Quantity of Egg 40 Quantity of Butter 10
  • 55. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 55 COMMON VISUALIZATION IDIOMS Line Graph  Line graphs, also called line charts, are used to represent quantitative data collected over a specific subject and a specific time interval.  All the data points are connected by a line.  Data points represent the observations that are collected on a survey or research.  The line graph has an x-axis and a y-axis.
  • 56. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 56 COMMON VISUALIZATION IDIOMS Line Graph: Shows continuous data over a period, setting against a general scale, and connecting individual data points together, ideal for showing growth rate or trends at even intervals. Scatter Plot It works best when comparing large numbers of data points without regard to time. This tool is very powerful when we are trying to show the relationship between two variables (x and y-axis), for example, a person's weight and height.
  • 57. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 57 COMMON VISUALIZATION IDIOMS AREA CHART  An area chart, also known as a mountain chart, is a data visualization type that combines the appearance of a line chart and a bar chart.  It is commonly used to show how numerical values change based on a second variable, usually a time period.
  • 58. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 58 COMMON VISUALIZATION IDIOMS TYPES OF AREA CHART  Basic area chart-tracking the number of weekly profile views from a business’s LinkedIn page.  Overlapping area chart- compare two or more data groups based on a specific variable  Stacked area chart(cumulative area chart)-deeper than just comparing values but also understand how each group contributes to a total.
  • 59. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 59 DIFFERENT CRITERIA TO CHOOSE CHARTS  Choosing the right chart type is crucial in data visualization.  It ensures that data is presented clearly, accurately, and effectively.  Selecting an inappropriate chart can lead to misinterpretation of data or obscure critical insights. Nature of Data 1. Qualitative vs. Quantitative Qualitative Data (Categorical): Data that can be divided into categories. Suitable Charts: Pie charts, bar charts, treemaps. Quantitative Data (Numerical): Data that can be measured. Suitable Charts: Line charts, scatter plots, histograms. 2. Time Series vs. Non-Time Series Time Series Data: Data points indexed in time order. Suitable Charts: Line charts, area charts, stacked area charts. Non-Time Series Data: Data without a chronological order. Suitable Charts: Bar charts, pie charts, scatter plots. 
  • 60. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 60 DIFFERENT CRITERIA TO CHOOSE CHARTS Number of Variables 1. Univariate Data Data focusing on a single variable. Suitable Charts: Histograms, pie charts, line charts (for time series). 2. Bivariate Data Data exploring the relationship between two variables. Suitable Charts: Scatter plots, line charts, bar charts. 3. Multivariate Data Data involving more than two variables. Suitable Charts: Scatter plot matrices, parallel coordinate plots, radar charts.
  • 61. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 61 DIFFERENT CRITERIA TO CHOOSE CHARTS Relationships and Comparisons 1. Comparing Categories : Viewing the difference in values across different categories. Suitable Charts: Bar charts, column charts, dot plots. 2. Distribution of Data : Observing the spread and shape of a dataset. Suitable Charts: Histograms, box plots, density plots. 3. Trend Over Time : Analyzing patterns or changes over a period. Suitable Charts: Line charts, area charts, waterfall charts. 4. Correlation : Examining the relationship between two or more variables. Suitable Charts: Scatter plots, bubble charts.
  • 62. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 62 DIFFERENT CRITERIA TO CHOOSE CHARTS Data Volume and Complexity 1. Small Data Sets : Emphasize individual data points or fine details. Suitable Charts: Dot plots, bar charts. 2. Large Data Sets : Offer a general overview or highlight trends. Suitable Charts: Line charts, heatmaps, histograms. 3. Hierarchical Data : Data with inherent structure or nested categories. Suitable Charts: Treemaps, dendrograms, sunburst charts.
  • 63. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 63 DIFFERENT CRITERIA TO CHOOSE CHARTS Audience and Context 1. Level of Expertise General Audience: Opt for simpler, more intuitive charts. Expert Audience: Can utilize more complex visualizations. 2. Presentation Medium Digital Displays: Interactive visualizations like dynamic dashboards. Print: Static visualizations with high clarity. 3. Storytelling vs. Exploration Storytelling: Use charts that drive home a specific point or narrative. Exploration: Use charts that allow users to delve into the data and discover insights
  • 64. 2 January 2024 KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India 64 DIFFERENT CRITERIA TO CHOOSE CHARTS Aesthetics and Design 1. Clarity and Simplicity Avoid clutter, prioritize readability. 2. Color Considerations Ensure color choices are meaningful and accessible. 3. Consistency If presenting multiple charts, maintain a consistent design language.