This chapter discusses different techniques for exploring and visualizing data to better understand its characteristics. It describes the different types of data objects and attributes as well as basic statistical measures like mean, median, and standard deviation that can characterize a dataset's central tendency and dispersion. Visualization techniques covered include histograms, boxplots, scatterplots, parallel coordinates, Chernoff faces, and landscapes that can reveal patterns, relationships, and outliers in the data.
Data mining techniques in data mining with examplesmqasimsheikh5
This document provides an overview of data mining concepts and techniques for understanding data. It discusses different types of data sets and attributes, basic statistical descriptions for analyzing data distributions and outliers, various data visualization techniques for exploring patterns and relationships, and measures for determining data similarity and dissimilarity.
This chapter discusses getting to know your data through data mining concepts and techniques. It covers data objects and attribute types, basic statistical descriptions of data like mean and standard deviation, visualizing data through histograms and scatter plots, measuring data similarity, and different types of data sets. The goal is to provide qualitative overviews and insights into data to find patterns, trends, relationships and irregularities.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This chapter discusses getting to know data through analysis and visualization. It covers data objects and attribute types, statistical descriptions of data including measures of central tendency and dispersion, visualization techniques like histograms and scatter plots, and measuring similarity between data objects. The goal is to better understand data characteristics before applying more advanced mining techniques.
This chapter discusses getting to know your data through descriptive analysis. It covers data objects and attribute types, including nominal, binary, numeric, discrete and continuous attributes. Statistical descriptions like mean, median, mode, variance and standard deviation are explained. Visualization techniques explored include histograms, boxplots, quantile plots, scatter plots and Chernoff faces. Dimensionality, sparsity, distribution and other data characteristics are also covered.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
This chapter discusses different techniques for understanding and visualizing data. It describes the different types of data objects and attributes as well as basic statistical methods for describing data, such as measures of central tendency and dispersion. It also covers various data visualization techniques, including pixel-oriented, geometric projection, and icon-based visualization methods to gain insight into data patterns and relationships.
Data mining techniques in data mining with examplesmqasimsheikh5
This document provides an overview of data mining concepts and techniques for understanding data. It discusses different types of data sets and attributes, basic statistical descriptions for analyzing data distributions and outliers, various data visualization techniques for exploring patterns and relationships, and measures for determining data similarity and dissimilarity.
This chapter discusses getting to know your data through data mining concepts and techniques. It covers data objects and attribute types, basic statistical descriptions of data like mean and standard deviation, visualizing data through histograms and scatter plots, measuring data similarity, and different types of data sets. The goal is to provide qualitative overviews and insights into data to find patterns, trends, relationships and irregularities.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This chapter discusses getting to know data through analysis and visualization. It covers data objects and attribute types, statistical descriptions of data including measures of central tendency and dispersion, visualization techniques like histograms and scatter plots, and measuring similarity between data objects. The goal is to better understand data characteristics before applying more advanced mining techniques.
This chapter discusses getting to know your data through descriptive analysis. It covers data objects and attribute types, including nominal, binary, numeric, discrete and continuous attributes. Statistical descriptions like mean, median, mode, variance and standard deviation are explained. Visualization techniques explored include histograms, boxplots, quantile plots, scatter plots and Chernoff faces. Dimensionality, sparsity, distribution and other data characteristics are also covered.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
This chapter discusses different techniques for understanding and visualizing data. It describes the different types of data objects and attributes as well as basic statistical methods for describing data, such as measures of central tendency and dispersion. It also covers various data visualization techniques, including pixel-oriented, geometric projection, and icon-based visualization methods to gain insight into data patterns and relationships.
Getting to Know Your Data Some sources from where you can access datasets for...AkshayRF
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data points, objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
This document discusses data and data preprocessing in data mining. It defines what data is, including data objects and attributes. It describes different attribute types like nominal, binary, ordinal, interval-scaled and ratio-scaled numeric attributes. It also discusses measuring the central tendency of data using the mean, median and mode. Additionally, it covers measuring data distribution through variance, standard deviation and z-scores. Finally, it briefly introduces measuring data similarity and dissimilarity, as well as an overview of data preprocessing.
The document discusses data preprocessing techniques. It covers why preprocessing is important for obtaining quality data and mining results. The major tasks covered include data cleaning, integration, transformation, reduction, and discretization. Data cleaning techniques discussed include handling missing data, noisy data, and inconsistencies. Data integration aims to combine data from multiple sources. Data reduction obtains a reduced representation while maintaining analytical results. Discretization is a type of data reduction important for numerical data.
This document provides an overview of data mining concepts and techniques discussed in Chapter 2 of the textbook "Data Mining: Concepts and Techniques". It defines key terms like data objects, attributes, attribute types, statistical descriptions of data, and different methods of data visualization. Various techniques are described for understanding the characteristics of data sets through statistical measures, histograms, quantile plots, scatter plots and other approaches. Different styles of data visualization like pixel plots, geometric projections, icons and hierarchies are also summarized.
This document discusses various techniques for analyzing and visualizing data to gain insights. It covers data attribute types, basic statistical descriptions to understand data distribution and outliers, different visualization methods to discover patterns and relationships, and various ways to measure similarity between data objects, including distances, coefficients, and cosine similarity for text. The goal is to preprocess and understand data at a high level before applying more advanced analytics.
This document provides an overview of key concepts in statistics, including:
- Descriptive statistics involves collecting and analyzing data without inferences, while inferential statistics analyzes a subset of data to make inferences about the whole.
- Parameters describe populations and statistics describe samples.
- Levels of measurement include nominal, ordinal, interval, and ratio.
- Measures of location summarize data distribution and include minimum, maximum, percentiles, deciles, and quartiles.
- Measures of variation describe data spread and include range, inter-quartile range, variance, standard deviation, and coefficient of variation. Variance and standard deviation are particularly important measures.
Data Mining StepsProblem Definition Market AnalysisCsharondabriggs
Data Mining Steps
Problem Definition
Market Analysis
Customer Profiling, Identifying Customer Requirements, Cross Market Analysis, Target Marketing, Determining Customer purchasing pattern
Corporate Analysis and Risk Management
Finance Planning and Asset Evaluation, Resource Planning, Competition
Fraud Detection
Customer Retention
Production Control
Science Exploration
> Data Preparation
Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. The Datasets you are provided in these projects were obtained from kaggle.com.
Variable selection and description
Numerical – Ratio, Interval
Categorical – Ordinal, Nominal
Simplifying variables: From continuous to discrete
Formatting the data
Basic data integrity checks: missing data, outliers
> Data Exploration
Data Exploration is about describing the data by means of statistical and visualization techniques.
· Data Visualization:
o
Univariate
analysis explores variables (attributes) one by one. Variables could be either categorical or numerical.
Univariate Analysis - Categorical
Statistics
Visualization
Description
Count
Bar Chart
The number of values of the specified variable.
Count%
Pie Chart
The percentage of values of the specified variable
Univariate Analysis - Numerical
Statistics
Visualization
Equation
Description
Count
Histogram
N
The number of values (observations) of the variable.
Minimum
Box Plot
Min
The smallest value of the variable.
Maximum
Box Plot
Max
The largest value of the variable.
Mean
Box Plot
The sum of the values divided by the count.
Median
Box Plot
The middle value. Below and above median lies an equal number of values.
Mode
Histogram
The most frequent value. There can be more than one mode.
Quantile
Box Plot
A set of 'cut points' that divide a set of data into groups containing equal numbers of values (Quartile, Quintile, Percentile, ...).
Range
Box Plot
Max-Min
The difference between maximum and minimum.
Variance
Histogram
A measure of data dispersion.
Standard Deviation
Histogram
The square root of variance.
Coefficient of Deviation
Histogram
A measure of data dispersion divided by mean.
Skewness
Histogram
A measure of symmetry or asymmetry in the distribution of data.
Kurtosis
Histogram
A measure of whether the data are peaked or flat relative to a normal distribution.
Note: There are two types of numerical variables, interval and ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good example is temperature in Centigrade degrees. Data on an int ...
The document discusses exploratory data analysis techniques, including univariate and multivariate graphical and non-graphical analysis. It covers tools like R and Python for EDA. Key EDA methods covered are estimating location and variability through measures like the mean, median, and standard deviation. Distribution analysis techniques like histograms, density plots, and boxplots are also discussed.
The document provides examples of using basic R commands like assigning values to objects (x, y), printing objects, removing objects, and checking what objects are in the workspace. It also demonstrates using the c() function to combine values into a vector and shows some errors that can occur from typos in names. Additionally, it discusses entering data with c function, population versus sample, and levels of measurement for variables.
The document provides an overview of populations, samples, and key concepts in descriptive statistics. It discusses how samples are used to make inferences about populations. Key points include:
- Samples are subsets of populations used for study due to constraints on time and resources.
- Descriptive statistics like means, medians, and histograms are calculated from samples to learn about characteristics of interest in populations.
- Categorical data can be summarized using frequency distributions and sample proportions.
- Different measures of center like the mean, median, and trimmed mean are used to summarize data, with the choice dependent on factors like outliers and distribution shape.
This document provides an introduction to biostatistics. It discusses topics such as collecting and presenting quantitative and qualitative data through tables, charts, and diagrams. It also covers descriptive statistics like measures of central tendency (mean, median, mode) and dispersion (range, standard deviation). Inferential statistics such as probability distributions, hypothesis testing, and tests of significance are introduced. Examples provided include the normal, binomial, and Poisson distributions as well as chi-square, t-tests, z-tests, and ANOVA for hypothesis testing.
Statistics is a mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. In general, its investigations and analyses fall into two broad categories called descriptive and inferential statistics.
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...nikshaikh786
The document discusses data mining architecture, tasks, and data exploration and preprocessing techniques. It describes the KDD process and issues in data mining. It also covers data types, attributes, statistical descriptions of data, and various data visualization techniques like histograms, boxplots, scatter plots and quantile plots to explore patterns in data. Data preprocessing steps discussed are data cleaning, integration, transformation, reduction and discretization.
This document provides an outline and introduction to statistical tools and SPSS used in social research. It covers topics such as data presentation, measures of central tendency, skewness and kurtosis, measures of dispersion, correlation, and regression. The document defines key statistical concepts and terms and provides examples of how to calculate statistics like the mean, median, mode, and percentiles for both ungrouped and grouped data sets. Formulas and calculation methods are presented.
This document provides an introduction to key concepts in statistics including measures of central tendency, variation, distributions, and linear regression. It defines the mean, median, and mode as measures of central tendency. Measures of variation described include range, variance, and standard deviation. Common distributions like the normal distribution are explained and its key properties outlined. Hypothesis testing and p-values are also introduced. Finally, the concepts of covariance, correlation, and simple linear regression models are summarized.
Data preprocessing involves several key steps:
1) Data cleaning to fill in missing values, identify and remove outliers, and resolve inconsistencies
2) Data integration to combine multiple data sources and resolve conflicts and redundancies
3) Data reduction techniques like discretization, dimensionality reduction, and aggregation to obtain a reduced representation of the data for mining and analysis.
MSC III_Research Methodology and Statistics_Descriptive statistics.pdfSuchita Rawat
This document discusses key concepts in research methodology and statistics. It defines statistics as dealing with the collection, analysis, and interpretation of quantitative and qualitative data. It then discusses various types of graphs used to visually represent data, such as bar graphs, pie charts, histograms, boxplots, and scatterplots. It also defines common measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation, IQR), and skewness.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Getting to Know Your Data Some sources from where you can access datasets for...AkshayRF
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data points, objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
This document discusses data and data preprocessing in data mining. It defines what data is, including data objects and attributes. It describes different attribute types like nominal, binary, ordinal, interval-scaled and ratio-scaled numeric attributes. It also discusses measuring the central tendency of data using the mean, median and mode. Additionally, it covers measuring data distribution through variance, standard deviation and z-scores. Finally, it briefly introduces measuring data similarity and dissimilarity, as well as an overview of data preprocessing.
The document discusses data preprocessing techniques. It covers why preprocessing is important for obtaining quality data and mining results. The major tasks covered include data cleaning, integration, transformation, reduction, and discretization. Data cleaning techniques discussed include handling missing data, noisy data, and inconsistencies. Data integration aims to combine data from multiple sources. Data reduction obtains a reduced representation while maintaining analytical results. Discretization is a type of data reduction important for numerical data.
This document provides an overview of data mining concepts and techniques discussed in Chapter 2 of the textbook "Data Mining: Concepts and Techniques". It defines key terms like data objects, attributes, attribute types, statistical descriptions of data, and different methods of data visualization. Various techniques are described for understanding the characteristics of data sets through statistical measures, histograms, quantile plots, scatter plots and other approaches. Different styles of data visualization like pixel plots, geometric projections, icons and hierarchies are also summarized.
This document discusses various techniques for analyzing and visualizing data to gain insights. It covers data attribute types, basic statistical descriptions to understand data distribution and outliers, different visualization methods to discover patterns and relationships, and various ways to measure similarity between data objects, including distances, coefficients, and cosine similarity for text. The goal is to preprocess and understand data at a high level before applying more advanced analytics.
This document provides an overview of key concepts in statistics, including:
- Descriptive statistics involves collecting and analyzing data without inferences, while inferential statistics analyzes a subset of data to make inferences about the whole.
- Parameters describe populations and statistics describe samples.
- Levels of measurement include nominal, ordinal, interval, and ratio.
- Measures of location summarize data distribution and include minimum, maximum, percentiles, deciles, and quartiles.
- Measures of variation describe data spread and include range, inter-quartile range, variance, standard deviation, and coefficient of variation. Variance and standard deviation are particularly important measures.
Data Mining StepsProblem Definition Market AnalysisCsharondabriggs
Data Mining Steps
Problem Definition
Market Analysis
Customer Profiling, Identifying Customer Requirements, Cross Market Analysis, Target Marketing, Determining Customer purchasing pattern
Corporate Analysis and Risk Management
Finance Planning and Asset Evaluation, Resource Planning, Competition
Fraud Detection
Customer Retention
Production Control
Science Exploration
> Data Preparation
Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. The Datasets you are provided in these projects were obtained from kaggle.com.
Variable selection and description
Numerical – Ratio, Interval
Categorical – Ordinal, Nominal
Simplifying variables: From continuous to discrete
Formatting the data
Basic data integrity checks: missing data, outliers
> Data Exploration
Data Exploration is about describing the data by means of statistical and visualization techniques.
· Data Visualization:
o
Univariate
analysis explores variables (attributes) one by one. Variables could be either categorical or numerical.
Univariate Analysis - Categorical
Statistics
Visualization
Description
Count
Bar Chart
The number of values of the specified variable.
Count%
Pie Chart
The percentage of values of the specified variable
Univariate Analysis - Numerical
Statistics
Visualization
Equation
Description
Count
Histogram
N
The number of values (observations) of the variable.
Minimum
Box Plot
Min
The smallest value of the variable.
Maximum
Box Plot
Max
The largest value of the variable.
Mean
Box Plot
The sum of the values divided by the count.
Median
Box Plot
The middle value. Below and above median lies an equal number of values.
Mode
Histogram
The most frequent value. There can be more than one mode.
Quantile
Box Plot
A set of 'cut points' that divide a set of data into groups containing equal numbers of values (Quartile, Quintile, Percentile, ...).
Range
Box Plot
Max-Min
The difference between maximum and minimum.
Variance
Histogram
A measure of data dispersion.
Standard Deviation
Histogram
The square root of variance.
Coefficient of Deviation
Histogram
A measure of data dispersion divided by mean.
Skewness
Histogram
A measure of symmetry or asymmetry in the distribution of data.
Kurtosis
Histogram
A measure of whether the data are peaked or flat relative to a normal distribution.
Note: There are two types of numerical variables, interval and ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good example is temperature in Centigrade degrees. Data on an int ...
The document discusses exploratory data analysis techniques, including univariate and multivariate graphical and non-graphical analysis. It covers tools like R and Python for EDA. Key EDA methods covered are estimating location and variability through measures like the mean, median, and standard deviation. Distribution analysis techniques like histograms, density plots, and boxplots are also discussed.
The document provides examples of using basic R commands like assigning values to objects (x, y), printing objects, removing objects, and checking what objects are in the workspace. It also demonstrates using the c() function to combine values into a vector and shows some errors that can occur from typos in names. Additionally, it discusses entering data with c function, population versus sample, and levels of measurement for variables.
The document provides an overview of populations, samples, and key concepts in descriptive statistics. It discusses how samples are used to make inferences about populations. Key points include:
- Samples are subsets of populations used for study due to constraints on time and resources.
- Descriptive statistics like means, medians, and histograms are calculated from samples to learn about characteristics of interest in populations.
- Categorical data can be summarized using frequency distributions and sample proportions.
- Different measures of center like the mean, median, and trimmed mean are used to summarize data, with the choice dependent on factors like outliers and distribution shape.
This document provides an introduction to biostatistics. It discusses topics such as collecting and presenting quantitative and qualitative data through tables, charts, and diagrams. It also covers descriptive statistics like measures of central tendency (mean, median, mode) and dispersion (range, standard deviation). Inferential statistics such as probability distributions, hypothesis testing, and tests of significance are introduced. Examples provided include the normal, binomial, and Poisson distributions as well as chi-square, t-tests, z-tests, and ANOVA for hypothesis testing.
Statistics is a mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. In general, its investigations and analyses fall into two broad categories called descriptive and inferential statistics.
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...nikshaikh786
The document discusses data mining architecture, tasks, and data exploration and preprocessing techniques. It describes the KDD process and issues in data mining. It also covers data types, attributes, statistical descriptions of data, and various data visualization techniques like histograms, boxplots, scatter plots and quantile plots to explore patterns in data. Data preprocessing steps discussed are data cleaning, integration, transformation, reduction and discretization.
This document provides an outline and introduction to statistical tools and SPSS used in social research. It covers topics such as data presentation, measures of central tendency, skewness and kurtosis, measures of dispersion, correlation, and regression. The document defines key statistical concepts and terms and provides examples of how to calculate statistics like the mean, median, mode, and percentiles for both ungrouped and grouped data sets. Formulas and calculation methods are presented.
This document provides an introduction to key concepts in statistics including measures of central tendency, variation, distributions, and linear regression. It defines the mean, median, and mode as measures of central tendency. Measures of variation described include range, variance, and standard deviation. Common distributions like the normal distribution are explained and its key properties outlined. Hypothesis testing and p-values are also introduced. Finally, the concepts of covariance, correlation, and simple linear regression models are summarized.
Data preprocessing involves several key steps:
1) Data cleaning to fill in missing values, identify and remove outliers, and resolve inconsistencies
2) Data integration to combine multiple data sources and resolve conflicts and redundancies
3) Data reduction techniques like discretization, dimensionality reduction, and aggregation to obtain a reduced representation of the data for mining and analysis.
MSC III_Research Methodology and Statistics_Descriptive statistics.pdfSuchita Rawat
This document discusses key concepts in research methodology and statistics. It defines statistics as dealing with the collection, analysis, and interpretation of quantitative and qualitative data. It then discusses various types of graphs used to visually represent data, such as bar graphs, pie charts, histograms, boxplots, and scatterplots. It also defines common measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation, IQR), and skewness.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
2. 2
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
3. 3
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
crosstabs
Document data: text documents: term-
frequency vector
Transaction data
Graph and network
World Wide Web
Social or information networks
Molecular Structures
Ordered
Video data: sequence of images
Temporal data: time-series
Sequential Data: transaction sequences
Genetic sequence data
Spatial, image and multimedia:
Spatial data: maps
Image data:
Video data:
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
4. 4
Important Characteristics of Structured
Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
5. 5
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data points,
objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
6. 6
Attributes
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
7. 7
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
8. 8
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
9. 9
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a
collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete
attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and
represented using a finite number of digits
Continuous attributes are typically represented as
floating-point variables
10. 10
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
11. 11
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
12. 12
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
Median:
Middle value if odd number of values, or average of
the middle two values otherwise
Estimated by interpolation (for grouped data):
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
n
i
i
x
n
x
1
1
n
i
i
n
i
i
i
w
x
w
x
1
1
width
freq
l
freq
n
L
median
median
)
)
(
2
/
(
1
)
(
3 median
mean
mode
mean
N
x
13. September 22, 2023 Data Mining: Concepts and Techniques 13
Symmetric vs. Skewed
Data
Median, mean and mode of
symmetric, positively and
negatively skewed data
positively skewed negatively skewed
symmetric
14. 14
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, median, Q3, max
Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
Variance: (algebraic, scalable computation)
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
n
i
n
i
i
i
n
i
i x
n
x
n
x
x
n
s
1 1
2
2
1
2
2
]
)
(
1
[
1
1
)
(
1
1
n
i
i
n
i
i x
N
x
N 1
2
2
1
2
2 1
)
(
1
15. 15
Boxplot Analysis
Five-number summary of a distribution
Minimum, Q1, Median, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the
box
Whiskers: two lines outside the box extended
to Minimum and Maximum
Outliers: points beyond a specified outlier
threshold, plotted individually
16. September 22, 2023 Data Mining: Concepts and Techniques 16
Visualization of Data Dispersion: 3-D Boxplots
17. 17
Properties of Normal Distribution Curve
The normal (distribution) curve
From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it
18. 18
Graphic Displays of Basic Statistical
Descriptions
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-axis repres. frequencies
Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are xi
Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
19. 19
Histogram Analysis
Histogram: Graph display of
tabulated frequencies, shown as
bars
It shows what proportion of cases
fall into each of several categories
Differs from a bar chart in that it is
the area of the bar that denotes the
value, not the height as in bar
charts, a crucial distinction when the
categories are not of uniform width
The categories are usually specified
as non-overlapping intervals of
some variable. The categories (bars)
must be adjacent
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
20. 20
Histograms Often Tell More than Boxplots
The two histograms
shown in the left may
have the same boxplot
representation
The same values
for: min, Q1,
median, Q3, max
But they have rather
different data
distributions
21. Data Mining: Concepts and Techniques 21
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
22. 22
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
View: Is there is a shift in going from one distribution to another?
Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.
23. 23
Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
24. 24
Positively and Negatively Correlated Data
The left half fragment is positively
correlated
The right half is negative correlated
26. 26
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
27. 27
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto graphical
primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships among
data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived
Categorization of visualization methods:
Pixel-oriented visualization techniques
Geometric projection visualization techniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations
28. 28
Pixel-Oriented Visualization Techniques
For a data set of m dimensions, create m windows on the screen, one
for each dimension
The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
The colors of the pixels reflect the corresponding values
(a) Income (b) Credit Limit (c) transaction volume (d) age
29. 29
Laying Out Pixels in Circle Segments
To save space and show the connections among multiple dimensions,
space filling is often done in a circle segment
(a) Representing a data record
in circle segment
(b) Laying out pixels in circle segment
30. 30
Geometric Projection Visualization
Techniques
Visualization of geometric transformations and projections
of the data
Methods
Direct visualization
Scatterplot and scatterplot matrices
Landscapes
Projection pursuit technique: Help users find meaningful
projections of multidimensional data
Prosection views
Hyperslice
Parallel coordinates
31. Data Mining: Concepts and Techniques 31
Direct Data Visualization
Ribbons
with
Twists
Based
on
Vorticity
32. 32
Scatterplot Matrices
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]
Used
by
ermission
of
M.
Ward,
Worcester
Polytechnic
Institute
33. 33
news articles
visualized as
a landscape
Used
by
permission
of
B.
Wright,
Visible
Decisions
Inc.
Landscapes
Visualization of the data as perspective landscape
The data needs to be transformed into a (possibly artificial) 2D
spatial representation which preserves the characteristics of the data
34. 34
Attr. 1 Attr. 2 Attr. k
Attr. 3
• • •
Parallel Coordinates
n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
Every data item corresponds to a polygonal line which intersects each
of the axes at the point which corresponds to the value for the
attribute
36. 36
Icon-Based Visualization Techniques
Visualization of the data values as features of icons
Typical visualization methods
Chernoff Faces
Stick Figures
General techniques
Shape coding: Use shape to represent certain
information encoding
Color icons: Use color icons to encode more information
Tile bars: Use small icons to represent the relevant
feature vectors in document retrieval
37. 37
Chernoff Faces
A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated using
Mathematica (S. Dickson)
REFERENCE: Gonick, L. and Smith, W. The
Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
38. 38
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
A census data
figure showing
age, income,
gender,
education, etc.
Stick Figure
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)
39. 39
Hierarchical Visualization Techniques
Visualization of the data using a hierarchical
partitioning into subspaces
Methods
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
Cone Trees
InfoCube
40. 40
Dimensional Stacking
attribute 1
attribute 2
attribute 3
attribute 4
Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are ‘stacked’ into each other
Partitioning of the attribute value ranges into classes. The
important attributes should be used on the outer levels.
Adequate for data with ordinal attributes of low cardinality
But, difficult to display more than nine dimensions
Important to map dimensions appropriately
41. 41
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
Dimensional Stacking
42. 42
Worlds-within-Worlds
Assign the function and two most important parameters to innermost
world
Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
Software that uses this paradigm
N–vision: Dynamic
interaction through data
glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
Auto Visual: Static
interaction by means of
queries
43. 43
Tree-Map
Screen-filling method which uses a hierarchical partitioning
of the screen into regions depending on the attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
MSR Netscan Image
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg
45. 45
InfoCube
A 3-D visualization technique where hierarchical
information is displayed as nested semi-transparent
cubes
The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smaller cubes inside the
outermost cubes, and so on
46. 46
Three-D Cone Trees
3D cone tree visualization technique works
well for up to a thousand nodes or so
First build a 2D circle tree that arranges its
nodes in concentric circles centered on the
root node
Cannot avoid overlaps when projected to
2D
G. Robertson, J. Mackinlay, S. Card. “Cone
Trees: Animated 3D Visualizations of
Hierarchical Information”, ACM SIGCHI'91
Graph from Nadeau Software Consulting
website: Visualize a social network data set
that models the way an infection spreads
from one person to the next
Ack.: http://nadeausoftware.com/articles/visualization
47. Visualizing Complex Data and Relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
The importance of
tag is represented
by font size/color
Besides text data,
there are also
methods to visualize
relationships, such as
visualizing social
networks
Newsmap: Google News Stories in 2005
48. 48
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
49. 49
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects
are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
50. 50
Data Matrix and Dissimilarity Matrix
Data matrix
n data points with p
dimensions
Two modes
Dissimilarity matrix
n data points, but
registers only the
distance
A triangular matrix
Single mode
np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
0
...
)
2
,
(
)
1
,
(
:
:
:
)
2
,
3
(
)
...
n
d
n
d
0
d
d(3,1
0
d(2,1)
0
51. 51
Proximity Measure for Nominal Attributes
Can take 2 or more states, e.g., red, yellow, blue,
green (generalization of a binary attribute)
Method 1: Simple matching
m: # of matches, p: total # of variables
Method 2: Use a large number of binary attributes
creating a new binary attribute for each of the
M nominal states
p
m
p
j
i
d
)
,
(
52. 52
Proximity Measure for Binary Attributes
A contingency table for binary data
Distance measure for symmetric
binary variables:
Distance measure for asymmetric
binary variables:
Jaccard coefficient (similarity
measure for asymmetric binary
variables):
Note: Jaccard coefficient is the same as “coherence”:
Object i
Object j
53. 53
Dissimilarity between Binary Variables
Example
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75
.
0
2
1
1
2
1
)
,
(
67
.
0
1
1
1
1
1
)
,
(
33
.
0
1
0
2
1
0
)
,
(
mary
jim
d
jim
jack
d
mary
jack
d
54. 54
Standardizing Numeric Data
Z-score:
X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
the distance between the raw score and the population mean in
units of the standard deviation
negative when the raw score is below the mean, “+” when above
An alternative way: Calculate the mean absolute deviation
where
standardized measure (z-score):
Using mean absolute deviation is more robust than using standard
deviation
.
)
...
2
1
1
nf
f
f
f
x
x
(x
n
m
|)
|
...
|
|
|
(|
1
2
1 f
nf
f
f
f
f
f
m
x
m
x
m
x
n
s
f
f
if
if s
m
x
z
x
z
56. 56
Distance on Numeric Data: Minkowski
Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
57. 57
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are
different between two binary vectors
h = 2: (L2 norm) Euclidean distance
h . “supremum” (Lmax norm, L norm) distance.
This is the maximum difference between any component
(attribute) of the vectors
)
|
|
...
|
|
|
(|
)
,
( 2
2
2
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d
|
|
...
|
|
|
|
)
,
(
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d
59. 59
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
compute the dissimilarity using methods for interval-
scaled variables
1
1
f
if
if M
r
z
}
,...,
1
{ f
if
M
r
60. 60
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
f is binary or nominal:
dij
(f) = 0 if xif = xjf , or dij
(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks rif and
Treat zif as interval-scaled
)
(
1
)
(
)
(
1
)
,
( f
ij
p
f
f
ij
f
ij
p
f
d
j
i
d
1
1
f
if
M
r
zif
61. 61
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
Other vector objects: gene features in micro-arrays, …
Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the length of vector d
63. 63
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
64. Summary
Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
Many types of data sets, e.g., numerical, text, graph, Web, image.
Gain insight into the data by:
Basic statistical data description: central tendency, dispersion,
graphical displays
Data visualization: map data onto graphical primitives
Measure data similarity
Above steps are the beginning of data preprocessing.
Many methods have been developed but still an active area of research.
64
65. References
W. Cleveland, Visualizing Data, Hobart Press, 1993
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
65