SlideShare a Scribd company logo
1 of 21
Download to read offline
1 Dvijesh Shastri
Day-3: Understanding Data
2 Dvijesh Shastri
Course Agenda
Week Day Theory Topic Hands-on Lab
1
Day-1 • Introduction to Data Analytics Lab-1 (Introduction to Orange)
Day-2
• Exploratory Data Analysis (EDA)
• Visual Analytics
Lab-2 (EDA)
Day-3
• Data Understanding
• Data Preprocessing
• Data Science Ethics (Afternoon Session)
Lab-3 (Data Preprocessing)
2
Day-4
• Supervised Machine Learning (Decision Tree)
• Model Evaluation
Lab-4 (Decision Tree)
Day-5
• Ensemble methods (Bagging, Random Forest and
Boosting)
Lab-5 (Ensemble methods for Classification)
Day-6
• Unsupervised Machine Learning (k-Means, and
Hierarchical)
• Data Science Ethics
Lab-6 (Clustering)
Day-7 • Kaggle Competition
3 Dvijesh Shastri
Today’s Agenda (Day-3)
#
Time Period
(HH:MM)
Time Length
(minutes)
Activities
1 09:00 – 09:20 20 Understanding Data
2 09:20 – 10:20 60 Data Preprocessing (1/2)
3 10:20 – 10:30 10 Break
4 10:30 – 11:00 30 Data Preprocessing (2/2)
5 11:00 – 12:00 60 Lab-3 (Data Preprocessing )
4 Dvijesh Shastri
What is Data?
5 Dvijesh Shastri
Data  Collection of Objects  Collection of Attributes
Example: Student, customer, movies, etc.
6 Dvijesh Shastri
Data: Collection of objects (examples) and their
attributes (features)  Matrix format
Attribute: is a property or characteristic of an object
Attribute is also known as variable, field, characteristic, or feature.
Examples: eye color of a person, temperature, etc.
Object: is a collection of attributes that describe an object.
Object is also known as record, data point, sample, entity, observation,
or instance.
Attributes
Objects
7 Dvijesh Shastri
Types of Attributes
Nominal (or Categorical)
 Are = or ≠ to other values
 Examples: ID numbers, eye color, zip codes
Ordinal
 Obey a < relationship
 Examples: rankings (e.g., taste of potato chips on a scale from
1-5), height in {tall, medium, short}
Interval
 Examples: calendar dates, temperature
Quantitative/Ratio
 Can do arithmetic on them
 Examples: length, time, counts, temperature
Qualitative
Quantitative
8 Dvijesh Shastri
Properties of Attribute Values
The type of an attribute depends on which of the following
properties it possesses:
 Distinctness: = ≠
 Order: < >
 Addition: + -
 Multiplication: * /
 Nominal attribute: distinctness
 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties
9 Dvijesh Shastri
Why do I care?
POLL 1: Why do I care to know about attribute type?
10 Dvijesh Shastri
Why do I care?
Different types of attributes may be preprocessed differently (Noise cleaning,
missing value, normalization)
 Ex: Missing value for Qualitative Attributes Use Mode
for Quantitative Attributes  Median, Mean, linear interpolation
ML algorithms may work better on certain kinds of attributes
 Ex: Qualitative data  Decision Tree
Quantitative data  kNN
11 Dvijesh Shastri
POLL 2: Identify attribute type.
1. Nominal
2. Ordinal
3. Interval
4. Ratio
12 Dvijesh Shastri
1. Nominal
2. Ordinal
3. Interval
4. Ratio
13 Dvijesh Shastri
Types of datasets
14 Dvijesh Shastri
Types of data sets
1. Record Data
– Data Matrix
– Document Data
– Transaction Data
2. Graph-based Data
– Social Network
– World Wide Web
3. Ordered Data
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
team coach play ball score game win lost timeout season
Document 1 1 0 1 0 1 1 0 1 0 1
Document 2 0 1 0 1 1 0 0 1 0 0
Document 3 0 1 0 0 1 1 1 0 1 0
5
2
1
2
5
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
15 Dvijesh Shastri
Why do I care?
POLL 3: Why do I care to know about the types of datasets?
16 Dvijesh Shastri
Why do I care?
Type of data set determines which tools and techniques can be used to
analyze the data.
5
2
1
2
5
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Classification
Association
Anlysis
Network
Analysis
Graph data Transaction data
Record data
17 Dvijesh Shastri
Structure vs. Unstructured Data
18 Dvijesh Shastri
A. Structured Data
Structured data conforms to a data model or schema
and is often stored in tabular form.
It is used to capture relationships between different
entities and is therefore most often stored in a
relational database.
Due to the abundance of tools and databases that
natively support structured data, it rarely requires
special consideration in regards to processing or
storage.
Examples of this type of data include banking
transactions, invoices, and customer records.
19 Dvijesh Shastri
B. Unstructured Data
Data that does not conform to a
data model or data schema is
known as unstructured data.
It is estimated that unstructured
data makes up 80% of the data
within any given enterprise.
Unstructured data has a faster
growth rate than structured data.
20 Dvijesh Shastri
References
Online
21 Dvijesh Shastri
Slides presented are Intellectual Property of Dr. Dvijesh Shastri and usage rights belong to him.

More Related Content

Similar to 1-Data Understanding.pdf

How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamTraveloka
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfJojo314349
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furcShani729
 
Its all about data mining
Its all about data miningIts all about data mining
Its all about data miningJason Rodrigues
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsEDB
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxYogeshGairola2
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptxImXaib
 
Dama - Protecting Sensitive Data on a Database
Dama - Protecting Sensitive Data on a DatabaseDama - Protecting Sensitive Data on a Database
Dama - Protecting Sensitive Data on a Databasejohanswart1234
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentPedro Staziaki
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBMongoDB
 

Similar to 1-Data Understanding.pdf (20)

How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
DBMS
DBMSDBMS
DBMS
 
Oracle openworld-presentation
Oracle openworld-presentationOracle openworld-presentation
Oracle openworld-presentation
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdf
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Its all about data mining
Its all about data miningIts all about data mining
Its all about data mining
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data Models
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Dama - Protecting Sensitive Data on a Database
Dama - Protecting Sensitive Data on a DatabaseDama - Protecting Sensitive Data on a Database
Dama - Protecting Sensitive Data on a Database
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology resident
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDB
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 

More from gopikahari7

Tema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptxTema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptxgopikahari7
 
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptxcuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptxgopikahari7
 
Final PPT.pptx (1).pptx
Final PPT.pptx (1).pptxFinal PPT.pptx (1).pptx
Final PPT.pptx (1).pptxgopikahari7
 
S12075-GPU-Accelerated-Video-Encoding.pptx
S12075-GPU-Accelerated-Video-Encoding.pptxS12075-GPU-Accelerated-Video-Encoding.pptx
S12075-GPU-Accelerated-Video-Encoding.pptxgopikahari7
 
S12075-GPU-Accelerated-Video-Encoding.pdf
S12075-GPU-Accelerated-Video-Encoding.pdfS12075-GPU-Accelerated-Video-Encoding.pdf
S12075-GPU-Accelerated-Video-Encoding.pdfgopikahari7
 
batalgorithm-160501121237 (1).pptx
batalgorithm-160501121237 (1).pptxbatalgorithm-160501121237 (1).pptx
batalgorithm-160501121237 (1).pptxgopikahari7
 
batalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxbatalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxgopikahari7
 
Copy of Parallel_and_Cluster_Computing.pptx
Copy of Parallel_and_Cluster_Computing.pptxCopy of Parallel_and_Cluster_Computing.pptx
Copy of Parallel_and_Cluster_Computing.pptxgopikahari7
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxgopikahari7
 
batalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxbatalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxgopikahari7
 
ELEMPowerPoint.pptx
ELEMPowerPoint.pptxELEMPowerPoint.pptx
ELEMPowerPoint.pptxgopikahari7
 
plantpresentation.ppt
plantpresentation.pptplantpresentation.ppt
plantpresentation.pptgopikahari7
 
2_2018_12_20!06_04_28_PM.ppt
2_2018_12_20!06_04_28_PM.ppt2_2018_12_20!06_04_28_PM.ppt
2_2018_12_20!06_04_28_PM.pptgopikahari7
 
abelbrownnvidiarakuten2016-170208065814 (1).pptx
abelbrownnvidiarakuten2016-170208065814 (1).pptxabelbrownnvidiarakuten2016-170208065814 (1).pptx
abelbrownnvidiarakuten2016-170208065814 (1).pptxgopikahari7
 
realtime_ai_systems_academia.pptx
realtime_ai_systems_academia.pptxrealtime_ai_systems_academia.pptx
realtime_ai_systems_academia.pptxgopikahari7
 
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...gopikahari7
 

More from gopikahari7 (20)

Tema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptxTema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptx
 
barrera.ppt
barrera.pptbarrera.ppt
barrera.ppt
 
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptxcuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
 
Final PPT.pptx (1).pptx
Final PPT.pptx (1).pptxFinal PPT.pptx (1).pptx
Final PPT.pptx (1).pptx
 
S12075-GPU-Accelerated-Video-Encoding.pptx
S12075-GPU-Accelerated-Video-Encoding.pptxS12075-GPU-Accelerated-Video-Encoding.pptx
S12075-GPU-Accelerated-Video-Encoding.pptx
 
S12075-GPU-Accelerated-Video-Encoding.pdf
S12075-GPU-Accelerated-Video-Encoding.pdfS12075-GPU-Accelerated-Video-Encoding.pdf
S12075-GPU-Accelerated-Video-Encoding.pdf
 
batalgorithm-160501121237 (1).pptx
batalgorithm-160501121237 (1).pptxbatalgorithm-160501121237 (1).pptx
batalgorithm-160501121237 (1).pptx
 
batalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxbatalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptx
 
Copy of Parallel_and_Cluster_Computing.pptx
Copy of Parallel_and_Cluster_Computing.pptxCopy of Parallel_and_Cluster_Computing.pptx
Copy of Parallel_and_Cluster_Computing.pptx
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
batalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxbatalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptx
 
ELEMPowerPoint.pptx
ELEMPowerPoint.pptxELEMPowerPoint.pptx
ELEMPowerPoint.pptx
 
Hayes2010.ppt
Hayes2010.pptHayes2010.ppt
Hayes2010.ppt
 
plantpresentation.ppt
plantpresentation.pptplantpresentation.ppt
plantpresentation.ppt
 
Plants.ppt
Plants.pptPlants.ppt
Plants.ppt
 
2_2018_12_20!06_04_28_PM.ppt
2_2018_12_20!06_04_28_PM.ppt2_2018_12_20!06_04_28_PM.ppt
2_2018_12_20!06_04_28_PM.ppt
 
abelbrownnvidiarakuten2016-170208065814 (1).pptx
abelbrownnvidiarakuten2016-170208065814 (1).pptxabelbrownnvidiarakuten2016-170208065814 (1).pptx
abelbrownnvidiarakuten2016-170208065814 (1).pptx
 
realtime_ai_systems_academia.pptx
realtime_ai_systems_academia.pptxrealtime_ai_systems_academia.pptx
realtime_ai_systems_academia.pptx
 
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
 
FPGA-Arch.ppt
FPGA-Arch.pptFPGA-Arch.ppt
FPGA-Arch.ppt
 

Recently uploaded

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

1-Data Understanding.pdf

  • 1. 1 Dvijesh Shastri Day-3: Understanding Data
  • 2. 2 Dvijesh Shastri Course Agenda Week Day Theory Topic Hands-on Lab 1 Day-1 • Introduction to Data Analytics Lab-1 (Introduction to Orange) Day-2 • Exploratory Data Analysis (EDA) • Visual Analytics Lab-2 (EDA) Day-3 • Data Understanding • Data Preprocessing • Data Science Ethics (Afternoon Session) Lab-3 (Data Preprocessing) 2 Day-4 • Supervised Machine Learning (Decision Tree) • Model Evaluation Lab-4 (Decision Tree) Day-5 • Ensemble methods (Bagging, Random Forest and Boosting) Lab-5 (Ensemble methods for Classification) Day-6 • Unsupervised Machine Learning (k-Means, and Hierarchical) • Data Science Ethics Lab-6 (Clustering) Day-7 • Kaggle Competition
  • 3. 3 Dvijesh Shastri Today’s Agenda (Day-3) # Time Period (HH:MM) Time Length (minutes) Activities 1 09:00 – 09:20 20 Understanding Data 2 09:20 – 10:20 60 Data Preprocessing (1/2) 3 10:20 – 10:30 10 Break 4 10:30 – 11:00 30 Data Preprocessing (2/2) 5 11:00 – 12:00 60 Lab-3 (Data Preprocessing )
  • 5. 5 Dvijesh Shastri Data  Collection of Objects  Collection of Attributes Example: Student, customer, movies, etc.
  • 6. 6 Dvijesh Shastri Data: Collection of objects (examples) and their attributes (features)  Matrix format Attribute: is a property or characteristic of an object Attribute is also known as variable, field, characteristic, or feature. Examples: eye color of a person, temperature, etc. Object: is a collection of attributes that describe an object. Object is also known as record, data point, sample, entity, observation, or instance. Attributes Objects
  • 7. 7 Dvijesh Shastri Types of Attributes Nominal (or Categorical)  Are = or ≠ to other values  Examples: ID numbers, eye color, zip codes Ordinal  Obey a < relationship  Examples: rankings (e.g., taste of potato chips on a scale from 1-5), height in {tall, medium, short} Interval  Examples: calendar dates, temperature Quantitative/Ratio  Can do arithmetic on them  Examples: length, time, counts, temperature Qualitative Quantitative
  • 8. 8 Dvijesh Shastri Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:  Distinctness: = ≠  Order: < >  Addition: + -  Multiplication: * /  Nominal attribute: distinctness  Ordinal attribute: distinctness & order  Interval attribute: distinctness, order & addition  Ratio attribute: all 4 properties
  • 9. 9 Dvijesh Shastri Why do I care? POLL 1: Why do I care to know about attribute type?
  • 10. 10 Dvijesh Shastri Why do I care? Different types of attributes may be preprocessed differently (Noise cleaning, missing value, normalization)  Ex: Missing value for Qualitative Attributes Use Mode for Quantitative Attributes  Median, Mean, linear interpolation ML algorithms may work better on certain kinds of attributes  Ex: Qualitative data  Decision Tree Quantitative data  kNN
  • 11. 11 Dvijesh Shastri POLL 2: Identify attribute type. 1. Nominal 2. Ordinal 3. Interval 4. Ratio
  • 12. 12 Dvijesh Shastri 1. Nominal 2. Ordinal 3. Interval 4. Ratio
  • 14. 14 Dvijesh Shastri Types of data sets 1. Record Data – Data Matrix – Document Data – Transaction Data 2. Graph-based Data – Social Network – World Wide Web 3. Ordered Data – Spatial Data – Temporal Data – Sequential Data – Genetic Sequence Data Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 team coach play ball score game win lost timeout season Document 1 1 0 1 0 1 1 0 1 0 1 Document 2 0 1 0 1 1 0 0 1 0 0 Document 3 0 1 0 0 1 1 1 0 1 0 5 2 1 2 5 GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk 1.1 2.2 16.22 6.25 12.65 1.2 2.7 15.22 5.27 10.23 Thickness Load Distance Projection of y load Projection of x Load 1.1 2.2 16.22 6.25 12.65 1.2 2.7 15.22 5.27 10.23 Thickness Load Distance Projection of y load Projection of x Load
  • 15. 15 Dvijesh Shastri Why do I care? POLL 3: Why do I care to know about the types of datasets?
  • 16. 16 Dvijesh Shastri Why do I care? Type of data set determines which tools and techniques can be used to analyze the data. 5 2 1 2 5 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Classification Association Anlysis Network Analysis Graph data Transaction data Record data
  • 17. 17 Dvijesh Shastri Structure vs. Unstructured Data
  • 18. 18 Dvijesh Shastri A. Structured Data Structured data conforms to a data model or schema and is often stored in tabular form. It is used to capture relationships between different entities and is therefore most often stored in a relational database. Due to the abundance of tools and databases that natively support structured data, it rarely requires special consideration in regards to processing or storage. Examples of this type of data include banking transactions, invoices, and customer records.
  • 19. 19 Dvijesh Shastri B. Unstructured Data Data that does not conform to a data model or data schema is known as unstructured data. It is estimated that unstructured data makes up 80% of the data within any given enterprise. Unstructured data has a faster growth rate than structured data.
  • 21. 21 Dvijesh Shastri Slides presented are Intellectual Property of Dr. Dvijesh Shastri and usage rights belong to him.