SlideShare a Scribd company logo
Some Key Questions
about you Data

                  Brian Mac Namee
Brendan Tierney
            Damian Gordon
The Data
   If the data is the key consideration in your research
    (although not all projects will necessarily be
    concerned with large datasets) it is important to
    consider several questions for those projects that
    do.
Overview
   How suitable is the data?
   What is the type of the data?
   Where will you get it from?
   What size is the dataset?
   What format is it in?
   How much cleaning is required?
   What is the quality of the data?
   How do you deal with missing data?
   How will you evaluate your analysis?
   etc.
Suitability: Dataset
   Determining the suitability of the data is a vital
    consideration, it is not sufficient to simply locate a
    dataset that is thematically linked to your research
    question, it must be appropriate to explore the
    questions that you want to ask.
   For example, just because you want to do Credit
    Card Fraud detection and you have a dataset that
    contains Credit Card transactions or was used in
    another Credit Card Fraud project, does not mean
    that it will be suitable for your project.
Suitability: Labelling
   Is the data already labelled?

   This is very important for supervised learning
    problems.
   To take the credit card fraud example again, you
    can probably get as many credit card transactions
    as you like but you probably won't be able to get
    them marked up as fraudulent and non-fraudulent.
Suitability: Labelling
   The same thing goes for a lot of text analytics
    problems - can you get people to label thousands of
    documents as being interesting or non-interesting to
    them so that you can train a predictive model?
   The availability of labelled data is a key
    consideration for any supervised learning problem.
   The areas of semi-supervised learning and active
    learning try to address this problem and have some
    very interesting open research questions.
Suitability: Labelling
   Two important considerations:

       The Curse of Dimensionality – When the dimensionality
        increases, the volume of the space increases so fast that
        the available data becomes sparse. In order to obtain a
        statistically sound result, the amount of data you need
        often grows exponentially with the dimensionality.

       The No Free Lunch Theorem - Classifier performance
        depends greatly on the characteristics of the data to be
        classified. There is no single classifier that works best on
        all given problems.
Suitability: Labelling
   Also remember for labelling, you might be aiming
    for one of three goals:

       Binary classifications – classifying each data item to one
        of two categories.

       Multiclass classifications - classifying each data item to
        more than two categories.

       Multi-label classifications - classifying each data item to
        multiple target labels.
Types of Data
   Federated data
   High dimensional data
   Descriptive data
   Longitudinal data
   Streaming data
   Web (scraped) data
   Numeric vs. categorical vs. text data
   etc.
Locating Datasets
   http://researchmethodsdataanalysis.blogsp

   e.g.
   http://www.kdnuggets.com/datasets/
   http://www.google.com/publicdata/directory
   http://opendata.ie/
   http://lib.stat.cmu.edu/datasets/
Size of the Dataset
   What is a reasonable size of a dataset?

   Obviously it vary a lot from problem to problem, but
    in general we would recommend at least 10
    features (columns) in the dataset, and we’d like to
    see thousands of instances.
Format of the Data
   TXT (Text file)
   MIME (Multipurpose Internet Mail Extensions)
   XML (Extensible Markup Language)
   CSV (Comma-Separated Values)
   ACSII (American Standard Code for Information
    Interchange)
   etc.
Cleaning of Data
   Parsing
   Correcting
   Standardizing
   Matching
   Consolidating
Quality of the Data
   Frequency counts
   Descriptive statistics (mean, standard deviation,
    median)
   Normality (skewness, kurtosis, frequency
    histograms, normal probability plots)
   Associations (correlations, scatter plots)
Missing Data?
   Imputation
   Partial imputation
   Partial deletion
   Full analysis

   Also consider database nullology
Evaluating the Analysis
   How confident are you in the outcomes of your
    analysis?

   Area under the Curve
   Misclassification Error
   Confusion Matrix
   N-fold Cross Validation
   Test predictions using the real-world
The Data
   Other questions?

More Related Content

What's hot

2 Data-mining process
2   Data-mining process2   Data-mining process
2 Data-mining process
Mahmoud Alfarra
 
Data analytics
Data analyticsData analytics
Data analytics
Bhanu Pratap
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Unit 2
Unit 2Unit 2
Data analytics
Data analyticsData analytics
Data analytics
Dr.Bhuvaneswari Velumani
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
hktripathy
 
Datamining
DataminingDatamining
Datamining
sumit621
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Edureka!
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
Jordan Open Source Association
 
Data analytics
Data analyticsData analytics
Data mining
Data miningData mining
Data mining
Daminda Herath
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
Mahmoud Alfarra
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
Er. Nawaraj Bhandari
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
Fazle Rabbi Ador
 
Data analytics
Data analyticsData analytics
Data mining
Data miningData mining
Data mining
Murniana Shazwen
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
Rohit Mittal
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
ANOOP V S
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
IDEAS - Int'l Data Engineering and Science Association
 

What's hot (19)

2 Data-mining process
2   Data-mining process2   Data-mining process
2 Data-mining process
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Unit 2
Unit 2Unit 2
Unit 2
 
Data analytics
Data analyticsData analytics
Data analytics
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Datamining
DataminingDatamining
Datamining
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining
Data miningData mining
Data mining
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining
Data miningData mining
Data mining
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 

Viewers also liked

Analysis of Interviews
Analysis of InterviewsAnalysis of Interviews
Analysis of Interviews
Damian T. Gordon
 
Interviews and Surveys
Interviews and SurveysInterviews and Surveys
Interviews and Surveys
Damian T. Gordon
 
Introduction to Interviewing
Introduction to InterviewingIntroduction to Interviewing
Introduction to Interviewing
Damian T. Gordon
 
Doing a Literature Review - Part 3
Doing a Literature Review - Part 3Doing a Literature Review - Part 3
Doing a Literature Review - Part 3
Damian T. Gordon
 
Doing a Literature Review - Part 4
Doing a Literature Review - Part 4Doing a Literature Review - Part 4
Doing a Literature Review - Part 4
Damian T. Gordon
 
Introduction to Statistics - Part 2
Introduction to Statistics - Part 2Introduction to Statistics - Part 2
Introduction to Statistics - Part 2
Damian T. Gordon
 
Doing a Literature Review - Part 1
Doing a Literature Review - Part 1Doing a Literature Review - Part 1
Doing a Literature Review - Part 1
Damian T. Gordon
 
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative StudiesHEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
Dr. Khaled OUANES
 
Experimental study of precast portal frame
Experimental study of precast portal frameExperimental study of precast portal frame
Experimental study of precast portal frame
Satish Kambaliya
 
Introduction To The Research Method
Introduction To The Research MethodIntroduction To The Research Method
Introduction To The Research Method
Prof (Dr.) Chamaru De Alwis
 
Qualitative Research Methods by Paulino Silva - ECSM2015
Qualitative Research Methods by Paulino Silva - ECSM2015Qualitative Research Methods by Paulino Silva - ECSM2015
Qualitative Research Methods by Paulino Silva - ECSM2015
Paulino Silva
 
CCAO Presentation
CCAO PresentationCCAO Presentation
CCAO Presentation
Edward Cameron
 
Sri lanka tracer study and impact assessment synthesis
Sri lanka   tracer study and impact assessment synthesisSri lanka   tracer study and impact assessment synthesis
Sri lanka tracer study and impact assessment synthesis
imecommunity
 
Lao pdr tracer study and impact assessment synthesis
Lao pdr   tracer study and impact assessment synthesisLao pdr   tracer study and impact assessment synthesis
Lao pdr tracer study and impact assessment synthesis
imecommunity
 
[Japanese] Style validator-html5etcstudy20151125
[Japanese] Style validator-html5etcstudy20151125[Japanese] Style validator-html5etcstudy20151125
[Japanese] Style validator-html5etcstudy20151125
Takeharu Igari
 
Steel sm
Steel smSteel sm
Steel sm
imecommunity
 
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESISEziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESISPAHUPDATE
 
02 indonesia tracer study and impact assessment synthesis
02 indonesia   tracer study and impact assessment synthesis02 indonesia   tracer study and impact assessment synthesis
02 indonesia tracer study and impact assessment synthesis
imecommunity
 
Plat 05
Plat 05Plat 05
Plat 05
imecommunity
 

Viewers also liked (20)

Analysis of Interviews
Analysis of InterviewsAnalysis of Interviews
Analysis of Interviews
 
Interviews and Surveys
Interviews and SurveysInterviews and Surveys
Interviews and Surveys
 
Introduction to Interviewing
Introduction to InterviewingIntroduction to Interviewing
Introduction to Interviewing
 
Doing a Literature Review - Part 3
Doing a Literature Review - Part 3Doing a Literature Review - Part 3
Doing a Literature Review - Part 3
 
Doing a Literature Review - Part 4
Doing a Literature Review - Part 4Doing a Literature Review - Part 4
Doing a Literature Review - Part 4
 
Introduction to Statistics - Part 2
Introduction to Statistics - Part 2Introduction to Statistics - Part 2
Introduction to Statistics - Part 2
 
Doing a Literature Review - Part 1
Doing a Literature Review - Part 1Doing a Literature Review - Part 1
Doing a Literature Review - Part 1
 
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative StudiesHEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
 
Experimental study of precast portal frame
Experimental study of precast portal frameExperimental study of precast portal frame
Experimental study of precast portal frame
 
Introduction To The Research Method
Introduction To The Research MethodIntroduction To The Research Method
Introduction To The Research Method
 
Qualitative Research Methods by Paulino Silva - ECSM2015
Qualitative Research Methods by Paulino Silva - ECSM2015Qualitative Research Methods by Paulino Silva - ECSM2015
Qualitative Research Methods by Paulino Silva - ECSM2015
 
CCAO Presentation
CCAO PresentationCCAO Presentation
CCAO Presentation
 
Sri lanka tracer study and impact assessment synthesis
Sri lanka   tracer study and impact assessment synthesisSri lanka   tracer study and impact assessment synthesis
Sri lanka tracer study and impact assessment synthesis
 
Lao pdr tracer study and impact assessment synthesis
Lao pdr   tracer study and impact assessment synthesisLao pdr   tracer study and impact assessment synthesis
Lao pdr tracer study and impact assessment synthesis
 
[Japanese] Style validator-html5etcstudy20151125
[Japanese] Style validator-html5etcstudy20151125[Japanese] Style validator-html5etcstudy20151125
[Japanese] Style validator-html5etcstudy20151125
 
Introduction to HTML
Introduction to HTMLIntroduction to HTML
Introduction to HTML
 
Steel sm
Steel smSteel sm
Steel sm
 
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESISEziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
 
02 indonesia tracer study and impact assessment synthesis
02 indonesia   tracer study and impact assessment synthesis02 indonesia   tracer study and impact assessment synthesis
02 indonesia tracer study and impact assessment synthesis
 
Plat 05
Plat 05Plat 05
Plat 05
 

Similar to Some Questions About Your Data

introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
Johnson Ubah
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
IvanHo572682
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
PothyeswariPothyes
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
Kevin Pledge
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Kevin Pledge
 
Data mining
Data miningData mining
Data mining
Daminda Herath
 
Part1
Part1Part1
Part1
sumit621
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
Precisely
 
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data AnalyticsIntroduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
VrushaliSolanke
 
Data mining
Data miningData mining
Data mining
Ujjwal Kumar
 
Technical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfTechnical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdf
Shristi Shrestha
 
data mining
data miningdata mining
data mining
manasa polu
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
NikitaRajbhoj
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
Albert Anthony Gavino, MBA
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
ranjit banshpal
 
BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?
Christopher Bradley
 
Data Mining
Data MiningData Mining
Data Mining
Gary Stefan
 
Talk
TalkTalk
Talk
sumit621
 

Similar to Some Questions About Your Data (20)

introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
 
Data mining
Data miningData mining
Data mining
 
Part1
Part1Part1
Part1
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data AnalyticsIntroduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
 
Data mining
Data miningData mining
Data mining
 
Technical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfTechnical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdf
 
data mining
data miningdata mining
data mining
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?
 
Data Mining
Data MiningData Mining
Data Mining
 
Talk
TalkTalk
Talk
 

More from Damian T. Gordon

Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.
Damian T. Gordon
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
Damian T. Gordon
 
REST and RESTful Services
REST and RESTful ServicesREST and RESTful Services
REST and RESTful Services
Damian T. Gordon
 
Serverless Computing
Serverless ComputingServerless Computing
Serverless Computing
Damian T. Gordon
 
Cloud Identity Management
Cloud Identity ManagementCloud Identity Management
Cloud Identity Management
Damian T. Gordon
 
Containers and Docker
Containers and DockerContainers and Docker
Containers and Docker
Damian T. Gordon
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
Damian T. Gordon
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
Damian T. Gordon
 
How to Argue Logically
How to Argue LogicallyHow to Argue Logically
How to Argue Logically
Damian T. Gordon
 
Evaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONSEvaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONS
Damian T. Gordon
 
Evaluating Teaching: MERLOT
Evaluating Teaching: MERLOTEvaluating Teaching: MERLOT
Evaluating Teaching: MERLOT
Damian T. Gordon
 
Evaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson RubricEvaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson Rubric
Damian T. Gordon
 
Evaluating Teaching: LORI
Evaluating Teaching: LORIEvaluating Teaching: LORI
Evaluating Teaching: LORI
Damian T. Gordon
 
Designing Teaching: Pause Procedure
Designing Teaching: Pause ProcedureDesigning Teaching: Pause Procedure
Designing Teaching: Pause Procedure
Damian T. Gordon
 
Designing Teaching: ADDIE
Designing Teaching: ADDIEDesigning Teaching: ADDIE
Designing Teaching: ADDIE
Damian T. Gordon
 
Designing Teaching: ASSURE
Designing Teaching: ASSUREDesigning Teaching: ASSURE
Designing Teaching: ASSURE
Damian T. Gordon
 
Designing Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning TypesDesigning Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning Types
Damian T. Gordon
 
Designing Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of InstructionDesigning Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of Instruction
Damian T. Gordon
 
Designing Teaching: Elaboration Theory
Designing Teaching: Elaboration TheoryDesigning Teaching: Elaboration Theory
Designing Teaching: Elaboration Theory
Damian T. Gordon
 
Universally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some ConsiderationsUniversally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some Considerations
Damian T. Gordon
 

More from Damian T. Gordon (20)

Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
REST and RESTful Services
REST and RESTful ServicesREST and RESTful Services
REST and RESTful Services
 
Serverless Computing
Serverless ComputingServerless Computing
Serverless Computing
 
Cloud Identity Management
Cloud Identity ManagementCloud Identity Management
Cloud Identity Management
 
Containers and Docker
Containers and DockerContainers and Docker
Containers and Docker
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
 
How to Argue Logically
How to Argue LogicallyHow to Argue Logically
How to Argue Logically
 
Evaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONSEvaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONS
 
Evaluating Teaching: MERLOT
Evaluating Teaching: MERLOTEvaluating Teaching: MERLOT
Evaluating Teaching: MERLOT
 
Evaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson RubricEvaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson Rubric
 
Evaluating Teaching: LORI
Evaluating Teaching: LORIEvaluating Teaching: LORI
Evaluating Teaching: LORI
 
Designing Teaching: Pause Procedure
Designing Teaching: Pause ProcedureDesigning Teaching: Pause Procedure
Designing Teaching: Pause Procedure
 
Designing Teaching: ADDIE
Designing Teaching: ADDIEDesigning Teaching: ADDIE
Designing Teaching: ADDIE
 
Designing Teaching: ASSURE
Designing Teaching: ASSUREDesigning Teaching: ASSURE
Designing Teaching: ASSURE
 
Designing Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning TypesDesigning Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning Types
 
Designing Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of InstructionDesigning Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of Instruction
 
Designing Teaching: Elaboration Theory
Designing Teaching: Elaboration TheoryDesigning Teaching: Elaboration Theory
Designing Teaching: Elaboration Theory
 
Universally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some ConsiderationsUniversally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some Considerations
 

Recently uploaded

How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
Assessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptxAssessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptx
Kavitha Krishnan
 

Recently uploaded (20)

How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
Assessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptxAssessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptx
 

Some Questions About Your Data

  • 1. Some Key Questions about you Data Brian Mac Namee Brendan Tierney Damian Gordon
  • 2. The Data  If the data is the key consideration in your research (although not all projects will necessarily be concerned with large datasets) it is important to consider several questions for those projects that do.
  • 3. Overview  How suitable is the data?  What is the type of the data?  Where will you get it from?  What size is the dataset?  What format is it in?  How much cleaning is required?  What is the quality of the data?  How do you deal with missing data?  How will you evaluate your analysis?  etc.
  • 4. Suitability: Dataset  Determining the suitability of the data is a vital consideration, it is not sufficient to simply locate a dataset that is thematically linked to your research question, it must be appropriate to explore the questions that you want to ask.  For example, just because you want to do Credit Card Fraud detection and you have a dataset that contains Credit Card transactions or was used in another Credit Card Fraud project, does not mean that it will be suitable for your project.
  • 5. Suitability: Labelling  Is the data already labelled?  This is very important for supervised learning problems.  To take the credit card fraud example again, you can probably get as many credit card transactions as you like but you probably won't be able to get them marked up as fraudulent and non-fraudulent.
  • 6. Suitability: Labelling  The same thing goes for a lot of text analytics problems - can you get people to label thousands of documents as being interesting or non-interesting to them so that you can train a predictive model?  The availability of labelled data is a key consideration for any supervised learning problem.  The areas of semi-supervised learning and active learning try to address this problem and have some very interesting open research questions.
  • 7. Suitability: Labelling  Two important considerations:  The Curse of Dimensionality – When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. In order to obtain a statistically sound result, the amount of data you need often grows exponentially with the dimensionality.  The No Free Lunch Theorem - Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems.
  • 8. Suitability: Labelling  Also remember for labelling, you might be aiming for one of three goals:  Binary classifications – classifying each data item to one of two categories.  Multiclass classifications - classifying each data item to more than two categories.  Multi-label classifications - classifying each data item to multiple target labels.
  • 9. Types of Data  Federated data  High dimensional data  Descriptive data  Longitudinal data  Streaming data  Web (scraped) data  Numeric vs. categorical vs. text data  etc.
  • 10. Locating Datasets  http://researchmethodsdataanalysis.blogsp  e.g.  http://www.kdnuggets.com/datasets/  http://www.google.com/publicdata/directory  http://opendata.ie/  http://lib.stat.cmu.edu/datasets/
  • 11. Size of the Dataset  What is a reasonable size of a dataset?  Obviously it vary a lot from problem to problem, but in general we would recommend at least 10 features (columns) in the dataset, and we’d like to see thousands of instances.
  • 12. Format of the Data  TXT (Text file)  MIME (Multipurpose Internet Mail Extensions)  XML (Extensible Markup Language)  CSV (Comma-Separated Values)  ACSII (American Standard Code for Information Interchange)  etc.
  • 13. Cleaning of Data  Parsing  Correcting  Standardizing  Matching  Consolidating
  • 14. Quality of the Data  Frequency counts  Descriptive statistics (mean, standard deviation, median)  Normality (skewness, kurtosis, frequency histograms, normal probability plots)  Associations (correlations, scatter plots)
  • 15. Missing Data?  Imputation  Partial imputation  Partial deletion  Full analysis  Also consider database nullology
  • 16. Evaluating the Analysis  How confident are you in the outcomes of your analysis?  Area under the Curve  Misclassification Error  Confusion Matrix  N-fold Cross Validation  Test predictions using the real-world
  • 17. The Data  Other questions?