Some Key Questionsabout you Data                  Brian Mac NameeBrendan Tierney            Damian Gordon
The Data   If the data is the key consideration in your research    (although not all projects will necessarily be    con...
Overview   How suitable is the data?   What is the type of the data?   Where will you get it from?   What size is the ...
Suitability: Dataset   Determining the suitability of the data is a vital    consideration, it is not sufficient to simpl...
Suitability: Labelling   Is the data already labelled?   This is very important for supervised learning    problems.   ...
Suitability: Labelling   The same thing goes for a lot of text analytics    problems - can you get people to label thousa...
Suitability: Labelling   Two important considerations:       The Curse of Dimensionality – When the dimensionality      ...
Suitability: Labelling   Also remember for labelling, you might be aiming    for one of three goals:       Binary classi...
Types of Data   Federated data   High dimensional data   Descriptive data   Longitudinal data   Streaming data   Web...
Locating Datasets   http://researchmethodsdataanalysis.blogsp   e.g.   http://www.kdnuggets.com/datasets/   http://www...
Size of the Dataset   What is a reasonable size of a dataset?   Obviously it vary a lot from problem to problem, but    ...
Format of the Data   TXT (Text file)   MIME (Multipurpose Internet Mail Extensions)   XML (Extensible Markup Language)...
Cleaning of Data   Parsing   Correcting   Standardizing   Matching   Consolidating
Quality of the Data   Frequency counts   Descriptive statistics (mean, standard deviation,    median)   Normality (skew...
Missing Data?   Imputation   Partial imputation   Partial deletion   Full analysis   Also consider database nullology
Evaluating the Analysis   How confident are you in the outcomes of your    analysis?   Area under the Curve   Misclassi...
The Data   Other questions?
Upcoming SlideShare
Loading in...5
×

Some Questions About Your Data

503

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
503
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Some Questions About Your Data

  1. 1. Some Key Questionsabout you Data Brian Mac NameeBrendan Tierney Damian Gordon
  2. 2. The Data If the data is the key consideration in your research (although not all projects will necessarily be concerned with large datasets) it is important to consider several questions for those projects that do.
  3. 3. Overview How suitable is the data? What is the type of the data? Where will you get it from? What size is the dataset? What format is it in? How much cleaning is required? What is the quality of the data? How do you deal with missing data? How will you evaluate your analysis? etc.
  4. 4. Suitability: Dataset Determining the suitability of the data is a vital consideration, it is not sufficient to simply locate a dataset that is thematically linked to your research question, it must be appropriate to explore the questions that you want to ask. For example, just because you want to do Credit Card Fraud detection and you have a dataset that contains Credit Card transactions or was used in another Credit Card Fraud project, does not mean that it will be suitable for your project.
  5. 5. Suitability: Labelling Is the data already labelled? This is very important for supervised learning problems. To take the credit card fraud example again, you can probably get as many credit card transactions as you like but you probably wont be able to get them marked up as fraudulent and non-fraudulent.
  6. 6. Suitability: Labelling The same thing goes for a lot of text analytics problems - can you get people to label thousands of documents as being interesting or non-interesting to them so that you can train a predictive model? The availability of labelled data is a key consideration for any supervised learning problem. The areas of semi-supervised learning and active learning try to address this problem and have some very interesting open research questions.
  7. 7. Suitability: Labelling Two important considerations:  The Curse of Dimensionality – When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. In order to obtain a statistically sound result, the amount of data you need often grows exponentially with the dimensionality.  The No Free Lunch Theorem - Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems.
  8. 8. Suitability: Labelling Also remember for labelling, you might be aiming for one of three goals:  Binary classifications – classifying each data item to one of two categories.  Multiclass classifications - classifying each data item to more than two categories.  Multi-label classifications - classifying each data item to multiple target labels.
  9. 9. Types of Data Federated data High dimensional data Descriptive data Longitudinal data Streaming data Web (scraped) data Numeric vs. categorical vs. text data etc.
  10. 10. Locating Datasets http://researchmethodsdataanalysis.blogsp e.g. http://www.kdnuggets.com/datasets/ http://www.google.com/publicdata/directory http://opendata.ie/ http://lib.stat.cmu.edu/datasets/
  11. 11. Size of the Dataset What is a reasonable size of a dataset? Obviously it vary a lot from problem to problem, but in general we would recommend at least 10 features (columns) in the dataset, and we’d like to see thousands of instances.
  12. 12. Format of the Data TXT (Text file) MIME (Multipurpose Internet Mail Extensions) XML (Extensible Markup Language) CSV (Comma-Separated Values) ACSII (American Standard Code for Information Interchange) etc.
  13. 13. Cleaning of Data Parsing Correcting Standardizing Matching Consolidating
  14. 14. Quality of the Data Frequency counts Descriptive statistics (mean, standard deviation, median) Normality (skewness, kurtosis, frequency histograms, normal probability plots) Associations (correlations, scatter plots)
  15. 15. Missing Data? Imputation Partial imputation Partial deletion Full analysis Also consider database nullology
  16. 16. Evaluating the Analysis How confident are you in the outcomes of your analysis? Area under the Curve Misclassification Error Confusion Matrix N-fold Cross Validation Test predictions using the real-world
  17. 17. The Data Other questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×