ITITITIT[[[[1111]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[2222]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[3333]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[4444]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[5555]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[6666]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[7777]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[8888]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[9999]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecnolog...
ITITITIT[[[[10101010]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[11111111]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[12121212]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[13131313]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[14141414]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[15151515]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[16161616]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[17171717]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[18181818]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[19191919]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[20202020]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[21212121]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[22222222]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[23232323]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[24242424]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[25252525]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[26262626]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
ITITITIT[[[[27272727]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation Tecn...
Upcoming SlideShare
Loading in …5
×

MadridJUG Mineria de Datos-Data Mining.09.may.2013

748 views

Published on

Introducción a mineria de datos para Madrid Java User Group

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
748
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MadridJUG Mineria de Datos-Data Mining.09.may.2013

  1. 1. ITITITIT[[[[1111]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyMadrid JUGMadrid JUGMadrid JUGMadrid JUG ---- Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining)9 de Mayo 2013Jose María Gómez Hidalgo (@jmgomez)Guillermo Santos García (@gsantosgo)DATA MINING
  2. 2. ITITITIT[[[[2222]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyINDEXINDEXINDEXINDEXMadrid JUG - Minería de Datos sobre Weka (Data Mining)...............................................................................................1INDEX......................................................................................................................................................................................21. Artificial Intelligence. Conceptual Map............................................................................................................................ 41.1 Knowledge Based System vs. Machine Learning System.......................................................................................52. Data Mining Process.........................................................................................................................................................62.1 Machine Learning.........................................................................................................................................................72.1.1 Supervised Machine Learning...................................................................................................................................72.1.2 Unsupervised Machine Learning............................................................................................................................82.1.3 The Top Ten Algorithms in Data Mining.................................................................................................................93. Tools...................................................................................................................................................................................103.1 WEKA (Waikato Environment for Knowledge Analysis) ...................................................................................103.2 R (#RStats)...........................................................................................................................................................103.3 RapidMiner.............................................................................................................................................................113.4 KNIME Desktop......................................................................................................................................................113.5 Orange...................................................................................................................................................................123.6 Polls.......................................................................................................................................................................133.6.1 What programming/statistics languages you used for analytics / data mining in the past 12 months?[579 voters] (Aug 2012).............................................................................................................................................133.6.2 What Analytics, Data mining, Big Data software you used in the past 12 months for a real project?(May 2012) ..................................................................................................................................................................134. Examples............................................................................................................................................................................154.1 Predicting Price House........................................................................................................................................154.2 Lending Club........................................................................................................................................................164.3 Spam or Ham Email............................................................................................................................................. 174.4 Handwritten Digit Recognition ..........................................................................................................................18
  3. 3. ITITITIT[[[[3333]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology4.5 Human Activity Recognition using Smartphones............................................................................................194.6 Inventory .............................................................................................................................................................204.7 Image Classification............................................................................................................................................214.8 Clustering............................................................................................................................................................225. Supervised Machine Learning........................................................................................................................................236. Evaluation.........................................................................................................................................................................246.1 Random Subsampling ..............................................................................................................................................246.2 Cross Validation (K-FOLD).......................................................................................................................................246.3 Confusion Matrix......................................................................................................................................................25A.1. ¿What is a DATASET?....................................................................................................................................................26A.2 Types of variables..........................................................................................................................................................26
  4. 4. ITITITIT[[[[4444]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology1.1.1.1. Artificial Intelligence.Artificial Intelligence.Artificial Intelligence.Artificial Intelligence. ConceptualConceptualConceptualConceptual MapMapMapMapLink: http://en.wikipedia.org/wiki/Artificial_intelligenceDATA MINING. LEARN FROM DATAArtificialIntelligenceProblem SolvingSearch MethodsLogicAgentsFuzzy LogicAutomaticClassificationInformationRetrievalFilteringAutromaticCategorizationKnowledge BasedSystemExpert SystemKnowledgerepresentationData MiningData AcquisitionMachineLearningSupervisedUnsupervisedNatural LanguageProcessingStatistical NLPKnowlegde BasedNLPRobotics
  5. 5. ITITITIT[[[[5555]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology1111....1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning SystemKnowledge Based System (Expert System)Knowledge Based System (Expert System)Knowledge Based System (Expert System)Knowledge Based System (Expert System)- Rules are codified manually (Represent knowledge)- Experts (expert is a person with extensive knowledge about domain).- Cost.Expert Sytems (Credit Expert System)If (Annual Income > 3 * Annual Debt) Then CREDIT = YESAnnual IncomeAnnual IncomeAnnual IncomeAnnual Income Annual DebtAnnual DebtAnnual DebtAnnual Debt CreditCreditCreditCredit42.000 € 15.000 € NO37.000 € 12.000 € SI80.000 € 40.500 € NO150.000 € 45.000€ SIMachine Learning SystemMachine Learning SystemMachine Learning SystemMachine Learning System- The manual process is automated.- There aren’t experts.- We take us advantage of data classified manually over years.- Training phase and testing phase.- At first, machine learning systems aren’t as accurate as knowledge based systems, however they’re can evolveand get better through time. (Ex. Spam Detection Spam)
  6. 6. ITITITIT[[[[6666]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology2. Data Mining Process2. Data Mining Process2. Data Mining Process2. Data Mining ProcessKDD (Knowlegde Discovery in Databases)Source: From Data Mining to Knowledge Discovery in Databases (Fayyad. 1997)1. Selection. The data relevant to select.2. Preprocessing.3. Transformation.4. Data-Mininq. Building Models and Patterns. (MODELLING)5. Interpretation/Evaluation . Evaluation and ResultsThe term DATA-MINING sometimes refers to the complete process KDD, and sometimes refers only to the phaseof MODELLING (4). Here mainly are applied algorithms in the scope of Machine Learning.
  7. 7. ITITITIT[[[[7777]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology2.2.2.2.1 Machine Learning1 Machine Learning1 Machine Learning1 Machine LearningAim. Building or creating programs capable of generalizgeneralizgeneralizgeneralizinginginging behaviorbehaviorbehaviorbehavior from weakly structured information.2.2.2.2.1.11.11.11.1 SupervisedSupervisedSupervisedSupervised Machine LearningMachine LearningMachine LearningMachine LearningAim. Predict the value of a variable based on a number of input variables.Regression Problem.Classification Problem.Result: PREDICTIVE MODELSPREDICTIVE MODELSPREDICTIVE MODELSPREDICTIVE MODELS or CLASSIFIERSCLASSIFIERSCLASSIFIERSCLASSIFIERS.DATAPREDICTIVE MODELSDESCRIPTIVE MODELSATTRIBUTES
  8. 8. ITITITIT[[[[8888]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology2.2.2.2.1.2 Unsupervised Machine Learning1.2 Unsupervised Machine Learning1.2 Unsupervised Machine Learning1.2 Unsupervised Machine LearningAim. Describe patterns or associations among a set of input measures.Patterns or AssociationsClusteringResult: DESCRIPTIVEDESCRIPTIVEDESCRIPTIVEDESCRIPTIVE MODELSMODELSMODELSMODELS
  9. 9. ITITITIT[[[[9999]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology2.2.2.2.1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data MiningIEEE International Conference on Data Mining (ICDM). http://www.cs.uvm.edu/~icdm/The most influential algorithms used in the Data Mining Community.1. C 4.5 (Decision Tree).2. K-Means.3. Support Vector Machine (SVM). The Best Generalization Ability4. Apriori. To find frequent itemsets from a transaction dataset and derive association rules5. EM (Expectation- Maximization) Pattern Recognition6. PageRank. Link-based ranking algorithm, which also powers the Google search engine.7. AdaBoost.8. k-Nearest Neighbors (k-NN)9. Naïve Bayes.10. CART. Classification and Regression TreesSource: http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf
  10. 10. ITITITIT[[[[10101010]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology3. Tools3. Tools3. Tools3. Tools3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis)http://www.cs.waikato.ac.nz/ml/weka/- Data Mining Software in Java.- Implemented in Java- Multi-platform- GUI (Limitations)- GPL License.- University of Waikato, New Zealand3.23.23.23.2 RRRR (#RStats)(#RStats)(#RStats)(#RStats)http://www.r-project.org/R is a language and environment for statistical computing and graphics.- S Language (Bell Laboratories)- Implemented in C/C++- Highly extensible. R can be extended via packages.- R Environment. Uses a command line interface. (NO GUI)- RStudio. Graphical User Interfaces (GUI)- GPL License.- Created by University of Auckland, New Zealand and currently developed R Development Core TeamLinks: How R growsBooks: Machine Learning for Hackers, The Elements of Statistical Learning: Data Mining, Inference and Prediction,OpenIntro StatisticsEnterprises: Revolution Analytics, Oracle R Enterprise, …R for LinuxR for LinuxR for LinuxR for Linux R for Mac OSXR for Mac OSXR for Mac OSXR for Mac OSX R for WindowsR for WindowsR for WindowsR for WindowsRWekaRWekaRWekaRWeka
  11. 11. ITITITIT[[[[11111111]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology3.33.33.33.3 RRRRapidMinerapidMinerapidMinerapidMinerhttp://rapid-i.com/content/view/181/190/lang,en/- Open-Source Data Mining and Analysis System- Implemented in Java- Multi-platform- Machine Learning library Weka fully integrated.- Access to data sources: Excel, MySQL, Oracle- ETL- Reporting- Data Analysis- AGPL License- Created by Dortmund University of Technology3.4 KNIME Desktop3.4 KNIME Desktop3.4 KNIME Desktop3.4 KNIME Desktophttp://www.knime.org/knime- Data Analytics (Data access, data transformation, predictive analytics, visualization and reporting).- Implemented in Java (Based in Eclipse Platform)- Reporting- ETL- KNIME Extensions. Excel support, R integration, Weka- GPL License- Konstanz University, GermanyRRRR Extension for RapidMinerExtension for RapidMinerExtension for RapidMinerExtension for RapidMinerR Extension for KnimeR Extension for KnimeR Extension for KnimeR Extension for Knime
  12. 12. ITITITIT[[[[12121212]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology3.5 Orange3.5 Orange3.5 Orange3.5 Orangehttp://orange.biolab.si/- A component-based data mining and machine learning software suite- A visual programming front-end for explorative data analysis and visualization- Multi-platform.- Python- GPL License- University of Ljubljana, Slovenia
  13. 13. ITITITIT[[[[13131313]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology3.6 Polls3.6 Polls3.6 Polls3.6 Polls3.6.13.6.13.6.13.6.1 What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12months? [579 voters]months? [579 voters]months? [579 voters]months? [579 voters] (Aug 2012)(Aug 2012)(Aug 2012)(Aug 2012)Source: http://www.kdnuggets.com/polls/2012/analytics-data-mining-programming-languages.html3.6.23.6.23.6.23.6.2 What Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in the past 12 months for a realhe past 12 months for a realhe past 12 months for a realhe past 12 months for a realproject?project?project?project? (May 2012)(May 2012)(May 2012)(May 2012)
  14. 14. ITITITIT[[[[14141414]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologySource: http:/www.kdnuggets.com/polls/2012/analytics-data-mining-big-data-software.html
  15. 15. ITITITIT[[[[15151515]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology4444.... ExamplesExamplesExamplesExamples4444.1 Predicting Price House.1 Predicting Price House.1 Predicting Price House.1 Predicting Price HouseSizeSizeSizeSize Price (K)Price (K)Price (K)Price (K)80 7090 83100 74110 93140 89140 58150 85160 114180 95200 100240 138250 111270 124320 161350 172Link: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/predictHousePrice.mdRegression ProblemRegression ProblemRegression ProblemRegression Problem
  16. 16. ITITITIT[[[[16161616]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology4444.2 Lending Club.2 Lending Club.2 Lending Club.2 Lending ClubPeer to peer lending company.What are the variables associated with the interest rate of a loan? MultivariateLinks: http://www.lendingclub.com/http://en.wikipedia.org/wiki/Lending_Clubhttps://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/loansLendingClub.mdRegression ProblemRegression ProblemRegression ProblemRegression Problem
  17. 17. ITITITIT[[[[17171717]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology4444.3.3.3.3 Spam orSpam orSpam orSpam or HamHamHamHam EmailEmailEmailEmailLinks: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/spam.mdClassification ProblemClassification ProblemClassification ProblemClassification Problem
  18. 18. ITITITIT[[[[18181818]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology4444.4.4.4.4 HandwrittenHandwrittenHandwrittenHandwritten Digit RecognitionDigit RecognitionDigit RecognitionDigit RecognitionIdentification the numbers in a handwritten ZIP code, from a digitized image.001 002 003 004 ... 015 016017 018 019 020 ... 031 032033 034 035 036 ... 037 038| | | | ... | |209 210 211 212 ... 223 224225 226 227 228 ... 239 240241 242 243 244 ... 255 256Each image is a 16 x 16 (256) 8-bit grayscale representation of a handwritten digithttp://www.kaggle.com/c/digit-recognizerLink: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/handwritten.md16x16Classification ProblemClassification ProblemClassification ProblemClassification Problem
  19. 19. ITITITIT[[[[19191919]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology4444.5.5.5.5 Human Activity RHuman Activity RHuman Activity RHuman Activity Recognition using Smartphonesecognition using Smartphonesecognition using Smartphonesecognition using SmartphonesWe used data obtained from accelerometer and gyroscope sensor signals of the smartphones3-axial linear acceleration3-axial angular velocityWe can monitor acceleration, positions, rotation and angular motion.Laying, Sitting, Standing, Walk, WalkDown, WalkUp
  20. 20. ITITITIT[[[[20202020]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyDataSet: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+SmartphonesSource: Activity Recognition using Cell Phone Accelerometershttp://www.cis.fordham.edu/wisdm/public_files/sensorKDD-2010.pdfLink: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/handwritten.md4444.6.6.6.6 InventoryInventoryInventoryInventoryA large inventory of identical items. You want to predict how many of these items will sell over the next 3 months.Classification ProblemClassification ProblemClassification ProblemClassification ProblemRegression ProblemRegression ProblemRegression ProblemRegression Problem
  21. 21. ITITITIT[[[[21212121]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology4.74.74.74.7 Image ClassificationImage ClassificationImage ClassificationImage ClassificationComputer Vision (C.V.)Haralick texture features. Haralick described 14 statistics that can be calculated from the co-occurrence matrixwith the intent of describing the texture of the image:- Angular Second Moment- Constrast- Correlation..Source: https://github.com/gsantosgo/RStats/tree/master/MadridJUG-DataMining/data/faces.arffAlessandra AmbrosioJessica AlbaMegan FoxLinks: http://murphylab.web.cmu.edu/publications/boland/boland_node26.htmlClassification ProblemClassification ProblemClassification ProblemClassification Problem
  22. 22. ITITITIT[[[[22222222]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology4.84.84.84.8 ClusteringClusteringClusteringClusteringGoogle NewsNews ClusteringSource: http://news.google.es/Clustering ProblemClustering ProblemClustering ProblemClustering Problem
  23. 23. ITITITIT[[[[23232323]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology5. Supervised Machine Learning5. Supervised Machine Learning5. Supervised Machine Learning5. Supervised Machine LearningGuide for Supervised Machine LearningTraining Phase Testing PhaseTraining DataSet(Colección de Entrenamiento)Attributes Selection and Extraction(Selección y Extracción de Atributos)Filtered DataSet(Colección filtrada)Learning or Training(Entrenamiento o Aprendizaje)Predictive Model or Classifier(Modelo Predictivo o Clasificador)Testing DataSet(Colección de Datos Reales)Filtering Attributes(Filtrado de Atributos)Filtered DataSet(Colección filtrada)Classification(Clasificación)Classified Data(Datos Clasificados)
  24. 24. ITITITIT[[[[24242424]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnology6666. Evaluation. Evaluation. Evaluation. EvaluationSTATE OF ARTSTATE OF ARTSTATE OF ARTSTATE OF ART6666.1 Random Subsampling.1 Random Subsampling.1 Random Subsampling.1 Random Subsampling1. Use the training set.2. Split it into training set (66.66 %) and testing set (33.33%). (RANDOM)3. Build a model on the training set.4. Evaluate on the test set.6666.2 Cross Validation (K.2 Cross Validation (K.2 Cross Validation (K.2 Cross Validation (K----FOLD)FOLD)FOLD)FOLD)1. Use the training set.2. Split it into training/test sets.3. Build a model on the training set4. Evaluate on the test set.5. Repeat and average the estimatedNever Overlap!K-FOLDK = 1K = 2…….K = 10 =1Test DataTest DataTest DataTraining DataTraining DataTraining DataTr. DataTest Training DataTestDataTestDataTestDataTestDataTestDataTest Training DataTestDataTestDataTestDataTestDataTestData
  25. 25. ITITITIT[[[[25252525]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyLink: https://es.wikipedia.org/wiki/Validaci%C3%B3n_cruzada6666.3 Confusion Matrix.3 Confusion Matrix.3 Confusion Matrix.3 Confusion Matrix- Accuracy (Precisión o Efectividad) . The rate of correct predictions- Error rate. The rate of incorrect predictions.- Performance (Eficiencia). The algorithm is quick or nor in the training phase or in the testing phase.Actual/Real ClassActual/Real ClassActual/Real ClassActual/Real Class Predicted ClassPredicted ClassPredicted ClassPredicted Class TotalTotalTotalTotalYes NoYes (1)Yes (1)Yes (1)Yes (1) True Positive (TP) False Negative (FN) Total Positive Real (TPR)No (0)No (0)No (0)No (0) False Positive (FP) True Negative (TN) Total Negative Real (TNR)TotalTotalTotalTotal Total Positive Predicted(TPP)Total Negative Predicted(TNP)TotalLink: http://en.wikipedia.org/wiki/Confusion_matrix
  26. 26. ITITITIT[[[[26262626]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyA.1A.1A.1A.1.... ¿What is a DATASET?¿What is a DATASET?¿What is a DATASET?¿What is a DATASET?Example: Dataset email50Row represents a casecasecasecase, a unit of observationunit of observationunit of observationunit of observation, an observational unitobservational unitobservational unitobservational unit, an instanceinstanceinstanceinstance. OBSERVATIONS.OBSERVATIONS.OBSERVATIONS.OBSERVATIONS.EXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARY....Column represents an attributeattributeattributeattribute, a variablevariablevariablevariable, a featurefeaturefeaturefeature (represent characteristics).Special column. the classthe classthe classthe class, the class labelthe class labelthe class labelthe class label ( two values or multi-valued)For example: The email 4, which is not spam, contains 2454 characters, 61 line breaks, is written in Text format(0=text, 1=html), and contains only small numbers.Variable Descriptionspam Specifies whether the message was spamnum_char The number of characters in the emailline_breaks The number of line breaks in the email (not including textwrapping)Format Indicates if the email contained special formatting, such asbolding, tables or links, which would indicate the message isin HTML formatNumber Indicates whether the email contained no number, a smallnumber (under 1 million) or a large numberDatasetDatasetDatasetDataset represents a data matrixdata matrixdata matrixdata matrix, data framedata framedata framedata frame. Each row of a data matrix corresponds to unique case(example), and each column corresponds to a variable.A.2A.2A.2A.2 Types of variablesTypes of variablesTypes of variablesTypes of variables
  27. 27. ITITITIT[[[[27272727]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgoInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation TecnologyInformation Tecnologynum_char and line_breaks CUANTITATIVE, NUMERICAL AND CONTINOUS VARIABLES.spam CUANTITATIVE, NUMERICAL AND DISCRETE VARIABLE.number indicates whether the email contained no number, a small number (under 1 million) or a large number. Ittakes values none, small and big. The different levels have a natural ordering. CUALITATIVE, CATEGORICALVARIABLES AND ORDINAL VARIABLE.VariablesNumericalContinuous DiscretesCategoricalRegularCategoricalOrdinal

×