Predictive modeling DBs


These are suggested data bases for predictive modeling certification at DataVita.

  1. 1. Predictive Modeling: Research Tasks Nilitis, LLC. © 2012
  2. 2. 1. Netflix Database, Inc. - American provider of on-demand Internet streaming media andflat rate DVD-by-mailTraining data set:100,480,507 ratings480,189 users17,770 moviesData set entry:<user (ID), movie (ID), date of grade (yyyy-mm-dd), grade(1-5)>The BellKor Solution: Big Chaos Solution: Pragmatic Theory Solution: 2 Nilitis, LLC. © 2012
  3. 3. 1. Netflix Database User-based collaborative filtering - Look for users who share the same rating patterns - Use the ratings from those users to calculate a prediction Item-based collaborative filtering - Build an item-item matrix determining relationships between pairs of items - Using the matrix, and the data on the current user, infer his taste…A note from the donor regarding Netflix data:"Thank you for your interest in the Netflix Prize dataset. The dataset is nolonger available.“Robust De-anonymization of Large Sparse Datasets 3 Nilitis, LLC. © 2012
  4. 4. 2. EEG Database Data Set data from a large study to examine EEGcorrelates of genetic predisposition to alcoholism.64 electrodes placed on subjects scalps whichwere sampled at 256 Hz for 1 second.There were two groups of subjects: alcoholic andcontrol.Each subject was exposed to either a singlestimulus (S1) or to two stimuli (S1 and S2).122 subjects, each subject completed 120 trialswhere different stimuli were shown.EEG / ERP data available for free public download 4 Nilitis, LLC. © 2012
  5. 5. 2. EEG Database Data SetControl Alcoholic example plots of a control and alcoholic subject - webpage of Lester IngberUse Ingber’s Canonical Momentum Indicator or smth. else? Or raw data? 5 Nilitis, LLC. © 2012
  6. 6. 3. Berlin Database of Emotional Speech basic emotions: anger, joy,sadness, fear, disgust and boredom+ neutral speechTen professional native Germanactors (5 female and 5 male)simulated these emotions,producing 10 utterances (5 shortand 5 longer sentences)emotion was recognized by at least80 % of the listeners 6 Nilitis, LLC. © 2012
  7. 7. 3. Berlin Database of Emotional SpeechVoice Emotion Recognition: Audio Feature Classifier Emotion Stream ExtractionFeature Extraction: “openEAR” settings from openEAR “emobase” config files and articles+ possibly to add some feature selection steps (state of the art–sequential feature selection)Classifier: state of the art – SVM with polynomial or RBF kernel(libSVM included into openEAR package) 7 Nilitis, LLC. © 2012
  8. 8. 4. Wikipedia page-to-page link database pages: 5,716,808Total links: 130,160,392Google PageRank technology: 85% likelihood of choosing a random link from the page 15% likelihood of jumping to a page chosen at random from the entire web 8 Nilitis, LLC. © 2012
  9. 9. 5. Detecting Malicious URLs about 2.4 million URLs 3.2 million featuresEstimating covariance matrix forhigh-dimensional dataLinear implementation of SVM(LIBLINEAR) 9 Nilitis, LLC. © 2012
  10. 10. 5. Pseudo Periodic Synthetic Time Series Data Set + Branch and Bond evaluationAn Indexing Scheme for Fast Similarity Search in Large Time Series Databases 10 Nilitis, LLC. © 2012
  11. 11. Other DatasetsIndividual household electric power consumption Data Set Marketing Data Set Flare Data Set Fires Data Set Data Set and Crime Data Set Income Data Set 11 Nilitis, LLC. © 2012