Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Predictive modeling DBs

1,488 views

Published on

These are suggested data bases for predictive modeling certification at DataVita.

  • Be the first to comment

  • Be the first to like this

Predictive modeling DBs

  1. 1. Predictive Modeling: Research Tasks Nilitis, LLC. © 2012
  2. 2. 1. Netflix Databasehttp://cms.uhd.edu/faculty/chenp/class/4319/project/netflixfiles.htmlNetflix, Inc. - American provider of on-demand Internet streaming media andflat rate DVD-by-mailTraining data set:100,480,507 ratings480,189 users17,770 moviesData set entry:<user (ID), movie (ID), date of grade (yyyy-mm-dd), grade(1-5)>The BellKor Solution:http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdfThe Big Chaos Solution:http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdfThe Pragmatic Theory Solution:http://www.netflixprize.com/assets/GrandPrize2009_BPC_PragmaticTheory.pdf 2 Nilitis, LLC. © 2012
  3. 3. 1. Netflix Database User-based collaborative filtering - Look for users who share the same rating patterns - Use the ratings from those users to calculate a prediction Item-based collaborative filtering - Build an item-item matrix determining relationships between pairs of items - Using the matrix, and the data on the current user, infer his taste…A note from the donor regarding Netflix data:"Thank you for your interest in the Netflix Prize dataset. The dataset is nolonger available.“Robust De-anonymization of Large Sparse Datasetshttp://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf 3 Nilitis, LLC. © 2012
  4. 4. 2. EEG Database Data Set http://archive.ics.uci.edu/ml/datasets/EEG+DatabaseThis data from a large study to examine EEGcorrelates of genetic predisposition to alcoholism.64 electrodes placed on subjects scalps whichwere sampled at 256 Hz for 1 second.There were two groups of subjects: alcoholic andcontrol.Each subject was exposed to either a singlestimulus (S1) or to two stimuli (S1 and S2).122 subjects, each subject completed 120 trialswhere different stimuli were shown.EEG / ERP data available for free public downloadhttp://sccn.ucsd.edu/~arno/fam2data/publicly_available_EEG_data.html 4 Nilitis, LLC. © 2012
  5. 5. 2. EEG Database Data SetControl Alcoholic example plots of a control and alcoholic subjecthttp://www.ingber.com/ - webpage of Lester IngberUse Ingber’s Canonical Momentum Indicator or smth. else? Or raw data? 5 Nilitis, LLC. © 2012
  6. 6. 3. Berlin Database of Emotional Speech http://database.syntheticspeech.de/6 basic emotions: anger, joy,sadness, fear, disgust and boredom+ neutral speechTen professional native Germanactors (5 female and 5 male)simulated these emotions,producing 10 utterances (5 shortand 5 longer sentences)emotion was recognized by at least80 % of the listeners 6 Nilitis, LLC. © 2012
  7. 7. 3. Berlin Database of Emotional SpeechVoice Emotion Recognition: Audio Feature Classifier Emotion Stream ExtractionFeature Extraction: “openEAR”http://sourceforge.net/projects/openart/?source=dlpTake settings from openEAR “emobase” config files and articles+ possibly to add some feature selection steps (state of the art–sequential feature selection)Classifier: state of the art – SVM with polynomial or RBF kernel(libSVM included into openEAR package) 7 Nilitis, LLC. © 2012
  8. 8. 4. Wikipedia page-to-page link database http://haselgrove.id.au/wikipedia.htmTotal pages: 5,716,808Total links: 130,160,392Google PageRank technology:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427 85% likelihood of choosing a random link from the page 15% likelihood of jumping to a page chosen at random from the entire web 8 Nilitis, LLC. © 2012
  9. 9. 5. Detecting Malicious URLs http://sysnet.ucsd.edu/projects/url/ about 2.4 million URLs 3.2 million featuresEstimating covariance matrix forhigh-dimensional dataLinear implementation of SVM(LIBLINEAR) 9 Nilitis, LLC. © 2012
  10. 10. 5. Pseudo Periodic Synthetic Time Series Data Set http://archive.ics.uci.edu/ml/datasets/Pseudo+Periodic+Synthetic+Time+Series + Branch and Bond evaluationAn Indexing Scheme for Fast Similarity Search in Large Time Series Databaseshttp://www.cs.rutgers.edu/~pazzani/Publications/ssdb99.pdf 10 Nilitis, LLC. © 2012
  11. 11. Other DatasetsIndividual household electric power consumption Data Set http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumptionBank Marketing Data Set http://archive.ics.uci.edu/ml/datasets/Bank+MarketingSolar Flare Data Set http://archive.ics.uci.edu/ml/datasets/Solar+FlareForest Fires Data Set http://archive.ics.uci.edu/ml/datasets/Forest+FiresArrhythmia Data Set http://archive.ics.uci.edu/ml/datasets/ArrhythmiaCommunities and Crime Data Set http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+UnnormalizedCensus Income Data Set http://archive.ics.uci.edu/ml/datasets/Census+Income 11 Nilitis, LLC. © 2012

×