Ga final report
Upcoming SlideShare
Loading in...5
×
 

Ga final report

on

  • 209 views

Ga final report

Ga final report

Statistics

Views

Total Views
209
Views on SlideShare
209
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Ga final report Ga final report Presentation Transcript

  • GA_data_science/ final_project dan knox co-founder, science exchange
  • how has the language of machine learning changed since 2003?
  • motivation
  • data
  • if "pdf" in a[“href"]: ! href = a.get("href") ! name = "cs229-%s-%s.pdf" % (year, counter) fullname = "%s/%s" % (directory, name) ! pdf_url = "http://cs229.stanford.edu/" + str(href) r = requests.get(pdf_url) with open(fullname, "wb") as pdf: pdf.write(r.content) ! print "...downloading %s as %s" % (href, name) counter += 1
  • 3,030 papers ! 913 papers from Stanford CS229 class (2005-12) 2,117 papers from N.I.P.S. Annual Conference (2005-12)
  • if input_file.endswith(".pdf"): ! print "extracting %s as %s" % (input_file, output_file) try: os_call = str("pdf2txt.py -c utf-8 -o %s %s" % (output_file, input_file)) os.system(os_call) except: pass
  • def stem_word(word): ! """Takes a single word input and returns the stem.""" ! stemmer = PorterStemmer() stem = stemmer.stem stemmed_word = stem(word) return stemmed_word
  • ['use', 'atom', 'action', 'control', 'snake', 'robot', 'locomotion', 'joseph', 'kooemail', ‘jckoo’, 'stanfordedu', 'introduction', 'locomot', 'problem', 'gener', 'difficult', 'because', 'larg', 'amount', 'coordin', 'requir', 'make', 'sure', 'desir', 'movement', 'occur', 'undesir', 'action', 'avoid', 'snake', 'locomot', 'exception', 'rule', 'project', 'show', 'simplifi', 'control', 'snake', 'robot', 'comput', 'tractable', 'still', 'produc', 'interest', 'locomot', 'strategi', 'particular', 'applic', 'snake', 'robot', 'navigate', 'complex', 'terrain', 'order', 'reach', 'goal', 'traversing', 'greatest', 'distanc', 'possibl', 'scenario', 'could', 'aris', 'exampl', 'applic', 'robot', 'need', 'look', 'through', 'rubbl', 'build', 'collaps', 'order', 'search', 'survivor', 'situat', 'wellknown', ‘strategi', 'work', 'strategi', 'highlyspeci', 'particular', 'terrain', 'case', 'exampl', 'accept', 'movement', 'produc', 'terrain', 'enough', 'contact', ‘snake', 'surfac', 'occur', 'regular', 'interv', 'along', 'snake', 'length', 'unfortun', 'hard', 'come', 'rubbl', 'snake', 'robot', 'use', 'project', 'shown', 'figur', 'consist', 'bodi', 'segment', 'connect', 'chain', 'consecut', 'bodi', 'segment', 'connect', 'pair', 'joint', 'give', 'degre', 'freedom', ‘link', 'total', 'degre', 'freedom', ‘entir', 'precis', 'control', ‘snake’, 'build']
  • full_corpus.txt 5,650,772 tokens
  • analysis
  • def calc_unigrams(text, number=100): ! """Returns frequency of unigrams from a text input.""" ! words = [w for w in text] count = Counter(words).most_common(number) return count
  • def calc_bigrams(text, min_freq=100): ! """Returns frequency of bigrams from a text input.""" ! words = [w.lower() for w in text] bcf = BigramCollocationFinder.from_words(words) bcf.apply_freq_filter(min_freq) bigrams = bcf.ngram_fd.items() bigram_list.append(bigrams) return bigram_list
  • def calc_trigrams(text, min_freq=50): ! """Returns frequency of trigrams from a text input.""" ! words = [w.lower() for w in text] tcf = TrigramCollocationFinder.from_words(words) tcf.apply_freq_filter(min_freq) trigrams = tcf.ngram_fd.items() trigram_list.append(trigrams) return trigram_list
  • results
  • papers per year 700 525 350 175 0 2005 2006 2007 cs229 2008 2009 2010 nips 2011 2012
  • HOT :) PCA: 3.4x gibbs sampling: 2.9x logistic regression: 2.7x naïve bayes: 2.7x cross validation: 2.3x random forest: 2.1x ! NOT :( nonlinear dimensionality reduction: 0.2x
  • extensions
  • ! • ! • • • more data more analysis ! better visualizations ! classification model i.e. see if I can predict which year a paper is from based on the language used