Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ga final report

320 views

Published on

Ga final report

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Ga final report

  1. 1. GA_data_science/ final_project dan knox co-founder, science exchange
  2. 2. how has the language of machine learning changed since 2003?
  3. 3. motivation
  4. 4. data
  5. 5. if "pdf" in a[“href"]: ! href = a.get("href") ! name = "cs229-%s-%s.pdf" % (year, counter) fullname = "%s/%s" % (directory, name) ! pdf_url = "http://cs229.stanford.edu/" + str(href) r = requests.get(pdf_url) with open(fullname, "wb") as pdf: pdf.write(r.content) ! print "...downloading %s as %s" % (href, name) counter += 1
  6. 6. 3,030 papers ! 913 papers from Stanford CS229 class (2005-12) 2,117 papers from N.I.P.S. Annual Conference (2005-12)
  7. 7. if input_file.endswith(".pdf"): ! print "extracting %s as %s" % (input_file, output_file) try: os_call = str("pdf2txt.py -c utf-8 -o %s %s" % (output_file, input_file)) os.system(os_call) except: pass
  8. 8. def stem_word(word): ! """Takes a single word input and returns the stem.""" ! stemmer = PorterStemmer() stem = stemmer.stem stemmed_word = stem(word) return stemmed_word
  9. 9. ['use', 'atom', 'action', 'control', 'snake', 'robot', 'locomotion', 'joseph', 'kooemail', ‘jckoo’, 'stanfordedu', 'introduction', 'locomot', 'problem', 'gener', 'difficult', 'because', 'larg', 'amount', 'coordin', 'requir', 'make', 'sure', 'desir', 'movement', 'occur', 'undesir', 'action', 'avoid', 'snake', 'locomot', 'exception', 'rule', 'project', 'show', 'simplifi', 'control', 'snake', 'robot', 'comput', 'tractable', 'still', 'produc', 'interest', 'locomot', 'strategi', 'particular', 'applic', 'snake', 'robot', 'navigate', 'complex', 'terrain', 'order', 'reach', 'goal', 'traversing', 'greatest', 'distanc', 'possibl', 'scenario', 'could', 'aris', 'exampl', 'applic', 'robot', 'need', 'look', 'through', 'rubbl', 'build', 'collaps', 'order', 'search', 'survivor', 'situat', 'wellknown', ‘strategi', 'work', 'strategi', 'highlyspeci', 'particular', 'terrain', 'case', 'exampl', 'accept', 'movement', 'produc', 'terrain', 'enough', 'contact', ‘snake', 'surfac', 'occur', 'regular', 'interv', 'along', 'snake', 'length', 'unfortun', 'hard', 'come', 'rubbl', 'snake', 'robot', 'use', 'project', 'shown', 'figur', 'consist', 'bodi', 'segment', 'connect', 'chain', 'consecut', 'bodi', 'segment', 'connect', 'pair', 'joint', 'give', 'degre', 'freedom', ‘link', 'total', 'degre', 'freedom', ‘entir', 'precis', 'control', ‘snake’, 'build']
  10. 10. full_corpus.txt 5,650,772 tokens
  11. 11. analysis
  12. 12. def calc_unigrams(text, number=100): ! """Returns frequency of unigrams from a text input.""" ! words = [w for w in text] count = Counter(words).most_common(number) return count
  13. 13. def calc_bigrams(text, min_freq=100): ! """Returns frequency of bigrams from a text input.""" ! words = [w.lower() for w in text] bcf = BigramCollocationFinder.from_words(words) bcf.apply_freq_filter(min_freq) bigrams = bcf.ngram_fd.items() bigram_list.append(bigrams) return bigram_list
  14. 14. def calc_trigrams(text, min_freq=50): ! """Returns frequency of trigrams from a text input.""" ! words = [w.lower() for w in text] tcf = TrigramCollocationFinder.from_words(words) tcf.apply_freq_filter(min_freq) trigrams = tcf.ngram_fd.items() trigram_list.append(trigrams) return trigram_list
  15. 15. results
  16. 16. papers per year 700 525 350 175 0 2005 2006 2007 cs229 2008 2009 2010 nips 2011 2012
  17. 17. HOT :) PCA: 3.4x gibbs sampling: 2.9x logistic regression: 2.7x naïve bayes: 2.7x cross validation: 2.3x random forest: 2.1x ! NOT :( nonlinear dimensionality reduction: 0.2x
  18. 18. extensions
  19. 19. ! • ! • • • more data more analysis ! better visualizations ! classification model i.e. see if I can predict which year a paper is from based on the language used

×