Ga final report

231 views
166 views

Published on

Ga final report

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
231
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ga final report

  1. 1. GA_data_science/ final_project dan knox co-founder, science exchange
  2. 2. how has the language of machine learning changed since 2003?
  3. 3. motivation
  4. 4. data
  5. 5. if "pdf" in a[“href"]: ! href = a.get("href") ! name = "cs229-%s-%s.pdf" % (year, counter) fullname = "%s/%s" % (directory, name) ! pdf_url = "http://cs229.stanford.edu/" + str(href) r = requests.get(pdf_url) with open(fullname, "wb") as pdf: pdf.write(r.content) ! print "...downloading %s as %s" % (href, name) counter += 1
  6. 6. 3,030 papers ! 913 papers from Stanford CS229 class (2005-12) 2,117 papers from N.I.P.S. Annual Conference (2005-12)
  7. 7. if input_file.endswith(".pdf"): ! print "extracting %s as %s" % (input_file, output_file) try: os_call = str("pdf2txt.py -c utf-8 -o %s %s" % (output_file, input_file)) os.system(os_call) except: pass
  8. 8. def stem_word(word): ! """Takes a single word input and returns the stem.""" ! stemmer = PorterStemmer() stem = stemmer.stem stemmed_word = stem(word) return stemmed_word
  9. 9. ['use', 'atom', 'action', 'control', 'snake', 'robot', 'locomotion', 'joseph', 'kooemail', ‘jckoo’, 'stanfordedu', 'introduction', 'locomot', 'problem', 'gener', 'difficult', 'because', 'larg', 'amount', 'coordin', 'requir', 'make', 'sure', 'desir', 'movement', 'occur', 'undesir', 'action', 'avoid', 'snake', 'locomot', 'exception', 'rule', 'project', 'show', 'simplifi', 'control', 'snake', 'robot', 'comput', 'tractable', 'still', 'produc', 'interest', 'locomot', 'strategi', 'particular', 'applic', 'snake', 'robot', 'navigate', 'complex', 'terrain', 'order', 'reach', 'goal', 'traversing', 'greatest', 'distanc', 'possibl', 'scenario', 'could', 'aris', 'exampl', 'applic', 'robot', 'need', 'look', 'through', 'rubbl', 'build', 'collaps', 'order', 'search', 'survivor', 'situat', 'wellknown', ‘strategi', 'work', 'strategi', 'highlyspeci', 'particular', 'terrain', 'case', 'exampl', 'accept', 'movement', 'produc', 'terrain', 'enough', 'contact', ‘snake', 'surfac', 'occur', 'regular', 'interv', 'along', 'snake', 'length', 'unfortun', 'hard', 'come', 'rubbl', 'snake', 'robot', 'use', 'project', 'shown', 'figur', 'consist', 'bodi', 'segment', 'connect', 'chain', 'consecut', 'bodi', 'segment', 'connect', 'pair', 'joint', 'give', 'degre', 'freedom', ‘link', 'total', 'degre', 'freedom', ‘entir', 'precis', 'control', ‘snake’, 'build']
  10. 10. full_corpus.txt 5,650,772 tokens
  11. 11. analysis
  12. 12. def calc_unigrams(text, number=100): ! """Returns frequency of unigrams from a text input.""" ! words = [w for w in text] count = Counter(words).most_common(number) return count
  13. 13. def calc_bigrams(text, min_freq=100): ! """Returns frequency of bigrams from a text input.""" ! words = [w.lower() for w in text] bcf = BigramCollocationFinder.from_words(words) bcf.apply_freq_filter(min_freq) bigrams = bcf.ngram_fd.items() bigram_list.append(bigrams) return bigram_list
  14. 14. def calc_trigrams(text, min_freq=50): ! """Returns frequency of trigrams from a text input.""" ! words = [w.lower() for w in text] tcf = TrigramCollocationFinder.from_words(words) tcf.apply_freq_filter(min_freq) trigrams = tcf.ngram_fd.items() trigram_list.append(trigrams) return trigram_list
  15. 15. results
  16. 16. papers per year 700 525 350 175 0 2005 2006 2007 cs229 2008 2009 2010 nips 2011 2012
  17. 17. HOT :) PCA: 3.4x gibbs sampling: 2.9x logistic regression: 2.7x naïve bayes: 2.7x cross validation: 2.3x random forest: 2.1x ! NOT :( nonlinear dimensionality reduction: 0.2x
  18. 18. extensions
  19. 19. ! • ! • • • more data more analysis ! better visualizations ! classification model i.e. see if I can predict which year a paper is from based on the language used

×