Your SlideShare is downloading. ×
Mining the social web ch8 - 1
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mining the social web ch8 - 1

333

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
333
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Mining The Social Web Ch8 Blogs et al.: Natural Language Processing (and Beyond) Ⅰ 발표 : 김연기 네이버 아키텍트를 꿈꾸는 사람들 http://Cafe.naver.com/architect1
  • 2. Natural Language Processing• 마침표로 문장을 처리하자!
  • 3. Natural Language Processing• 마침표로 문장을 처리하자!
  • 4. NLP Pipeline With NLTK 문장의 끝 찾기 단어 자르기 구문별 짝짖기(?) 단어 의미 부여 추출
  • 5. Natural Language Processing• 문장의 끝 찾기(EOS Detection)
  • 6. Natural Language Processing• 문장의 끝 찾기(EOS Detection)
  • 7. Natural Language Processing• 구문별 짝짓기 (POS Tagging)
  • 8. Natural Language Processing
  • 9. Natural Language Processing• 추출( Extraction)
  • 10. Natural Language Processing
  • 11. Natural Language Processing
  • 12. Natural Language Processingdef cleanHtml(html):return BeautifulStoneSoup(clean_html(html),convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]fp = feedparser.parse(FEED_URL)print "Fetched %s entries from %s" %(len(fp.entries[0].title), fp.feed.title)blog_posts = []for e in fp.entries:blog_posts.append({title: e.title, content: cleanHtml(e.content[0].value), link: e.links[0].href})
  • 13. Natural Language Processing# Basic statsnum_words = sum([i[1] for i in fdist.items()])num_unique_words = len(fdist.keys())# Hapaxes are words that appear only oncenum_hapaxes = len(fdist.hapaxes())top_10_words_sans_stop_words = [w for w in fdist.items()if w[0] not in stop_words][:10]print post[title]print tNum Sentences:.ljust(25), len(sentences)print tNum Words:.ljust(25), num_wordsprint tNum Unique Words:.ljust(25), num_unique_wordsprint tNum Hapaxes:.ljust(25), num_hapaxesprint tTop 10 Most Frequent Words (sans stop words):ntt,ntt.join([%s (%s)‘ % (w[0], w[1]) for w in top_10_words_sans_stop_words])print
  • 14. Natural Language Processing
  • 15. Natural Language Processing# Summaization Approach 1:# Filter out non-significant sentences by using the averagescore plus a# fraction of the std dev as a filteravg = numpy.mean([s[1] for s in scored_sentences])std = numpy.std([s[1] for s in scored_sentences])mean_scored = [(sent_idx, score) for (sent_idx, score) inscored_sentences if score > avg + 0.5 * std]# Summarization Approach 2:# Another approach would be to return only the top N rankedsentences top_n_scored = sorted(scored_sentences, key=lambda s:s[1])[-TOP_SENTENCES:] top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
  • 16. Natural Language Processing
  • 17. Natural Language Processing– Luhn’s Summarization Algorithm • Score = (문장에서 중요한 단어)^2)/(문장 총단어 수)
  • 18. Natural Language Processing– Luhn’s Summarization Algorithm • Score =

×