SlideShare a Scribd company logo
Mining The Social Web
  Ch8 Blogs et al.: Natural Language
     Processing (and Beyond)      Ⅰ


               발표 : 김연기
     네이버 아키텍트를 꿈꾸는 사람들
     http://Cafe.naver.com/architect1
Natural Language
       Processing
• 마침표로 문장을 처리하자!
Natural Language
       Processing
• 마침표로 문장을 처리하자!
NLP Pipeline With NLTK
        문장의 끝 찾기


        단어 자르기


       구문별 짝짖기(?)


        단어 의미 부여


          추출
Natural Language
         Processing
• 문장의 끝 찾기(EOS Detection)
Natural Language
         Processing
• 문장의 끝 찾기(EOS Detection)
Natural Language
         Processing
• 구문별 짝짓기 (POS Tagging)
Natural Language
   Processing
Natural Language
           Processing
• 추출( Extraction)
Natural Language
   Processing
Natural Language
   Processing
Natural Language
               Processing
def cleanHtml(html):
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
fp = feedparser.parse(FEED_URL)
print "Fetched %s entries from '%s'" %
(len(fp.entries[0].title), fp.feed.title)
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title, 'content'
: cleanHtml(e.content[0].value), 'link': e.links[0].href})
Natural Language
               Processing
# Basic stats
num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())
# Hapaxes are words that appear only once
num_hapaxes = len(fdist.hapaxes())
top_10_words_sans_stop_words = [w for w in fdist.items()
if w[0] not in stop_words][:10]
print post['title']
print 'tNum Sentences:'.ljust(25), len(sentences)
print 'tNum Words:'.ljust(25), num_words
print 'tNum Unique Words:'.ljust(25), num_unique_words
print 'tNum Hapaxes:'.ljust(25), num_hapaxes
print 'tTop 10 Most Frequent Words (sans stop words):ntt',
'ntt'.join(['%s (%s)‘
        % (w[0], w[1]) for w in top_10_words_sans_stop_words])
print
Natural Language
   Processing
Natural Language
               Processing
# Summaization Approach 1:
# Filter out non-significant sentences by using the average
score plus a
# fraction of the std dev as a filter

avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in
scored_sentences if score > avg + 0.5 * std]

# Summarization Approach 2:
# Another approach would be to return only the top N ranked
sentences

    top_n_scored = sorted(scored_sentences, key=lambda s:
s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
Natural Language
   Processing
Natural Language
         Processing
– Luhn’s Summarization Algorithm
  • Score = (문장에서 중요한 단어)^2)/(문장 총단어
    수)
Natural Language
         Processing
– Luhn’s Summarization Algorithm
  • Score =

More Related Content

Viewers also liked

Yapcasia 2012 skyarc
Yapcasia 2012 skyarcYapcasia 2012 skyarc
Yapcasia 2012 skyarconagatani
 
Featuring my trip to Yunnan
Featuring my trip to YunnanFeaturing my trip to Yunnan
Featuring my trip to Yunnan
jwolfie
 
Sachin tuli
Sachin tuliSachin tuli
Sachin tulisknsz
 
Grocery Shopping at Fry's
Grocery Shopping at Fry'sGrocery Shopping at Fry's
Grocery Shopping at Fry's
shenny06
 
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkołySpotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkołysknsz
 
VodQA_Parallelizingcukes_AmanKing
VodQA_Parallelizingcukes_AmanKingVodQA_Parallelizingcukes_AmanKing
VodQA_Parallelizingcukes_AmanKing
vodQA
 
Application Software
Application SoftwareApplication Software
Application Software
Beth
 
Continuing Pakistan Floods
Continuing Pakistan FloodsContinuing Pakistan Floods
Continuing Pakistan Floods
Carlos Felipe
 
Suburbarian - presentation
Suburbarian - presentationSuburbarian - presentation
Suburbarian - presentation
Alex Levashov
 
Web ve
Web veWeb ve
Web ve
Anam
 
The romans 3
The romans 3The romans 3
The romans 3
FranJLte
 
10. perilaku tercela sm t2
10. perilaku tercela sm t210. perilaku tercela sm t2
10. perilaku tercela sm t2
adulcharli
 
Power point 1 media
Power point 1 mediaPower point 1 media
Power point 1 media
jackthompson
 
Testing the Mysterious Sphere
Testing the Mysterious SphereTesting the Mysterious Sphere
Testing the Mysterious Sphere
vodQA
 
Forever Presentation
Forever PresentationForever Presentation
Forever Presentation
intriguehealth
 
Google themes
Google themesGoogle themes
Google themes
Alex Person
 
Project in mapeh(bravo)
Project in mapeh(bravo)Project in mapeh(bravo)
Project in mapeh(bravo)
Joyjoy Pena
 
Swiatowyponchiny
SwiatowyponchinySwiatowyponchiny
Swiatowyponchinysknsz
 
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillaraVodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
vodQA
 
Percobaan osmosis dan mitosis
Percobaan osmosis dan mitosisPercobaan osmosis dan mitosis
Percobaan osmosis dan mitosis
Nelva Kirana
 

Viewers also liked (20)

Yapcasia 2012 skyarc
Yapcasia 2012 skyarcYapcasia 2012 skyarc
Yapcasia 2012 skyarc
 
Featuring my trip to Yunnan
Featuring my trip to YunnanFeaturing my trip to Yunnan
Featuring my trip to Yunnan
 
Sachin tuli
Sachin tuliSachin tuli
Sachin tuli
 
Grocery Shopping at Fry's
Grocery Shopping at Fry'sGrocery Shopping at Fry's
Grocery Shopping at Fry's
 
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkołySpotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
 
VodQA_Parallelizingcukes_AmanKing
VodQA_Parallelizingcukes_AmanKingVodQA_Parallelizingcukes_AmanKing
VodQA_Parallelizingcukes_AmanKing
 
Application Software
Application SoftwareApplication Software
Application Software
 
Continuing Pakistan Floods
Continuing Pakistan FloodsContinuing Pakistan Floods
Continuing Pakistan Floods
 
Suburbarian - presentation
Suburbarian - presentationSuburbarian - presentation
Suburbarian - presentation
 
Web ve
Web veWeb ve
Web ve
 
The romans 3
The romans 3The romans 3
The romans 3
 
10. perilaku tercela sm t2
10. perilaku tercela sm t210. perilaku tercela sm t2
10. perilaku tercela sm t2
 
Power point 1 media
Power point 1 mediaPower point 1 media
Power point 1 media
 
Testing the Mysterious Sphere
Testing the Mysterious SphereTesting the Mysterious Sphere
Testing the Mysterious Sphere
 
Forever Presentation
Forever PresentationForever Presentation
Forever Presentation
 
Google themes
Google themesGoogle themes
Google themes
 
Project in mapeh(bravo)
Project in mapeh(bravo)Project in mapeh(bravo)
Project in mapeh(bravo)
 
Swiatowyponchiny
SwiatowyponchinySwiatowyponchiny
Swiatowyponchiny
 
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillaraVodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
 
Percobaan osmosis dan mitosis
Percobaan osmosis dan mitosisPercobaan osmosis dan mitosis
Percobaan osmosis dan mitosis
 

Similar to Mining the social web ch8 - 1

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
Sean Cribbs
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Iván Compañy Avi
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
Gopi Krishnan Nambiar
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
outsider2
 
Nltk
NltkNltk
Nltk
Anirudh
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
Lakshya Sivaramakrishnan
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
Fadlie Ahdon
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Multilingual drupal 7
Multilingual drupal 7Multilingual drupal 7
Multilingual drupal 7
Pavel Makhrinsky
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
Software Guru
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
Chandan Deb
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
ppt
pptppt
ppt
butest
 
ppt
pptppt
ppt
butest
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
Nick Hathaway
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
Insoo Chung
 

Similar to Mining the social web ch8 - 1 (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Nltk
NltkNltk
Nltk
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Multilingual drupal 7
Multilingual drupal 7Multilingual drupal 7
Multilingual drupal 7
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 

More from scor7910

대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14scor7910
 
Head first statistics ch15
Head first statistics ch15Head first statistics ch15
Head first statistics ch15scor7910
 
Head first statistics ch.11
Head first statistics ch.11Head first statistics ch.11
Head first statistics ch.11scor7910
 
어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기scor7910
 
Mining the social web ch3
Mining the social web ch3Mining the social web ch3
Mining the social web ch3
scor7910
 
Software pattern
Software patternSoftware pattern
Software patternscor7910
 
Google app engine
Google app engineGoogle app engine
Google app enginescor7910
 
Cpp 0x kimRyungee
Cpp 0x kimRyungeeCpp 0x kimRyungee
Cpp 0x kimRyungeescor7910
 
Component configurator
Component configuratorComponent configurator
Component configuratorscor7910
 
Proxy pattern
Proxy patternProxy pattern
Proxy pattern
scor7910
 
Reflection
ReflectionReflection
Reflectionscor7910
 

More from scor7910 (11)

대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14
 
Head first statistics ch15
Head first statistics ch15Head first statistics ch15
Head first statistics ch15
 
Head first statistics ch.11
Head first statistics ch.11Head first statistics ch.11
Head first statistics ch.11
 
어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기
 
Mining the social web ch3
Mining the social web ch3Mining the social web ch3
Mining the social web ch3
 
Software pattern
Software patternSoftware pattern
Software pattern
 
Google app engine
Google app engineGoogle app engine
Google app engine
 
Cpp 0x kimRyungee
Cpp 0x kimRyungeeCpp 0x kimRyungee
Cpp 0x kimRyungee
 
Component configurator
Component configuratorComponent configurator
Component configurator
 
Proxy pattern
Proxy patternProxy pattern
Proxy pattern
 
Reflection
ReflectionReflection
Reflection
 

Recently uploaded

Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 

Recently uploaded (20)

Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 

Mining the social web ch8 - 1

  • 1. Mining The Social Web Ch8 Blogs et al.: Natural Language Processing (and Beyond) Ⅰ 발표 : 김연기 네이버 아키텍트를 꿈꾸는 사람들 http://Cafe.naver.com/architect1
  • 2. Natural Language Processing • 마침표로 문장을 처리하자!
  • 3. Natural Language Processing • 마침표로 문장을 처리하자!
  • 4. NLP Pipeline With NLTK 문장의 끝 찾기 단어 자르기 구문별 짝짖기(?) 단어 의미 부여 추출
  • 5. Natural Language Processing • 문장의 끝 찾기(EOS Detection)
  • 6. Natural Language Processing • 문장의 끝 찾기(EOS Detection)
  • 7. Natural Language Processing • 구문별 짝짓기 (POS Tagging)
  • 8. Natural Language Processing
  • 9. Natural Language Processing • 추출( Extraction)
  • 10. Natural Language Processing
  • 11. Natural Language Processing
  • 12. Natural Language Processing def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts = [] for e in fp.entries: blog_posts.append({'title': e.title, 'content' : cleanHtml(e.content[0].value), 'link': e.links[0].href})
  • 13. Natural Language Processing # Basic stats num_words = sum([i[1] for i in fdist.items()]) num_unique_words = len(fdist.keys()) # Hapaxes are words that appear only once num_hapaxes = len(fdist.hapaxes()) top_10_words_sans_stop_words = [w for w in fdist.items() if w[0] not in stop_words][:10] print post['title'] print 'tNum Sentences:'.ljust(25), len(sentences) print 'tNum Words:'.ljust(25), num_words print 'tNum Unique Words:'.ljust(25), num_unique_words print 'tNum Hapaxes:'.ljust(25), num_hapaxes print 'tTop 10 Most Frequent Words (sans stop words):ntt', 'ntt'.join(['%s (%s)‘ % (w[0], w[1]) for w in top_10_words_sans_stop_words]) print
  • 14. Natural Language Processing
  • 15. Natural Language Processing # Summaization Approach 1: # Filter out non-significant sentences by using the average score plus a # fraction of the std dev as a filter avg = numpy.mean([s[1] for s in scored_sentences]) std = numpy.std([s[1] for s in scored_sentences]) mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences if score > avg + 0.5 * std] # Summarization Approach 2: # Another approach would be to return only the top N ranked sentences top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:] top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
  • 16. Natural Language Processing
  • 17. Natural Language Processing – Luhn’s Summarization Algorithm • Score = (문장에서 중요한 단어)^2)/(문장 총단어 수)
  • 18. Natural Language Processing – Luhn’s Summarization Algorithm • Score =