SlideShare a Scribd company logo
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
A Distributed Framework for NLP‐Based 
Keyword and Keyphrase Extraction From 
Web Pages and Documents
Paolo Nesi, Gianni Pantaleo and Gianmarco Sanesi
Department of Information Engineering, DINFO 
University of Florence
Via S. Marta 3, 50139, Firenze, Florence, Italy
Tel: +39‐055‐2758511,       fax: +39‐055‐2758516
DISIT Lab
http://www.disit.dinfo.unifi.it alias http://www.disit.org
paolo.nesi@unifi.it , gianni.pantaleo@unifi.it
21st International Conference on Distributed Multimedia Systems – DMS2015
DISIT Lab (DINFO UNIFI), 31° August 2015 1
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
2
Problem Focus & Main Issues
 The WWW continuously growing represents a massive source of knowledge,
which is embedded for the most part in the textual content of web pages,
documents, social media etc...
 Problem: Online web resources, pages and documents are mainly represented
as unstructured natural language text data. Instead, structured data are
required for the extraction and management of higher level knowledge.
 For automatic retrieval and production of structured information there is the
need to ingest, process and analyze huge amounts (Big Data problem) of natural
language data (NLP, Statistical, Machine Learning & AI problems).
DISIT Lab (DINFO UNIFI), 31° August 2015
100 Petabytes per day processed 
on 3 million servers in 2014
300 Petabytes stored
in the first half of 2014
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
3
Application Areas
 Many application fields and areas:
 Keywords & keyphrase extraction for Content/Topic extraction and
summarization.
 Comprehension of natural language text documents.
 Production of machine‐readable corpora (integration with semantic
resources and ontologies…).
 Indexing for search engine functionalities: Content‐based, multi‐faceted
search queries…
 Design of expert systems (e.g.: Recommendation Tools, DSS, Question‐
Answer systems, personal assistants, etc.)
 Social media mining and user behavior analysis for target‐marketing and
customized services.
 Applications and Interests in: e‐commerce, target marketing and
advertising, business web services, Smart City platforms, e‐Healthcare,
scientific research, ICT technologies etc… DISIT Lab (DINFO UNIFI), 31° August 2015
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
4
Distributed Architecture & Parallel Computing Frameworks
DISIT Lab (DINFO UNIFI), 31° August 2015
 Currently, single‐machine, non‐distributed architectures are proving to be
inefficient for tasks like Big Data mining and intensive text processing and NLP
analysis.
 Distributed Architectures and Parallel Computing techniques have been
increasingly adopted in literature for automatic keyword extraction on Big Data
sets.
 First attempts of employing parallel computing frameworks for NLP tasks in
the middle 90s: Chung and Moldovan [Chung and Moldovan, 1995]
proposed a parallel memory‐based parser called PARALLEL implemented
on a parallel computer, the Semantic Network Array Processor (SNAP).
 Ogmios [Hamon et al., 2007] is a platform for annotation enrichment and
NLP analysis of specialized domain documents within a distributed corpus.
 Exner and Hugues recently presented Koshik [Exner and Hugues, 2014], a
multi‐language NLP processing framework for large scale‐processing and
querying of unstructured natural language documents distributed upon a
Hadoop‐based cluster.
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
5
The Open Source Apache Hadoop Framework
 The proposed system has been designed to run on the Apache Hadoop open source
framework:
 A significant advantage of using the Hadoop
distributed architecture is the capability to efficiently
and easily scale by adding inexpensive commodity
hardware to the cluster.
DISIT Lab (DINFO UNIFI), 31° August 2015[Figures reference: Holmes, A., “Hadoop in Practice”, Ed. Manning (2012).]
Hadoop File System: HDFS Hadoop MapReduce paradigm
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
6
General Architecture
DISIT Lab (DINFO UNIFI), 31° August 2015
Hadoop HDFS
Web Pages 
& 
Documents
Web 
Crawler
Crawler DB
Gate .XGAPP 
Application
3rd Party
Plugins &  
Resources
Keyword DB
Namenode 
(Master)
Datanode 
(Slave) #1
Datanode 
(Slave) #2
Datanode 
(Slave) #3
Datanode 
(Slave) #4
HDFS 
Storage
Map
Reduce
Map
Reduce
Map
… …
Map
Reduce
Map
Reduce
Map
… …
Map
Reduce
Map
Reduce
Map
… …
Map
Reduce
Map
Reduce
Map
… …
Get Domain
Get Domain
Get Domain
Get Domain
Execute GATE,
TF‐IDF
Execute GATE,
TF‐IDF
Execute GATE,
TF‐IDF
Execute GATE,
TF‐IDF
Keywords & 
Keyphrases 
Extractor
Job 
Partitioner
DB Storage 
(Sqoop)
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
7
General Architecture
DISIT Lab (DINFO UNIFI), 31° August 2015
The system architecture is composed by 3 main modules:
 The Web Crawler module, based on the open source Apache Nutch tool;
 To retrieves and parse the textual content of web pages.
 The Keywords / Keyphrases Extractor module is responsible for:
 The execution of our NLP application (based on GATE) in a Hadoop
MapReduce environment for text annotation and keywords / keyphrases
extraction and storage in the HDFS.
 The keywords and keyphrases relevance estimation, obtained by
computing the TF‐IDF function for each extracted keywords / keyphrases,
performed to assess their relevance with respect to their whole corpus.
 The DB Storage module which finally stores designed keywords and
keyphrases into an external SQL database, using the Apache Sqoop open
source tool (specifically designed for data transfer between Hadoop HDFS and
structured datastores).
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
8
Keywords & Keyphrases Extractor Module, .xgapp
DISIT Lab (DINFO UNIFI), 31° August 2015
 The Map function that associates key/value record pairs where the key is the URL
of the single web page, and the value is the corresponding web domain (grouping
web pages of the same domain).
 The Reduce function, in turn, fulfills the following operations:
 Setup, launch and execution of a multi‐corpora GATE application for keywords
/ keyphrases extraction:
 Estimation of extracted keywords / keyphrases relevance at web domain level,
by implementing the TF‐IDF function:
∙
, log .
Doc
Doc
Corpus 1
Corpus N
. . .  
Splitter, 
Tokenizer
…
GATE Application
TreeTagger
(POS‐Tagging)
. . .   . . .  
JAPE
(+Syntactical 
Annotation Rules)
. . .   . . .  
Keywords
List 1
Keywords
List N
. . .  
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
9
Validation – Test Dataset and Configurations
 Test Dataset: 10000 web pages and documents ingested by the Web Crawler from a
seed list of commercial companies, services and research institutes web domains.
 Test Configurations: Different test configurations for the Hadoop Cluster: from 2 to 5
active nodes.
Nodes Running Daemons
Master Namenode, JobTracker,
TaskTracker
Slave #1 SecondaryNamenode
Datanode, TaskTracker
Slave #2 Datanode, TaskTracker
Slave #3 Datanode, TaskTracker
Slave #4 Datanode, TaskTracker
Test Config. 1: 5‐Nodes Cluster
Nodes Running Daemons
Master Namenode, JobTracker,
TaskTracker
Slave #1 SecondaryNamenode
Datanode, TaskTracker
Slave #2 Datanode, TaskTracker
Slave #3 Datanode, TaskTracker
Nodes Running Daemons
Master Namenode, JobTracker,
TaskTracker
Slave #1 SecondaryNamenode
Datanode, TaskTracker
Slave #2 Datanode, TaskTracker
Nodes Running Daemons
Master Namenode, JobTracker,
TaskTracker
Slave #1 SecondaryNamenode
Datanode, TaskTracker
Test Config. 2: 4‐Nodes Cluster Test Config. 3: 3‐Nodes Cluster
Test Config. 4: 2‐Nodes Cluster
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
10
Results and Discussion
 Results: For each cluster configuration, processing times for extracting a total of
about 2.5 Million keywords and keyphrases have been assessed and compared.
 The scaling capabilities observed confirm the nearly linear growth trend of
the Hadoop architecture.
 Significant performance improvements can be achieved only for a larger
number of nodes in the cluster.
 Single cpu/thread execution in 60 hours DISIT Lab (DINFO UNIFI), 31° August 2015
Configuration
Processing 
Time 
(hh:mm:ss)
Speed ‐
Up
HDFS ‐ single node 07:17:01 ‐
HDFS ‐ 2 nodes 05:21:53 1.36
HDFS ‐ 3 nodes 04:08:00 1.76
HDFS ‐ 4 nodes 03:39:42 1,99
HDFS ‐ 5 nodes 03:20:09 2.18 0
1
2
3
4
5
6
2 3 4 5
SpeedUp
Nodes #
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
11
A Current Application
 Integration in the Twitter Vigilance tool (http://www.disit.org/tv/), developed at
DISIT Lab, University of Florence.
 Twitter Vigilance is a multi user tool for making analysis of "Twitter
channels". Each channel can be tuned to monitor one or more Search
Queries on Twitter with a sophisticated and expressive syntax.
 Some public channels are publically accessible as examples.
DISIT Lab (DINFO UNIFI), 31° August 2015
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
12
A Current Application
 Currently more than 12 Millions Twitters have been processed, at a rate of
approximately 200000 New TW per day, so a distributed architecture is required.
 Main Channel focus:
 City Service Assessment: smart city application
 Weather conditions and measures
 Weather communication channel assessment and qualification
 Events appreciation assessment
 Product market appreciation assessment
 Drugs vs people appreciation assessment
DISIT Lab (DINFO UNIFI), 31° August 2015
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
13
Sentiment and Measure Analysis of Social Media (Twitter)
DISIT Lab (DINFO UNIFI), 31° August 2015
Twitter 
API
Twitter
Ingested 
Tweets DB
External
Knowledge Base
(e.g. SentiWordnet)
Analysis
Sentiment‐Score
Keywords DB
Context &
Weights
Channel 
or selec.
Affective 
Classifier 
Twitter
affective 
DB
Twitter
Vigilance
Backend
Indexer
Twitter
Vigilance
User Interface
Tweets
http://tvsolr.disit.org/
http://www.disit.org/tv/
Measure 
Keywords DB
Keyword & 
Keyphrase 
Extractor
Dependency
Parsing, RST …
Measure 
Classifier 
Twitter
Measure 
DB
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
Conclusions
• A NLP application to enforce the NLP capabilities 
into Hadoop has been designed and built
• Advantages with respect to the state of the art are: 
– usage of GATE, instead of creating a new one
– Different kind of text sources and not only documents
– Larger flexibility
• The resulting solution 
– has been assessed in terms of performance
– Is currently in further development and use for Twitter 
Vigilance Affective and Measuring analyses in multiple 
projects with different institutions: CNR IBINET, LAMMA, 
NEUROFARBA
DISIT Lab (DINFO UNIFI), 31° August 2015 14
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
15
Sentiment and Measure Analysis of Social Media (Twitter)
 Extracted keywords could be monitored by different POS (nouns, adjectives
and verbs)
or by special characters for user mentions (@) and hashtags (#), in order to
asses temporal trends and perform statistical analysis.
 By exploiting external knowledge bases and lexicons specifically designed for
Sentiment Analysis (such as SentiWordnet), sentiment scores can be assigned
to each extracted keyword and keyphrases (as compositions of keywords) in
order to estimate the general sentiment polarity of collected tweets.
 An advanced Sentiment Analysis could be eventually performed by employing
more sophisticated techniques, such as dependency parsing, Rhetorical
Structure Theory (RST) and Discourse Following, in order to perform
syntactical sentence parsing and extract relations and dependencies among
keywords and syntagmas.
 These resources can be then used to weight sentiment coefficients previously
assigned to extracted keywords, in order to improve the estimation of sentiment.
DISIT Lab (DINFO UNIFI), 31° August 2015
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
16
References
DISIT Lab (DINFO UNIFI), 31° August 2015
[Chung and Moldovan, 1995] M. Chung and D. I. Moldovan, “Parallel Natural Language Processing
on a Semantic Network Array Processor”, IEEE Transactions on Knowledge and Data
Engineering, Vol. 7(3), pp. 391‐404, 1995.
[Exner and Nugues, 2014] Exner, P. and Nugues, P., “KOSHIK ‐ A Large‐scale Distributed Computing
Framework for NLP“, in Proc. of the International Conference on Pattern Recognition
Applications and Methods (ICPRAM 2014), pp. 463‐470, 2014.
[Hamon et al., 2007] T. Hamon, J. Deriviere and Nazarenko, “Ogmios: a scalable NLP platform for
annotating large web document collections”, in Proc. of Corpus Linguistics, Birmingham,
United Kingdom, 2007.
[Holmes, 2012] Holmes, A., “Hadoop in Practice”, Ed. Manning (2012).
DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
17
Thanks for your 
attention !
DISIT Lab (DINFO UNIFI), 31° August 2015

More Related Content

What's hot

Top 11 python frameworks for machine learning and deep learning
Top 11 python frameworks for machine learning and deep learningTop 11 python frameworks for machine learning and deep learning
Top 11 python frameworks for machine learning and deep learning
ThinkTanker Technosoft PVT LTD
 
Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...
Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...
Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...
IntelliSemantic
 
What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...
What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...
What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...
Blancco
 
Introduction to eCloud
Introduction to eCloudIntroduction to eCloud
Introduction to eCloud
TU Delft, Netherlands
 
Multimedia data mining using deep learning
Multimedia data mining using deep learningMultimedia data mining using deep learning
Multimedia data mining using deep learning
Peter Wlodarczak
 

What's hot (6)

Top 11 python frameworks for machine learning and deep learning
Top 11 python frameworks for machine learning and deep learningTop 11 python frameworks for machine learning and deep learning
Top 11 python frameworks for machine learning and deep learning
 
Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...
Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...
Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...
 
What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...
What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...
What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...
 
CV
CVCV
CV
 
Introduction to eCloud
Introduction to eCloudIntroduction to eCloud
Introduction to eCloud
 
Multimedia data mining using deep learning
Multimedia data mining using deep learningMultimedia data mining using deep learning
Multimedia data mining using deep learning
 

Viewers also liked

Roonm 17 Lunch Audit
Roonm 17 Lunch AuditRoonm 17 Lunch Audit
Roonm 17 Lunch AuditHelenOfTroy
 
SES Astra satellite monitor survey - Europe and Ukraine, 2010
SES Astra satellite monitor survey - Europe and Ukraine, 2010SES Astra satellite monitor survey - Europe and Ukraine, 2010
SES Astra satellite monitor survey - Europe and Ukraine, 2010Grayling Ukraine
 
Km4City Simple access to open and aggregated data for Public Administratio...
 Km4City   Simple access to open and aggregated data for Public Administratio... Km4City   Simple access to open and aggregated data for Public Administratio...
Km4City Simple access to open and aggregated data for Public Administratio...
Paolo Nesi
 
Km4City: Smart City Ontology Building for Effective Erogation of Services
Km4City: Smart City Ontology Building for Effective Erogation of ServicesKm4City: Smart City Ontology Building for Effective Erogation of Services
Km4City: Smart City Ontology Building for Effective Erogation of Services
Paolo Nesi
 
How to create a Pick a Path Story using Powerpoint
How to create a Pick a Path Story using PowerpointHow to create a Pick a Path Story using Powerpoint
How to create a Pick a Path Story using Powerpoint
HelenOfTroy
 
Takahe by Jacob, Claire, Yuan and Karen L
Takahe by Jacob, Claire, Yuan and Karen LTakahe by Jacob, Claire, Yuan and Karen L
Takahe by Jacob, Claire, Yuan and Karen L
HelenOfTroy
 
Km4City Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...
Km4City   Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...Km4City   Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...
Km4City Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...
Paolo Nesi
 
Km4City: how to make smart and resilient your city, beginner document
Km4City: how to make smart and resilient your city, beginner documentKm4City: how to make smart and resilient your city, beginner document
Km4City: how to make smart and resilient your city, beginner document
Paolo Nesi
 
Smart Cities e Città Metropolitane
Smart Cities e Città MetropolitaneSmart Cities e Città Metropolitane
Smart Cities e Città Metropolitane
Paolo Nesi
 
A2 Media Evaluation Proper
A2  Media  Evaluation ProperA2  Media  Evaluation Proper
A2 Media Evaluation Proper
benbriscoe2010
 
Ukraine credentials grayling (2010)
Ukraine credentials   grayling (2010)Ukraine credentials   grayling (2010)
Ukraine credentials grayling (2010)
Grayling Ukraine
 
Our language pictures
Our language picturesOur language pictures
Our language pictures
HelenOfTroy
 
Connecting With Citizens Over The Web
Connecting With Citizens Over The WebConnecting With Citizens Over The Web
Connecting With Citizens Over The Web
Niyam Bhushan
 
Grayling Ukraine - Vaccination Thinkpiece (UKR)
Grayling Ukraine - Vaccination Thinkpiece (UKR)Grayling Ukraine - Vaccination Thinkpiece (UKR)
Grayling Ukraine - Vaccination Thinkpiece (UKR)Grayling Ukraine
 
Grayling Ukraine - Vaccination Thinkpiece
Grayling Ukraine - Vaccination ThinkpieceGrayling Ukraine - Vaccination Thinkpiece
Grayling Ukraine - Vaccination ThinkpieceGrayling Ukraine
 
Km4city: open flexible scalable city platform
Km4city: open flexible scalable city platformKm4city: open flexible scalable city platform
Km4city: open flexible scalable city platform
Paolo Nesi
 
SAMR Presentation by Helen Prescott
SAMR Presentation by Helen PrescottSAMR Presentation by Helen Prescott
SAMR Presentation by Helen Prescott
HelenOfTroy
 

Viewers also liked (18)

Roonm 17 Lunch Audit
Roonm 17 Lunch AuditRoonm 17 Lunch Audit
Roonm 17 Lunch Audit
 
SES Astra satellite monitor survey - Europe and Ukraine, 2010
SES Astra satellite monitor survey - Europe and Ukraine, 2010SES Astra satellite monitor survey - Europe and Ukraine, 2010
SES Astra satellite monitor survey - Europe and Ukraine, 2010
 
Km4City Simple access to open and aggregated data for Public Administratio...
 Km4City   Simple access to open and aggregated data for Public Administratio... Km4City   Simple access to open and aggregated data for Public Administratio...
Km4City Simple access to open and aggregated data for Public Administratio...
 
Km4City: Smart City Ontology Building for Effective Erogation of Services
Km4City: Smart City Ontology Building for Effective Erogation of ServicesKm4City: Smart City Ontology Building for Effective Erogation of Services
Km4City: Smart City Ontology Building for Effective Erogation of Services
 
How to create a Pick a Path Story using Powerpoint
How to create a Pick a Path Story using PowerpointHow to create a Pick a Path Story using Powerpoint
How to create a Pick a Path Story using Powerpoint
 
A2 Media Evaluation
A2 Media EvaluationA2 Media Evaluation
A2 Media Evaluation
 
Takahe by Jacob, Claire, Yuan and Karen L
Takahe by Jacob, Claire, Yuan and Karen LTakahe by Jacob, Claire, Yuan and Karen L
Takahe by Jacob, Claire, Yuan and Karen L
 
Km4City Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...
Km4City   Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...Km4City   Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...
Km4City Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...
 
Km4City: how to make smart and resilient your city, beginner document
Km4City: how to make smart and resilient your city, beginner documentKm4City: how to make smart and resilient your city, beginner document
Km4City: how to make smart and resilient your city, beginner document
 
Smart Cities e Città Metropolitane
Smart Cities e Città MetropolitaneSmart Cities e Città Metropolitane
Smart Cities e Città Metropolitane
 
A2 Media Evaluation Proper
A2  Media  Evaluation ProperA2  Media  Evaluation Proper
A2 Media Evaluation Proper
 
Ukraine credentials grayling (2010)
Ukraine credentials   grayling (2010)Ukraine credentials   grayling (2010)
Ukraine credentials grayling (2010)
 
Our language pictures
Our language picturesOur language pictures
Our language pictures
 
Connecting With Citizens Over The Web
Connecting With Citizens Over The WebConnecting With Citizens Over The Web
Connecting With Citizens Over The Web
 
Grayling Ukraine - Vaccination Thinkpiece (UKR)
Grayling Ukraine - Vaccination Thinkpiece (UKR)Grayling Ukraine - Vaccination Thinkpiece (UKR)
Grayling Ukraine - Vaccination Thinkpiece (UKR)
 
Grayling Ukraine - Vaccination Thinkpiece
Grayling Ukraine - Vaccination ThinkpieceGrayling Ukraine - Vaccination Thinkpiece
Grayling Ukraine - Vaccination Thinkpiece
 
Km4city: open flexible scalable city platform
Km4city: open flexible scalable city platformKm4city: open flexible scalable city platform
Km4city: open flexible scalable city platform
 
SAMR Presentation by Helen Prescott
SAMR Presentation by Helen PrescottSAMR Presentation by Helen Prescott
SAMR Presentation by Helen Prescott
 

Similar to NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents

Smart Cloud Engine and Solution based on Knowledge Base
Smart Cloud Engine and Solution based on Knowledge BaseSmart Cloud Engine and Solution based on Knowledge Base
Smart Cloud Engine and Solution based on Knowledge Base
Paolo Nesi
 
DISIT Lab overview: smart city, big data, semantic computing, cloud
DISIT Lab overview: smart city, big data, semantic computing, cloudDISIT Lab overview: smart city, big data, semantic computing, cloud
DISIT Lab overview: smart city, big data, semantic computing, cloud
Paolo Nesi
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
Gérard Dupont
 
Knowledge mining and Semantic Models: from Cloud to Smart City
Knowledge mining and Semantic Models: from Cloud to Smart CityKnowledge mining and Semantic Models: from Cloud to Smart City
Knowledge mining and Semantic Models: from Cloud to Smart City
Paolo Nesi
 
Benchmarking open source deep learning frameworks
Benchmarking open source deep learning frameworksBenchmarking open source deep learning frameworks
Benchmarking open source deep learning frameworks
IJECEIAES
 
Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...
Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...
Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...
Paolo Nesi
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
ASIS&T
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euData management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.eu
EUDAT
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Oladokun Sulaiman
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Mark Goldstein
 
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Paolo Nesi
 
Frankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectFrankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee Projeect
Goethe Univeristy
 
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesOntology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Paolo Nesi
 
Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...
Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...
Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...
Paolo Nesi
 
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
Mathieu d'Aquin
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
Insuk (Chris) Cho
 
Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708
Sandro D'Elia
 
Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_
Jeffrey Shomaker
 
IOT-2016 7-9 Septermber, 2016, Stuttgart, Germany
IOT-2016  7-9 Septermber, 2016, Stuttgart, GermanyIOT-2016  7-9 Septermber, 2016, Stuttgart, Germany
IOT-2016 7-9 Septermber, 2016, Stuttgart, Germany
Charith Perera
 

Similar to NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents (20)

Smart Cloud Engine and Solution based on Knowledge Base
Smart Cloud Engine and Solution based on Knowledge BaseSmart Cloud Engine and Solution based on Knowledge Base
Smart Cloud Engine and Solution based on Knowledge Base
 
DISIT Lab overview: smart city, big data, semantic computing, cloud
DISIT Lab overview: smart city, big data, semantic computing, cloudDISIT Lab overview: smart city, big data, semantic computing, cloud
DISIT Lab overview: smart city, big data, semantic computing, cloud
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Knowledge mining and Semantic Models: from Cloud to Smart City
Knowledge mining and Semantic Models: from Cloud to Smart CityKnowledge mining and Semantic Models: from Cloud to Smart City
Knowledge mining and Semantic Models: from Cloud to Smart City
 
Benchmarking open source deep learning frameworks
Benchmarking open source deep learning frameworksBenchmarking open source deep learning frameworks
Benchmarking open source deep learning frameworks
 
Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...
Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...
Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euData management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.eu
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
 
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
 
Frankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectFrankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee Projeect
 
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesOntology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
 
Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...
Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...
Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...
 
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
 
Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708
 
Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_
 
IOT-2016 7-9 Septermber, 2016, Stuttgart, Germany
IOT-2016  7-9 Septermber, 2016, Stuttgart, GermanyIOT-2016  7-9 Septermber, 2016, Stuttgart, Germany
IOT-2016 7-9 Septermber, 2016, Stuttgart, Germany
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents

  • 1. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it A Distributed Framework for NLP‐Based  Keyword and Keyphrase Extraction From  Web Pages and Documents Paolo Nesi, Gianni Pantaleo and Gianmarco Sanesi Department of Information Engineering, DINFO  University of Florence Via S. Marta 3, 50139, Firenze, Florence, Italy Tel: +39‐055‐2758511,       fax: +39‐055‐2758516 DISIT Lab http://www.disit.dinfo.unifi.it alias http://www.disit.org paolo.nesi@unifi.it , gianni.pantaleo@unifi.it 21st International Conference on Distributed Multimedia Systems – DMS2015 DISIT Lab (DINFO UNIFI), 31° August 2015 1
  • 2. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 2 Problem Focus & Main Issues  The WWW continuously growing represents a massive source of knowledge, which is embedded for the most part in the textual content of web pages, documents, social media etc...  Problem: Online web resources, pages and documents are mainly represented as unstructured natural language text data. Instead, structured data are required for the extraction and management of higher level knowledge.  For automatic retrieval and production of structured information there is the need to ingest, process and analyze huge amounts (Big Data problem) of natural language data (NLP, Statistical, Machine Learning & AI problems). DISIT Lab (DINFO UNIFI), 31° August 2015 100 Petabytes per day processed  on 3 million servers in 2014 300 Petabytes stored in the first half of 2014
  • 3. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3 Application Areas  Many application fields and areas:  Keywords & keyphrase extraction for Content/Topic extraction and summarization.  Comprehension of natural language text documents.  Production of machine‐readable corpora (integration with semantic resources and ontologies…).  Indexing for search engine functionalities: Content‐based, multi‐faceted search queries…  Design of expert systems (e.g.: Recommendation Tools, DSS, Question‐ Answer systems, personal assistants, etc.)  Social media mining and user behavior analysis for target‐marketing and customized services.  Applications and Interests in: e‐commerce, target marketing and advertising, business web services, Smart City platforms, e‐Healthcare, scientific research, ICT technologies etc… DISIT Lab (DINFO UNIFI), 31° August 2015
  • 4. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4 Distributed Architecture & Parallel Computing Frameworks DISIT Lab (DINFO UNIFI), 31° August 2015  Currently, single‐machine, non‐distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and NLP analysis.  Distributed Architectures and Parallel Computing techniques have been increasingly adopted in literature for automatic keyword extraction on Big Data sets.  First attempts of employing parallel computing frameworks for NLP tasks in the middle 90s: Chung and Moldovan [Chung and Moldovan, 1995] proposed a parallel memory‐based parser called PARALLEL implemented on a parallel computer, the Semantic Network Array Processor (SNAP).  Ogmios [Hamon et al., 2007] is a platform for annotation enrichment and NLP analysis of specialized domain documents within a distributed corpus.  Exner and Hugues recently presented Koshik [Exner and Hugues, 2014], a multi‐language NLP processing framework for large scale‐processing and querying of unstructured natural language documents distributed upon a Hadoop‐based cluster.
  • 5. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5 The Open Source Apache Hadoop Framework  The proposed system has been designed to run on the Apache Hadoop open source framework:  A significant advantage of using the Hadoop distributed architecture is the capability to efficiently and easily scale by adding inexpensive commodity hardware to the cluster. DISIT Lab (DINFO UNIFI), 31° August 2015[Figures reference: Holmes, A., “Hadoop in Practice”, Ed. Manning (2012).] Hadoop File System: HDFS Hadoop MapReduce paradigm
  • 6. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 6 General Architecture DISIT Lab (DINFO UNIFI), 31° August 2015 Hadoop HDFS Web Pages  &  Documents Web  Crawler Crawler DB Gate .XGAPP  Application 3rd Party Plugins &   Resources Keyword DB Namenode  (Master) Datanode  (Slave) #1 Datanode  (Slave) #2 Datanode  (Slave) #3 Datanode  (Slave) #4 HDFS  Storage Map Reduce Map Reduce Map … … Map Reduce Map Reduce Map … … Map Reduce Map Reduce Map … … Map Reduce Map Reduce Map … … Get Domain Get Domain Get Domain Get Domain Execute GATE, TF‐IDF Execute GATE, TF‐IDF Execute GATE, TF‐IDF Execute GATE, TF‐IDF Keywords &  Keyphrases  Extractor Job  Partitioner DB Storage  (Sqoop)
  • 7. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 7 General Architecture DISIT Lab (DINFO UNIFI), 31° August 2015 The system architecture is composed by 3 main modules:  The Web Crawler module, based on the open source Apache Nutch tool;  To retrieves and parse the textual content of web pages.  The Keywords / Keyphrases Extractor module is responsible for:  The execution of our NLP application (based on GATE) in a Hadoop MapReduce environment for text annotation and keywords / keyphrases extraction and storage in the HDFS.  The keywords and keyphrases relevance estimation, obtained by computing the TF‐IDF function for each extracted keywords / keyphrases, performed to assess their relevance with respect to their whole corpus.  The DB Storage module which finally stores designed keywords and keyphrases into an external SQL database, using the Apache Sqoop open source tool (specifically designed for data transfer between Hadoop HDFS and structured datastores).
  • 8. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 8 Keywords & Keyphrases Extractor Module, .xgapp DISIT Lab (DINFO UNIFI), 31° August 2015  The Map function that associates key/value record pairs where the key is the URL of the single web page, and the value is the corresponding web domain (grouping web pages of the same domain).  The Reduce function, in turn, fulfills the following operations:  Setup, launch and execution of a multi‐corpora GATE application for keywords / keyphrases extraction:  Estimation of extracted keywords / keyphrases relevance at web domain level, by implementing the TF‐IDF function: ∙ , log . Doc Doc Corpus 1 Corpus N . . .   Splitter,  Tokenizer … GATE Application TreeTagger (POS‐Tagging) . . .   . . .   JAPE (+Syntactical  Annotation Rules) . . .   . . .   Keywords List 1 Keywords List N . . .  
  • 9. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 9 Validation – Test Dataset and Configurations  Test Dataset: 10000 web pages and documents ingested by the Web Crawler from a seed list of commercial companies, services and research institutes web domains.  Test Configurations: Different test configurations for the Hadoop Cluster: from 2 to 5 active nodes. Nodes Running Daemons Master Namenode, JobTracker, TaskTracker Slave #1 SecondaryNamenode Datanode, TaskTracker Slave #2 Datanode, TaskTracker Slave #3 Datanode, TaskTracker Slave #4 Datanode, TaskTracker Test Config. 1: 5‐Nodes Cluster Nodes Running Daemons Master Namenode, JobTracker, TaskTracker Slave #1 SecondaryNamenode Datanode, TaskTracker Slave #2 Datanode, TaskTracker Slave #3 Datanode, TaskTracker Nodes Running Daemons Master Namenode, JobTracker, TaskTracker Slave #1 SecondaryNamenode Datanode, TaskTracker Slave #2 Datanode, TaskTracker Nodes Running Daemons Master Namenode, JobTracker, TaskTracker Slave #1 SecondaryNamenode Datanode, TaskTracker Test Config. 2: 4‐Nodes Cluster Test Config. 3: 3‐Nodes Cluster Test Config. 4: 2‐Nodes Cluster
  • 10. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 10 Results and Discussion  Results: For each cluster configuration, processing times for extracting a total of about 2.5 Million keywords and keyphrases have been assessed and compared.  The scaling capabilities observed confirm the nearly linear growth trend of the Hadoop architecture.  Significant performance improvements can be achieved only for a larger number of nodes in the cluster.  Single cpu/thread execution in 60 hours DISIT Lab (DINFO UNIFI), 31° August 2015 Configuration Processing  Time  (hh:mm:ss) Speed ‐ Up HDFS ‐ single node 07:17:01 ‐ HDFS ‐ 2 nodes 05:21:53 1.36 HDFS ‐ 3 nodes 04:08:00 1.76 HDFS ‐ 4 nodes 03:39:42 1,99 HDFS ‐ 5 nodes 03:20:09 2.18 0 1 2 3 4 5 6 2 3 4 5 SpeedUp Nodes #
  • 11. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 11 A Current Application  Integration in the Twitter Vigilance tool (http://www.disit.org/tv/), developed at DISIT Lab, University of Florence.  Twitter Vigilance is a multi user tool for making analysis of "Twitter channels". Each channel can be tuned to monitor one or more Search Queries on Twitter with a sophisticated and expressive syntax.  Some public channels are publically accessible as examples. DISIT Lab (DINFO UNIFI), 31° August 2015
  • 12. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 12 A Current Application  Currently more than 12 Millions Twitters have been processed, at a rate of approximately 200000 New TW per day, so a distributed architecture is required.  Main Channel focus:  City Service Assessment: smart city application  Weather conditions and measures  Weather communication channel assessment and qualification  Events appreciation assessment  Product market appreciation assessment  Drugs vs people appreciation assessment DISIT Lab (DINFO UNIFI), 31° August 2015
  • 13. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 13 Sentiment and Measure Analysis of Social Media (Twitter) DISIT Lab (DINFO UNIFI), 31° August 2015 Twitter  API Twitter Ingested  Tweets DB External Knowledge Base (e.g. SentiWordnet) Analysis Sentiment‐Score Keywords DB Context & Weights Channel  or selec. Affective  Classifier  Twitter affective  DB Twitter Vigilance Backend Indexer Twitter Vigilance User Interface Tweets http://tvsolr.disit.org/ http://www.disit.org/tv/ Measure  Keywords DB Keyword &  Keyphrase  Extractor Dependency Parsing, RST … Measure  Classifier  Twitter Measure  DB
  • 14. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it Conclusions • A NLP application to enforce the NLP capabilities  into Hadoop has been designed and built • Advantages with respect to the state of the art are:  – usage of GATE, instead of creating a new one – Different kind of text sources and not only documents – Larger flexibility • The resulting solution  – has been assessed in terms of performance – Is currently in further development and use for Twitter  Vigilance Affective and Measuring analyses in multiple  projects with different institutions: CNR IBINET, LAMMA,  NEUROFARBA DISIT Lab (DINFO UNIFI), 31° August 2015 14
  • 15. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 15 Sentiment and Measure Analysis of Social Media (Twitter)  Extracted keywords could be monitored by different POS (nouns, adjectives and verbs) or by special characters for user mentions (@) and hashtags (#), in order to asses temporal trends and perform statistical analysis.  By exploiting external knowledge bases and lexicons specifically designed for Sentiment Analysis (such as SentiWordnet), sentiment scores can be assigned to each extracted keyword and keyphrases (as compositions of keywords) in order to estimate the general sentiment polarity of collected tweets.  An advanced Sentiment Analysis could be eventually performed by employing more sophisticated techniques, such as dependency parsing, Rhetorical Structure Theory (RST) and Discourse Following, in order to perform syntactical sentence parsing and extract relations and dependencies among keywords and syntagmas.  These resources can be then used to weight sentiment coefficients previously assigned to extracted keywords, in order to improve the estimation of sentiment. DISIT Lab (DINFO UNIFI), 31° August 2015
  • 16. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 16 References DISIT Lab (DINFO UNIFI), 31° August 2015 [Chung and Moldovan, 1995] M. Chung and D. I. Moldovan, “Parallel Natural Language Processing on a Semantic Network Array Processor”, IEEE Transactions on Knowledge and Data Engineering, Vol. 7(3), pp. 391‐404, 1995. [Exner and Nugues, 2014] Exner, P. and Nugues, P., “KOSHIK ‐ A Large‐scale Distributed Computing Framework for NLP“, in Proc. of the International Conference on Pattern Recognition Applications and Methods (ICPRAM 2014), pp. 463‐470, 2014. [Hamon et al., 2007] T. Hamon, J. Deriviere and Nazarenko, “Ogmios: a scalable NLP platform for annotating large web document collections”, in Proc. of Corpus Linguistics, Birmingham, United Kingdom, 2007. [Holmes, 2012] Holmes, A., “Hadoop in Practice”, Ed. Manning (2012).
  • 17. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 17 Thanks for your  attention ! DISIT Lab (DINFO UNIFI), 31° August 2015