Abstract—The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.
Graph Databases Lifecycle Methodology and Tool to Support Index/Store Versio...Paolo Nesi
Abstract— Graph databases are taking place in many different applications: smart city, smart cloud, smart education, etc. In most cases, the applications imply the creation of ontologies and the integration of a large set of knowledge to build a knowledge base as an RDF KB store, with ontologies, static data, historical data and real time data. Most of the RDF stores are endowed of inferential engines that materialize some knowledge as triples during indexing or querying. In these cases, deleting concepts may imply the removal and change of many triples, especially if the triples are those modeling the ontological part of the knowledge base, or are referred by many other concepts. For these solutions, the graph database versioning feature is not provided at level of the RDF stores tool, and it is quite complex and time consuming to be addressed as black box approach. In most cases the indexing is a time consuming process, and the rebuilding of the KB may imply manually edited long scripts that are error prone. Therefore, in order to solve these kinds of problems, this paper proposes a lifecycle methodology and a tool supporting versioning of indexes for RDF KB store. The solution proposed has been developed on the basis of a number of knowledge oriented projects as Sii-Mobility (smart city), RESOLUTE (smart city risk assessment), ICARO (smart cloud). Results are reported in terms of time saving and reliability.
Keywords — RDF Knowledge base versioning, graph stores versioning, RDF store management, knowledge base life cycle.
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...Dataconomy Media
"NoSQL Databases: An Overview", Lena Wiese, Research Group Knowledge Engineer at Göttingen University
YouTube Link: https://www.youtube.com/watch?v=RDXgBzAaF4g
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dr. Lena Wiese is head of the research group Knowledge Engineering and lecturer at the Georg August University Göttingen. She has been teaching advanced courses on data management and database technology for several years at both graduate and undergraduate level. She is author of the book "Advanced Data Management for SQL, NoSQL, Cloud and Distributed Databases" (DeGruyter, 2015)
Replication and Synchronization Algorithms for Distributed Databases - Lena W...distributed matters
This talk will provide in-depth background on strategies for replication and synchronization as implemented in modern distributed databases. The following topics will be covered: master-slave vs multi-master replication; epidemic protocols; two-phase commit vs Paxos; multiversion concurrency control; read and write quorums. A concise overview of implementations in current NoSQL databases will be presented.
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
Nowadays, real-time systems and intelligent systems offer more and more control interface based on voice
recognition or human language recognition. Robots and drones will soon be mainly controlled by voice.
Other robots will integrate bots to interact with their users, this can be useful both in industry and
entertainment. At first, researchers were digging on the side of "ontology reasoning". Given all the
technical constraints brought by the treatment of ontologies, an interesting solution has emerged in last
years: the construction of a model based on machine learning to connect a human language to a knowledge
base (based for example on RDF). We present in this paper our contribution to build a bot that could be
used on real-time systems and drones/robots, using recent machine learning technologies.
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...AIST
Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov and Marina Dudarenko
BigARTM: Open Source Library for
Regularized Multimodal Topic Modeling
of Large Collections
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
Graph Databases Lifecycle Methodology and Tool to Support Index/Store Versio...Paolo Nesi
Abstract— Graph databases are taking place in many different applications: smart city, smart cloud, smart education, etc. In most cases, the applications imply the creation of ontologies and the integration of a large set of knowledge to build a knowledge base as an RDF KB store, with ontologies, static data, historical data and real time data. Most of the RDF stores are endowed of inferential engines that materialize some knowledge as triples during indexing or querying. In these cases, deleting concepts may imply the removal and change of many triples, especially if the triples are those modeling the ontological part of the knowledge base, or are referred by many other concepts. For these solutions, the graph database versioning feature is not provided at level of the RDF stores tool, and it is quite complex and time consuming to be addressed as black box approach. In most cases the indexing is a time consuming process, and the rebuilding of the KB may imply manually edited long scripts that are error prone. Therefore, in order to solve these kinds of problems, this paper proposes a lifecycle methodology and a tool supporting versioning of indexes for RDF KB store. The solution proposed has been developed on the basis of a number of knowledge oriented projects as Sii-Mobility (smart city), RESOLUTE (smart city risk assessment), ICARO (smart cloud). Results are reported in terms of time saving and reliability.
Keywords — RDF Knowledge base versioning, graph stores versioning, RDF store management, knowledge base life cycle.
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...Dataconomy Media
"NoSQL Databases: An Overview", Lena Wiese, Research Group Knowledge Engineer at Göttingen University
YouTube Link: https://www.youtube.com/watch?v=RDXgBzAaF4g
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dr. Lena Wiese is head of the research group Knowledge Engineering and lecturer at the Georg August University Göttingen. She has been teaching advanced courses on data management and database technology for several years at both graduate and undergraduate level. She is author of the book "Advanced Data Management for SQL, NoSQL, Cloud and Distributed Databases" (DeGruyter, 2015)
Replication and Synchronization Algorithms for Distributed Databases - Lena W...distributed matters
This talk will provide in-depth background on strategies for replication and synchronization as implemented in modern distributed databases. The following topics will be covered: master-slave vs multi-master replication; epidemic protocols; two-phase commit vs Paxos; multiversion concurrency control; read and write quorums. A concise overview of implementations in current NoSQL databases will be presented.
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
Nowadays, real-time systems and intelligent systems offer more and more control interface based on voice
recognition or human language recognition. Robots and drones will soon be mainly controlled by voice.
Other robots will integrate bots to interact with their users, this can be useful both in industry and
entertainment. At first, researchers were digging on the side of "ontology reasoning". Given all the
technical constraints brought by the treatment of ontologies, an interesting solution has emerged in last
years: the construction of a model based on machine learning to connect a human language to a knowledge
base (based for example on RDF). We present in this paper our contribution to build a bot that could be
used on real-time systems and drones/robots, using recent machine learning technologies.
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...AIST
Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov and Marina Dudarenko
BigARTM: Open Source Library for
Regularized Multimodal Topic Modeling
of Large Collections
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
Python has a variety of libraries and frameworks, which makes it the best programming language of all time Let’s have a look at the top 11 Python frameworks for Machine learning and deep learning
What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...Blancco
Do organizations have a defined process for wiping sensitive company information before discarding/reselling old drives and mobile devices?
In this webinar, Randy F. Smith and data security experts from Blancco Technology Group explore the following topics:
- How easily residual data can be recovered from hard drives and mobile devices
- The risks leftover data can pose to organizations
- The most secure ways to wipe company data from end-of-life devices and drives
Due to the large amounts of Multimedia data on the Internet, Multimedia mining has become a very active area of research. Multimedia mining is a form of data mining. Data mining uses algorithms to segment data to identify useful patterns and to make predictions. Despite the successes in many areas, data mining remains a challenging task. In the past, multimedia mining was one of the fields where the results were often not satisfactory. Multimedia Data Mining extracts relevant data from multimedia files such as audio, video and still images to perform similarity searches, identify associations, entity resolution and for classification. As the mining techniques have matured, new techniques were developed. A lot of progress has been made in areas such as visual data mining and natural language processing using deep learning techniques. Deep learning is a branch of machine learning and has been used among other on Smartphones for face recognition and voice commands. Deep learners are a type of artificial neural networks with multiple data processing layers that learn representations by increasing the level of abstraction from one layer to the next. These methods have improved the state-of-the-art in multimedia mining, in speech recognition, visual object recognition, natural language processing and other areas such as genome mining and predicting the efficacy of drug molecules.
Km4City Simple access to open and aggregated data for Public Administratio...Paolo Nesi
Public Administrations are producing thousands of open data. They are a phenomenal leverage to produce value by enabling services. Thus, solutions that aggregate data make possible the integration of low-cost private data, and the subsequent interrogation with geographical and textual queries, with location-based and proximity results.
Open data and the integration of private and open data enable the development of various applications ranging from security, tourism, cultural heritage, but also monitoring of fleets, definition of strategies on the territory, the assessment of risk factors, etc. Actually these opportunities are very hard to be taken for government agencies and companies. The main obstacles to their exploitation are:
• High costs of data integration and aggregation, given the limited natural interoperability between data that are produced at different times by institutions and/or from individuals and companies;
• The difficulty to assess immediately if a given idea can produce viable and valid results in terms of use and repercussions.
With Km4City institutions as well as companies can integrate open data, private data, sensitive and/or critical and contextualized data with those accessible in the city with the aim to create new services for their qualified staff and/or for the citizen. Km4City allow to make possible the develop of apps and web pages using these services quickly and easily.
Km4City: Smart City Ontology Building for Effective Erogation of ServicesPaolo Nesi
Provides a unique point of service with integrated and aggregated data and tools for
-- Qualified users: public administrations à developers
-- Operators: mobility, energy, SME, shops, ….. à developers
-- Final users à citizens, students, pendular, tourists
Problems:
--Aggregated Data are not available:
not semantically interoperable, heterogeneous for: format, vocabulary, structure, velocity, volume, ownership/control, access / license, …
---As OD, LD, LOD, private data, ..
---Lack of Services and tools to make the adoption simple
Final Users tools:
--Km4City mobile app with personal assistant is coming…
--Km4City mobile applications: Google Play, Apple Store, …
--Km4City web application: http://www.km4city.org
--Open Source Mobile Application, FODD: an example in open source http://www.disit.org/6595
Public administrator tools:
--Smart decision support system, http://smartds.disit.org
--Developers http://www.disit.org/km4city tools:
--Service Map Server, plus API, http://servicemap.disit.org
--LOG LOD browser: an ultimate visual tool to browse the RDF Store.
--Ontology Documentation: an ultimate tool to understand,
if needed !!
The dirty work of Km4City service
--Data Ingestion Manager, DIM
--RDF Indexer Manager, RIM
--RDF Store Methodology
--RDF store enricher with dbPedia
--Distributed SCE Scheduler, DISCES
--SCE: Smart City Engine
--Doc and info on http://www.disit.org/km4city
Km4City Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...Paolo Nesi
Le Pubbliche Amministrazioni stanno producendo migliaia di dati aperti, questi sono una leva fenomenale per produrre valore abilitando nuovi servizi. Le soluzioni che permettono di fornire dati aggregati rendono possibile le integrazioni a basso costo di dati privati, e la successiva l’interrogazione con ricerche geografiche e testuali, con risultati geolocalizzati, e di prossimità sul territorio. E pertanto lo sviluppo di svariate applicazioni che vanno dall’ambito della sicurezza, al turismo, ai beni culturali, ma anche il monitoraggio di flotte, la definizione di strategie sul territorio, la valutazione dei fattori di rischio, etc. Al momento queste opportunità sono difficili da cogliere per le pubbliche amministrazioni e le imprese. I principali ostacoli sono:
• Costi di integrazione ed aggregazione dei dati, vista la naturale limitata interoperabilità fra dati che sono prodotti in momenti diversi da enti e/o da privati e imprese;
• La difficoltà di valutare in modo immediato se una determinata idea può produrre risultati percorribili e validi in termini di utilizzo e ricadute.
Km4City risolve queste limitazioni mettendo a disposizione di pubbliche amministrazioni e imprese:
• soluzioni scalabili ed efficaci per erogare servizi innovativi in tempi rapidi o nulli (i) fornendo dati integrati/aggregati e (ii) permettendo l’integrazione di dati privati o specifici con dati open;
• servizi con API semplici ed efficaci per sviluppare applicazioni mobili e web che usano in modo coerente dati, fornendo alle un canale costante di dati aggregati aggiornati.
Km4City include strumenti di sviluppo e di produzione e si fonda sul modello Km4City (http://www.disit.org/km4city) e su una serie di strumenti che sono stati sviluppati e sono al momento in uso nell’aggregatore di Firenze sviluppato dal DISIT Lab, accessibile via http://servicemap.disit.org (e anche accessibile con API). La soluzione Km4City, è alla base del progetto Sii-Mobility smart city nazionale (http://www.sii-mobility.org), e di RESOLUTE H2020 (http://www.resolute-eu.org); ha avuto un ranking elevato dalla valutazione di Ready4SmartCity FP7 (http://smartcity.linkeddata.es); è considerato da IBM come uno dei modelli Smart City più interessanti (http://cognitive-science.info/community/weekly-update/).
Km4City: how to make smart and resilient your city, beginner documentPaolo Nesi
Open Source and inter-operable tools to
• keep city under control via personalized dashboards
• monitoring services’ status of city operators
• monitoring and understanding the city users behaviour
• collecting moods, contributions and data from the city users
• monitoring social media for city services and events, event predictions
• improve city resilience, reducing risks and decision support by:
• assessing city resilience level
• improving city resilience, providing objective hints
• improving city users awareness with personal city assistants and participatory tools
• transform data in value for the city:
• enabling commercial and business applications
• aggregating multi-domain data and services for SMEs and city operators
• enabling integrated city services into third party web portal for all
• providing suggestion on demand services for SMEs and city operators
• accelerating and simplifying the implementation of business and service oriented Apps
Follow the Km4City City Smartener Process
Smart Cities e Città Metropolitane
SMART CITIES FOR A BETTER WORLD”
FIRENZE, venerdì 26 FEBBRAIO
Università di Firenze “Aula Magna” - Palazzo Fenzi
via San Gallo, 10 15.30: Registrazione partecipanti e welcome
Talk delivered at the India eGov National Workshop, New Delhi, on 28 February 2011. Am presenting new approaches and ideas on how to interact with citizens over the web, through graphic design and user-interface design.
Python has a variety of libraries and frameworks, which makes it the best programming language of all time Let’s have a look at the top 11 Python frameworks for Machine learning and deep learning
What One Digital Forensics Expert Found on Hundreds of Hard Drives, iPhones a...Blancco
Do organizations have a defined process for wiping sensitive company information before discarding/reselling old drives and mobile devices?
In this webinar, Randy F. Smith and data security experts from Blancco Technology Group explore the following topics:
- How easily residual data can be recovered from hard drives and mobile devices
- The risks leftover data can pose to organizations
- The most secure ways to wipe company data from end-of-life devices and drives
Due to the large amounts of Multimedia data on the Internet, Multimedia mining has become a very active area of research. Multimedia mining is a form of data mining. Data mining uses algorithms to segment data to identify useful patterns and to make predictions. Despite the successes in many areas, data mining remains a challenging task. In the past, multimedia mining was one of the fields where the results were often not satisfactory. Multimedia Data Mining extracts relevant data from multimedia files such as audio, video and still images to perform similarity searches, identify associations, entity resolution and for classification. As the mining techniques have matured, new techniques were developed. A lot of progress has been made in areas such as visual data mining and natural language processing using deep learning techniques. Deep learning is a branch of machine learning and has been used among other on Smartphones for face recognition and voice commands. Deep learners are a type of artificial neural networks with multiple data processing layers that learn representations by increasing the level of abstraction from one layer to the next. These methods have improved the state-of-the-art in multimedia mining, in speech recognition, visual object recognition, natural language processing and other areas such as genome mining and predicting the efficacy of drug molecules.
Km4City Simple access to open and aggregated data for Public Administratio...Paolo Nesi
Public Administrations are producing thousands of open data. They are a phenomenal leverage to produce value by enabling services. Thus, solutions that aggregate data make possible the integration of low-cost private data, and the subsequent interrogation with geographical and textual queries, with location-based and proximity results.
Open data and the integration of private and open data enable the development of various applications ranging from security, tourism, cultural heritage, but also monitoring of fleets, definition of strategies on the territory, the assessment of risk factors, etc. Actually these opportunities are very hard to be taken for government agencies and companies. The main obstacles to their exploitation are:
• High costs of data integration and aggregation, given the limited natural interoperability between data that are produced at different times by institutions and/or from individuals and companies;
• The difficulty to assess immediately if a given idea can produce viable and valid results in terms of use and repercussions.
With Km4City institutions as well as companies can integrate open data, private data, sensitive and/or critical and contextualized data with those accessible in the city with the aim to create new services for their qualified staff and/or for the citizen. Km4City allow to make possible the develop of apps and web pages using these services quickly and easily.
Km4City: Smart City Ontology Building for Effective Erogation of ServicesPaolo Nesi
Provides a unique point of service with integrated and aggregated data and tools for
-- Qualified users: public administrations à developers
-- Operators: mobility, energy, SME, shops, ….. à developers
-- Final users à citizens, students, pendular, tourists
Problems:
--Aggregated Data are not available:
not semantically interoperable, heterogeneous for: format, vocabulary, structure, velocity, volume, ownership/control, access / license, …
---As OD, LD, LOD, private data, ..
---Lack of Services and tools to make the adoption simple
Final Users tools:
--Km4City mobile app with personal assistant is coming…
--Km4City mobile applications: Google Play, Apple Store, …
--Km4City web application: http://www.km4city.org
--Open Source Mobile Application, FODD: an example in open source http://www.disit.org/6595
Public administrator tools:
--Smart decision support system, http://smartds.disit.org
--Developers http://www.disit.org/km4city tools:
--Service Map Server, plus API, http://servicemap.disit.org
--LOG LOD browser: an ultimate visual tool to browse the RDF Store.
--Ontology Documentation: an ultimate tool to understand,
if needed !!
The dirty work of Km4City service
--Data Ingestion Manager, DIM
--RDF Indexer Manager, RIM
--RDF Store Methodology
--RDF store enricher with dbPedia
--Distributed SCE Scheduler, DISCES
--SCE: Smart City Engine
--Doc and info on http://www.disit.org/km4city
Km4City Accesso Semplice a Open Data e Dati Aggregati per le Pubbliche Ammi...Paolo Nesi
Le Pubbliche Amministrazioni stanno producendo migliaia di dati aperti, questi sono una leva fenomenale per produrre valore abilitando nuovi servizi. Le soluzioni che permettono di fornire dati aggregati rendono possibile le integrazioni a basso costo di dati privati, e la successiva l’interrogazione con ricerche geografiche e testuali, con risultati geolocalizzati, e di prossimità sul territorio. E pertanto lo sviluppo di svariate applicazioni che vanno dall’ambito della sicurezza, al turismo, ai beni culturali, ma anche il monitoraggio di flotte, la definizione di strategie sul territorio, la valutazione dei fattori di rischio, etc. Al momento queste opportunità sono difficili da cogliere per le pubbliche amministrazioni e le imprese. I principali ostacoli sono:
• Costi di integrazione ed aggregazione dei dati, vista la naturale limitata interoperabilità fra dati che sono prodotti in momenti diversi da enti e/o da privati e imprese;
• La difficoltà di valutare in modo immediato se una determinata idea può produrre risultati percorribili e validi in termini di utilizzo e ricadute.
Km4City risolve queste limitazioni mettendo a disposizione di pubbliche amministrazioni e imprese:
• soluzioni scalabili ed efficaci per erogare servizi innovativi in tempi rapidi o nulli (i) fornendo dati integrati/aggregati e (ii) permettendo l’integrazione di dati privati o specifici con dati open;
• servizi con API semplici ed efficaci per sviluppare applicazioni mobili e web che usano in modo coerente dati, fornendo alle un canale costante di dati aggregati aggiornati.
Km4City include strumenti di sviluppo e di produzione e si fonda sul modello Km4City (http://www.disit.org/km4city) e su una serie di strumenti che sono stati sviluppati e sono al momento in uso nell’aggregatore di Firenze sviluppato dal DISIT Lab, accessibile via http://servicemap.disit.org (e anche accessibile con API). La soluzione Km4City, è alla base del progetto Sii-Mobility smart city nazionale (http://www.sii-mobility.org), e di RESOLUTE H2020 (http://www.resolute-eu.org); ha avuto un ranking elevato dalla valutazione di Ready4SmartCity FP7 (http://smartcity.linkeddata.es); è considerato da IBM come uno dei modelli Smart City più interessanti (http://cognitive-science.info/community/weekly-update/).
Km4City: how to make smart and resilient your city, beginner documentPaolo Nesi
Open Source and inter-operable tools to
• keep city under control via personalized dashboards
• monitoring services’ status of city operators
• monitoring and understanding the city users behaviour
• collecting moods, contributions and data from the city users
• monitoring social media for city services and events, event predictions
• improve city resilience, reducing risks and decision support by:
• assessing city resilience level
• improving city resilience, providing objective hints
• improving city users awareness with personal city assistants and participatory tools
• transform data in value for the city:
• enabling commercial and business applications
• aggregating multi-domain data and services for SMEs and city operators
• enabling integrated city services into third party web portal for all
• providing suggestion on demand services for SMEs and city operators
• accelerating and simplifying the implementation of business and service oriented Apps
Follow the Km4City City Smartener Process
Smart Cities e Città Metropolitane
SMART CITIES FOR A BETTER WORLD”
FIRENZE, venerdì 26 FEBBRAIO
Università di Firenze “Aula Magna” - Palazzo Fenzi
via San Gallo, 10 15.30: Registrazione partecipanti e welcome
Talk delivered at the India eGov National Workshop, New Delhi, on 28 February 2011. Am presenting new approaches and ideas on how to interact with citizens over the web, through graphic design and user-interface design.
Smart Cloud Engine and Solution based on Knowledge BasePaolo Nesi
Complexity of cloud infrastructures needs models and tools for process management, configuration, scaling, elastic computing and healthiness control. This paper presents a Smart Cloud solution based on a Knowledge Base, KB, with the aim of modeling cloud resources, Service Level Agreements and their evolution, and enabling the reasoning on structures by implementing strategies of efficient smart cloud management and intelligence. The solution proposed provides formal verification tools and intelligence for cloud control. It can be easily integrated with any cloud configuration manager, cloud orchestrator, and monitoring tool, since the connections with these tools are performed by using REST calls and XML files. It has been validated in the large ICARO Cloud project with a national cloud service provider.
DISIT Lab overview: smart city, big data, semantic computing, cloudPaolo Nesi
Smart City
• Projects: http://www.disit.org/5501
– Sii-Mobility, http://www.sii-mobility.org
– Service Map: http://servicemap.disit.org
– Social Innovation: Coll@bora http://www.disit.org/5479
– Navigation Indoor/outdoor: Mobile Emergency http://www.disit.org/5404
– Mobility and Transport: TRACE-IT, RAISSS, TESYSRAIL
• Tools: http://www.disit.org/5489
– Data gathering, data mining and reconciliation
– Data reasoning, deduction, prediction
– Smart city ontology and reasoning tools
– Service analysis and recommendations
– Autonomous train operator, train signaling
– Risk analysis, decision support systems
– Mobile Applications
Data Analytics - Big data
• Projects: http://www.disit.org/5501
– Linked Open Graph: http://LOG.disit.org
– Sii-Mobility, http://www.sii-mobility.org
– Service on a number of projects
• Tools: http://www.disit.org/5489
– Open data and Linked Open Data
– LOG LOD service and tools
– Data mining and reconciliation
– Data reasoning, deduction, prediction, decision support
– SN Analysis and recommendations
– User behavior monitoring and analysis
Smart Cloud - Computing
• Projects: http://www.disit.org/5501
– ICARO: http://www.disit.org/5482
– Cloud ontology: http://www.disit.org/5604
– Cloud simulator:
– Smart Cloud: http://www.disit.org/6544
• Tools: http://www.disit.org/5489
– Cloud Monitoring
– Smart Cloud Engine and reasoner,
– Service Level Analyzer and control
– Configuration analysis and checker
– Cloud Simulation
Text and Web Mining
• Projects: http://www.disit.org/5501
– OSIM: http://www.disit.org/5482
– SACVAR: http://www.disit.org/5604
– Blog/Twitter Vigilance
• Tools: http://www.disit.org/5489
– Text and web mining, Natural Language Processing
– Service localization
– Web Crawling
– Competence analysis
– Blog Vigiliance, sentiment analysis
Social Media and e-Learning
• Projects: http://www.disit.org/5501
– ECLAP, http://www.eclap.eu
– ApreToscana: http://www.apretoscana.org
– Others: AXMEDIS, VARIAZIONI, SMNET, etc.
– Samsung Smart TV: http://www.disit.org/6534
• Tools: http://www.disit.org/5489
– XLMS, Cross Media Learning System
– IPR and content protection and distribution
– Mobile and SmartTv Applications
– Suggestions and recommendations
– Matchmaking solutions
– Media Tools for cross media content
Mobile Computing
• Projects:
– ECLAP: http://www.eclap.eu
– Mobile Medicine: http://mobmed.axmedis.org
– Mobile Emergency: http://www.disit.org/5500
– Smart City, FODD 2015: http://www.disit.org/6593
– Resolute: Mobiles as sensors
• Tools and support:
– Content distribution: e-learning
– Integrated Indoor/outdoor navigation
– User networking and collaboration
– Service localization
– Smart city and services
– OS: iOS, Android, Windows Phone, etc.
– Tech: IOT, iBeacoms, NFC, QR, ….
https://bigscience.huggingface.co/
EN: Presentation of the BigScience project: a research initiative launched by HuggingFace and aiming to build a large language model (inspired by OpenAI and GPTx) over multiple languages and a very large processing cluster. The participants plan to investigate the dataset and the model from all angles: bias, social impact, capabilities, limitations, ethics, potential improvements, specific domain performances, carbon impact, general AI/cognitive research landscape.
FR : Présentation du projet Bigscience : un projet de recherche ouvert lancé par HuggingFace et qui a pour objectif de contruire un modèle de langue (ie un peu comme openAI et GPT-3) mais en explorant les problèmes liés au jeux de données et au modèle selon les angles des biais cognitifs, de l'impact social et environemental, des limites éthiques, des possibles gain de performance et de l'impact général de ce type d'approche lorsque le but n'est pas seulement "d'avoir un plus gros modèle".
Knowledge mining and Semantic Models: from Cloud to Smart CityPaolo Nesi
Course for the Doctorate in Information Technologies, DIST, at DINFO UNIFI
Pierfrancesco Bellini, Paolo Nesi, DISIT Lab
Dipartimento di Ingegneria dell’Informazione, DINFO
Università degli Studi di Firenze,Via S. Marta 3, 50139, Firenze, Italy
Tel: +39-055-2758511, fax: +39-055-2758570
http://www.disit.dinfo.unifi.it alias http://www.disit.org
Pierfrancesco.bellini@unifi.it , Paolo.nesi@unifi.it
.
program:
From RDF to OWL
Knowledge engineering for Beginners
Smart Cloud Application (ICARO Case)
Big Data Smart City Architecture
Smart-city Ontology
Data Ingestion and Mining
Distributed and real time processes
RDF processing
Smart City Engine
Development Interfaces
Sii-Mobility
Benchmarking open source deep learning frameworksIJECEIAES
Deep Learning (DL) is one of the hottest fields. To foster the growth of DL, several open source frameworks appeared providing implementations of the most common DL algorithms. These frameworks vary in the algorithms they support and in the quality of their implementations. The purpose of this work is to provide a qualitative and quantitative comparison among three such frameworks: TensorFlow, Theano and CNTK. To ensure that our study is as comprehensive as possible, we consider multiple benchmark datasets from different fields (image processing, NLP, etc.) and measure the performance of the frameworks’ implementations of different DL algorithms. For most of our experiments, we find out that CNTK’s implementations are superior to the other ones under consideration.
Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analyt...Paolo Nesi
Citizens as sensors to
Assess sentiment on services, events, …
Response of consumers wrt…
Early detection of critical conditions
Information channel
Opinion leaders
Communities
Formation
Predicting volume of visitors for tuning the services
Collecting Tweets
on the basis of several criterial, searches
Multiple users may have multiple searches and multiple purposes (views on those searches) minimization of searches
With high reliable model exploiting Twitter Search and/or Stream API
Performing NLP and Sentiment Analysis
Real time or daily
Multiple languages
Compute Metrics:
different kinds, etc.: volume, users, etc..
Low and High Level,
Work on daily, hourly and real time
Downloading data: raw data and derived metrics
Providing support for drilling down on data
to perform inspection and analysis, faceted search for instance
perform data analytics
exploiting metrics on Machine Learning and predictive tools
Providing API for querying, dashboard
The IEEE Smart World Congress originated from the 2005 Workshop on Ubiquitous Smart Worlds (USW, Taipei) and the 2005 Symposium on Ubiquitous Intelligence and Smart World (UISW, Nagasaki). SmartWorld 2017 in San Francisco is the next edition after the successful SmartWorld 2016 in Toulouse France and SmartWorld 2015 in Beijing China. SmartWorld 2017 is to provide a high-profile, leading-edge platform for researchers and engineers to exchange and explore state-of-art advances and innovations in graceful integrations of Cyber, Physical, Social, and Thinking Worlds for the theme
http://ieee-smartworld.org/2017/smartworld/
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
Data management plans – EUDAT Best practices and case study | www.eudat.euEUDAT
| www.eudat.eu | Presentation given by Stéphane Coutin during the PRACE 2017 Spring School joint training event with the EU H2020 VI-SEEM project (https://vi-seem.eu/) organised by CaSToRC at The Cyprus Institute. Science and more specifically projects using HPC is facing a digital data explosion. Instruments and simulations are producing more and more volume; data can be shared, mined, cited, preserved… They are a great asset, but they are facing risks: we can miss storage, we can lose them, they can be misused,… To start this session, we will review why it is important to manage research data and how to do this by maintaining a Data Management Plan. This will be based on the best practices from EUDAT H2020 project and European Commission recommendation. During the second part we will interactively draft a DMP for a given use case.
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Mark Goldstein
“Big Data for IoT: Analytics from Descriptive to Predictive to Prescriptive” was presented to the Phoenix Data Conference on 11/4/17 at Grand Canyon University.
As the Internet of Things (IoT) floods data lakes and fills data oceans with sensor and real-world data, analytic tools and real-time responsiveness will require improved platforms and applications to deal with the data flow and move from descriptive to predictive to prescriptive analysis and outcomes.
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...Paolo Nesi
Open Data Day, UNIMORE, Modena, 5 Marzo 2016.
Aggregazione dati, experienza di Firenze,
Smart City, Km4City,
Smart Decision Support,
Data Ingestion manager,
Data aggregation,
User profiling on demand.
Mobilità: inter-modalità, bigliettazione integrata, sostenibile, scambiatori, sfruttamento stazioni, etc.,
Servizi: gov ..SUAP, edu, turismo, beni culturali, salute, etc.,
Energia: risparmio energetico, riduzione amissioni, inquinamento, etc.,
Ambiente: qualità dell’aria, fiumi, meteo, rifiuti, etc.,
… commercio, industria, etc.
... Infrastrutture critiche. resilienza
Collezionamento dati statici, quasi statici e real time, stream
Dati open: geo localizzati, servizi, statistiche, censimenti, etc.
Dati privati degli operatori: con licenze limitate per non permettere di fare profitto ad altri operatori sulla base dei loro dati
Dati personali delle persone: profili, comportamenti tramite APP, IOT, sensori, web, etc.
Integrazione dati per renderli semanticamente interoperabili, ed operare deduzioni (time, space… )
I tradizionali collettori di open data danno visioni statistiche ma non sono adatti a produrre servizi integrati
Integrazione con modelli semantici unificanti come Km4City
Control Room delle Città Metropolitane devono:
arrivare a supervisionare domini multipli e le interdipendenze fra mobilità, energia, comunicazione, servizi, flussi traffico, flussi pedonali, turismo, etc.
Migliorare la loro Resilienza, capacità di reazione ed assorbimento
ridurre i costi sociali della mobilità per le persone
consentendo minori disagi, maggiore efficienza,
maggiore sensibilità verso le necessità del cittadino,
minori emissioni, migliori condizioni ambientali;
percorsi info-formativi in modo che il cittadino cambi le abitudini non virtuose;
ridurre i costi di trasporto ed i tempi di percorrenza per gli utenti, per i gestori e le amministrazioni, tramite soluzioni di ottimizzazione.
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesPaolo Nesi
Presently, a very large number of public and private data sets are available around the local governments. In most cases, they are not semantically interoperable and a huge human effort is needed to create integrated ontologies and knowledge base for smart city. Smart City ontology is not yet standardized, and a lot of research work is needed to identify models that can easily support the data reconciliation, the management of the complexity and reasoning. In this paper, a system for data ingestion and reconciliation of smart cities related aspects as road graph, services available on the roads, traffic sensors etc., is proposed. The system allows managing a big volume of data coming from a variety of sources considering both static and dynamic data. These data are mapped to smart-city ontology and stored into an RDF-Store where they are available for applications via SPARQL queries to provide new services to the users. The paper presents the process adopted to produce the ontology and the knowledge base and the mechanisms adopted for the verification, reconciliation and validation. Some examples about the possible usage of the coherent knowledge base produced are also offered and are accessible from the RDF-Store and related services. The article also presented the work performed about reconciliation algorithms and their comparative assessment and selection. Keywords Smart city, knowledge base construction, reconciliation, validation and verification of knowledge base, smart city ontology, linked open graph.
Rights Enforcement and Licensing Understanding for RDF Stores Aggregating Ope...Paolo Nesi
Several applications are going to aggregate data on triple stores coming from different data sets and presenting different licenses. Semantic queries should provide only allowed triples, while most of the RDF stores have strong limitations in providing support for access control, licensing, rights enforcement and supporting the developers in providing tutoring information about what is possible and what is not. In this paper, a specific solution is proposed for supporting developers in understanding the licensing level of the requested triples, and the RDF stores in enforcing rights. The proposed solutions can be integrated into a range of different RDF stores for removing their limitations and assisting developers. The proposed solution has been developed and tested in the case of large smart city solution called Km4City and adopted in a number of projects: Sii-Mobility SCN, RESOLUTE H2020 and REPLICATE H2020.
ICT research in the context of European Union
CASE SUMMER SCHOOL ON APPLIED SOFTWARE ENGINEERING
APPLIED SOFTWARE PROCESS MANAGEMENT AND TESTING
JULY 6-10, 2009, BOZEN/BOLZANO, ITALY
Charith Perera, Ciaran Mccormick, Arosha Bandara, Blaine A. Price, Bashar Nuseibeh, Privacy-by-Design Framework for Assessing Internet of Things Applications and Platforms, Proceedings of the 6th ACM International Conference on Internet of Things (IoT), Stuttgart, Germany, November, 2016, Pages 83-92
Similar to NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents (20)
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents
1. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
A Distributed Framework for NLP‐Based
Keyword and Keyphrase Extraction From
Web Pages and Documents
Paolo Nesi, Gianni Pantaleo and Gianmarco Sanesi
Department of Information Engineering, DINFO
University of Florence
Via S. Marta 3, 50139, Firenze, Florence, Italy
Tel: +39‐055‐2758511, fax: +39‐055‐2758516
DISIT Lab
http://www.disit.dinfo.unifi.it alias http://www.disit.org
paolo.nesi@unifi.it , gianni.pantaleo@unifi.it
21st International Conference on Distributed Multimedia Systems – DMS2015
DISIT Lab (DINFO UNIFI), 31° August 2015 1
2. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
2
Problem Focus & Main Issues
The WWW continuously growing represents a massive source of knowledge,
which is embedded for the most part in the textual content of web pages,
documents, social media etc...
Problem: Online web resources, pages and documents are mainly represented
as unstructured natural language text data. Instead, structured data are
required for the extraction and management of higher level knowledge.
For automatic retrieval and production of structured information there is the
need to ingest, process and analyze huge amounts (Big Data problem) of natural
language data (NLP, Statistical, Machine Learning & AI problems).
DISIT Lab (DINFO UNIFI), 31° August 2015
100 Petabytes per day processed
on 3 million servers in 2014
300 Petabytes stored
in the first half of 2014
3. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
3
Application Areas
Many application fields and areas:
Keywords & keyphrase extraction for Content/Topic extraction and
summarization.
Comprehension of natural language text documents.
Production of machine‐readable corpora (integration with semantic
resources and ontologies…).
Indexing for search engine functionalities: Content‐based, multi‐faceted
search queries…
Design of expert systems (e.g.: Recommendation Tools, DSS, Question‐
Answer systems, personal assistants, etc.)
Social media mining and user behavior analysis for target‐marketing and
customized services.
Applications and Interests in: e‐commerce, target marketing and
advertising, business web services, Smart City platforms, e‐Healthcare,
scientific research, ICT technologies etc… DISIT Lab (DINFO UNIFI), 31° August 2015
4. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
4
Distributed Architecture & Parallel Computing Frameworks
DISIT Lab (DINFO UNIFI), 31° August 2015
Currently, single‐machine, non‐distributed architectures are proving to be
inefficient for tasks like Big Data mining and intensive text processing and NLP
analysis.
Distributed Architectures and Parallel Computing techniques have been
increasingly adopted in literature for automatic keyword extraction on Big Data
sets.
First attempts of employing parallel computing frameworks for NLP tasks in
the middle 90s: Chung and Moldovan [Chung and Moldovan, 1995]
proposed a parallel memory‐based parser called PARALLEL implemented
on a parallel computer, the Semantic Network Array Processor (SNAP).
Ogmios [Hamon et al., 2007] is a platform for annotation enrichment and
NLP analysis of specialized domain documents within a distributed corpus.
Exner and Hugues recently presented Koshik [Exner and Hugues, 2014], a
multi‐language NLP processing framework for large scale‐processing and
querying of unstructured natural language documents distributed upon a
Hadoop‐based cluster.
5. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
5
The Open Source Apache Hadoop Framework
The proposed system has been designed to run on the Apache Hadoop open source
framework:
A significant advantage of using the Hadoop
distributed architecture is the capability to efficiently
and easily scale by adding inexpensive commodity
hardware to the cluster.
DISIT Lab (DINFO UNIFI), 31° August 2015[Figures reference: Holmes, A., “Hadoop in Practice”, Ed. Manning (2012).]
Hadoop File System: HDFS Hadoop MapReduce paradigm
6. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
6
General Architecture
DISIT Lab (DINFO UNIFI), 31° August 2015
Hadoop HDFS
Web Pages
&
Documents
Web
Crawler
Crawler DB
Gate .XGAPP
Application
3rd Party
Plugins &
Resources
Keyword DB
Namenode
(Master)
Datanode
(Slave) #1
Datanode
(Slave) #2
Datanode
(Slave) #3
Datanode
(Slave) #4
HDFS
Storage
Map
Reduce
Map
Reduce
Map
… …
Map
Reduce
Map
Reduce
Map
… …
Map
Reduce
Map
Reduce
Map
… …
Map
Reduce
Map
Reduce
Map
… …
Get Domain
Get Domain
Get Domain
Get Domain
Execute GATE,
TF‐IDF
Execute GATE,
TF‐IDF
Execute GATE,
TF‐IDF
Execute GATE,
TF‐IDF
Keywords &
Keyphrases
Extractor
Job
Partitioner
DB Storage
(Sqoop)
7. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
7
General Architecture
DISIT Lab (DINFO UNIFI), 31° August 2015
The system architecture is composed by 3 main modules:
The Web Crawler module, based on the open source Apache Nutch tool;
To retrieves and parse the textual content of web pages.
The Keywords / Keyphrases Extractor module is responsible for:
The execution of our NLP application (based on GATE) in a Hadoop
MapReduce environment for text annotation and keywords / keyphrases
extraction and storage in the HDFS.
The keywords and keyphrases relevance estimation, obtained by
computing the TF‐IDF function for each extracted keywords / keyphrases,
performed to assess their relevance with respect to their whole corpus.
The DB Storage module which finally stores designed keywords and
keyphrases into an external SQL database, using the Apache Sqoop open
source tool (specifically designed for data transfer between Hadoop HDFS and
structured datastores).
8. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
8
Keywords & Keyphrases Extractor Module, .xgapp
DISIT Lab (DINFO UNIFI), 31° August 2015
The Map function that associates key/value record pairs where the key is the URL
of the single web page, and the value is the corresponding web domain (grouping
web pages of the same domain).
The Reduce function, in turn, fulfills the following operations:
Setup, launch and execution of a multi‐corpora GATE application for keywords
/ keyphrases extraction:
Estimation of extracted keywords / keyphrases relevance at web domain level,
by implementing the TF‐IDF function:
∙
, log .
Doc
Doc
Corpus 1
Corpus N
. . .
Splitter,
Tokenizer
…
GATE Application
TreeTagger
(POS‐Tagging)
. . . . . .
JAPE
(+Syntactical
Annotation Rules)
. . . . . .
Keywords
List 1
Keywords
List N
. . .
9. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
9
Validation – Test Dataset and Configurations
Test Dataset: 10000 web pages and documents ingested by the Web Crawler from a
seed list of commercial companies, services and research institutes web domains.
Test Configurations: Different test configurations for the Hadoop Cluster: from 2 to 5
active nodes.
Nodes Running Daemons
Master Namenode, JobTracker,
TaskTracker
Slave #1 SecondaryNamenode
Datanode, TaskTracker
Slave #2 Datanode, TaskTracker
Slave #3 Datanode, TaskTracker
Slave #4 Datanode, TaskTracker
Test Config. 1: 5‐Nodes Cluster
Nodes Running Daemons
Master Namenode, JobTracker,
TaskTracker
Slave #1 SecondaryNamenode
Datanode, TaskTracker
Slave #2 Datanode, TaskTracker
Slave #3 Datanode, TaskTracker
Nodes Running Daemons
Master Namenode, JobTracker,
TaskTracker
Slave #1 SecondaryNamenode
Datanode, TaskTracker
Slave #2 Datanode, TaskTracker
Nodes Running Daemons
Master Namenode, JobTracker,
TaskTracker
Slave #1 SecondaryNamenode
Datanode, TaskTracker
Test Config. 2: 4‐Nodes Cluster Test Config. 3: 3‐Nodes Cluster
Test Config. 4: 2‐Nodes Cluster
10. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
10
Results and Discussion
Results: For each cluster configuration, processing times for extracting a total of
about 2.5 Million keywords and keyphrases have been assessed and compared.
The scaling capabilities observed confirm the nearly linear growth trend of
the Hadoop architecture.
Significant performance improvements can be achieved only for a larger
number of nodes in the cluster.
Single cpu/thread execution in 60 hours DISIT Lab (DINFO UNIFI), 31° August 2015
Configuration
Processing
Time
(hh:mm:ss)
Speed ‐
Up
HDFS ‐ single node 07:17:01 ‐
HDFS ‐ 2 nodes 05:21:53 1.36
HDFS ‐ 3 nodes 04:08:00 1.76
HDFS ‐ 4 nodes 03:39:42 1,99
HDFS ‐ 5 nodes 03:20:09 2.18 0
1
2
3
4
5
6
2 3 4 5
SpeedUp
Nodes #
11. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
11
A Current Application
Integration in the Twitter Vigilance tool (http://www.disit.org/tv/), developed at
DISIT Lab, University of Florence.
Twitter Vigilance is a multi user tool for making analysis of "Twitter
channels". Each channel can be tuned to monitor one or more Search
Queries on Twitter with a sophisticated and expressive syntax.
Some public channels are publically accessible as examples.
DISIT Lab (DINFO UNIFI), 31° August 2015
12. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
12
A Current Application
Currently more than 12 Millions Twitters have been processed, at a rate of
approximately 200000 New TW per day, so a distributed architecture is required.
Main Channel focus:
City Service Assessment: smart city application
Weather conditions and measures
Weather communication channel assessment and qualification
Events appreciation assessment
Product market appreciation assessment
Drugs vs people appreciation assessment
DISIT Lab (DINFO UNIFI), 31° August 2015
13. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
13
Sentiment and Measure Analysis of Social Media (Twitter)
DISIT Lab (DINFO UNIFI), 31° August 2015
Twitter
API
Twitter
Ingested
Tweets DB
External
Knowledge Base
(e.g. SentiWordnet)
Analysis
Sentiment‐Score
Keywords DB
Context &
Weights
Channel
or selec.
Affective
Classifier
Twitter
affective
DB
Twitter
Vigilance
Backend
Indexer
Twitter
Vigilance
User Interface
Tweets
http://tvsolr.disit.org/
http://www.disit.org/tv/
Measure
Keywords DB
Keyword &
Keyphrase
Extractor
Dependency
Parsing, RST …
Measure
Classifier
Twitter
Measure
DB
14. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
Conclusions
• A NLP application to enforce the NLP capabilities
into Hadoop has been designed and built
• Advantages with respect to the state of the art are:
– usage of GATE, instead of creating a new one
– Different kind of text sources and not only documents
– Larger flexibility
• The resulting solution
– has been assessed in terms of performance
– Is currently in further development and use for Twitter
Vigilance Affective and Measuring analyses in multiple
projects with different institutions: CNR IBINET, LAMMA,
NEUROFARBA
DISIT Lab (DINFO UNIFI), 31° August 2015 14
15. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
15
Sentiment and Measure Analysis of Social Media (Twitter)
Extracted keywords could be monitored by different POS (nouns, adjectives
and verbs)
or by special characters for user mentions (@) and hashtags (#), in order to
asses temporal trends and perform statistical analysis.
By exploiting external knowledge bases and lexicons specifically designed for
Sentiment Analysis (such as SentiWordnet), sentiment scores can be assigned
to each extracted keyword and keyphrases (as compositions of keywords) in
order to estimate the general sentiment polarity of collected tweets.
An advanced Sentiment Analysis could be eventually performed by employing
more sophisticated techniques, such as dependency parsing, Rhetorical
Structure Theory (RST) and Discourse Following, in order to perform
syntactical sentence parsing and extract relations and dependencies among
keywords and syntagmas.
These resources can be then used to weight sentiment coefficients previously
assigned to extracted keywords, in order to improve the estimation of sentiment.
DISIT Lab (DINFO UNIFI), 31° August 2015
16. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
16
References
DISIT Lab (DINFO UNIFI), 31° August 2015
[Chung and Moldovan, 1995] M. Chung and D. I. Moldovan, “Parallel Natural Language Processing
on a Semantic Network Array Processor”, IEEE Transactions on Knowledge and Data
Engineering, Vol. 7(3), pp. 391‐404, 1995.
[Exner and Nugues, 2014] Exner, P. and Nugues, P., “KOSHIK ‐ A Large‐scale Distributed Computing
Framework for NLP“, in Proc. of the International Conference on Pattern Recognition
Applications and Methods (ICPRAM 2014), pp. 463‐470, 2014.
[Hamon et al., 2007] T. Hamon, J. Deriviere and Nazarenko, “Ogmios: a scalable NLP platform for
annotating large web document collections”, in Proc. of Corpus Linguistics, Birmingham,
United Kingdom, 2007.
[Holmes, 2012] Holmes, A., “Hadoop in Practice”, Ed. Manning (2012).
17. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
17
Thanks for your
attention !
DISIT Lab (DINFO UNIFI), 31° August 2015