The document discusses open source enterprise content management and how it can be enhanced by integrating semantic web technologies. It describes how semantic technologies can help extract meaning from unstructured content, connect information to form knowledge, reason about the knowledge, and present it in an actionable way. The document also provides an overview of Nuxeo's work on semantic ECM through various research projects and their semantic engine which extracts metadata from content.
This is a presentation Zen style talk (ala Garr Reynolds) on the importance of publishing high quality (“5 star”)
Linked Data and why this is central to fulfilling the promise of Open Government in the 21st Century. I blogged the full story on http://3roundstones.com/2011/10/17/a-new-era-of-transparency/
Linked Data Cookbook for Government Agencies, SemTech East, Washington DC 1-D...Bernadette Hyland-Wood
Linked Data Cookbook for US Government Agencies by Bernadette Hyland, 3 Round Stones, Inc. and W3C Government Linked Data co-chair.
Presented at Semantic Technology Conference Dec 2011, Washington DC
Update on the progress of two Linked Data projects, including one from US EPA and another from a Virginia based regional healthcare company using anonymized EMR and Linked Data for personalized healthcare.
This is a presentation Zen style talk (ala Garr Reynolds) on the importance of publishing high quality (“5 star”)
Linked Data and why this is central to fulfilling the promise of Open Government in the 21st Century. I blogged the full story on http://3roundstones.com/2011/10/17/a-new-era-of-transparency/
Linked Data Cookbook for Government Agencies, SemTech East, Washington DC 1-D...Bernadette Hyland-Wood
Linked Data Cookbook for US Government Agencies by Bernadette Hyland, 3 Round Stones, Inc. and W3C Government Linked Data co-chair.
Presented at Semantic Technology Conference Dec 2011, Washington DC
Update on the progress of two Linked Data projects, including one from US EPA and another from a Virginia based regional healthcare company using anonymized EMR and Linked Data for personalized healthcare.
Kick-off meeting on February 24th 2017 for the Linkflows project, a collaboration between the Web & Media Sciences Group, Computer Science Department, Vrije Universiteit Amsterdam, IOS Press and Netherlands Institute for Sound and Vision.
Another version of my talk about the state of the Internet Operating System, but this one focused on how it will affect business intelligence. Given at Greenplum Days in Las Vegas, held in conjunction with the Gartner BI Summit.
When the Wikipedians talk: network and tree structure of Wikipedia discussion...David Laniado
Talk pages play a fundamental role in Wikipedia as the place for discussion and communication. In this work we use the comments on these pages to extract and study three networks, corresponding to different kinds of interactions. We find evidence of a specific assortativity profile which differentiates article discussions from personal conversations. An analysis of the tree structure of the article talk pages allows to capture patterns of interaction, and reveals structural differences among the discussions about articles from different semantic areas.
Open Source in the Cloud Computing EraTim O'Reilly
While open source software plays an important role in many cloud applications, we need to understand where the cloud is taking us or we'll find ourselves in the grip of a new monopoly. Open source needs to get serious about building interoperable open data services - they are the operating system of the internet.
Drupalcon keynote: Open Source and Open Data in the age of the cloudTim O'Reilly
My keynote at Drupalcon SF on April 20, 2009. Similar to my talk at OSBC, MySQL and Greenplum, but with a bit of a drupal twist. Ending riff on DIY inspired by Isaiah Saxon's comments on my MySQL keynote.
Everything is a Subject: The vision of subject-centric computingSteve Pepper
Closing keynote from Topic Maps 2008 showing how Topic Maps -- and the concept of subject-centric computing -- can pave the way for computing “as we may think”, and thereby help realize the visions of Vannevar Bush, Ted Nelson, Doug Engelbart and, yes, Tim Berners-Lee.
The Nuxeo Way: leveraging open source to build a world-class ECM platformNuxeo
How can one create and deliver enterprise-class software, worth tens of years of R&D, with minimal capital investment? Open source can help, as well as the right context and ecosystem. This first talk will highlight the experience gained in the 8 first years of Nuxeo, and how they were applied to the latest iteration of the Nuxeo Platform.
Kick-off meeting on February 24th 2017 for the Linkflows project, a collaboration between the Web & Media Sciences Group, Computer Science Department, Vrije Universiteit Amsterdam, IOS Press and Netherlands Institute for Sound and Vision.
Another version of my talk about the state of the Internet Operating System, but this one focused on how it will affect business intelligence. Given at Greenplum Days in Las Vegas, held in conjunction with the Gartner BI Summit.
When the Wikipedians talk: network and tree structure of Wikipedia discussion...David Laniado
Talk pages play a fundamental role in Wikipedia as the place for discussion and communication. In this work we use the comments on these pages to extract and study three networks, corresponding to different kinds of interactions. We find evidence of a specific assortativity profile which differentiates article discussions from personal conversations. An analysis of the tree structure of the article talk pages allows to capture patterns of interaction, and reveals structural differences among the discussions about articles from different semantic areas.
Open Source in the Cloud Computing EraTim O'Reilly
While open source software plays an important role in many cloud applications, we need to understand where the cloud is taking us or we'll find ourselves in the grip of a new monopoly. Open source needs to get serious about building interoperable open data services - they are the operating system of the internet.
Drupalcon keynote: Open Source and Open Data in the age of the cloudTim O'Reilly
My keynote at Drupalcon SF on April 20, 2009. Similar to my talk at OSBC, MySQL and Greenplum, but with a bit of a drupal twist. Ending riff on DIY inspired by Isaiah Saxon's comments on my MySQL keynote.
Everything is a Subject: The vision of subject-centric computingSteve Pepper
Closing keynote from Topic Maps 2008 showing how Topic Maps -- and the concept of subject-centric computing -- can pave the way for computing “as we may think”, and thereby help realize the visions of Vannevar Bush, Ted Nelson, Doug Engelbart and, yes, Tim Berners-Lee.
The Nuxeo Way: leveraging open source to build a world-class ECM platformNuxeo
How can one create and deliver enterprise-class software, worth tens of years of R&D, with minimal capital investment? Open source can help, as well as the right context and ecosystem. This first talk will highlight the experience gained in the 8 first years of Nuxeo, and how they were applied to the latest iteration of the Nuxeo Platform.
Présentation du marché du logiciel libre en France en juin 2010, d'après une cartographie publiée par le CNLL et des études de marché plus anciennes de Gartner et Pierre Audoin Consultants.
6 - Making Information Pay 2011 -- SOLOMON, MADI (Pearson)bisg
"Smart Content: The Importance of Semantics in Publishing"
The way we organize our information is shifting from the book-centric table of contents or bibliographic citations to a more dynamic approach that directs us to content that may never have been initially intended, or previously encountered.
Smart content is content that is equipped with structured data that allows it to soar across domains, user groups, profiles, and knowledge maps to reach readers in non-linear ways. Through the guidance of taxonomies and the exploitation of classifications, smart content no longer waits for the wisdom of the reader, but seeks the most appropriate reader for its content.
This presentation explores how semantics and reliable metadata act as agents to broker such relationships.
Trove: A Government 2.0 Showcase August 2010, Australian ParliamentRose Holley
Presentation covers the aspects of Trove which make it a Government 2.0 showcase example. It is a search engine with several social engagement and crowdsourcing features.
5 Heresies for a Better World: some playful challenges to everyone's assumptions about building for the modern web.
(From some time back in 2008, so some of the references have been forgotten by now. The points about having to think, and not just following the crowd without thinking, and cats being evil and about to make us all obsolete slaves, are still pretty relevant thought.)
Presentation at the Master Course in Life Sciences , module A1: Innovation and Knowledge Management, October 18 2011 in Spiez.The Master of Science (MSc) in Life Sciences is offered in cooperation with four Swiss universities of applied sciences.
A presentation by Daniel Lewis of the Open Knowledge Foundation.
Delivered at the Cataloguing and Indexing Group Scotland (CIGS) Linked Open Data (LOD) Conference which took place Fri 21 September 2012 at the Edinburgh Centre for Carbon Innovation.
Introduction to digital libraries - definitions, examples, concepts and trend...Olaf Janssen
This presentation gives an introduction to digital libraries.
It first explores different defintions of the phrase "Digital Library".
It then looks at 11 real life examples of digital library websites (slides 44-112), including Europeana, Google Books, Flickr the Commons, Delpher, Wikisource, The Memory of the Netherlands and Project Gutenberg. Each of these DLs is assessed against five different criteria (concepts, properties)
- Content/User experience
- Cultural heritage domain (libraries, archives, museums, AV-institutions)
- Controlled / run by
- Content providing parties
- User involvement
Many references are made to Web2.0-concepts from Tim O'Reilly's article http://www.oreilly.com/pub/a/web2/archive/what-is-web-20.html
From these 11x5 = 55 datapoints 6 trend plots are drawn (slides 116-166) to show "what is hot" and "what is not" in the current DL-landscape. Key slide summarizing this = no 168
Finally, some strategies for content & brand distribution of DLs are being discussed (SEO, Wikipedia, social & ego networks) , as well as some financial trends in DLs
This presentation was given by Olaf Janssen (National Library of the Netherlands - KB) as a lecture for students of the master's course "The Library" at Leiden University, most recently on 3-11-2016.
Presentation delivered as part of the NISO "Back From the Endangered List: Using Authority Data to Enhance the Semantic Web" Webinar on February 9th 2011.
Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr
Presented by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global
There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.
This presentation will compare Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We will further explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases will be presented describing how to map the various functions between systems.
I've given different versions of this talk at different venues over the past 12 months. This is the most recent version as presented on 18/10/2011 at the Belgian ISSA chapter meeting.
Similar to ECM Meets the Semantic Web - Nuxeo World 2011 (20)
15 ans de politiques publiques du logiciel libre en FranceStefane Fermigier
La France est le champion du monde de l'open source, en partie grâce à la prise conscience dès la fin des années 90 par les pouvoirs publics de l'importance du sujet pour l'indépendance technologique, l'interopérabilité, le développement économique et l'innovation.
Le MOOC powered by Abilian - Plateforme open source de MOOCStefane Fermigier
Le phénomène actuel des MOOC s'inscrit directement dans la mouvance du Web 2.0 et de l'Entreprise 2.0. Après avoir déconstruit un MOOC en utilisant les notions issues de ces domaines, nous montrerons comment construire une plateforme de MOOC à l'état de l'art en partant d'une plateforme open source pour applications Entreprise 2.0.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
8. A few figures
• 50% more data / content / information
produced every year
• 1.8 zettabytes of data produced in 2011
(= 1 billion terabytes)
• Employees are drowning in a sea of email,
status messages, etc., and spend on average
more than 6 hours / weeks unsuccessfully
searching for or recreating lost documents
Thursday, October 20, 2011
10. A Brief History of the Web
• Web 1.0 (1990-now): web of sites and pages,
aka the World Wide Web
• Web 2.0 (2000-now): web of people and of
participation, aka the Social Web (Blogs, RSS,
tags, Facebook, Wikipedia, etc.)
• Web 3.0 (2010-now): web of data, of meaning
and connected knowledge, aka the Semantic
Web
10
Thursday, October 20, 2011
12. “To a computer, then, the web is a flat,
boring world devoid of meaning”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/ 12
Thursday, October 20, 2011
13. “This is a pity, as in fact documents on the
web describe real objects and imaginary
concepts, and give particular relationships
between them”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/ 13
Thursday, October 20, 2011
14. “Adding semantics to the web involves two things:
allowing documents which have information in
machine-readable forms, and allowing links to be
created with relationship values.”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
14
Thursday, October 20, 2011
15. “The Semantic Web is not a separate Web but an
extension of the current one, in which information
is given well-defined meaning, better enabling
computers and people to work in cooperation.”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
15
Thursday, October 20, 2011
17. 4 stages
• Extract meaning from raw data / content
• Connect information to form knowledge
• Reason about this knowledge
• Present this knowledge in actionable form
17
Thursday, October 20, 2011
18. Extracting
• Leverage metadata embedded in or associated with
documents (when they exist)
• Or use machine learning, NLP (Natural Language
Processing) and image processing algorithms to
extract meaning from text / images
• Examples include: named entities extraction,
automatic categorization / tagging, sentiment
analysis, etc.
18
Thursday, October 20, 2011
19. Interlude:
Linked Open Data
19
Thursday, October 20, 2011
20. 2007
2008
2009 2010
20
Thursday, October 20, 2011
22. Linking
• Many Linked Open Data repositories have been
made available over the last 10 years
• RDF and graph database systems are now available
to manage this huge mass of information (billions of
triples)
• Match information extracted from content with
these public (or internal) data/knowledge bases
22
Thursday, October 20, 2011
23. Reasoning
• When you are working on reliable metadata (ex:
RDFa embedded in web pages), you can use rule /
inference engines to infer actionable knowledge
from your content (ex: shopping recommendation
engine)
• Rules can also be used to clean up / flag errors
when working with unreliable (e.g. automatically
extracted) information
23
Thursday, October 20, 2011
24. Presenting
• Allow the users of your system to interact with the
knowledge thus extracted or produced, in a way
that allows them to do their jobs better
• A smart presentation system solves the information
overload issue by contextualizing the information,
i.e. presenting only information relevant to what the
user is currently doing
24
Thursday, October 20, 2011
25. R&D Projects
Involving Nuxeo
25
Thursday, October 20, 2011
26. IKS project
• European R&D project under the FP7, with 13
partners (6 SMEs) and a 8.5M EUR budget
• Goal: create a semantic software “stack” that
will be used by CMS vendors to add semantic
features to their products
• Started in Jan. 2009, will last until Dec. 2012
• First tangible result: Apache Stanbol
(more about this later) 26
Thursday, October 20, 2011
27. SAMAR project
• French collaborative R&D project with 10
partners, and a 4.5M EUR budget
• Goal: create a platform for managing
multimedia content in arabic, for news agencies
such as AFP
• Will include: automated translation, named
entities extraction, content classification
• First results: integration between Nuxeo and
Temis (more later) 27
Thursday, October 20, 2011
28. State of the Art
Semantic ECM at Nuxeo
28
Thursday, October 20, 2011
29. The Semantic Engine
• From unstructured content to Knowledge
• Language guessing
• Topic classification (Business, Sports, Media, ...)
• Named Entities extraction and linking
• Relationships and properties extraction
29
Thursday, October 20, 2011
37. =
Semantic Engines
(Apache OpenNLP)
+
Fast Linked Data local index
(Apache Solr)
+
Semantic Rule Engine 37
(Apache Jena)
Thursday, October 20, 2011
38. Apache Stanbol
Engine 1 DBpedia
Engine 2
2
1 Engine 3
Freebase
Nuxeo DM
3
addon
Geonames
LDAP
Local IT infrastructure (LAN) 38
Thursday, October 20, 2011
39. How to build engines?
39
Thursday, October 20, 2011
40. Training statistical models for NER with
Wikipedia and DBpedia
• Extract sentences with link positions in Wikipedia articles
• DBPedia to the find type of the target entity (Person,
Location, Organization)
• Apache Pig scripts to compute the join + format the result as
training files for OpenNLP
• Apache OpenNLP to build and evaluate the models
• Apache Hadoop for distributed processing
• Apache Whirr for deployment and management on Amazon
EC2 cluster
40
Thursday, October 20, 2011
45. Training statistical models for topic
classification from Wikipedia and DBpedia
• Filter category tree from DBpedia SKOS entries (~500k)
• Pig scripts to compute the joins with articles abstracts for all
the articles categorized in Wikipedia
• Export as 2.8GB TSV file to be indexed in Apache Solr
• Use Solr MoreLikeThisHandler to find the top 3 most related
Wikipedia category for any kind of text
• Apache Whirr & Hadoop for deployment and management on
Amazon EC2 cluster
45
Thursday, October 20, 2011
46. Wrap Up on Recent Work
• Full offline mode: Stanbol EntityHub
• Multi-lingual Indexes
• New UI for occurrences reviews
• Temis Luxid Annotation Factory integration
46
Thursday, October 20, 2011
47. What’s next?
• Stanbol and Temis connection in Admin Center
• Embedded Stanbol mode for easy deployment
• More OpenNLP models for more languages
• Finalize topic classification - handle hierarchy
• Tight integration with Nuxeo DM search features
47
Thursday, October 20, 2011
48. Thank you for your attention!
48
Thursday, October 20, 2011