Building a global listening platform with Solr presents technical and global challenges. The speaker will demonstrate a platform they built in 3 months using Solr and Basis Technology products for content acquisition, analysis including language identification and entity extraction, and search visualization. Key aspects include distributed processing pipelines for analysis, language-specific indexing, and dashboard interfaces beyond basic search results.
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
If you're user can't find it, they can't buy it right? In this talk, Apache Lucene and Solr committer Grant Ingersoll will discuss architecture, techniques and tips for successfully deploying search tools like Lucene, Solr and LucidWorks Enterprise in eCommerce environments.
Kitenga's ZettaVox and ZettaSearch products support SOLR and Lucene ecosystems at both the ingestion point and for the search user. In this talk, I will show how ZettaVox, our professional content mining platform on Hadoop, can be used to index content and rich metadata into a LucidWorks Enterprise installation. Being built on Hadoop, ZettaVox scales up by scaling out. I will then create an end-user search and analytics experience using our ZettaSearch solution that leverages the faceted metadata to enhance information discovery and analysis. All in about 20 minutes.
Trulia is a real estate search company that helps customers find homes for sale or to rent and provides them with information to help them make better decisions in the process. It is also a hub for real estate professionals to market their listings, view real estate data and promote their services.
Maybe you’ve heard pundits say that in the next year, humans will create more data than in all of human history. The problem with those predictions, Stephen O’Grady of Redmonk said in his keynote to Day 2 of Lucene Revolution, is that they’re true.
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
For CareerBuilder, a 1% deviance in search relevancy can mean millions of missed job opportunities for our users. When CareerBuilder moved to Solr from an expensive, proprietary search vendor, our top priorities were maintaining the quality of our search results and drastically improving our agility.
If you're user can't find it, they can't buy it right? In this talk, Apache Lucene and Solr committer Grant Ingersoll will discuss architecture, techniques and tips for successfully deploying search tools like Lucene, Solr and LucidWorks Enterprise in eCommerce environments.
Kitenga's ZettaVox and ZettaSearch products support SOLR and Lucene ecosystems at both the ingestion point and for the search user. In this talk, I will show how ZettaVox, our professional content mining platform on Hadoop, can be used to index content and rich metadata into a LucidWorks Enterprise installation. Being built on Hadoop, ZettaVox scales up by scaling out. I will then create an end-user search and analytics experience using our ZettaSearch solution that leverages the faceted metadata to enhance information discovery and analysis. All in about 20 minutes.
Trulia is a real estate search company that helps customers find homes for sale or to rent and provides them with information to help them make better decisions in the process. It is also a hub for real estate professionals to market their listings, view real estate data and promote their services.
Maybe you’ve heard pundits say that in the next year, humans will create more data than in all of human history. The problem with those predictions, Stephen O’Grady of Redmonk said in his keynote to Day 2 of Lucene Revolution, is that they’re true.
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
For CareerBuilder, a 1% deviance in search relevancy can mean millions of missed job opportunities for our users. When CareerBuilder moved to Solr from an expensive, proprietary search vendor, our top priorities were maintaining the quality of our search results and drastically improving our agility.
Facets and Pivoting for Flexible and Usable Linked Data ExplorationRoberto García
The success of Open Data initiatives has increased the amount of data available on the Web. Unfortunately, most of this data is only available in raw tabular form, what makes analysis and reuse quite difficult for non-experts. Linked Data principles allow for a more sophisticated approach by making explicit both the structure and semantics of the data. However, from the end-user viewpoint, they continue to be monolithic files completely opaque or difficult to explore by making tedious semantic queries. Our objective is to facilitate the user to grasp what kind of entities are in the dataset, how they are interrelated, which are their main properties and values, etc. Rhizomer is a tool for data publishing whose interface provides a set of components borrowed from Information Architecture (IA) that facilitate awareness of the dataset at hand. It automatically generates navigation menus and facets based on the kinds of things in the dataset and how they are described through metadata properties and values. Moreover, motivated by recent tests with end-users, it also provides the possibility to pivot among the faceted views created for each class of resources in the dataset.
10 Things I Like in SharePoint 2013 SearchSPC Adriatics
Speaker: Agnes Molnar;
Based on my SharePoint and FAST Search experience, I’ll demonstrate my “Research Path” on SharePoint 2013 Search. What’s new, what improvements we can find there as well as how to use our existing Search knowledge and experience in SharePoint 2013 Search.
You will learn:
Config options in SharePoint 2013 Search – Central Admin vs. PowerShell
Crawled and Managed Properties across Content Sources
Ranking and Relevancy
The demands, techniques, paradigms and languages of modern software development have been changing at an ever increasing pace. This has left many software engineers bewildered, confused and feeling like they’re never going to be able to keep up, let alone be good enough! In the Java/JVM space there is a lot of tooling, libraries and APIs available to try and bridge the gaps of TDD, BDD, CI, Build, Static Code Analysis and so forth, but so many, many challenges still remain, like developing with RxJava e.g.
“Rx.Observable.prototype.flatMapLatest(selector, [thisArg])
Projects each element of an observable sequence into a new sequence of observable sequences by incorporating the element’s index and then transforms an observable sequence of observable sequences into an observable sequence producing values only from the most recent observable sequence.”
WAT? Err get me out of here…..
The Diabolical Developer will take you through a journey of what problems are deemed to be ‘solved’, what problems are currently being worked on and the gaps that are likely to remain (which he fervently hopes you’re going to solve for him). He’ll share his dystopian view of the future where the modern challenges are tackled by sharp tools and clever thinking.
Similar to 2010 10-building-global-listening-platform-with-solr (20)
Couchbase Connect 2014: Lucidworks CEO Will Hayes takes you on a fantastic voyage through the hope and the hype of big data and why the future is search-centric.
LucidWorks SiLK is an open source stack that combines Lucene/Solr with best in class open source data ingestion and analytics tools such as Flume, LogStash and Kibana. This webinar will explore the features of SiLK, and provide attendees with valuable information on how they can benefit from the following:
- A powerful UI to analyze time series data stored in Lucene/Solr
- Creating and sharing visualizations, dashboards and reports
- Discovery and analysis of data coming from servers, applications, devices and more
- Exploration of click, geospatial and social data in ways previously unimaginable
LucidWorks App for Splunk Enterprise is the first of its kind, specifically designed to allow companies to analyze and manage the health and availability of their Solr deployments in Splunk software. The solution integrates multi-structured data indexed by Solr directly into Splunk® Enterprise, giving system administrators the ability to look at the intersection of documents, customer records or other unstructured data sources as they relate to machine data. This enables companies to optimize their Solr applications, glean insights from search and usage patterns and spot security concerns to improve end user experiences and derive more business value from data-driven applications.
This webinar will explore the features of the App, and provide attendees with valuable information on the following key components:
Solr Monitor: Monitor the health and availability and utilization of LucidWorks and/or Solr deployments with pre-defined data inputs, dashboards and reports
Search Analytics: Perform user behavior and click-stream analysis with pre-built search analytics reports and fields
NoSQL Lookups: Using Splunk’s lookup facility enrich your Splunk reports with data of any structure using Solr’s fully indexed and searchable NoSQL-datastore
Search Time Joins: Join Splunk data with human generated and other unstructured data sources stored in Solr at search time for developing data-driven applications
1. Building a Global Listening
Platform with Solr
Steve Kearns
Rosette Product Manager
Basis Technology
October 7, 2010
Monday, October 04, 2010
2
2. Agenda
• Agenda
• Who Am I?
• What is a “Listening Platform”?
• Challenges (Technical & Global)
• Details
• Demonstration
3
3. About me
• Product Manager at Basis Technology
– Rosette linguistics platform
• Language ID
• Language Support for Search
• Entity Extraction
• Entity Translation/Search
•…
• Related history
– Media Monitoring at BBN Technologies
• Video, Web content extraction: STT, MT, Search
4
4. What is a Listening Platform?
• Content aggregator for online media
• Targets:
– Social/Brand monitoring
– Government OSINT
• Functions:
– Content acquisition
– Content analysis
– Search indexing
– Search (UI)
– Visualization
5
5. Content Acquisition
• What:
– News Articles
– Social Media
• How:
– Web Crawler
• Nutch!
– RSS feed reader/aggregator
• ROME/Curn
– Pay a 3rd party aggregator
• Good option, if you can afford it.
6
7. Content Acquisition: How
• CURN – Customizable Utilitarian RSS Notifier
CURN History
CURN
RSS Yes New No
RSS Feed story? End
Feed Solr
List
Output
Plug-In Download Extract Create Solr
Story URL Content Message
(BoilerPipe || Readability)
8
8. What is a Listening Platform?
• Content aggregator for online media
• Targets:
– Social/Brand monitoring
– Government OSINT
• Functions:
– Content acquisition
– Content analysis
– Search indexing
– Other visualization
9
10. Content Analysis: How?
• Preprocessor to Solr
– Custom
– OpenPipeline
• Solr UpdateRequestProcessors
– Chain of URP’s defined in SolrConfig
– Add, edit, remove fields
11
11. Content Analysis: How
• Custom distributed processing pipeline
– Complexity of components
– Number of components
– Some components require their own data storage
• Solr Indexing is the final processing step
12
12. Content Analysis Details
• Language Identification for:
– Indexing
– Faceting/Searching
– Entity Extraction
• Language-specific indexing for:
– Improved recall with high precision
• Entity Extraction for:
– Faceting
– Entity search
– Input to relationship extraction
13
14. Language-Specific Indexing
• Every language has unique challenges:
– Tokenization
• Morphological Analysis vs. N-Gram
– Stemming vs. Lemmatization
• All European and Middle Eastern languages
– Compound words
• Swedish, Danish, Norwegian, Dutch, German
15
16. Stemming vs. Lemmatization
• Stemming:
– Set of rules for removing characters from words
– Increased recall at the expense of precision
– Example EN rule: Remove trailing “ing” or “al”
• Lemmatization:
– Complex set of approaches for producing the
dictionary form of a word
– Increased recall without hurting precision
– Uses context to disambiguate candidates
17
18. Stemming vs. Lemmatization
• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”
• Stemming:
• Lemmatization (and decompounding!)
19
19. Stemming and Lemmatization Challenges
• Can I index text from many languages into
the same field?
– Yes, but it’s not always a good idea!
• Query language ID is not accurate.
– You need a custom Query Analyzer that does
stemming/lemmatization in many languages for
the same query.
• How do I query text in multiple fields?
– Dismax parser allows you to specify multiple
fields to search.
20
20. Entity Extraction
• Process of identifying people, places, organizations, dates,
times, etc. in unstructured text.
• Methods:
– List-based
– Rules-based
– Statistical-based
• Define your goals upfront!
– Some extraction methods work better for certain entity types
• Rules work well for dates, email addresses, and URL’s, but not people
• Lists work well for titles, but not locations
• Statistical extractors work well for ambiguous entities like people,
locations, organizations
21
27. Architecture
CURN Rosette Analysis Document
(RSS Harvester)
Components Classification and
• Lang ID Clustering
• Entity Extraction
• Relationship Extraction
Indexing / User Interface
Query Service
MySQL
(Long Term
Datastore)
Name Solr
Indexer 28
28. Demo
• Listening Platform built on Solr
• I built this version in 3 months using Solr
and products from Basis Technology
• I would be happy to show you the Solr
config and let you try it out
29