NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:
NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.
Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.
Apache solr is an enterprise search engine. It facilitates indexing of large number of documents of any size and provides very robust search techniques. This ppt provides brief introduction of it.
Tips for Tuning Solr Search: No Coding RequiredAcquia
Helping online visitors easily find what they’re looking for is key to a website’s success. In this webinar, you’ll learn how to improve search in ways that don’t require any coding or code changes. We’ll show you easy modifications to tune up the relevancy to more advanced topics, such as altering the display or configuring advanced facets.
Acquia’s Senior Search Engineer, Nick Veenhof , will guide you step by step through improving the search functionality of a website, using an in-house version of an actual conference site.
Some of the search topics we'll demonstrate include:
• Clean faceted URL’s
• Adding sliders, checkboxes, sorting and more to your facets
• Complete customization of your search displays using Display Suite
• Tuning relevancy by using Solr optimizations
This webinar will make use of the Facet API module suite in combination with the Apache Solr Search Integration module suite. We'll also use some generic modules to improve the search results that are independent of the search technology that is used. All of the examples shown are fully supported by Acquia Search.
Apache solr is an enterprise search engine. It facilitates indexing of large number of documents of any size and provides very robust search techniques. This ppt provides brief introduction of it.
Tips for Tuning Solr Search: No Coding RequiredAcquia
Helping online visitors easily find what they’re looking for is key to a website’s success. In this webinar, you’ll learn how to improve search in ways that don’t require any coding or code changes. We’ll show you easy modifications to tune up the relevancy to more advanced topics, such as altering the display or configuring advanced facets.
Acquia’s Senior Search Engineer, Nick Veenhof , will guide you step by step through improving the search functionality of a website, using an in-house version of an actual conference site.
Some of the search topics we'll demonstrate include:
• Clean faceted URL’s
• Adding sliders, checkboxes, sorting and more to your facets
• Complete customization of your search displays using Display Suite
• Tuning relevancy by using Solr optimizations
This webinar will make use of the Facet API module suite in combination with the Apache Solr Search Integration module suite. We'll also use some generic modules to improve the search results that are independent of the search technology that is used. All of the examples shown are fully supported by Acquia Search.
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Erik Hatcher
Solr powers library, government, and enterprise search systems in thousands of applications. This talk showcases various technologies and techniques used to build effective user search, browse, and find interfaces on top of Solr.
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesBrett Meyer
DevNexus 2014
Hibernate has always revolved around data, ORM, and JPA. However, it’s much more than that. Hibernate has grown into a family of projects and capabilities, extending well beyond the traditional ORM/JPA space.
This talk will present powerful features provided both by Hibernate ORM, as well as third-party extensions. Some capabilities are brand new, while others are older-but-improved. Topics include multiple-tenancy, geographic data, auditing/versioning, sharding, OSGi, and integration with additional Hibernate projects. The talk will include live demonstrations.
Introduction to Solr, presented at Bangkok meetup in April 2014:
http://www.meetup.com/bkk-web/events/172090992/
Covers high-level use-cases for Solr. Demos include support for Thai language (with GitHub link for source).
Has slides showcasing Solr-ecosystem as well as couple of ideas for possible Solr-specific learning projects.
eZ Find workshop: advanced insights & recipesPaul Borgermans
Various how-to's and recipes to get things done with eZ Find, advanced searches, facet navigation, clustering of search results, domain specific boosting, etc. This workshop is based on eZ version 4 stack but the knowledge provided reaches beyond eZ versions.
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
ZendCon 2010 - Building Intelligent Search Applications with Apache Solr and PHP5. This is a presentation on how to create intelligent web-based search applications using PHP 5 and the out-of-the-box features available in Solr 1.4.1 After we finish we finish the illustration of adding, updating and removing data from the Solr index, we will discuss how to add features such as auto-completion, hit highlighting, faceted navigation, spelling suggestions etc
Solr Flair: Search User Interfaces Powered by Apache SolrErik Hatcher
Solr powers library, government, and enterprise search systems in thousands of applications. This talk will showcase the various technologies and techniques used to build effective user search, browse, and find interfaces on top of Solr. Several of the full featured open-source library Solr front-ends will be shown, including Blacklight and VuFind. We’ll also demonstrate several front-end frameworks including:
• SolrJS - a JavaScript widget library
• Solr Flare - a Ruby on Rails plugin featuring Simile Timeline integration, Ajax suggest, and more
• Solritas - a built-in lightweight UI templating framework
Additionally, we’ll take a look under the covers of http://search.lucidimagination.com and see what makes it shine.
Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.” SolrCloud enables highly available and fault tolerant clusters for large scale distributed indexing and searching. There are many other changes that will be surveyed as well. This talk will cover these improvements in detail, comparing and contrasting to previous versions of Solr.
How Solr Search Works - A tech Talk at Atlogys Delhi Office by our Senior Technologist Rajat Jain. The lecture takes a deep dive into Solr - what it is, how it works, what it does and its inbuilt architecture. A wonderful technical session with many live examples, a sneak peak into solr code and config files and a live demo. Part of Atlogys Academy Series.
A short introduction on Apache Solr :
- what is the project
- base features (document indexing, querying)
- advanced features (faceting, highlighting...)
(french presentation)
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Erik Hatcher
Solr powers library, government, and enterprise search systems in thousands of applications. This talk showcases various technologies and techniques used to build effective user search, browse, and find interfaces on top of Solr.
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesBrett Meyer
DevNexus 2014
Hibernate has always revolved around data, ORM, and JPA. However, it’s much more than that. Hibernate has grown into a family of projects and capabilities, extending well beyond the traditional ORM/JPA space.
This talk will present powerful features provided both by Hibernate ORM, as well as third-party extensions. Some capabilities are brand new, while others are older-but-improved. Topics include multiple-tenancy, geographic data, auditing/versioning, sharding, OSGi, and integration with additional Hibernate projects. The talk will include live demonstrations.
Introduction to Solr, presented at Bangkok meetup in April 2014:
http://www.meetup.com/bkk-web/events/172090992/
Covers high-level use-cases for Solr. Demos include support for Thai language (with GitHub link for source).
Has slides showcasing Solr-ecosystem as well as couple of ideas for possible Solr-specific learning projects.
eZ Find workshop: advanced insights & recipesPaul Borgermans
Various how-to's and recipes to get things done with eZ Find, advanced searches, facet navigation, clustering of search results, domain specific boosting, etc. This workshop is based on eZ version 4 stack but the knowledge provided reaches beyond eZ versions.
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
ZendCon 2010 - Building Intelligent Search Applications with Apache Solr and PHP5. This is a presentation on how to create intelligent web-based search applications using PHP 5 and the out-of-the-box features available in Solr 1.4.1 After we finish we finish the illustration of adding, updating and removing data from the Solr index, we will discuss how to add features such as auto-completion, hit highlighting, faceted navigation, spelling suggestions etc
Solr Flair: Search User Interfaces Powered by Apache SolrErik Hatcher
Solr powers library, government, and enterprise search systems in thousands of applications. This talk will showcase the various technologies and techniques used to build effective user search, browse, and find interfaces on top of Solr. Several of the full featured open-source library Solr front-ends will be shown, including Blacklight and VuFind. We’ll also demonstrate several front-end frameworks including:
• SolrJS - a JavaScript widget library
• Solr Flare - a Ruby on Rails plugin featuring Simile Timeline integration, Ajax suggest, and more
• Solritas - a built-in lightweight UI templating framework
Additionally, we’ll take a look under the covers of http://search.lucidimagination.com and see what makes it shine.
Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.” SolrCloud enables highly available and fault tolerant clusters for large scale distributed indexing and searching. There are many other changes that will be surveyed as well. This talk will cover these improvements in detail, comparing and contrasting to previous versions of Solr.
How Solr Search Works - A tech Talk at Atlogys Delhi Office by our Senior Technologist Rajat Jain. The lecture takes a deep dive into Solr - what it is, how it works, what it does and its inbuilt architecture. A wonderful technical session with many live examples, a sneak peak into solr code and config files and a live demo. Part of Atlogys Academy Series.
A short introduction on Apache Solr :
- what is the project
- base features (document indexing, querying)
- advanced features (faceting, highlighting...)
(french presentation)
Machine translation course program (in English)Dmitry Kan
This is the English version of my Machine Translation course program for the following course slides (in Russian):
http://www.slideshare.net/dmitrykan/introduction-to-machine-translation-2911038
and
http://www.slideshare.net/dmitrykan/introduction-to-machine-translation-1
Lucene revolution eu 2013 dublin writeupDmitry Kan
This presentation is loosly based on my 2-day writeups on Lucene Revolution conference 2013 held in Dublin
http://dmitrykan.blogspot.fi/2013/11/lucene-revolution-eu-2013-in-dublin-day.html
http://dmitrykan.blogspot.fi/2013/11/lucene-revolution-eu-2013-in-dublin-day_13.html
Linguistic component Sentiment Analyzer for the Russian languageDmitry Kan
Sentiment Analyzer for processing generic texts as well as tweets in Russian. Attributes to three classes {NEGATIVE, NEUTRAL, POSITIVE} and detetcts subjectivity / objectivity. Both modes can be run with and without keywords describing a target object (for example brand name).
Linguistic component Lemmatizer for the Russian languageDmitry Kan
Lemmatizer for Russian based on a robust algorithm and a dictionary with high coverage.
It beats classical stemming, which can be rather crude approach to handle multivariate surface forms.
MTEngine: Semantic-level Crowdsourced Machine TranslationDmitry Kan
Видео к презентации: http://vk.com/mtengine
В докладе представлен краудсорсинг проект, ориентированный на построение и улучшение системы машинного перевода. Отличительной чертой является применение компьютерной семантики русского языка. Также рассматривается статистический метод автоматической генерации переводных словарей.
Semantic Analysis: theory, applications and use casesDmitry Kan
Presentation we gave at 6th Seminar of Finnish-Russian University Cooperation in Telecommunications (FRUCT) Program organized by Nokia Research Center, Helsinki University of Technology, Saint-Petersburg State University of Aerospace Instrumentation and sponsored by Nokia Siemens Networks, IEEE Russia (North West) Section, Nokia University Cooperation Program in Russia
www.fruct.org
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
The web has changed! Users spend more time on mobile than on desktops and expect to have an amazing user experience on both. APIs are the heart of the new web as the central point of access data, encapsulating logic and providing the same data and same features for desktops and mobiles. In this workshop, Antonio will show you how to create complex APIs in an easy and quick way using API Platform built on Symfony.
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
London IR Meetup - Players in Vector Search_ algorithms, software and use casesDmitry Kan
In this talk we will dive into the de facto emerged field of Vector Search that you cannot ignore. We will look at how it all started, examine its algorithmic principles, explore software in the form of databases, frameworks and embedding servers, and go through use cases. The discussion is based on the author’s own experience in researching vector search algorithms, implementing search engines for clients and for Medium blog, as well as interviewing the makers for his Vector Podcast. We will also take a look at vector search in action, tackling some tough search problems, like multilinguality and multimodality.
Presented for London IR Meetup, July 26 2022: https://www.meetup.com/london-information-retrieval-meetup-group/events/287183033/
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
Promoting diversity among items in a search result has been shown to increase user satisfaction, compared to relevancy only based ranking. In this talk, we'll present how we went about implementing search result diversification methods across different vertical search engines. Starting from zero with no diversification at all, exploring simple heuristic-based methods and moving onwards to more complex ones based on entropy and determinantal point processing. We'll also discuss evaluation methods and useful tooling around that.
Presented by Dmitry Kan, Principal AI Scientist at Silo AI and Daniel Wärnå, AI Engineer, Silo AI.
YouTube recording:
https://www.youtube.com/watch?v=bri0C28mfl8
Code demoed: https://github.com/DmitryKey/bert-solr-search/tree/master/src/diversify
Introductory level presentation on Information Retrieval: Open source state. Helps the reader to comprehend what open source systems and tools are available for creating / managing own search engines. Provides a glimpse into research directions in IR, also solvable with open source solutions.
These slides were presented in the University of Helsinki, as a guest lecture for the "Information Retrieval and Search Engines - Spring 2017" course.
SentiScan: система автоматической разметки тональности в social mediaDmitry Kan
Автоматический анализ тональности можно по праву считать подзадачей ИИ. В этом докладе мы рассмотрим проблематику создания системы SentiScan, коснёмся вопросов оценки качества, сопровождения, реальных кейсов и способов улучшения качества в полуавтоматическом режиме.
Компания SemanticAnalyzer разработала API для распознавания объектной тональности в текстах на русском языке. Испробовать систему можно подключившись к API на сайте: https://www.mashape.com/dmitrykey/russiansentimentanalyzer
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
3. •The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQL
movement "departs from the relational model altogether; it should
therefore have been called more appropriately 'NoREL', or something to
that effect.“ (wikipedia)
•NoSQL = Not Only SQL
•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google
•Data storage: billion gigabytes (GB) of data
•Interconnected data: hyperlinks, blog pingbacks, social networks
•Complex Data structure: hierarchical nested data structures easily
(multiple relational tables in SQL)
•Performance: the more data in SQL, the likely it to degrade
•NoSQL is not:
•… SQL and not relational
•… replacement for SQL, but compliment
•... There is no fixed schema and no joins
•... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-
out” (spreading the load over many commodity systems) – horizontal
scaling
4. NoSQL Categories
•Key-value Stores: bigh hashtable with caching mechanisms
•Column Family Stores: keys point to multiple columns (Google’s BigTable)
•Document Databases: documents are collections of other key-value
collections
•Graph Databases: nodes, relationships between nodes and nodes props
Major NoSQL players
•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage
service)
•Cassandra: open-sourced by Facebook, column oriented NoSQL DB
•BigTable: Google’s proprietary column oriented DB (App Engine)
•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)
•Neo4j: OS graph DB
Querying NoSQL DB:
•Data model specific
•RESTful interfaces or query APIs
•SPARQL: declarative query specification for graph DBs
5. Simple Protocol And RDFQuery Language
(courtesy of about.com and IBM)
Example of retrieving the URL of a blogger
PREFIX foaf <http://xmlns.com/foaf/0.1/>
SELECT ?url
FROM <bloggers.rdf>
WHERE {
?contributor foaf:name "Jon Foobar" .
?contributor foaf:weblog ?url .
}
stats!
6. Some stats from (Information Week) via
about.com (2010):
•44% biz IT professionals haven’t heard of NoSQL
•1%: NoSQL is strategic direction
•Some stats from NerdCamp (April 2011):
•10% heard and used the NoSQL
•Much more people know about cloud, which can
become more and more a driving platform behind
NoSQL
Does the world of NoSQL have enough mass to
appeal to IT now?
7. “Solr is the popular, blazing
fast open source enterprise
search platform from the
Apache Lucene project.”
Created by Yonik Seeley at
CNET
Features:
•Full-text search
•Hit highlighting
http://lucene.apache.org/solr/ •Faceted search (Dynamic
http://lucene.apache.org/solr/tutorial.html clustering)
http://lucene.apache.org/java/docs/index.html •DB integration
•Rich doc handling
Books •Geospatial search
•Distributed search
•Replicataion
•REST-like HTTP/XML & JSON
APIS
10. Curent version: Apache Solr 3.1 (March 31, 2011) Operating system support
License: ASL 2.0 All with a Java VM, including:
Features: Linux (all versions)
•Faceted navigation Windows (all versions)
•Hit highlighting MacOS (all versions)
•GEO search: filter and sort by distance Unix variants
•Spellcheck and auto suggest App-server support
•Advanced ranking and sorting Apache Tomcat, Jetty, Resin,
•Distributed and replicated search WebLogic™, WebSphere™,
•Structured / unstructured search GlassFish, dmServer™, JBoss™
•Rich plugin architecture, extensible and many more
Java version requirement
Java JDK 1.5 or later
Client API support
Java, .NET, PHP, Python, Ruby
(on
Rails), C++, XML/HTTP,
Overview of current state JSON/HTTP ++
April 2011
11. Faceted search
•A technique for refining search results
•Concept composition:
• Article + in English + about nerdcamp
• Finnish rap + < 1 minute + released in 2001
•Types:
• Standard facets (list of facets with values)
• Hierarchical facet values (taxonomy of facet
values)
• Range / query facets: by date, by price, by
alphabet, by interval
12. Spatial Search
Combines location data with text data
•Represent spatial data in the index
•Filter by some spatial concept such as a bounding box or other shape
•Sort by distance
•Score/boost by distance
•<field name="store">45.17614,-93.87341</field> <!-- Buffalo store -->
<field name="store">40.7143,-74.006</field> <!-- NYC store -->
<field name="store">37.7752,-122.4232</field> <!-- San Francisco store --
>
•bbox: bounding box filter (bbox is a range of lats and lons that
encompasses the circle of radius d)
•geodist: the distance function
14. Spellcheck and autosuggest
Spellcheck:
•Query suggestion for a missspelled query term
http://localhost:8983/solr/spell?q=hell
ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru
e
<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int
name="numFound">1</int> <int name="startOffset">0</int> <int
name="endOffset">4</int> <arr name="suggestion"> <str>dell</str>
</arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int>
<int name="startOffset">5</int> <int name="endOffset">14</int> <arr
name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str
name="collation">dell ultrasharp</str> </lst> </lst>
Autosuggest:
Example with solr and jquery
15. Advanced sorting, ranking and searching
•sort=score+asc
•sort=Author+desc,score+desc
•boosting single documents
•Term Frequency—tf
•Inverse Document Frequency – idf
•Co-ordination Factor – coord (the greater the # of queried terms match,
the greater the score)
•Field Length – fieldNorm (the shorter the matching field is in number of
indexed terms, the greater the document’s score)
•AND, OR, NOT, NEAR, fuzzy search
•Smashing~0.7 yields more results than just Smashing
16. Distributed and replicated search
Before doing this:
•Consider vertical scaling (faster and better machine)
•Rethink the data model (what data goes to which solr index)
•Remove logging on updates (and / or searches)
•Redesign you index: make as many fields non-indexed and non-stored (use cases)
•Check your Internet connection
17. Extendability
Plugins:
•Query parser: extend LuceneQParserPlugin
public class NerdCampQParserPlugin extends LuceneQParserPlugin {
public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {}
}
18. SOLR I/O
•Nutch (crawler)
•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich document
import, like pdf), your format
•Output: xml, json, python, javabin, csv… , your format
19. SOLR Processing Pipeline
•On each step, a document gets transformed
•Stop words removal
•Stemming
•(smart) Tokenization
•Ngrams (letter level and word level)
•Regular expressions
•Low casing
•Reversed wildcard
•Duplicate removal
20. Solr on the cloud
Hadoop: MapReduce
ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo
Batch indexing, no realtime search yet
Hadoop vital components: Core and API
MapReduce -- computation model
HDFS
I/O
ZooKeeper
Pig (adds level of abstraction for processing
large datasets)
21. Solr on the cloud
Does it shine? Yes, but not fully
22. References
[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com Guide
Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI
[2] "Dynamo: Amazon’s Highly Available Key-value Store".
http://www.cs.uwaterloo.ca/:
WATERLOO. p. 2/22. Retrieved 2011-04-05.
"Dynamo: a highly available and scalable distributed data store"
[3] http://cassandra.apache.org/
[4] http://labs.google.com/papers/bigtable.html
[5] http://aws.amazon.com/ (look for SimpleDB)
[6] http://couchdb.apache.org/
[7] http://neo4j.org/
[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL
http://bit.ly/go5ios
[9] http://drupal.org/
[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination
[11] http://wiki.apache.org/solr/SpatialSearch
[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
23. References
[14] Using Nutch with SOLR,
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
[15] http://tika.apache.org/
[16] http://lucene.apache.org/solr/