Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NoSQL, Apache SOLR and Apache Hadoop


Published on

NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:
NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.

Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.

Published in: Technology
  • Dating direct: ❶❶❶ ❶❶❶
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ♥♥♥ ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here

NoSQL, Apache SOLR and Apache Hadoop

  1. 1. NoSQL: Apache SOLR Apache Hadoop By Dmitry Kan for NerdCamp, April 23 2011
  2. 2. Dilbert: expert in NoSQL
  3. 3. •The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQLmovement "departs from the relational model altogether; it shouldtherefore have been called more appropriately NoREL, or something tothat effect.“ (wikipedia)•NoSQL = Not Only SQL•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google•Data storage: billion gigabytes (GB) of data•Interconnected data: hyperlinks, blog pingbacks, social networks•Complex Data structure: hierarchical nested data structures easily(multiple relational tables in SQL)•Performance: the more data in SQL, the likely it to degrade•NoSQL is not: •… SQL and not relational •… replacement for SQL, but compliment •... There is no fixed schema and no joins •... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales- out” (spreading the load over many commodity systems) – horizontal scaling
  4. 4. NoSQL Categories•Key-value Stores: bigh hashtable with caching mechanisms•Column Family Stores: keys point to multiple columns (Google’s BigTable)•Document Databases: documents are collections of other key-valuecollections•Graph Databases: nodes, relationships between nodes and nodes propsMajor NoSQL players•Dynamo:, key-value, used in Amazon S3 (simple storageservice)•Cassandra: open-sourced by Facebook, column oriented NoSQL DB•BigTable: Google’s proprietary column oriented DB (App Engine)•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)•Neo4j: OS graph DBQuerying NoSQL DB:•Data model specific•RESTful interfaces or query APIs•SPARQL: declarative query specification for graph DBs
  5. 5. Simple Protocol And RDFQuery Language(courtesy of and IBM)Example of retrieving the URL of a bloggerPREFIX foaf <>SELECT ?urlFROM <bloggers.rdf>WHERE {?contributor foaf:name "Jon Foobar" .?contributor foaf:weblog ?url .} stats!
  6. 6. Some stats from (Information Week) (2010):•44% biz IT professionals haven’t heard of NoSQL•1%: NoSQL is strategic direction•Some stats from NerdCamp (April 2011):•10% heard and used the NoSQL•Much more people know about cloud, which canbecome more and more a driving platform behindNoSQLDoes the world of NoSQL have enough mass toappeal to IT now?
  7. 7. “Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.” Created by Yonik Seeley at CNET Features: •Full-text search •Hit highlighting •Faceted search (Dynamic clustering) •DB integration •Rich doc handlingBooks •Geospatial search •Distributed search •Replicataion •REST-like HTTP/XML & JSON APIS
  8. 8. drupalCompanies using SOLR
  9. 9. Curent version: Apache Solr 3.1 (March 31, 2011) Operating system support License: ASL 2.0 All with a Java VM, including: Features: Linux (all versions) •Faceted navigation Windows (all versions) •Hit highlighting MacOS (all versions) •GEO search: filter and sort by distance Unix variants •Spellcheck and auto suggest App-server support •Advanced ranking and sorting Apache Tomcat, Jetty, Resin, •Distributed and replicated search WebLogic™, WebSphere™, •Structured / unstructured search GlassFish, dmServer™, JBoss™ •Rich plugin architecture, extensible and many more Java version requirement Java JDK 1.5 or later Client API support Java, .NET, PHP, Python, Ruby (on Rails), C++, XML/HTTP,Overview of current state JSON/HTTP ++April 2011
  10. 10. Faceted search•A technique for refining search results•Concept composition: • Article + in English + about nerdcamp • Finnish rap + < 1 minute + released in 2001•Types: • Standard facets (list of facets with values) • Hierarchical facet values (taxonomy of facet values) • Range / query facets: by date, by price, by alphabet, by interval
  11. 11. Spatial SearchCombines location data with text data•Represent spatial data in the index•Filter by some spatial concept such as a bounding box or other shape•Sort by distance•Score/boost by distance•<field name="store">45.17614,-93.87341</field> <!-- Buffalo store --><field name="store">40.7143,-74.006</field> <!-- NYC store --><field name="store">37.7752,-122.4232</field> <!-- San Francisco store -->•bbox: bounding box filter (bbox is a range of lats and lons thatencompasses the circle of radius d)•geodist: the distance function
  12. 12. Hit highlightingExample from solr admin
  13. 13. Spellcheck and autosuggestSpellcheck:•Query suggestion for a missspelled query termhttp://localhost:8983/solr/spell?q=hellultrashar&spellcheck=true&spellcheck.collate=true&<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <intname="numFound">1</int> <int name="startOffset">0</int> <intname="endOffset">4</int> <arr name="suggestion"> <str>dell</str></arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int><int name="startOffset">5</int> <int name="endOffset">14</int> <arrname="suggestion"> <str>ultrasharp</str> </arr> </lst> <strname="collation">dell ultrasharp</str> </lst> </lst>Autosuggest:Example with solr and jquery
  14. 14. Advanced sorting, ranking and searching•sort=score+asc•sort=Author+desc,score+desc•boosting single documents•Term Frequency—tf•Inverse Document Frequency – idf•Co-ordination Factor – coord (the greater the # of queried terms match,the greater the score)•Field Length – fieldNorm (the shorter the matching field is in number ofindexed terms, the greater the document’s score)•AND, OR, NOT, NEAR, fuzzy search•Smashing~0.7 yields more results than just Smashing
  15. 15. Distributed and replicated searchBefore doing this:•Consider vertical scaling (faster and better machine)•Rethink the data model (what data goes to which solr index)•Remove logging on updates (and / or searches)•Redesign you index: make as many fields non-indexed and non-stored (use cases)•Check your Internet connection
  16. 16. ExtendabilityPlugins:•Query parser: extend LuceneQParserPluginpublic class NerdCampQParserPlugin extends LuceneQParserPlugin {public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {}}
  17. 17. SOLR I/O•Nutch (crawler)•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich documentimport, like pdf), your format•Output: xml, json, python, javabin, csv… , your format
  18. 18. SOLR Processing Pipeline•On each step, a document gets transformed•Stop words removal•Stemming•(smart) Tokenization•Ngrams (letter level and word level)•Regular expressions•Low casing•Reversed wildcard•Duplicate removal
  19. 19. Solr on the cloudHadoop: MapReduceZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your ZooBatch indexing, no realtime search yet Hadoop vital components: Core and API MapReduce -- computation model HDFS I/O ZooKeeper Pig (adds level of abstraction for processing large datasets)
  20. 20. Solr on the cloudDoes it shine? Yes, but not fully
  21. 21. References[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, GuideSarah Pidcock (2011-01-31).[2] "Dynamo: Amazon’s Highly Available Key-value Store". p. 2/22. Retrieved 2011-04-05."Dynamo: a highly available and scalable distributed data store"[3][4][5] (look for SimpleDB)[6][7][8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL[9][10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination[11][12][13]
  22. 22. References[14] Using Nutch with SOLR,[15][16]