The Power of Elasticsearch                                    What is Search?                                    Search is...
Built on LuceneApache’s Lucene is an open-source Java library for text search. The Lucene project has beengrowing for more...
Elastic Search FeaturesElasticsearch is best thought of as an interface to Lucene designed for big data from the ground up...
Example Use CasesThere’s a lot you can do with Elasticsearch besides just searching for phrases. The following exam-ples a...
Records can be Complex DocumentsA record in Elasticsearch doesn’t have to be flat like a record in a traditional RDBMS. El...
Geo QueriesElasticsearch understands geography. Geolocations can be stored within records as (latitude, longi-tude) pairs ...
Time SeriesThings change. It’s important to see how. Elasticsearch understands dates and times and can returntime series d...
Application SupportElasticsearch isn’t just a search engine; it’s a full-fledged database, and you can build an entire fro...
About Infochimps                                    Our mission is to make the world’s data more accessible.              ...
Upcoming SlideShare
Loading in...5
×

The Power of Elasticsearch

3,462

Published on

A horizontally-scalable, distributed database built on Apache’s Lucene that delivers a full-featured search experience across terabytes of data with a simple yet powerful API.

Learn more at http://infochimps.com

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,462
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
91
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

The Power of Elasticsearch

  1. 1. The Power of Elasticsearch What is Search? Search is a feature that can make or break an application or service. The ability to do full-text search across a body of text,A horizontally- to narrow search queries by values or ranges of specific fields, to use advanced features like faceted search, geo queries, orscalable, distributed similarity searching can add a wealth of functionality and act as adatabase built on differentiator for your product. When search fails, either because of inflexible query parameters, irrelevant results, or an inability toApache’s Lucene scale to meet demands of volume or usage, users notice immedi- ately -- and they will be upset.that delivers a full-featured search What is Elasticsearch?experience across Elasticsearch is a new database built to handle huge amountsterabytes of data of data volume with very high availability and to distribute itself across many machines to be fault-tolerant and scalable, all thewith a simple yet while maintaining a simple but powerful API that allows applica-powerful API. tions from any language or framework access to the database.© 2012 Infochimps, Inc. All rights reserved. 1
  2. 2. Built on LuceneApache’s Lucene is an open-source Java library for text search. The Lucene project has beengrowing for more than a decade and has now become the standard reference for how to build apowerful yet easy to integrate, open-source search library. Its feature set includes but is not limitedto: Performance/Scalability Search Features Accessibility Can index over 95 GB/hr/ Ranked/relevance searching with Completely open-source node fine-grained control Low RAM overhead (1MB Allows querying on phrases, Implemented in Java so heap) wildcards, geographical inherently cross-platform proximity, variable ranges, &c.Wrapping Lucene for Big DataLucene, as a search library, must be wrapped with an interface to allow its features to be used by anapplication. Many such interfaces have been built for different platforms and use cases. One of themost popular is Apache’s own SOLR project, which creates an interface around Lucene tailored forsomething like a traditional web application.An interface like SOLR, however, is designed for a world in which a single server can handle the fullworkload of indexing and querying the data. When the data volume begins to increase past this limit,SOLR (and similar interfaces to Lucene) become unwieldy to use: the same problems of sharding,replication, and query dispatching that occur in RDBMS systems begin occur again in this context.And just as various methods exist for dealing with these difficulties in the RDBMS world, various toolsexist for shard creation and distribution around SOLR.But just as the right solution to big data databases means moving away from RDBMS into NoSQLtechnologies, the right solution to scaling Lucene is to move away from tools like SOLR and use a toolbuilt from the ground-up to work with terabytes of data in a horizontally scalable, distributed, and fault-tolerant way: Elasticsearch!© 2012 Infochimps, Inc. All rights reserved. 2
  3. 3. Elastic Search FeaturesElasticsearch is best thought of as an interface to Lucene designed for big data from the ground up.The complex feature set that Lucene provides for searching data is directly available through Elas-ticsearch, as Lucene is ultimately the library that’s used for indexing and querying data. This alsomeans that plugins that work with Lucene will work with Elasticsearch out of the box.The features that Elasticsearch itself provides around Lucene are designed to make it the perfect toolfor full-text search on big data: Performance/Scalability Robustness East of Use An 8-node cluster can provide No single point of failure Simple, JSON-based sub-200ms response latency REST API means any when performing complex language can index searches on 10B+ records! or query records in an Elasticsearch cluster. Add or subtract nodes on the Automatically backup all data in the Java and Thrift APIs fly to dynamically scale the cluster to local disk or permanent, exist for finer-grained cluster to the current load remote storage (like AWS’ S3 or more performant service). access. Ability to independently scale Tune the replication factor of data on Flexible schemas allow the indexing and querying a per-index level for complex treatments performance of the cluster to of types like dates deal with different sorts of use without forcing all cases documents in a table to be identical. Data will automatically be migrated Multiple indices enable through the cluster if a node fails to multi-tenancy out of the maintain performance and replication box. factor.© 2012 Infochimps, Inc. All rights reserved. 3
  4. 4. Example Use CasesThere’s a lot you can do with Elasticsearch besides just searching for phrases. The following exam-ples act as a quick guide to just a few of the features Elasticsearch provides.Powerful Query SyntaxThe simplest way to interface with Elasticsearch is also one of the most powerful: the query string.Elasticsearch exposes the full Lucene query syntax through query strings that can be passed from auser in an application directly to the database to be evaluated. Feature Query String Notes Boolean logic (coke OR pepsi) AND health Wildcards apple AND ip*d Wildcards can be applied for a single character (?) or for groups of characters (*). Specific search fields coffee AND author:Smith Can search on deeply nested fields like “author.lastName” as well. Search within a range apple AND date:[20100101 TO 20100201] Boost results in relevance taxicab AND (“New York”^2 or Boosting can also be “San Francisco”) configured at index time.© 2012 Infochimps, Inc. All rights reserved. 4
  5. 5. Records can be Complex DocumentsA record in Elasticsearch doesn’t have to be flat like a record in a traditional RDBMS. Elasticsearchallows documents to be hierarchical, and for sub-fields within a document to themselves have hierar-chical structure. This makes data modeling very flexible. An example of how one might store a blogpost: { “id”: 1001, “author”: { “name”: “Alexander Hamilton”, “id”: 3874 }, “date”: “1787-10-07 12:31:00 -0600 CST”, “title”: “The Federalist Papers”, “subtitle”: “Paper #1” “text”: “AFTER an unequivocal experience of the inefficiency...” “similar_posts”: [ 1002, 1003, 1005] “comments”: [ { “author”: “John Adams”, “text”: “I must beg to differ...” }, … ] }We could query these records using “author.name” or even “comments.text”, giving us a great deal offlexibilty in how we choose to denormalize and access the data in the database.© 2012 Infochimps, Inc. All rights reserved. 5
  6. 6. Geo QueriesElasticsearch understands geography. Geolocations can be stored within records as (latitude, longi-tude) pairs or as geohashes. In either case, Elasticsearch provides the ability to query using a varietyof geo-methods: Geo queries defined with a bounding box Geo queries defined by distance range from a given point© 2012 Infochimps, Inc. All rights reserved. 6
  7. 7. Time SeriesThings change. It’s important to see how. Elasticsearch understands dates and times and can returntime series data which represent an aggregation of the search results binned by time interval. Raw tweets stored in Elasticsearch can be binned into a time series on the fly at query time.© 2012 Infochimps, Inc. All rights reserved. 7
  8. 8. Application SupportElasticsearch isn’t just a search engine; it’s a full-fledged database, and you can build an entire fron-tend application on top of it.Elasticsearch supports multiple indices (databases) and multiple mappings (tables) per index. Thisfeature, combined with the complex document structure Elasticsearch allows, lets you build the com-plex data models that support applications.And, in addition to being able to execute rich search queries across the data, Elasticsearch allowsthe more “traditional” operations that define an application database: listing records, creating records,updating records, and deleting records. These features give you what you need to build a traditionaldatabase-driven, read/write application on top of the same database that lets you do full-text searchand complex queries, all with horizontal scalability built-in from the ground up.Administration & MonitoringElasticsearch also exposes a complete administrative and monitoring interface over the same APIthat powers the indexing, retrieval, and search of data.Creating indices, updating their indexing or storage properties, defining rules for dealing with specificfields in specific mappings, &c. can all be accomplished via this same API.Getting detailed information about the cluster’s availability state, health, individual nodes’ memoryfootprint, &c. is also available through this API, making monitoring of Elasticsearch easy.© 2012 Infochimps, Inc. All rights reserved. 8
  9. 9. About Infochimps Our mission is to make the world’s data more accessible. Infochimps helps companies understand their data. We provide tools and services that connect their internal data, leverage the power of cloud computing and new technologies such as Hadoop, and provide a wealth of external datasets, which organizations can connect to their own data. Contact Us Infochimps, Inc. 1214 W 6th St. Suite 202 Austin, TX 78703 1-855-DATA-FUN (1-855-328-2386) www.infochimps.com info@infochimps.com Twitter: @infochimps Get a free Big Data consultation Let’s talk Big Data in the enterprise! Get a free conference with the leading big data experts regarding your enterprise big data project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop about your project objectives, design, infrastructure, tools, etc. Find out how other compa- nies are solving similar problems. Learn best practices and get recommendations — free.© 2012 Infochimps, Inc. All rights reserved. 8

×