Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
2. Agenda
• What is Big Data?
• Why is NOSQL required?
• What are different types of NOSQL database?
• ElasticSearch - Introduction
• ElasticSearch - Features
• Hands on
http://www.meetup.com/abctalks
3. Big Data
Any collection of data sets so large and complex that it becomes difficult to
process using traditional data processing applications.
Require "massively parallel software running on tens, hundreds, or even
thousands of servers"
http://www.meetup.com/abctalks
4. Factors of growth, challenges and
opportunities of big data
Volume – the quantity of data that is generated.
Variety – category to which Big Data belongs to.
Velocity – how fast the data is generated and processed to meet the demands.
http://www.meetup.com/abctalks
5. Horizontal & Vertical scaling
Horizontal scaling - scale by adding more machines to your pool of resources.
Vertical scaling - scale by adding more power (CPU, RAM, etc.) to your existing
machine.
Horizontal scaling is easier to scale dynamically by adding more machines into
the existing pool.
Vertical scaling is often limited to the capacity of a single machine
Horizontal scaling are the Cloud data stores, e.g. DynamoDB, Cassandra ,
MongoDB
Vertical scaling is MySQL - Amazon RDS (The cloud version of MySQL)
http://www.meetup.com/abctalks
6. NOSQL
Basically a large serialized object store
Doesn’t have a structured schema
Recommends de-normalization
Designed to be distributed (cloud-scale) out of the box
Because of this, drops the ACID requirements
Any database can answer any query
Any write query can operate against any database and will “eventually” propagate to other
distributed servers
http://www.meetup.com/abctalks
7. Why NOSQL?
Today, data is becoming easier to access and capture through third parties such as Facebook,
Google+ and others.
Personal user information, social graphs, geo-location data, user-generated content and
machine logging data are just a few examples where the data has been increasing
exponentially.
To use the above services properly requires the processing of huge amounts of data. Which
SQL databases are no good for, and were never designed for.
NoSQL databases have evolved to handle this huge data properly.
http://www.meetup.com/abctalks
8. CAP Theorem
Consistency - This means that all nodes see the same
data at the same time.
Availability - This means that the system is always on,
no downtime.
Partition Tolerance - This means that the system
continues to function even if the communication
among the servers is unreliable
Distributed systems must be partition tolerant , so we
have to choose between Consistency and Availability.
http://www.meetup.com/abctalks
9. Different types of NOSQL
Column Store
Column data is saved together, as opposed to row data
Super useful for data analytics
Hadoop, Cassandra, Hypertable
Key-Value Store
A key that refers to a payload
MemcacheDB, Azure Table Storage, Redis
Document / XML / Object Store
Key (and possibly other indexes) point at a serialized object
DB can operate against values in document
MongoDB, CouchDB, RavenDB, ElasticSearch
Graph Store
Nodes are stored independently, and the relationship between nodes (edges) are
stored with data
http://www.meetup.com/abctalks
10. RDBMS vs NOSQL
RDBMS NoSQL
Structured and organized data Semi-structured or unorganized data
Structured Query Language (SQL) No declarative query language
Tight consistency Eventual consistency
ACID transactions BASE transactions
Data and Relationships stored in tables No pre defined schema
http://www.meetup.com/abctalks
11. What is ElasticSearch?
ElasticSearchisafreeandopensourcedistributedinvertedindexcreatedbyshaybanon.
BuildontopofApacheLucene
Luceneisamostpopularjava-basedfulltextsearchindeximplementation.
Firstpublicreleaseversionv0.4inFebruary2010.
DevelopedinJava,soinherentlycross-platform.
http://www.meetup.com/abctalks
12. Why ElasticSearch?
Easy to scale (Distributed)
Everything is one JSON call away (RESTful API)
Unleashed power of Lucene under the hood
Excellent Query DSL
Multi-tenancy
Support for advanced search features (Full Text)
Configurable and Extensible
Document Oriented
Schema free
Conflict management
Active community
.
http://www.meetup.com/abctalks
13. ElasticSearch is built to scale horizontally out of the box. When ever
you need to increase capacity, just add more nodes, and let the
cluster reorganize itself to take advantage of the extra hardware.
One server can hold one or more parts of one or more indexes, and
whenever new nodes are introduced to the cluster they are just
being added to the party. Every such index, or part of it, is called a
shard, and ElasticSearch shards can be moved around the cluster
very easily.
Easy to Scale (Distributed)
RESTful API
ElasticSearch is API driven. Almost any action can be performed using a
simple RESTful API using JSON over HTTP. .
Responses are always in JSON format.
http://www.meetup.com/abctalks
14. Apache Lucene is a high performance, full-featured Information
Retrieval library, written in Java. ElasticSearch uses Lucene internally to
build its state of the art distributed search and analytics capabilities.
Since Lucene is a stable, proven technology, and continuously being
added with more features and best practices, having Lucene as the
underlying engine that powers ElasticSearch.
Build on top of Apache Lucene
Excellent Query DSL
The REST API exposes a very complex and capable query DSL, that is very
easy to use. Every query is just a JSON object that can practically contain
any type of query, or even several of them combined.
Using filtered queries, with some queries expressed as Lucene filters,
helps leverage caching and thus speed up common queries, or complex
queries with parts that can be reused.
Faceting, another very common search feature, is just something that
upon-request is accompanied to search results, and then is ready for you
to use.http://www.meetup.com/abctalks
15. Multiple indexes can be stored on one ElasticSearch installation
- node or cluster. Each index can have multiple "types", which
are essentially completely different indexes.
The nice thing is you can query multiple types and multiple
indexes with one simple query.
Multi-tenancy
Support for advanced search features (Full Text)
ElasticSearch uses Lucene under the covers to provide the most powerful
full text search capabilities available in any open source product.
Search comes with multi-language support, a powerful query language,
support for geolocation, context aware did-you-mean suggestions,
autocomplete and search snippets.
Script support in filters and scorers
http://www.meetup.com/abctalks
16. Many of ElasticSearch configurations can be changed while ElasticSearch is running, but some will require a restart (and in
some cases re-indexing). Most configurations can be changed using the REST API too.
ElasticSearch has several extension points - namely site plugins (let you serve static content from ES - like monitoring java
script apps), rivers (for feeding data into ElasticSearch), and plugins to add modules or components within ElasticSearch
itself. This allows you to switch almost every part of ElasticSearch if so you choose, fairly easily.
Configurable and Extensible
Document Oriented
Store complex real world entities in ElasticSearch as structured JSON
documents. All fields are indexed by default, and all the indices can be
used in a single query, to return results at breath taking speed.
Per-operation Persistence
ElasticSearch primary moto is data safety. Document changes are recorded
in transaction logs on multiple nodes in the cluster to minimize the chance
of any data loss.
http://www.meetup.com/abctalks
17. ElasticSearch allows you to get started easily. Send a JSON
document and it will try to detect the data structure, index the
data and make it searchable.
Schema free
Conflict management
Optimistic version control can be used where needed to ensure that data
is never lost due to conflicting changes from multiple processes.
Active community
The community, other than creating nice tools and plugins, is very helpful and supporting. The overall vibe is really great, and
this is an important metric of any OSS project.
There are also some books currently being written by community members, and many blog posts around the net sharing
experiences and knowledge
http://www.meetup.com/abctalks
30. Searching
Search across all indexes and all types
http://localhost:9200/_search
Search across all types in the test index.
http://localhost:9200/test/_search
Search explicitly for documents of type cities within the test index.
http://localhost:9200/test/cities/_search
There’s3differenttypesofsearchqueries
Full Text Search (query string)
Structured Search (filter)
Analytics (facets)
http://www.meetup.com/abctalks
31. Routing
All the data lives in a primary shard in the cluster. You may have ‘N’ number of shards in the cluster. Routing is the
process of determining which shard that document will reside in.
ElasticSearch has no idea where a indexed document is located. So ElasticSearch broadcasts the request
to all shards. This is a non-negligible overhead and can easily impact performance.
Routing ensures that all documents with the same routing value will locate to the same shard, eliminating the
need to broadcast searches and increase the performance.
http://www.meetup.com/abctalks
32. Data Synchronization
ElasticSearch supports river a pluggable service to run within ElasticSearch cluster to pull data (or being
pushed with data) that is then indexed into the cluster.(https://github.com/jprante/ElasticSearch-river-
jdbc)
Rivers are available for mongodb, couchdb, rabitmq, twitter, wikipedia, mysql, and etc
The relational data is internally transformed into structured JSON objects for the schema-less
indexing model of ElasticSearch documents.
The plugin can fetch data from different RDBMS source in parallel, and multithreaded bulk mode
ensures high throughput when indexing to ElasticSearch.
TypicallyElasticSearchimplementsworkerroleasalayerwithintheapplicationtopushdata/entitiestoElasticsearch.
http://www.meetup.com/abctalks