Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
An introduction to elasticsearch with a short demonstration on Kibana to present the search API. The slide covers:
- Quick overview of the Elastic stack
- indexation
- Analysers
- Relevance score
- One use case of elasticsearch
The query used for the Kibana demonstration can be found here:
https://github.com/melvynator/elasticsearch_presentation
Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
An introduction to elasticsearch with a short demonstration on Kibana to present the search API. The slide covers:
- Quick overview of the Elastic stack
- indexation
- Analysers
- Relevance score
- One use case of elasticsearch
The query used for the Kibana demonstration can be found here:
https://github.com/melvynator/elasticsearch_presentation
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka Elasticsearch Tutorial will help you in understanding the fundamentals of Elasticsearch along with its practical usage and help you in building a strong foundation in ELK Stack. This video helps you to learn following topics:
1. What Is Elasticsearch?
2. Why Elasticsearch?
3. Elasticsearch Advantages
4. Elasticsearch Installation
5. API Conventions
6. Elasticsearch Query DSL
7. Mapping
8. Analysis
9 Modules
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
What makes it that Elasticsearch is "horizontaly" scalable while Lucene is not? How does the technology of one affect the other? How does ElasticSearch scale over Lucene and what are the limiting factor?
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
"ElasticSearch in action" by Thijs Feryn.
ElasticSearch is a really powerful search engine, NoSQL database & analytics engine. It is fast, it scales and it's a child of the Cloud/BigData generation. This talk will show you how to get things done using ElasticSearch. The focus is on doing actual work, creating actual queries and achieving actual results. Topics that will be covered: - Filters and queries - Cluster, shard and index management - Data mapping - Analyzers and tokenizers - Aggregations - ElasticSearch as part of the ELK stack - Integration in your code.
Antidot Content Classifier - Valorisez vos contenusAntidot
Comment analyser sémantiquement et classer automatiquement des millions de documents sans avoir besoin de les lire ou de les relire ?
Antidot rend disponible à tous, les dernières technologies du Machine Learning pour :
- Trier, classer et mieux ranger automatiquement votre GED ou votre intranet : retrouver un document ou y trouver de l'information est enfin possible.
- Recommander les documents pertinents, contextualisés en fonction du profil de l’utilisateur.
- Segmenter finement des contenus payants et délivrer des abonnements sur mesure à vos clients
- Alerter de manière très ciblée vos utilisateurs sur les nouveaux documents utiles à leur activité
- Aiguiller automatiquement des demandes entrantes, selon leur sujet, leur niveau d’urgence.
- Analyser les réseaux sociaux, tweets, e-mails et contributions dans les forums afin de détecter les sujets et de réagir de façon ciblée.
- … et bien d’autres cas d’application
Profitez vite des innovations d’Antidot pour booster votre productivité et rester en tête du peloton !
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...DataStax
Since the introduction of SASI in Cassandra 3.4, it is way easier than before to query data. Now you can create performant indices on your columns as well as benefit from full text search capabilities with the introduction of the new `LIKE '%term%'` syntax.
This talk will show the architecture on a high level and exposes all the trade-offs so you can choose and use SAS wisely.
We also highlight some use-cases where SASI is not a good fit and should be avoided (there is no magic sorry)
To illustrate the talk, we'll use a sample database of 110 000 albums and artists and create indices on them
About the Speaker
DuyHai DOAN Apache Cassandra Evangelist, Datastax
DuyHai DOAN is an Apache Cassandra Evangelist at DataStax. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects like Achilles or Apache Zeppelin to support the community and helping all companies using Cassandra to make their project successful. Previously he was working as a freelance Java/Cassandra consultant.
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka Elasticsearch Tutorial will help you in understanding the fundamentals of Elasticsearch along with its practical usage and help you in building a strong foundation in ELK Stack. This video helps you to learn following topics:
1. What Is Elasticsearch?
2. Why Elasticsearch?
3. Elasticsearch Advantages
4. Elasticsearch Installation
5. API Conventions
6. Elasticsearch Query DSL
7. Mapping
8. Analysis
9 Modules
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
What makes it that Elasticsearch is "horizontaly" scalable while Lucene is not? How does the technology of one affect the other? How does ElasticSearch scale over Lucene and what are the limiting factor?
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
"ElasticSearch in action" by Thijs Feryn.
ElasticSearch is a really powerful search engine, NoSQL database & analytics engine. It is fast, it scales and it's a child of the Cloud/BigData generation. This talk will show you how to get things done using ElasticSearch. The focus is on doing actual work, creating actual queries and achieving actual results. Topics that will be covered: - Filters and queries - Cluster, shard and index management - Data mapping - Analyzers and tokenizers - Aggregations - ElasticSearch as part of the ELK stack - Integration in your code.
Antidot Content Classifier - Valorisez vos contenusAntidot
Comment analyser sémantiquement et classer automatiquement des millions de documents sans avoir besoin de les lire ou de les relire ?
Antidot rend disponible à tous, les dernières technologies du Machine Learning pour :
- Trier, classer et mieux ranger automatiquement votre GED ou votre intranet : retrouver un document ou y trouver de l'information est enfin possible.
- Recommander les documents pertinents, contextualisés en fonction du profil de l’utilisateur.
- Segmenter finement des contenus payants et délivrer des abonnements sur mesure à vos clients
- Alerter de manière très ciblée vos utilisateurs sur les nouveaux documents utiles à leur activité
- Aiguiller automatiquement des demandes entrantes, selon leur sujet, leur niveau d’urgence.
- Analyser les réseaux sociaux, tweets, e-mails et contributions dans les forums afin de détecter les sujets et de réagir de façon ciblée.
- … et bien d’autres cas d’application
Profitez vite des innovations d’Antidot pour booster votre productivité et rester en tête du peloton !
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...DataStax
Since the introduction of SASI in Cassandra 3.4, it is way easier than before to query data. Now you can create performant indices on your columns as well as benefit from full text search capabilities with the introduction of the new `LIKE '%term%'` syntax.
This talk will show the architecture on a high level and exposes all the trade-offs so you can choose and use SAS wisely.
We also highlight some use-cases where SASI is not a good fit and should be avoided (there is no magic sorry)
To illustrate the talk, we'll use a sample database of 110 000 albums and artists and create indices on them
About the Speaker
DuyHai DOAN Apache Cassandra Evangelist, Datastax
DuyHai DOAN is an Apache Cassandra Evangelist at DataStax. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects like Achilles or Apache Zeppelin to support the community and helping all companies using Cassandra to make their project successful. Previously he was working as a freelance Java/Cassandra consultant.
A brief presentation outlining the basics of elasticsearch for beginners. Can be used to deliver a seminar on elasticsearch.(P.S. I used it) Would Recommend the presenter to fiddle with elasticsearch beforehand.
This presentation slide is a condensed theoretical overview of Elasticsearch prepared by going through the official ES Definitive Guide and Practical Guide.
Elasticsearch is a search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. ElasticSearchis a free and open source distributed inverted index. So it’s a bunch of indexed documents in a repository. As well as it’s fast, incisive search against large volumes of data. And directly accessed to the data in the denormaliz document storage. Additionally in general distributable and highly scalable DB.
The importance of search for modern applications is evident and nowadays it is higher than ever. A lot of projects use search forms as a primary interface for communication with a user. Though implementation of an intelligent search functionality is still a challenge and we need a good set of tools.
In this presentation, I will talk through the high-level architecture and benefits of Elasticsearch with some examples. Aside from that, we will also take a look at its existing competitors, their similarities, and differences.
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیEhsan Asgarian
در این اسلاید به مباحث زیر می پردازیم:
مقدمات پایگاه داده های غیر اس.کیو.ال، مبانی جستجوگرها
سپس معرفی ابزار جستجوی الاستیکی، کاربردها، معماری کلی، مقایسه با ابزارهای مشابه
افزودن تحلیلگر متن و در نهایت لینک آن با دات نت
ا
A horizontally-scalable, distributed database built on Apache’s Lucene that delivers a full-featured search experience across terabytes of data with a simple yet powerful API.
Learn more at http://infochimps.com
Elasticsearch, a distributed search engine with real-time analyticsTiziano Fagni
An overview of Elasticsearch: main features, architecture, limitations. It includes also a description on how to query data both using REST API and using elastic4s library, with also a specific interest into integration of the search engine with Apache Spark.
ElasticSearch - index server used as a document databaseRobert Lujo
Presentation held on 5.10.2014 on http://2014.webcampzg.org/talks/.
Although ElasticSearch (ES) primary purpose is to be used as index/search server, in its featureset ES overlaps with common NoSql database; better to say, document database.
Why this could be interesting and how this could be used effectively?
Talk overview:
- ES - history, background, philosophy, featureset overview, focus on indexing/search features
- short presentation on how to get started - installation, indexing and search/retrieving
- Database should provide following functions: store, search, retrieve -> differences between relational, document and search databases
- it is not unusual to use ES additionally as an document database (store and retrieve)
- an use-case will be presented where ES can be used as a single database in the system (benefits and drawbacks)
- what if a relational database is introduced in previosly demonstrated system (benefits and drawbacks)
ES is a nice and in reality ready-to-use example that can change perspective of development of some type of software systems.
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
Presented at Spark Summit EU 2015 at Amsterdam. Details the SoDA project, a micro-service accessible via Spark for naive annotation of large volumes of text against very large lexicons.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. Agenda
Who am I?
Text searching
Full text based
Term based
Databases vs. Search engines
Why not simple SQL?
Why need Lucene?
Elasticsearch
Concepts/APIs
Network/Discovery
Split-brain Issue
Solutions
Data Structure
Inverted Index
SOLR – Dataverse’s Search
Why not SOLR for Consilience?
Elasticsearch – Consilience’s Search
Language integration
Python
Java
Scala
SPARK
Why Spark?
Where Spark?
When Spark?
Language support
Conclusion and Questions
3. Who am I?
Animesh Pandey
Computer Science grad student @
Northeastern University, Boston
Intern for Project Consilience for
Summer 2015
Job: integration of Elasticsearch and
Spark into the existing project
4. Text Searching
Text – a from of data
Text – available from various resources
Internet, books, articles etc.
We are concerned with digital text or converting the traditional text to digital
Digital text – internet, news articles, blogs, research papers
Traditional text – any text from a physical book, manuscript, typed papers,
newspapers etc.
Traditional text conversion to digital text
Automatic - Optical Character Recognizers (OCR) e.g. Tesseract by Google Inc.
Manual - type to a system
5. Full text based vs. Term based
Full text based search
Most general kind of search
Used everyday when using
Google, Bing or Yahoo
In the background it is much more
than a simple character by
character match
Lot of pre-processing involved for
a Full text search
Term based search
Generally comprises of exact term
matching
You can think of it as a SQL query
where try to find documents that
contain the exact match of a
specified word
6. Databases vs. Search Engines
The both have unique strengths but also have overlapping capabilities
Similarities:
Both can be stored as data stores
Basic updates and modifications can be done using both
Differences:
Search Engines
Used for both structured as well
as unstructured data
The results are ordered as per
the relevance of the result to
the query
Databases
Used for structured data
There is relevance
matching between the
query and results
7. Why not simple SQL?
MySQL provides us some ways to perform a full text search along with term
based searches BUT …..
Needs MyISAM storage engine. It was the default storage engine of MySQL.
MyISAM is optimized for read operations with few write operations or may be
none.
But you cannot avoid write (update/modify) operations.
MyISAM creates one index for one table.
No. of tables = No. of index => more tables more complexity.
Relational DBs have locks. They won’t read/write operations if already one
operation is being executed.
8. How does a search engine help?
Efficient indexing of data
You don’t need multiple indices like you needed in Databases
Index is on all fields/combinations of fields
Analyzing data
Text search
Tokenzing => splitting of text
Stemming => converting words to their root forms
Filtering => removal of certain words
Relevance Scoring
9. In order to solve the problems mentioned before there are several
Open Source search engines….
10. Information Retrieval Software Library
Free/Open Source
Supported by Apache Foundation
Created by Doug Cutting
Since 1999
In order to use it there are two Java libraries available…..
APACHE LUCENE
11. Built on Lucene
Perfect for single server search
Part of the Lucene project (Lucene comes with Solr)
Large user and developer base
This is Dataverse’s Search engine. Later will talk why using
Elasticsearch here won’t make a big difference
APACHE SOLR
12. {
"status" : 200,
"name" : "Fafnir",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.4.2",
"build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c",
"build_timestamp" : "2014-12-16T14:11:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}
ELASTICSEARCH
Free/Open source
Built on top of Lucene
Created by Shay Banon @kimchy
Current stable version is 1.6.0
Has wrappers in many languages
13. RESTful Service
JSON API over HTTP
Chrome Plugins – Marvel Sense and POSTman
Can be used from Java, Python and many other languages
High availability and clustering is very easy to set up
Long term persistence
What does Elasticsearch add to Lucene?
14. Elasticsearch is a “download and use” distro
Executables
Log files
Node Configs
Data Storage
├── bin
│ ├── elasticsearch
│ ├── elasticsearch.in.sh
│ └── plugin
├── config
│ ├── elasticsearch.yml
│ └── logging.yml
├── data
│ └── cluster1
├── lib
│ ├── elasticsearch-x.y.z.jar
│ ├── ...
│ └──
└── logs
├── elasticsearch.log
└── elasticsearch_index_search_slowlog.log
└── elasticsearch_index_indexing_slowlog.log
Jar
Distributions
15. Here we can initialize the basic configuration
required to start an ES node. Following are the
config types that are generally changed.
cluster.name – the cluster to which it’ll join
node.name – specify name of the node
node.master – whether the node is a master
node.data – whether this node will hold data
path.data – path of the index
path.conf – path of the config folder (scripts or
any file put in this folder)
path.logs – path of the logs
elasticsearch.yml – Config file of Elasticsearch
curl -XPUT "http://localhost:9200/social_media/" -d'
{
"settings": {
"node": {
"master": true
},
"path": {
"conf": "D:/social_media/config/"
},
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
}'
16. Underlying Lucene Inverted Index
This is term to document mapping
Inverted index contains terms mapped to
all documents in which it occurred
Every document is paired with the term
frequency of the term being considered
Sum all term frequencies to get corpus
frequency of the term
17. Shards and Replicas
Primary Shard
Created when indexing
Index has 1..N primary shards
Persistent
This is the actual data
Replica Shard
Index has 0..N primary replicas
Not persistent
The is copy of the data
Promoted to Primary shard if the node fails
18. Nodes discovery
Nodes discovery in ES is using multicast
Unicast is also possible
Can be modified by changing elasticsearch.yml
In multicast the master node will send requests to all nodes to check
which are waiting for connection
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [“host1", "host2:port", "host3"]
19. Split-brain Issue
Suppose we have three node cluster which has 1 master and 2 slaves
Suppose due to some reason connection to NODE 2 fails
NODE 2 will promote its replica shards to primary shards and will convert itself to a
Master
Cluster will be in an inconsistent state
Indexing request to NODE 2 won’t be reflected to NODE 1 – NODE 3
This will result in two different indices => different results
20. Solving the Split-brain issue
Specify the number of masters in a cluster
discovery.zen.minimum_master_nodes = (N/2 + 1), where N is the number of nodes in a
cluster
In the three node cluster, the cluster with one node will fail and the production will come to
know about such issue
discovery.zen.ping.timeout should be increased in a slow network so that nodes get
extra time to ping to each other
Default value is 3 seconds
21. Elasticsearch APIs
There are certain number of APIs provided by elasticsearch. We will
be covering the ones useful to us:
INDEX API
SETTING API
MAPPING API
TERMVECTOR/MTERMVECTOR API
BULK API
SEARCH API
22. Processing of Text using Analyzers (Settings API)
Analyzers help in manipulating the
text that is to be indexed.
Tokenizers, stemmers, token-filters are
the most used Analyzers.
Analyzers are usually given a name/id
so that they can be used in future with
any type of text.
There are other analyzers as well that
are based on term-replacement,
regular-expression pattern,
punctuation characters.
Custom analyzers can also be
created in ES.
curl -XPUT
"http://localhost:9200/social_media/tweet/_settings" -d'
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"my_english": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload",
"cust_stop"
]
}
},
"filter": {
"cust_stop": {
"type": "stop",
"stopwords_path": "stoplist.txt",
}
}
}
}
}’
23. Mapping of Documents to be indexed (Mappings API)
curl -XPUT
"http://localhost:9200/social_media/tweet/_mapping" -d
'{
"tweet": {
"properties": {
"_id": {
"type": "string",
"store": True,
"index": "not_analyzed"
},
"text": {
"type": "multi_field",
"fields": {
"text": {
"include_in_all": False,
"type": "string",
"store": False,
"index": "not_analyzed"
},
"_analyzed": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector":
"with_positions_offsets_payloads",
"analyzer": “my_english”
}
}
}
}}}
Elasticsearch auto-maps fields but we
can also specify the types.
Data types provided by ES:
String
Number
Boolean
Date-time
Geo-point (coordinates)
Attachment (requires plugin)
Consilience uses this for indexing PDF
files
24. Creation of Index
Specifying setting and mapping and sending a PUT request to
Elasticsearch initializes the index
Now the task is to send documents to Elasticsearch
We have to keep in mind the mappings of each field in the document
Document Metadata fields
_id : identifier of the document
_index : index name
_type : mapping type
_source : enabled/disabled
_timestamp
_ttl
_size : size of uncompressed _source
_version
25. Indexing a document (Index API)
curl -XPOST
"http://localhost:9200/social_media/tweet/616272192
012165183" -d '{
"_source": {
"text": "random text",
"exact_text": "random text"
}
}‘
For ES 1.6.0+
curl -XPOST
"http://localhost:9200/social_media/tweet/616272192
012165183" -d '{
"text": "random text",
"exact_text": "random text"
}'
{
'_index': 'social_media',
'_type': 'tweet',
'_id': ‘616272192012165120',
'_source': {
'text': '@bshor Thanks for the info; this will
help us. Are these the 2 datasets you were
uploading? https://t.co/W1M4vrQUEI
https://t.co/ITRycQnPKz',
'exact_text': '@bshor Thanks for the info; this
will help us. Are these the 2 datasets you were
uploading? https://t.co/W1M4vrQUEI
https://t.co/ITRycQnPKz'
}
}
Document structure Indexing new document
27. Processing independent documents
This can be done by using Analyze API
The analyzer my_english was defined in Slide 16
The above DSL results in where document was
“Text to analyze”
curl -XGET "http://localhost:9200/social_media/_analyze?analyzer=my_english&text=Text to analyze"
{
"tokens": [
{
"token": "text",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "analyze",
"start_offset": 8,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}
28. Working with Shingles
Shingles are a way to index group of
tokens like unigrams, bigrams etc.
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2, // for bigrams
"max_shingle_size" : 2,
"output_unigrams": True
}
curl -XGET
"http://localhost:9200/social_media/_anal
yze?analyzer=my_english_shingle&text=Text
to analyze"
{
"tokens": [
{
"token": "text",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "text _",
"start_offset": 0,
"end_offset": 8,
"type": "shingle",
"position": 1
},
{
"token": "_ analyze",
"start_offset": 8,
"end_offset": 15,
"type": "shingle",
"position": 2
},
{
"token": "analyze",
"start_offset": 8,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}
This filter can be used in
termvector API to get
vectors containing both
unigram and bigrams
29. Searching in Index (Search API)
Default search
Exact phrase matching
curl -XGET "http://localhost:9200/social_media/tweet/_search" -d'
{
"query": {
"match": {
"text._analyzed": “some Texts“ // will search for “some text”, “some” and “text”
}
},
"explain": true
}‘
curl -XGET "http://localhost:9200/social_media/tweet/_search" -d'
{
"query": {
"match_phrase": {
"text": “some Texts“ // will search for “some Texts” as a phrase
}
},
"explain": true
}‘
30. Recommended Design Patterns
Keep the number of nodes odd
Take pre-cautions to avoid Split-brain issue
Regularly refresh indices
Add refresh_interval to settings
Manage heap size
ES_HEAP_SIZE <= ½ of the system’s RAM but not more than 32GB
export ES_HEAP_SIZE=10g
./bin/elasticsearch -Xmx10g -Xms10g
Use Aliases
Searches are made using an index created from the original index
This prevents cluster down time or delays that may occur during the updation/modification of the index
Delete aliases when they become old and create new one
You can create time-based aliases as well
Use Routing
A way to know which shard contains what document
Reduces the lookup time during searches
When bulk indexing
Timeout after every push
Push should be of maximum size 2-3MB
31. Why not SOLR?
SOLR is a better search engine than Elasticsearch
But we require Term_vectors and analysis more than a search
ES provides better APIs for analytics
termvector with field and term statistics
mtermvector
search with explain enabled
function_scoring (Didn’t mention before)
If you need only a search engine, go for SOLR. If you need something more
than that Elasticsearch is the best choice.
32. Language Support
We have
JAVA wrappers : org.elasticsearch.*
Python wrapper: py-elasticsearch
Scala wrapper : elastic4s
Domain Specific Language (DSL) : cURL/JSON as shown in every
example previously
33. Lets add some SPARK to ES…
Apache Spark is an engine for large scale data processing
It runs programs nearly 100 times faster than Hadoop
Has language support for Python, Java, Scala and R
For Project Consilience:
Earlier I had thought of keeping the starting and end point of the whole
application to be Spark
i.e. read files using spark, index them using Elasticsearch and apply clustering
using Spark’s MLlib
Flat file reading is very direct in Spark
spark.textfile() => parallel reading of the file in chunks
spark.wholetextfile() => loads complete file into memory
34. Lets add some SPARK to ES…
Earlier experiments were done in
Scala
Scala gave us the advantage
of Functional programming
along with the Parallel
processing
Now Java 8 also provides with
Functional programming so
Scala and Java won’t make
much difference
import org.elasticsearch.spark._ //ES-Spark connector
val conf = new SparkConf()
.setAppName(“super_spark")
.setMaster("local[2]")
.set("spark.executor.memory", "1g")
.set("spark.rdd.compress", "true")
.set("spark.storage.memoryFraction", "1")
.set("es.index.auto.create", "true")
.set(“es.node”, 9200)
// other configurations can be added as well
val sc = new SparkContext(conf)
// parallel reading for arrays. Same syntax in Java and Python
val data = sc.parallelize(1 to 10000).collect().filter(_ < 100)
data.foreach(println)
val textFile = sc.textFile("/home/cloudera/Documents/pg2265.txt")
val counts = textFile
.flatMap(line => line.split(" ")) // all tokens in an array
.filter(_ != ' ') // remove all empty tokens
.map(word => (word.replaceAll("p{P}", "") // remove
punctuations
.toLowerCase(), 1)) // convert to lower case
.reduceByKey(_ + _) // add as per key values
val thing = counts.collect()
sc.makeRDD(<put a Mapping here>).saveToEs("spark/docs")
35. Tried the Spark-Hadoop-Elasticsearch connector but noticed some
overhead and unnecessary computations
The project currently won’t accept large volumes of data and that too
frequently. So fast computation isn’t really required
What we want is features to do clustering. Those features can easily be
provided by Elasticsearch
May be in future, Spark will be added in the first phase of the project.
As of now Spark will be used for Clustering of the documents. The
library MLlib provides APIs for this
Lets add some SPARK to ES…