In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
ElasticSearch introduction talk. Overview of the API, functionality, use cases. What can be achieved, how to scale? What is Kibana, how it can benefit your business.
In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
ElasticSearch introduction talk. Overview of the API, functionality, use cases. What can be achieved, how to scale? What is Kibana, how it can benefit your business.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
This presentation contains differences between Elasticsearch and relational Databases. Along with that it also has some Glossary Of Elasticsearch and its basic operation.
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka Elasticsearch Tutorial will help you in understanding the fundamentals of Elasticsearch along with its practical usage and help you in building a strong foundation in ELK Stack. This video helps you to learn following topics:
1. What Is Elasticsearch?
2. Why Elasticsearch?
3. Elasticsearch Advantages
4. Elasticsearch Installation
5. API Conventions
6. Elasticsearch Query DSL
7. Mapping
8. Analysis
9 Modules
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka tutorial on What Is ELK Stack will help you in understanding the fundamentals of Elasticsearch, Logstash, and Kibana together and help you in building a strong foundation in ELK Stack. Below are the topics covered in this ELK tutorial for beginners:
1. Need for Log Analysis
2. Problems with Log Analysis
3. What is ELK Stack?
4. Features of ELK Stack
5. Companies Using ELK Stack
A brief presentation outlining the basics of elasticsearch for beginners. Can be used to deliver a seminar on elasticsearch.(P.S. I used it) Would Recommend the presenter to fiddle with elasticsearch beforehand.
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
요즘 Hadoop 보다 더 뜨고 있는 Spark.
그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다.
RDD가 어떻게 동작하는지, 원 논문을 리뷰하며 살펴보도록 합시다.
http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in ProductionFIWARE
This training camp teaches you how FIWARE technologies and iSHARE, brought together under the umbrella of the i4Trust initiative, can be combined to provide the means for creation of data spaces in which multiple organizations can exchange digital twin data in a trusted and efficient manner, collaborating in the development of innovative services based on data sharing and creating value out of the data they share. SMEs and Digital Innovation Hubs (DIHs) will be equipped with the necessary know-how to use the i4Trust framework for creating data spaces!
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
This presentation contains differences between Elasticsearch and relational Databases. Along with that it also has some Glossary Of Elasticsearch and its basic operation.
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka Elasticsearch Tutorial will help you in understanding the fundamentals of Elasticsearch along with its practical usage and help you in building a strong foundation in ELK Stack. This video helps you to learn following topics:
1. What Is Elasticsearch?
2. Why Elasticsearch?
3. Elasticsearch Advantages
4. Elasticsearch Installation
5. API Conventions
6. Elasticsearch Query DSL
7. Mapping
8. Analysis
9 Modules
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka tutorial on What Is ELK Stack will help you in understanding the fundamentals of Elasticsearch, Logstash, and Kibana together and help you in building a strong foundation in ELK Stack. Below are the topics covered in this ELK tutorial for beginners:
1. Need for Log Analysis
2. Problems with Log Analysis
3. What is ELK Stack?
4. Features of ELK Stack
5. Companies Using ELK Stack
A brief presentation outlining the basics of elasticsearch for beginners. Can be used to deliver a seminar on elasticsearch.(P.S. I used it) Would Recommend the presenter to fiddle with elasticsearch beforehand.
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
요즘 Hadoop 보다 더 뜨고 있는 Spark.
그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다.
RDD가 어떻게 동작하는지, 원 논문을 리뷰하며 살펴보도록 합시다.
http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in ProductionFIWARE
This training camp teaches you how FIWARE technologies and iSHARE, brought together under the umbrella of the i4Trust initiative, can be combined to provide the means for creation of data spaces in which multiple organizations can exchange digital twin data in a trusted and efficient manner, collaborating in the development of innovative services based on data sharing and creating value out of the data they share. SMEs and Digital Innovation Hubs (DIHs) will be equipped with the necessary know-how to use the i4Trust framework for creating data spaces!
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
Slides from my talk during ApacheCon EU 2012 - "Battle of the giants: Apache Solr vs ElasticSearch". Video available at http://player.vimeo.com/video/55645629
From zero to hero - Easy log centralization with Logstash and ElasticsearchRafał Kuć
Presentation I gave during DevOps Days Warsaw 2014 about combining Elasticsearch, Logstash and Kibana together or use our Logsene solution instead of Elasticsearch.
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.
Search faceting presentation and comparison of BigData textual content indexing/search analytics solutions. Presentation focuses on the comparison of the open source solutions provided by Apache Solr to those of ElasticSearch.
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
Presented at Spark Summit EU 2015 at Amsterdam. Details the SoDA project, a micro-service accessible via Spark for naive annotation of large volumes of text against very large lexicons.
Solr Exchange: Introduction to SolrCloudthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will provide an architectural overview of SolrCloud and highlight its most important features. Specifically, Tim covers topics such as: sharding, replication, ZooKeeper fundamentals, leaders/replicas, and failure/recovery scenarios. Any discussion of a complex distributed system would not be complete without a discussion of the CAP theorem. Mr. Potter will describe why Solr is considered a CP system and how that impacts the design of a search application.
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the Author:
Currently Vadim is a Senior Machine Learning Engineer at source{d} where he works on deep neural networks that aim to understand all of the world's developers through their code. Vadim is one of the creators of the distributed deep learning platform Veles (https://velesnet.ml) while working at Samsung. Afterwards Vadim was responsible for the machine learning efforts to fight email spam at Mail.Ru. In the past Vadim was also a visiting associate professor at Moscow Institute of Physics and Technology, teaching about new technologies and conducting ACM-like internal coding competitions. Vadim is also a big fan of GitHub (vmarkovtsev) and HackerRank (markhor), as well as likes to write technical articles on a number of web sites.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Market overview of Docker orchestrators. A detailed architecture's comparison of Kubernetes and Docker Swarm, including benefits and issues. Which orchestrator works better for microservice and highly available applications?
Scality CTO Giorgio Regni and Software Engineer Lauren Spiegel talk about the open source S3 clone, written in Node.js. This presentation was given at a meetup on September 1, 2016 in San Francisco.
These slides were presented at the Great Indian Developer Summit 2014 at Bangalore. See http://www.developermarch.com/developersummit/session.html?insert=ShalinMangar2
"SolrCloud" is the name given to Apache Solr's feature set for fault tolerant, highly available, and massively scalable capabilities. SolrCloud has enabled organizations to scale, impressively, into the billions of documents with sub-second search!
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
2. Who Am I
• „Solr 3.1 Cookbook” author
• Sematext software engineer
• Solr.pl co-founder
• Father and husband :-)
Copyright 2012 Sematext Int’l. All rights reserved
3. What Will I Talk About ?
• ElasticSearch scaling
• Indexing thousands of documents per second
• Performing queries in tens of milliseconds
• Controling shard and replica placement
• Handling multilingual content
• Performance testing
• Cluster monitoring
Copyright 2012 Sematext Int’l. All rights reserved
4. The Challenge
• More than 50 millions of documents a day
• Real time search
• Less than 200ms average query latency
• Throughput of at least 1000 QPS
• Multilingual indexing
• Multilingual querying
Copyright 2012 Sematext Int’l. All rights reserved
5. Why ElasticSearch ?
• Written with NRT and cloud support in mind
• Uses Lucene and all its goodness
• Distributed indexing with document
distribution control out of the box
• Easy index, shard and replicas creation on live
cluster
Copyright 2012 Sematext Int’l. All rights reserved
6. Index Design
• Several indices (at least one index for each day
of data)
• Indices divided into multiple shards
• Multiple replicas of a single shard
• Real-time, synchronous replication
• Near-real-time index refresh (1 to 30 seconds)
Copyright 2012 Sematext Int’l. All rights reserved
7. Shard Deployment Problems
• Multiple shards per node
• Replicas on the same nodes as shards
• Not evenly distributed shards and replicas
• Some nodes being hot, while others are cold
Copyright 2012 Sematext Int’l. All rights reserved
9. What Can We Do With Shards Then ?
• Contol shard placement with node tags:
– index.routing.allocation.include.tag
– index.routing.allocation.exclude.tag
• Control shard placement with nodes IP
addresses:
– cluster.routing.allocation.include._ip
– cluster.routing.allocation.exclude._ip
• Specified on index or cluster level
• Can be changed on live cluster !
Copyright 2012 Sematext Int’l. All rights reserved
11. Number of Shards Per Node
• Allows one to specify number of shards per
node
• Specified on index level
• Can be changed on live indices
• Example:
curl -XPUT localhost:9200/sematext -d '{
"index.routing.allocation.total_shards_per_node" : 2
}'
Copyright 2012 Sematext Int’l. All rights reserved
13. Does Routing Matters ?
• Controls target shard for each document
• Defaults to hash of a document identifier
• Can be specified explicitly (routing parameter) or
as a field value (a bit less performant)
• Can take any value
• Example:
curl -XPUT localhost:9200/sematext/test/1?routing=1234 -d '{
"title" : "Test routing document"
}'
Copyright 2012 Sematext Int’l. All rights reserved
14. Indexing the Data
Shard Replica Shard Replica
1 2 3 1
Node 1 Node 2
Shard Replica
2 3
Node 3
ElasticSearch Cluster
Indexing application
Copyright 2012 Sematext Int’l. All rights reserved
15. How We Indexed Data
Shard 1 Shard 2
Node 1 Node 2
Node 3
ElasticSearch Cluster
Indexing application
Copyright 2012 Sematext Int’l. All rights reserved
16. Nodes Without Data
• Nodes used only to route data and queries to
other nodes in the cluster
• Such nodes don’t suffer from I/O waits (of
course Data Nodes don’t suffer from I/O waits
all the time)
• Not default ElasticSearch behavior
• Setup by setting node.data to false
Copyright 2012 Sematext Int’l. All rights reserved
17. Multilingual Indexing
• Detection of document's language before
sending it for indexing
• With, e.g. Sematext LangID or Apache Tika
• Set known language analyzers in configuration
or mappings
• Set analyzer during indexing (_analyzer field)
Copyright 2012 Sematext Int’l. All rights reserved
19. Multilingual Queries
• Identify language of query before its execution
(can be problematic)
• Query analyzer can be specified per query
(analyzer parameter):
curl -XGET
localhost:9200/sematext/_search?q=let+AND+me&analyzer=english
Copyright 2012 Sematext Int’l. All rights reserved
20. Query Performance Factors – Lucene
level
• Refresh interval
– Defaults to 1 second
– Can be specified on cluster or index level
– curl -XPUT localhost:9200/_settings -d '{ "index" : {
"refresh_interval" : "600s" } }'
• Merge factor
– Defaults to 10
– Can be specified on cluster or index level
– curl -XPUT localhost:9200/_settings -d '{ "index" : {
"merge.policy.merge_factor" : 30 } }'
Copyright 2012 Sematext Int’l. All rights reserved
21. Let’s Talk About Routing Once Again
• Routes a query to a particular shard
• Speeds up queries depending on number of
shards for a given index
• Have to be specified manualy with routing
parameter during query
• routing parameter can take any value:
curl -XGET
'localhost:9200/sematext/_search?q=test&routing=2012-02-16'
Copyright 2012 Sematext Int’l. All rights reserved
22. Querying ElasticSearch – No Routing
Shard 1 Shard 2 Shard 3 Shard 4
Shard 5 Shard 6 Shard 7 Shard 8
ElasticSearch Index
Application
Copyright 2012 Sematext Int’l. All rights reserved
23. Querying ElasticSearch – With Routing
Shard 1 Shard 2 Shard 3 Shard 4
Shard 5 Shard 6 Shard 7 Shard 8
ElasticSearch Index
Application
Copyright 2012 Sematext Int’l. All rights reserved
24. Performance Numbers
Queries without routing (200 shards, 1 replica)
#threads Avg response time Throughput 90% line Median CPU Utilization
1 3169ms 19,0/min 5214ms 2692ms 95 – 99%
Queries with routing (200 shards, 1 replica)
#threads Avg response time Throughput 90% line Median CPU Utilization
10 196ms 50,6/sec 642ms 29ms 25 – 40%
20 218ms 91,2/sec 718ms 11ms 10 – 15%
Copyright 2012 Sematext Int’l. All rights reserved
25. Scaling Query Throughput – What Else ?
• Increasing the number of shards for data
distribution
• Increasing the number of replicas
• Using routing
• Avoid always hitting the same node and
hotspotting it
Copyright 2012 Sematext Int’l. All rights reserved
26. FieldCache and OutOfMemory
• ElasticSearch default setup doesn’t limit field
data cache size
Copyright 2012 Sematext Int’l. All rights reserved
27. FieldCache – What We Can do With It ?
• Keep its default type and set:
– Maximum size (index.cache.field.max_size)
– Expiration time (index.cache.field.expire)
• Change its type:
– soft (index.cache.field.type)
• Change your data:
– Make your fields less precise (ie: dates)
– If you sort or facet on fields think if you can reduce
fields granularity
• Buy more servers :-)
Copyright 2012 Sematext Int’l. All rights reserved
29. Additional Problems We Encountered
• Rebalancing after full cluster restarts
– cluster.routing.allocation.disable_allocation
– cluster.routing.allocation.disable_replica_allocation
• Long startup and initialization
• Faceting with strings vs faceting on numbers on
high cardinality fields
Copyright 2012 Sematext Int’l. All rights reserved
30. JVM Optimization
• Remember to leave enough memory to OS for
cache
• Make GC frequent ans short vs. rare and long
– -XX:+UseParNewGC
– -XX:+UseConcMarkSweepGC
– -XX:+CMSParallelRemarkEnabled
• -XX:+AlwaysPreTouch (for short performance
tests)
Copyright 2012 Sematext Int’l. All rights reserved
31. Performance Testing
• Data
– How much data do I need ?
– Choosing the right queries
• Make changes
– One change at a time
– Understand the impact of the change
• Monitor your cluster (jstat, dstat/vmstat,
SPM)
• Analyze your results
Copyright 2012 Sematext Int’l. All rights reserved
32. ElasticSearch Cluster Monitoring
• Cluster health
• Indexing statistics
• Query rate
• JVM memory and garbage collector work
• Cache usage
• Node memory and CPU usage
Copyright 2012 Sematext Int’l. All rights reserved
33. Cluster Health
Node restart
Copyright 2012 Sematext Int’l. All rights reserved
38. CPU and Memory
Copyright 2012 Sematext Int’l. All rights reserved
39. Summary
• Controlling shard and replica placement
• Indexing and querying multilingual data
• How to use sharding and routing and not to
tear your hair out
• How to test your cluster performance to find
bottle-necks
• How to monitor your cluster and find
problems right away
Copyright 2012 Sematext Int’l. All rights reserved
40. We Are Hiring !
• Dig Search ?
• Dig Analytics ?
• Dig Big Data ?
• Dig Performance ?
• Dig working with and in open – source ?
• We’re hiring world – wide !
http://sematext.com/about/jobs.html
Copyright 2012 Sematext Int’l. All rights reserved
41. How to Reach Us
• Rafał Kuć
– Twitter: @kucrafal
– E-mail: rafal.kuc@sematext.com
• Sematext
– Twitter: @sematext
– Website: http://sematext.com
• Graphs used in the presentation are from:
– SPM for ElasticSearch (http://sematext.com/spm)
Copyright 2012 Sematext Int’l. All rights reserved