2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
Presentazione dell'evento EsInRome del 7 Febbraio 2017 - Integrazione Elasticsearch in architettura BigData e facilità di integrazione con Apache Spark.
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
It's the presentation slides I prepared for my college workshop. This demonstrates how you can talk with PostgreSql db using python scripting.For queries, mail at dipeshsuwal@gmail.com
You're stuck on a basic Windows estate, you can't pull the data out, there's no SIEM, and you have 20GB of logs you've been tasked to turn into actionable intelligence. Powershell brings not just in-built tools for querying Windows event logs, but also extremely powerful text processing tools. This talk will give you a quick overview of these features and its notable quirks, allowing you to pull off tricks that are often thought to be only for *NIX environments.
Implementing Glacier's Tree Hash using recursive, functional programming in Perl5. With Keyword::Declare we get clean syntax for tail-call elimination. Result is a simple, fast, functional solution.
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
Presentazione dell'evento EsInRome del 7 Febbraio 2017 - Integrazione Elasticsearch in architettura BigData e facilità di integrazione con Apache Spark.
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
It's the presentation slides I prepared for my college workshop. This demonstrates how you can talk with PostgreSql db using python scripting.For queries, mail at dipeshsuwal@gmail.com
You're stuck on a basic Windows estate, you can't pull the data out, there's no SIEM, and you have 20GB of logs you've been tasked to turn into actionable intelligence. Powershell brings not just in-built tools for querying Windows event logs, but also extremely powerful text processing tools. This talk will give you a quick overview of these features and its notable quirks, allowing you to pull off tricks that are often thought to be only for *NIX environments.
Implementing Glacier's Tree Hash using recursive, functional programming in Perl5. With Keyword::Declare we get clean syntax for tail-call elimination. Result is a simple, fast, functional solution.
This describes a Functional Programming approach to computing AWS Glacier "tree hash" values, hiding the tail-call elimination in Perl5 with a keyword and also shows how to accomplish the same result in Perl6.
This was the talk actually given at YAPC::NA 2016 by Dr. Conway and myself.
Dr. Hsieh is teaching how to use the state-of-the-art libraries, Spark by Apache, to conduct data analysis on hadoop platform in ISSNIP 2015, Singapore. He started with teaching the basic operations like “map, reduce, flatten, and more,” followed by explaining the extension of Spark, including MLib, GraphX, and SparkSQL.
This presentation will demonstrate how you can use the aggregation pipeline with MongoDB similar to how you would use GROUP BY in SQL and the new stage operators coming 3.4. MongoDB’s Aggregation Framework has many operators that give you the ability to get more value out of your data, discover usage patterns within your data, or use the Aggregation Framework to power your application. Considerations regarding version, indexing, operators, and saving the output will be reviewed.
A talk in JSDC.tw 2014. I introduce the advantage and disadvantage to write JavaScript in functional style. It covers simple Functional Programming concepts, how JavaScript becomes more functional, and all the difficulties people may encounter.
As presented at Confoo 2013.
More than some arcane NoSQL tool, Redis is a simple but powerful swiss army knife you can begin using today.
This talk introduces the audience to Redis and focuses on using it to cleanly solve common problems. Along the way, we'll see how Redis can be used as an alternative to several common PHP tools.
From Zero to Application Delivery with NixOSSusan Potter
Managing configurations for different kinds of nodes and cloud resources in a microservice architecture can be an operational nightmare, especially if not managed with the application codebase. CI and CD job environments often tend to stray from production configuration yielding their results unpredictable at best, or producing false positives in the worst case. Code pushes to staging and production can have unintended consequences which often can’t be inspected fully on a dry run.
This session will show you a toolchain and immutable infrastructure principles that will allow you to define your infrastructure in code versioned alongside your application code that will give you repeatable configuration, ephemeral testing environments, consistent CI/CD environments, and diffable dependency transparency all before pushing changes to production.
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
James Sweeney presents on "PuppetDB: A Single Source for Storing Your Puppet Data" at Puppet User Group NYC.
Video: http://www.youtube.com/watch?v=HTr4b02aU7A
Puppet NYC: http://www.meetup.com/puppetnyc-meetings/
Meet Ramda, a functional programming helper library which can replace Lodash and Underscore in various use-cases. Ramda is all curried and adds various facilities for increasing code reuse.
Php 102: Out with the Bad, In with the GoodJeremy Kendall
In this session, we'll look at a typical PHP application, review a few of the horrible mistakes the fictional developer made, and then refactor the app according to some best practices. Along the way you might even learn a thing or two about PHP you don't already know.
Presented by Gregg Donovan, Senior Software Engineer, Etsy.com, Inc.
Understanding the impact of garbage collection, both at a single node and a cluster level, is key to developing high-performance, high-availability Solr and Lucene applications. After a brief overview of garbage collection theory, we will review the design and use of the various collectors in the JVM.
At a single-node level, we will explore GC monitoring -- how to understand GC logs, how to monitor what % of your Solr request time is spend on GC, how to use VisualGC, YourKit, and other tools, and what to log and monitor. We will review GC tuning and how to measure success.
At a cluster-level, we will review how to design for partial availability -- how to avoid sending requests to a GCing node and how to be resilient to mid-request GC pauses.For application development, we will review common memory leak scenarios in custom Solr and Lucene application code and how to detect them.
This describes a Functional Programming approach to computing AWS Glacier "tree hash" values, hiding the tail-call elimination in Perl5 with a keyword and also shows how to accomplish the same result in Perl6.
This was the talk actually given at YAPC::NA 2016 by Dr. Conway and myself.
Dr. Hsieh is teaching how to use the state-of-the-art libraries, Spark by Apache, to conduct data analysis on hadoop platform in ISSNIP 2015, Singapore. He started with teaching the basic operations like “map, reduce, flatten, and more,” followed by explaining the extension of Spark, including MLib, GraphX, and SparkSQL.
This presentation will demonstrate how you can use the aggregation pipeline with MongoDB similar to how you would use GROUP BY in SQL and the new stage operators coming 3.4. MongoDB’s Aggregation Framework has many operators that give you the ability to get more value out of your data, discover usage patterns within your data, or use the Aggregation Framework to power your application. Considerations regarding version, indexing, operators, and saving the output will be reviewed.
A talk in JSDC.tw 2014. I introduce the advantage and disadvantage to write JavaScript in functional style. It covers simple Functional Programming concepts, how JavaScript becomes more functional, and all the difficulties people may encounter.
As presented at Confoo 2013.
More than some arcane NoSQL tool, Redis is a simple but powerful swiss army knife you can begin using today.
This talk introduces the audience to Redis and focuses on using it to cleanly solve common problems. Along the way, we'll see how Redis can be used as an alternative to several common PHP tools.
From Zero to Application Delivery with NixOSSusan Potter
Managing configurations for different kinds of nodes and cloud resources in a microservice architecture can be an operational nightmare, especially if not managed with the application codebase. CI and CD job environments often tend to stray from production configuration yielding their results unpredictable at best, or producing false positives in the worst case. Code pushes to staging and production can have unintended consequences which often can’t be inspected fully on a dry run.
This session will show you a toolchain and immutable infrastructure principles that will allow you to define your infrastructure in code versioned alongside your application code that will give you repeatable configuration, ephemeral testing environments, consistent CI/CD environments, and diffable dependency transparency all before pushing changes to production.
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
James Sweeney presents on "PuppetDB: A Single Source for Storing Your Puppet Data" at Puppet User Group NYC.
Video: http://www.youtube.com/watch?v=HTr4b02aU7A
Puppet NYC: http://www.meetup.com/puppetnyc-meetings/
Meet Ramda, a functional programming helper library which can replace Lodash and Underscore in various use-cases. Ramda is all curried and adds various facilities for increasing code reuse.
Php 102: Out with the Bad, In with the GoodJeremy Kendall
In this session, we'll look at a typical PHP application, review a few of the horrible mistakes the fictional developer made, and then refactor the app according to some best practices. Along the way you might even learn a thing or two about PHP you don't already know.
Presented by Gregg Donovan, Senior Software Engineer, Etsy.com, Inc.
Understanding the impact of garbage collection, both at a single node and a cluster level, is key to developing high-performance, high-availability Solr and Lucene applications. After a brief overview of garbage collection theory, we will review the design and use of the various collectors in the JVM.
At a single-node level, we will explore GC monitoring -- how to understand GC logs, how to monitor what % of your Solr request time is spend on GC, how to use VisualGC, YourKit, and other tools, and what to log and monitor. We will review GC tuning and how to measure success.
At a cluster-level, we will review how to design for partial availability -- how to avoid sending requests to a GCing node and how to be resilient to mid-request GC pauses.For application development, we will review common memory leak scenarios in custom Solr and Lucene application code and how to detect them.
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.
The importance of search for modern applications is evident and nowadays it is higher than ever. A lot of projects use search forms as a primary interface for communication with a user. Though implementation of an intelligent search functionality is still a challenge and we need a good set of tools.
In this presentation, I will talk through the high-level architecture and benefits of Elasticsearch with some examples. Aside from that, we will also take a look at its existing competitors, their similarities, and differences.
Elasticsearch Introduction to Data model, Search & AggregationsAlaa Elhadba
An overview of Elasticsearch features and explains performing smart search, data aggregations, and relevancy through scoring functions. How Elasticsearch works as a distributed scalable data storage. Finally, showcasing some use cases that are currently becoming core functionalities in Zalando.
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...Amazon Web Services
Elasticsearch is a fully featured search engine used for real-time analytics, and Amazon Elasticsearch Service makes it easy to deploy Elasticsearch clusters on AWS. With Amazon ES, you can ingest and process billions of events per day, and explore the data using Kibana to discover patterns. In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data into it using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
Elasticsearch has quickly become the leading open source technology for scaling search and building document services on. Many software providers have come to rely on it to serve the needs of high-performance, production applications.
In this talk, we’ll go deep on lessons learned from three years in production scaling from a few shards to more than 100 spread across 100s of nodes on AWS--to serve real-time queries against 100s of millions of documents.
Attendees will learn:
* How to capacity plan for ES on AWS
* How to scale and reshard on AWS with zero downtime
* What AWS and ES metrics to collect and alert on
* Tips on day to day ES operations
Session sponsored by SignalFx.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
Elasticsearch sur Azure : Make sense of your (BIG) data !Microsoft
Sous licence Apache2, elasticsearch est un moteur de recherche puissant, distribué et scalable. Il fournit également des agrégations en temps réel en fonction de vos besoins. Couplé à Kibana, dashboard générique et hautement personnalisable, il vous permet de donner immédiatement du sens à vos données. En forte progression au niveau de son adhésion par les entreprises et les sites publics, découvrez ce que sont elasticsearch et Kibana et à quel point il est simple de les déployer facilement sur la plate-forme Windows Azure. Thomas et David illustreront à l'aide de cas clients les bénéfices obtenus à travers ces solutions.
Speakers : Thomas Conté (Microsoft), David Pilato (Elasticsearch)
Service discovery and configuration provisioningSource Ministry
Slides from our talk "Service discovery and configuration provisioning" presented by Mariusz Gil at PHP Benelux 2016
Apache Zookeeper or Consul are almost completely unknown in the PHP world, although its use solves a lot of typical problems. In a nutshell, they are a central services of provisioning configuration information, distributed synchronization and coordination of servers/processes. It simplifies the processes of application configuration management, so it is possible to change its settings and operation in real time (eg. feature flagging). During the presentation the typical cases of use of Zookeeper/Consul in PHP applications will be presented, both strictly web and workers running from the CLI.
Burn down the silos! Helping dev and ops gel on high availability websitesLindsay Holmwood
HA websites are where the rubber meets the road - at 200km/h. Traditional separation of dev and ops just doesn't cut it.
Everything is related to everything. Code relies on performant and resilient infrastructure, but highly performant infrastructure will only get a poorly written application so far. Worse still, root cause analysis in HA sites will more often than not identify problems that don't clearly belong to either devs or ops.
The two options are collaborate or die.
This talk will introduce 3 core principles for improving collaboration between operations and development teams: consistency, repeatability, and visibility. These principles will be investigated with real world case studies and associated technologies audience members can start using now. In particular, there will be a focus on:
- fast provisioning of test environments with configuration management
- reliable and repeatable automated deployments
- application and infrastructure visibility with statistics collection, logging, and visualisation
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...Codemotion
In the past few years I have helped a lot of customers optimising their elastic cluster. With each version elasticsearch has more options to track performance of your nodes and recently profiling your queries was added. In this talk I am going to discuss the steps you have to take when starting with elasticsearch. The choices you have to make for the size of your cluster, the amount of indexes, amount of shards, choosing the right mappings, and creating better queries. After the setup I'll continue showing how to monitor your cluster and profile your queries.
Introduction to automation in the cloud, why it's needed, what are the tools or ways of working, the processes, the best practises with some examples and takeaways.
DevOps Fest 2019. Сергей Марченко. Terraform: a novel about modules, provider...DevOps_Fest
В Dev-Pro DevOps-специалисты работают с Terraform в рамках Azure. Команда работает с множеством окружений и ресурсов, среди которых есть AKS (Kubernetes). Сергей поделится опытом успешного написания модулей и провайдеров для Terraform.
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
With growing trend of Big data, companies are tend to rely on high cost SIEM solutions. However, with introduction of open source and lightweight cluster management solution like ElasticSearch this has been the highlight of the year. Similarly, the log aggregation has been simplified by logstash and kibana providing a visual look to the complex data structure. This presentation will exactly cater to this need of having a appropriate log analysis+Detecting Intrusion+Visualizing data in a powerful interface.
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...Big Data Spain
Elasticsearch is a distributed, RESTful search and analytics engine built on top of Apache Lucene. After the initial release in 2010 it has become the most widely used full-text search engine, but it is not stopping there. The revolution happened and now it is time for evolution. We dive into current improvements and new features — how to make a great product even better.
https://www.bigdataspain.org/2017/talk/elasticsearch-revolution-you-know-for-search
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
LUISS - Deep Learning and data analyses - 09/01/19Alberto Paro
My participation to the course "Data Analysis, Mobility, Proximity and App-based Marketing".
A new perspective on how data support companies on strategic decisions.
Elasticsearch in architetture Big Data - EsInADay-2017Alberto Paro
ElasticSearch è diventato una componente essenziale nelle architetture Big Data odierne (FastData), non solo per la sua funzione di motore di ricerca, ma soprattutto per il vantaggio competitivo che i suoi anaytics in real-time offrono. In questo breve talk vedremo il posizionamento di ElasticSearch all’interno del panorama NoSQL, esempi di architetture Big Data che sfruttano le sue caratteristiche e facilità di integrazione con tools come Apache Spark.
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
Using Elasticsearch in a BigData environment is very simple. In this talk, we analyse what's Big Data and we show how it is easy integrating ElasticSearch with Apache Spark
2016 02-24 - Piattaforme per i Big DataAlberto Paro
Saper valutare la corretta soluzione NoSQL o soluzione Big Data per il proprio business è essenziale. Non tutti i datastore NoSQL sono uguali come non sono uguali le necessità di trattamento del dato nel proprio business. Cerchiamo di fare chiarezza sui temi principali del Big Data.
What's Big Data? - Big Data Tech - 2015 - FirenzeAlberto Paro
Big Data Tech - 2015 - Florence
Technologie Big Data spiegate al Management
Comprendere i concetti del bigdata e gli strumenti che esistono per affrontarli (Nosql, Hadoop/Spark) sono essenziali al management attuale per poter affrontare le sfide di domani.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
1. Roma – 8 Febbraio 2017
presenta Alberto Paro, Seacom
ElasticSearch 5.x
New Tricks
2. Alberto Paro
Laureato in Ingegneria Informatica (POLIMI)
Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech
review
Lavoro principalmente in Scala e su tecnologie BD
(Akka, Spray.io, Playframework, Apache Spark) e NoSQL
(Accumulo, Cassandra, ElasticSearch e MongoDB)
Evangelist linguaggio Scala e Scala.JS
3. Tip 1: Shrink - 1/5
Why?
The wrong number of shards during the initial
design sizing. Often sizing the shards without
knowing the correct data/text distribution tends to
oversize the number of shards
Reducing the number of shards to reduce memory
and resource usage
Reducing the number of shards to speed up
searching
4. Tip 1: Shrink - 2/5 - Where is your data?
We can retrieve it via the _nodes API:
curl -XGET 'http://localhost:9200/_nodes?pretty'
In the result there will be a similar section:
.... "nodes" : {
"5Sei9ip8Qhee3J0o9dTV4g" : {
"name" : "Gin Genie",
"transport_address" : "127.0.0.1:9300",
"host" : "127.0.0.1",
"ip" : "127.0.0.1",
"version" : "5.1.1",....
The name of my node is Gin Genie
5. Tip 1: Shrink - 3/5 - Relocate your data
We can change the index settings, forcing allocation to a single node for
our index, and disabling the writing for the index.
curl -XPUT 'http://localhost:9200/myindex/_settings' -d ’
{
"settings": {
"index.routing.allocation.require._name": "Gin Genie", "index.blocks.write":
true
}
}’
We can check for the green status:
curl -XGET 'http://localhost:9200/_cluster/health?pretty'
6. Tip 1: Shrink - 4/5 – Shrink our shards
We need to disable the writing for the index via:
curl -XPUT 'http://localhost:9200/myindex/_settings?index.blocks.write=true'
The shrink call for creating the reduced_index, will be:
curl -XPOST 'http://localhost:9200/myindex/_shrink/reduced_index' -d '{
"settings": {
"index.number_of_replicas": 1,
"index.number_of_shards": 1,
"index.codec": "best_compression”
},
"aliases": {"my_search_indices": {}}
}'
7. Tip 1: Shrink - 5/5 – Post Shrinking
We can also wait for a yellow status if the index it is ready to work:
curl -XGET 'http://localhost:9200/_cluster/health? wait_for_status=yellow’
Now we can remove the read-only by changing the index settings:
curl -XPUT 'http://localhost:9200/myindex/_settings? index.blocks.write=true'
8. Tip 2: Reindex - 1/2
Why?
Changing an analyzer for a mapping
Adding a new subfield to a mapping and you need
to reprocess all the records to search for the new
subfield
Removing an unused mapping
Changing a record structure that requires a new
mapping
10. Tip 3: Update By Query with painless
Add a new Field
1. Create your mapping (i.e modified: date)
2. Call an update by query
curl -XPOST http://$server/$index/$mapping/_update_by_query -d '{
"script": {
"inline": "ctx._source.modified="2015-10-06T00:00:00.000+00:00"",
"lang": "painless”
},
"query": {
"bool": {"must_not":[{"exists":{"field":"modified"} }]}
}
}'
12. Tip 5: Reindex for a remote node – 1/2
Why?
The backup is a safe Lucene index copy, so it depends on the
Elasticsearch version used. If you are switching from a version
of Elastisearch that is prior to version 5.x, it's not possible to
restore old indices.
It's not possible to restore backups of a newer Elasticsearch
version in an older version. The restore is only forward-
compatible.
It's not possible to restore partial data from a backup.
13. Tip 5: Reindex for a remote node – 2/2
In config/elasticsearch.yml add:
reindex.remote.whitelist: ["192.168.1.227:9200"]
Then:
curl -XPOST "http://$server/_reindex" -d' {
"source": {
"remote": { "host": "http://192.168.1.227:9200" },
"index": "test-source”
},
"dest": {
"index": "test-dest”
}
}'
14. Tip 6: Ingest Pipeline – 1/2
Why
Adding/Removing fields without changing your code
Manipulate your records before ingesting
Computed fields
Also supports scripting