- The document discusses strategies for indexing and querying documents in multiple languages within Elasticsearch.
- Two main indexing strategies are described: having separate indices for each language, or having a single index with different mappings or fields for each language.
- When querying, the strategy used depends on the indexing approach, but all allow for multi-language searches.
- Custom analyzers and token filters are demonstrated for more accurate language analysis beyond the standard analyzer, such as for German text.
If you are building a service oriented system and you want to build it for scale as well as flexibility. There are a few questions you need to make sure are asked and answered regarding the data interchange between services and offline persistency of services data. Questions as:
- How can I change a service API without breaking other services?
- How do I keep data from services consistent over time?
This talk covers the challenges we tackled during building our new service oriented system. Summarizing what we realized would bad Ideas to do, what are the better approaches to data consistency.
It includes a dive into the Apache Avro technology and how we used it.
Also what other supporting infrastructure we created to help us achieving the goal of consistent yet flexible system.
If you are building a service oriented system and you want to build it for scale as well as flexibility. There are a few questions you need to make sure are asked and answered regarding the data interchange between services and offline persistency of services data. Questions as:
- How can I change a service API without breaking other services?
- How do I keep data from services consistent over time?
This talk covers the challenges we tackled during building our new service oriented system. Summarizing what we realized would bad Ideas to do, what are the better approaches to data consistency.
It includes a dive into the Apache Avro technology and how we used it.
Also what other supporting infrastructure we created to help us achieving the goal of consistent yet flexible system.
Discover the power of field-level optimization with Elasticsearch Analyzers. Uncover strategies to enhance indexing precision and search relevance. Explore key analyzers for efficient data indexing and fine-tune your Elasticsearch setup for superior performance.
Introducing new features in Apache Pinot. In this talk, we will go over indexing support in Pinot, recently added text indexing feature, SQL support, and cloud readiness.
JLIFF: Where we are, and where we're goingChase Tingley
Presented at FEISGILTT 2017. An introduction to JLIFF, a JSON-based version of the XLIFF 2.x standard currently under development by the OASIS XLIFF-OMOS technical committee.
Software engineering research often requires analyzing
multiple revisions of several software projects, be it to make and
test predictions or to observe and identify patterns in how software evolves. However, code analysis tools are almost exclusively designed for the analysis of one specific version of the code, and the time and resources requirements grow linearly with each additional revision to be analyzed. Thus, code studies often observe a relatively small number of revisions and projects. Furthermore, each programming ecosystem provides dedicated tools, hence researchers typically only analyze code of one language, even when researching topics that should generalize
to other ecosystems. To alleviate these issues, frameworks and models have been developed to combine analysis tools or automate the analysis of multiple revisions, but little research has gone into actually removing redundancies in multi-revision, multi-language code analysis. We present a novel end-to-end approach that systematically avoids redundancies every step of the way: when reading sources from version control, during parsing, in the internal code representation, and during the actual analysis. We evaluate our open-source implementation, LISA, on the full
history of 300 projects, written in 3 different programming languages, computing basic code metrics for over 1.1 million program revisions. When analyzing many revisions, LISA requires less than a second on average to compute basic code metrics for all files in a single revision, even for projects consisting of millions of lines of code.
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
Apache Solr was always built on strong Information Retrieval/Natural Language Processing foundation. And, in recent versions, even more Artificial Intelligence features, techniques and integrations were added to the Solr.
This presentation covers some classic (and hidden gems) AI elements that Solr supported for long time as well as the most recent features that are not even fully documented yet.
The presentation was made with references to Solr 7.4.
ASTs are an incredibly powerful tool for understanding and manipulating JavaScript. We'll explore this topic by looking at examples from ESLint, a pluggable static analysis tool, and Browserify, a client-side module bundler. Through these examples we'll see how ASTs can be great for analyzing and even for modifying your JavaScript. This talk should be interesting to anyone that regularly builds apps in JavaScript either on the client-side or on the server-side.
Relevance trilogy may dream be with you! (dec17)Woonsan Ko
Introducing new BloomReach Experience Plugins which changes the game of DREAM (Digital Relevance Experience & Agility Management), to increase productivity and business agility.
Discover the power of field-level optimization with Elasticsearch Analyzers. Uncover strategies to enhance indexing precision and search relevance. Explore key analyzers for efficient data indexing and fine-tune your Elasticsearch setup for superior performance.
Introducing new features in Apache Pinot. In this talk, we will go over indexing support in Pinot, recently added text indexing feature, SQL support, and cloud readiness.
JLIFF: Where we are, and where we're goingChase Tingley
Presented at FEISGILTT 2017. An introduction to JLIFF, a JSON-based version of the XLIFF 2.x standard currently under development by the OASIS XLIFF-OMOS technical committee.
Software engineering research often requires analyzing
multiple revisions of several software projects, be it to make and
test predictions or to observe and identify patterns in how software evolves. However, code analysis tools are almost exclusively designed for the analysis of one specific version of the code, and the time and resources requirements grow linearly with each additional revision to be analyzed. Thus, code studies often observe a relatively small number of revisions and projects. Furthermore, each programming ecosystem provides dedicated tools, hence researchers typically only analyze code of one language, even when researching topics that should generalize
to other ecosystems. To alleviate these issues, frameworks and models have been developed to combine analysis tools or automate the analysis of multiple revisions, but little research has gone into actually removing redundancies in multi-revision, multi-language code analysis. We present a novel end-to-end approach that systematically avoids redundancies every step of the way: when reading sources from version control, during parsing, in the internal code representation, and during the actual analysis. We evaluate our open-source implementation, LISA, on the full
history of 300 projects, written in 3 different programming languages, computing basic code metrics for over 1.1 million program revisions. When analyzing many revisions, LISA requires less than a second on average to compute basic code metrics for all files in a single revision, even for projects consisting of millions of lines of code.
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
Apache Solr was always built on strong Information Retrieval/Natural Language Processing foundation. And, in recent versions, even more Artificial Intelligence features, techniques and integrations were added to the Solr.
This presentation covers some classic (and hidden gems) AI elements that Solr supported for long time as well as the most recent features that are not even fully documented yet.
The presentation was made with references to Solr 7.4.
ASTs are an incredibly powerful tool for understanding and manipulating JavaScript. We'll explore this topic by looking at examples from ESLint, a pluggable static analysis tool, and Browserify, a client-side module bundler. Through these examples we'll see how ASTs can be great for analyzing and even for modifying your JavaScript. This talk should be interesting to anyone that regularly builds apps in JavaScript either on the client-side or on the server-side.
Relevance trilogy may dream be with you! (dec17)Woonsan Ko
Introducing new BloomReach Experience Plugins which changes the game of DREAM (Digital Relevance Experience & Agility Management), to increase productivity and business agility.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
2. About me
● Bryan Warner - Developer @Traackr
○ bwarner@traackr.com
● I've worked with ElasticSearch since early 2012 ...
before that I had worked with Lucene & Solr
● Primary background is in Java back-end development
● Shifting focus into Scala development past year
3. About Traackr
● Influencer search engine
● We track content daily & in real-time for our database of
influential people
● We leverage ElasticSearch parent/child (top-children)
queries to search content (i.e. the children) to surface
the influencers who've authored it (i.e. the parents)
● Some of our back-end stack includes: ElasticSearch,
MongoDb, Java/Spring, Scala/Akka, etc.
4. Overview
● Indexing / Querying strategies to support language-
targeted searches within ES
● ES Analyzers / TokenFilters for language analysis
● Custom Analyzers / TokenFilters for ES
● Look at some OS projects that assist in language
detection & analysis
5. Use Case
● We have a database of articles written in many
languages
● We want our users to be able to search articles written
in a particular language
● We want that search to handle the nuances for that
particular language
7. Indexing Strategies
Separate indices per language
- OR -
Same index for all languages
8. Indexing Strategies
Separate Indices per language
PROS
■ Clean separation
■ Truer IDF values
○ IDF = log(numDocs/(docFreq+1)) + 1
CONS
■ Increased Overhead
■ Parent/Child queries -> parent document duplication
○ Same problem for Solr Joins
■ Maintain schema per index
9. Indexing Strategies
Same index for all languages
PROS
■ One index to maintain (and one schema)
■ Parent/Child queries are fine
CONS
■ Schema complexity grows
■ IDF values might be skewed
10. Indexing Strategies
Same index for all languages ... how?
1. Create different "mapping" types per language
a. At indexing time, we set the right mapping based on
the article's language
2. Create different fields per language-analyzed field
a. At indexing time, we populate the correct text field
based on the article's language
13. Querying Strategies
How do we execute a language-targeted search?
... all based on our indexing strategy.
14. Querying Strategies
(1) Separate Indices per language
...
String targetIndex = getIndexForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch(targetIndex)
.setTypes("article");
QueryStringQueryBuilder query = QueryBuilders.queryString(
"boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one
request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
15. Querying Strategies
(2a) Same index for language - Diff. mappings
...
String targetMapping = getMappingForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch("your_index")
.setTypes(targetMapping);
QueryStringQueryBuilder query = QueryBuilders.queryString(
"boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one
request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
16. Querying Strategies
(2b) Same index for language - Diff. fields
...
SearchRequestBuilder request = client.prepareSearch("your_index")
.setTypes("article");
QueryStringQueryBuilder query = QueryBuilders.queryString(
"boston elasticsearch");
query.field(text_en|text_fr|text_de); // pick one
query.analyzer(english|french|german); // pick one
request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
17. Querying Strategies
● Will these strategies support a multi-language search?
○ E.g. Search by french and german
○ E.g. Search against all languages
● Yes! *
● In the same SearchRequest:
○ We can search against multiple indices
○ We can search against multiple "mapping" types
○ We can search against multiple fields
* Need to give thought which query analyzer to use
18. Language Analysis
● What does ElasticSearch and/or Lucene offer us for
analyzing various languages?
● Is there a one-size-fits-all solution?
○ e.g. StandardAnalyzer
● Or do we need custom analyzers for each language?
19. Language Analysis
StandardAnalyzer - The Good
● For many languages (french, spanish), it will get you
95% of the way there
● Each language analyzer provides its own flavor to the
StandardAnalyzer
● FrenchAnalyzer
○ Adds an ElisionFilter (l'avion -> avion)
○ Adds French StopWords filter
○ FrenchLightStemFilter
20. Language Analysis
StandardAnalyzer - The Bad
● For some languages, it will get you 2/3 of the way there
● German has a heavy use of compound words
■ das Vaterland => The fatherland
■ Rechtsanwaltskanzleien => Law Firms
● For best search results, these compound words should
produce index terms for their individual parts
● GermanAnalyzer lacks a Word Compound Token Filter
21. Language Analysis
StandardAnalyzer - The Ugly
● For other languages (e.g. Asian languages), it will not
get you far
● Using a Standard Tokenizer to extract tokens from
Chinese text will not produce accurate terms
○ Some 3rd-party Chinese analyzers will extract
bigrams from Chinese text and index those as if they
were words
● Need to do your research
22. Language Analysis
You should also know about...
● ASCII Folding Token Filter
○ über => uber
● ICU Analysis Plugin
○ http://www.elasticsearch.org/guide/reference/index-
modules/analysis/icu-plugin.html
○ Allows for unicode normalization, collation and
folding
23. Custom Analyzer / Token Filter
● Let's create a custom analyzer definition for German
text (e.g. remove stemming)
● How do we go about doing this?
○ One way is to leverage ElasticSearch's flexible
schema definitions
25. Custom Analyzer / Token Filter
Create a custom German analyzer in our schema:
"settings" : {
....
"analysis":{
"analyzer":{
"custom_text_german":{
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase"], stop words, german normalization?
}
}
....
}
}
26. Custom Analyzer / Token Filter
1. Declare schema filter for german stop_words
2. We'll also need to create a custom TokenFilter class to wrap Lucene's org.
apache.lucene.analysis.de.GermanNormalizationFilter
a. It does not come as a pre-defined ES TokenFilter
b. German text needs to normalize on certain characters based .. e.g.
'ae' and 'oe' are replaced by 'a', and 'o', respectively.
3. Declare schema filter for custom GermanNormalizationFilter
30. OS Projects
Language Detection
● https://code.google.com/p/language-detection/
○ Written in Java
○ Provides language profiles with unigram, bigram, and trigram
character frequencies
○ Detector provides accuracy % for each language detected
PROS
■ Very fast (~4k pieces of text per second)
■ Very reliable for text greater than 30-40 characters
CONS
■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e.
short tweets
31. OS Projects
German Word Decompounder
● https://github.com/jprante/elasticsearch-analysis-decompound
● Lucene offers two compound word token filters, a dictionary- &
hyphenation-based variant
○ Not bundled with Lucene due to licensing issues
○ Require loading a word list in memory before they are run
● The decompounder uses prebuilt Compact Patricia Tries for efficient word
segmentation provided by the ASV toolbox
○ ASV Toolbox project - http://wortschatz.uni-leipzig.
de/~cbiemann/software/toolbox/index.htm