In this presentation we'll go through the needed contrib modules, as well as the resources and special cofigurations needed for SOLR to not only index, but also find multi-language content.
Turbocharging Drupal Syndication with Node.jsExove
You can get far by caching Drupal's content feeds. There are a lot of caching layers available. But when you need a bit of intelligence to your caching layer, drowning deep into the world of Varnish VCL configurations isn't the only option.
We went from trying to optimize Drupal's ability to deliver JSON-feeds out with MongoDB field storage and SOLR backed Views with a Varnish caching layer to a performance-optimized standalone Node.JS/MongoDB stack.
In this presentation we'll show a real-world case, where Drupal's content is optimized and indexed to MongoDB and then delivered out in JSON with astonishing speeds with a very simple Node.JS layer.
The setup serves most of the video content to Finland's biggest media corporation, Sanoma. It's the sole source of video content to their online TV service, Ruutu.fi.
The same setup could be used for serving as a backend for high-volume Javascript applications, replicating a lot of content around the world or optimizing the UX of a Drupal site by adding super-fast asynchronous APIs.
In the presentation we'll look at the architecture, the development phases, performance optimizations and lessons learnt in storing complicated data structures to Drupal and MongoDB. We'll also look at the current development efforts in getting the system in shape for Drupal 8 upgrade in the near future.
The session video (slides with audio) can be viewed in YouTube: https://www.youtube.com/watch?v=VmTd6hITVVA
This time AppTalk will focus on the everyday question of web applications vs hybrid applications vs native mobile applications. We'll provide guidance in making the business decision between these approaches. This will be presented through practical real-life cases. The focus will be on mobile applications rather than games.
Turbocharging Drupal Syndication with Node.jsExove
You can get far by caching Drupal's content feeds. There are a lot of caching layers available. But when you need a bit of intelligence to your caching layer, drowning deep into the world of Varnish VCL configurations isn't the only option.
We went from trying to optimize Drupal's ability to deliver JSON-feeds out with MongoDB field storage and SOLR backed Views with a Varnish caching layer to a performance-optimized standalone Node.JS/MongoDB stack.
In this presentation we'll show a real-world case, where Drupal's content is optimized and indexed to MongoDB and then delivered out in JSON with astonishing speeds with a very simple Node.JS layer.
The setup serves most of the video content to Finland's biggest media corporation, Sanoma. It's the sole source of video content to their online TV service, Ruutu.fi.
The same setup could be used for serving as a backend for high-volume Javascript applications, replicating a lot of content around the world or optimizing the UX of a Drupal site by adding super-fast asynchronous APIs.
In the presentation we'll look at the architecture, the development phases, performance optimizations and lessons learnt in storing complicated data structures to Drupal and MongoDB. We'll also look at the current development efforts in getting the system in shape for Drupal 8 upgrade in the near future.
The session video (slides with audio) can be viewed in YouTube: https://www.youtube.com/watch?v=VmTd6hITVVA
This time AppTalk will focus on the everyday question of web applications vs hybrid applications vs native mobile applications. We'll provide guidance in making the business decision between these approaches. This will be presented through practical real-life cases. The focus will be on mobile applications rather than games.
You probably think that PL/SQL is dull and ordinary programming language. Not so! Parts of it can be downright WEIRD. In this presentation, Steven offers what he considers to be some of the stranger nooks and crannies of the PL/SQL language, perhaps in the process making them a little bit less weird.
Software Engineering Thailand: Programming with ScalaBrian Topping
Meet-up, May 28, 2015, Launchpad, Bangkok. http://www.meetup.com/Software-Engineering-Thailand/events/222548484/.
Apologies for the rendering quality not matching the presentation, I did these with Apple Keynote and Slideshare does not support this format. I will try to edit them when there is more time.
Thanks to Bangkok LaunchPad (https://www.facebook.com/launchpadhq) for generously hosting this event!
You probably think that PL/SQL is dull and ordinary programming language. Not so! Parts of it can be downright WEIRD. In this presentation, Steven offers what he considers to be some of the stranger nooks and crannies of the PL/SQL language, perhaps in the process making them a little bit less weird.
Software Engineering Thailand: Programming with ScalaBrian Topping
Meet-up, May 28, 2015, Launchpad, Bangkok. http://www.meetup.com/Software-Engineering-Thailand/events/222548484/.
Apologies for the rendering quality not matching the presentation, I did these with Apple Keynote and Slideshare does not support this format. I will try to edit them when there is more time.
Thanks to Bangkok LaunchPad (https://www.facebook.com/launchpadhq) for generously hosting this event!
Exove's CTO Kalle Varisvirta shares his insights on diversity in recruitment. Kalle has many years of experience in recruiting software developers. Exove is a company with a diverse & inclusive workforce – and we are very proud of it! Read more about us: exove.com.
Kalle was one of the speakers in the Agile Search HR meetup on 28 March and he gave this presentation there.
Mitä saavutettavuusdirektiivi pitää sisälläänExove
Mitä saavutettavuusdirektiivi pitää sisällään, Kimmo Sääskilahti, Annanpura
Kimmo Sääskilahden puheenvuoro Exoven seminaarissa "Saavutettavuus ja käytettävyys verkkopalveluissa" 15.2.2019
Life with digital services after GDPR by Kalle Varisvirta, Exove
Seminar Exove and Bird & Bird 26th April 2018: GDPR tulee - mitä tapahtuu h-hetken jälkeen
Exove Extends keynote on Dec 13th, 2017
Developing truly personalised experiences by Simon Chapman from Acquia
Acquia powers some of the world’s biggest and most well-known websites, delivering personalised content whatever the channel, location or device. We’ll take a deep dive into the technologies and components of the Acquia platform and explore traditional development methods versus headless or decoupled architectures. We’ll outline the benefits of using modern JS frameworks whilst delivering personalised experiences that capture your customers ‘in the moment’, which ultimately can be measured through analytics...and as your customer data grows, we’ll talk about how this ‘big data’ can be used to drive reporting, customer journeys and the ‘next best action’.
Adventures In Programmatic Branding – How To Design With Algorithms And How T...Exove
IxDA Helsinki x Exove meetup 19.10.2017
Adventures In Programmatic Branding – How To Design With Algorithms And How To Tame Metaballs?
by AKI-VILLE PÖYKIÖ
We created a fluid, ever-changing brand for Women in Tech, a diversity in technology movement kickstarted in Singapore. ED’s design director Aki-Ville Pöykiö tells the story and how we survived an algorithm gone rogue.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
2. SOLR? What’s that and
why do I care?
§ SOLR is a open source search platform,
optimized for full-text searching, hit highlighting,
faceted search and lot more
§ Incomparable to Drupal’s internal search; it
blows you away when you compare it
§ Integrates to Drupal in many ways and can be
used in many ways – we’re focusing on the
actual search functionality
3. SOLR
§ Since it’s Java, it needs the Java-capable web-
server and ships with one, Jetty
§ Very easy to configure and start, even for a
Drupal developer used to drush etc.
§ Integrates for searching with “Apache SOLR
search integration” –module sponsored by
Acquia
4.
5. How does Drupal integrate
to SOLR
§ Basically the module replaces Drupal’s internal
search indexing and instead uses a SOLR
schema (schema.xml) that ships with the
module
§ It defines the mandatory node fields in Drupal
and uses SOLR’s cool dynamic field definitions
to accommodate all your FieldAPI fields
6. So, what does SOLR do?
§ Obviously first it looks at the type of the field, the
behavior differs for different field types
§ For text it does a lot, it makes your text
searchable by first processing it in many ways
and then indexing it
§ The behavior differs in different languages – and
we’ll come to that later – but here’s the basic
process for a popular language example:
English
7. SOLR processing
§ First it tokenizes the text by whitespace
§ Then it removes the stop words (words not to
index, e.g. and or or)
§ Then it splits words by case change, numerics
and by couple of other rules, e.g. “PowerShot”
=> indexed as “Power” and “Shot”
§ Then it stems the words, reducing inflected
words to their stems, e.g. “stemming” => “stem”
§ Then it removes duplicate tokens
8. SOLR processing
FreeAir X500 Wireless Router is a powerful wireless solution
well suited for the home or office.
9. SOLR processing
Separated by whitespace.
FreeAir X500 Wireless Router is a powerful wireless solution
well suited for the home or office
10. SOLR processing
Stop words removed.
FreeAir X500 Wireless Router powerful wireless solution
suited home office
11. SOLR processing
Words split, but not FreeAir, since it’s on the protected words list.
FreeAir X 500 Wireless Router powerful wireless solution
suited home office
12. SOLR processing
Everything in lowercase.
freeair x 500 wireless router powerful wireless solution
suited home office
14. Searching from SOLR
§ Now when you search from SOLR, it does parts
of the same magic to your query text
§ This way you’ll match the indexed document
even if you wrote it a bit differently
§ “Office capable wireless routers” will match our
indexed document just nicely, not by every
word, but enough and close by each other, that
it’ll be a good match and ranking high on
SOLR’s relevance score
15. Apache SOLR integration
§ All the special configurations you need for SOLR to
run a site (in English) gets shipped with Apache
SOLR search integration module
§ Just copy them to SOLR and you’re good to go
§ The rest of the presentation will presume you’re
using this module to connect to SOLR, if you’re
using Search API Solr search, you’re out of luck and
will have to be doing a lot of more handywork,
check out http://drupal.org/node/1210810
16. SO, MY SOLR SEARCH
WORKS WELL WITH MY
ENGLISH CONTENT
17. But, then, this is Europe
We do use a lot of other languages here too
… and then, things get a bit more complicated
18. SOLR schema has to be
language-aware
§ Stemming, stopwords, compound words and
such are all language dependent
§ The SOLR main indexing and querying
configuration, schema.xml, needs to be
language specific
§ Schema.xml is a long, complicated XML
document and any errors in it will prevent SOLR
to start
19. Here’s an example
schema.xml
<?xml version="1.0" encoding="UTF-8"?>!
<!--!
This is the Solr schema file. This file should be named "schema.xml" and!
should be in the conf directory under the solr home!
(i.e. ./solr/conf/schema.xml by default)!
or located where the classloader for the Solr webapp can find it.!
!
For more information, on how to customize this file, please see!
http://wiki.apache.org/solr/SchemaXml!
-->!
<schema name="drupal-3.0-0-solr3" version="1.3">!
<!-- attribute "name" is the name of this schema and is only used for display purposes.!
Applications should change this to reflect the nature of the search collection.!
version="1.2" is Solr's version number for the schema syntax and semantics. It should!
not normally be changed by applications.!
1.0: multiValued attribute did not exist, all fields are multiValued by nature!
1.1: multiValued attribute introduced, false by default!
1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields.!
1.3: removed optional field compress feature!
-->!
<types>!
<!-- field type definitions. The "name" attribute is!
just a label to be used by field definitions. The "class"!
attribute and any other attributes determine the real!
behavior of the fieldType.!
Class names starting with "solr" refer to java classes in the!
org.apache.solr.analysis package.!
-->!
!
<!-- The StrField type is not analyzed, but indexed/stored verbatim.!
- StrField and TextField support an optional compressThreshold which!
limits compression (if enabled in the derived fields) to values which!
exceed a certain size (in characters).!
-->!
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>!
!
22. There’s help available
§ There are two modules in Drupal.org to make
your life easier, Apache SOLR Multilingual and
Apache SOLR config generator
§ They combined will enable you to
§ Have a multi-language site with SOLR search
optimized for each language
§ Generate configuration for such multi-language site,
or even a site with one non-english language
23. Apache SOLR multilingual
§ Apache SOLR multilingual will separate the Drupal
node fields per language and store them into SOLR
in different fields
§ That way you can have different configuration setup
for the same Drupal field in different languages
§ It’ll handle the spell checking too
§ Apache SOLR config generator will then generate
you a suitable starting point for your SOLR
configuration files
24. … but it doesn’t do
everything
§ It ships with the stopword list for most common
languages, the ISO Latin mapping list for German
(the module author speaks German) and some
other files
§ Most of the language specific language lists, such
as protwords (usually site-specific anyway), ISO
mappings, synonyms and compound word lists
you’ll have to provide yourself
§ Some languages need a different stemmer to work
properly, the configuration generator uses
SnowBallFilterFactory
25. Stop words
§ All the languages need the stop words list, these
are the “and, or, then” words you don’t index at
all
§ Needless to say, they are language specific
§ Luckily you’ll find most of them either in the
Apache SOLR multilingual module or
somewhere online
26. ISO mapping
§ This means the special letter in some languages
and how convert them for better matching
§ This is done usually for accents and such, that
are to guide the pronunciation of the word and
doesn’t change the meaning (eg. café => cafe,
in both indexing and querying)
§ Umlauts (ä, ö, å) do change the meaning and
usually are NOT replaced
27. Protected words
§ As stated earlier, protected words are the words
you don’t want the indexer to deform
§ Usually trademarks, product names and such
§ These are usually site-specific – for obvious
reasons
§ This also means you’ll have to be writing this list
yourself – not a long list usually though
28. Synonyms
§ Synonyms are good if you want to make sure
your results are found even if the users don’t
use the same word
§ Also language specific and not easy to find for
smaller languages
§ Here’s an example:
!GB,gib,gigabyte,gigabytes!
29. Compound words
§ There’s also a file to split up compound words
§ For a lot of languages you don’t even need it
and for most a small one is only needed
§ But then there are some languages you can’t go
without one, like German or Finnish
§ Let’s look a an example
30. Compound words
example
§ We did a Drupal site that is about food recipes
§ In English, searching for ‘soup’ would result in all
the soups
§ Oxtail soup
§ Lentil soup
§ Goulash soup
§ Tomato soup
… and so on
31. Compound words
example
§ By searching with soup in Finnish, ‘keitto’, you’d
normally get none of the following:
§ Häränhäntäkeitto
§ Linssikeitto
§ Gulassikeitto
§ Tomaattikeitto
… see why?
32. Compound words
§ See, SOLR doesn’t do infix indexes, that means
it doesn’t find words “within” other words*
§ So you’ll have to cut compound words to be
able to access the words
* There is a way to do infix indexes in SOLR, but that’s so complicated that it’s not even
funny. You’ll have to have two indexes, one the normal way and one in reverse and
then reverse the query to search from the reverse index.
33. Some special languages
§ Chinese, Japanese and Korean have their own
different approach to indexing with SOLR,
basically you don’t have to stem, but only cut the
words out of the sentences (whitespace doesn’t
work like in the European languages)
§ For some languages, you can’t even find the
basic stuff (try Mongolian for instance)
34. Multilingual SOLR search
§ After adding all those word lists and retuning your
search according to examples in SOLR’s wiki and
example configurations, you’ll have a working multi-
language SOLR search
§ Let native users of that language use it and you’ll
have some more tuning to do and words to add to
those lists
§ Eventually your site will be the benchmark for
functional searching – working multi-language
searches are that rare
36. Apache SOLR integration
§ Apache SOLR integration is module for
integrating your search to SOLR from Drupal
§ It works well for English, even better if you tune
the SOLR configuration a bit
§ Apache SOLR multilingual and config generator
enable to you index multiple language content
§ If you’re using Search API Solr search, you in for
a lot of manual labor
37. Apache SOLR multilingual
§ But you need to tune your settings by hand and
you need the word lists
§ Word lists for stop words are easy to find for
common languages
§ Other word lists you can only find for really
popular languages
§ Protected words you’ll have to craft up yourself