Full text search in PostgreSQL is a flexible and powerful facility to search collection of documents using natural language queries. We will discuss several new improvements of FTS in PostgreSQL 9.6 release, such as phrase search, better dictionaries support and tsvector editing functions. Also, we will present new features currently in development - RUM index support, which enables acceleration of some important kinds of full text queries, new and better ranking function for relevance search, loading dictionaries into shared memory and support for search multilingual content.
End-to-end Streaming Between gRPC Services Via Kafka with John FallowsHostedbyConfluent
"You have built and deployed your gRPC micro-services architecture and now you need to expand beyond request-response to add streaming. Helpfully, gRPC has built-in support for bidirectional streaming. You probably already have the event streams in Kafka and have been tasked with figuring out the integration. You might even build a prototype to better understand the performance, security and scalability challenges.
I will discuss some of these challenges in detail and also describe solutions to help meet them, including a straightforward approach to bridge gRPC streams to and from Apache Kafka.
This session is targeted towards developers interested in learning how to integrate gRPC with Kafka event streaming; securely, reliably and scalably."
Ariel Waizel discusses the Data Plane Development Kit (DPDK), an API for developing fast packet processing code in user space.
* Who needs this library? Why bypass the kernel?
* How does it work?
* How good is it? What are the benchmarks?
* Pros and cons
Ariel worked on kernel development at the IDF, Ben Gurion University, and several companies. He is interested in networking, security, machine learning, and basically everything except UI development. Currently a Solution Architect at ConteXtream (an HPE company), which specializes in SDN solutions for the telecom industry.
End-to-end Streaming Between gRPC Services Via Kafka with John FallowsHostedbyConfluent
"You have built and deployed your gRPC micro-services architecture and now you need to expand beyond request-response to add streaming. Helpfully, gRPC has built-in support for bidirectional streaming. You probably already have the event streams in Kafka and have been tasked with figuring out the integration. You might even build a prototype to better understand the performance, security and scalability challenges.
I will discuss some of these challenges in detail and also describe solutions to help meet them, including a straightforward approach to bridge gRPC streams to and from Apache Kafka.
This session is targeted towards developers interested in learning how to integrate gRPC with Kafka event streaming; securely, reliably and scalably."
Ariel Waizel discusses the Data Plane Development Kit (DPDK), an API for developing fast packet processing code in user space.
* Who needs this library? Why bypass the kernel?
* How does it work?
* How good is it? What are the benchmarks?
* Pros and cons
Ariel worked on kernel development at the IDF, Ben Gurion University, and several companies. He is interested in networking, security, machine learning, and basically everything except UI development. Currently a Solution Architect at ConteXtream (an HPE company), which specializes in SDN solutions for the telecom industry.
Traditional virtualization technologies have been used by cloud infrastructure providers for many years in providing isolated environments for hosting applications. These technologies make use of full-blown operating system images for creating virtual machines (VMs). According to this architecture, each VM needs its own guest operating system to run application processes. More recently, with the introduction of the Docker project, the Linux Container (LXC) virtualization technology became popular and attracted the attention. Unlike VMs, containers do not need a dedicated guest operating system for providing OS-level isolation, rather they can provide the same level of isolation on top of a single operating system instance.
An enterprise application may need to run a server cluster to handle high request volumes. Running an entire server cluster on Docker containers, on a single Docker host could introduce the risk of single point of failure. Google started a project called Kubernetes to solve this problem. Kubernetes provides a cluster of Docker hosts for managing Docker containers in a clustered environment. It provides an API on top of Docker API for managing docker containers on multiple Docker hosts with many more features.
Kernel Recipes 2019 - Suricata and XDPAnne Nicolas
Suricata is a network threat detection engine using network packets capture to reconstruct the traffic till the application layer and find threats on the network using rules that define behavior to detect. This task is really CPU intensive and discarding non interesting traffic is a solution to enable a scaling of Suricata to 40gbps and other.
This talk will present the latest evolution of Suricata that knows uses eBPF and XDP to bypass traffic. Suricata 5.0 is supporting the hardware XDP to provide ypass with network card such as Netronome. It also takes advantage of pinned maps to get persistance of the bypassed flows. This talk will cover the different usage of XDP and eBPF in Suricata and shows how it impact performance and usability. If development time permit, the talk will also cover AF_XDP and the impact on this new capture method on Suricata.
Eric Leblond
While Go is the language-of-choice in the cloud-native world, Python has a huge community and makes it really easy to extend Kubernetes in only a few lines of code.
This talk shows examples on how to use Python to query the Kubernetes API, how to write simple controllers in only 10 lines of Python, how to build complete web UIs, and how to test everything with py.test and Kind.
Some of the open-source projects which will be covered: pykube-ng, Kubernetes Web View, kube-janitor, and Kopf (Kubernetes Operator Pythonic Framework).
Talk held in Prague on 2019-09-05:
https://www.meetup.com/Cloud-Native-Prague/events/263802447/
HTTP/3 is designed to improve in areas where HTTP/2 still has some shortcomings, primarily by changing the transport layer. HTTP/3 is the first major protocol to step away from TCP and instead it uses QUIC.
HTTP/3 is the designated name for the coming next version of the protocol that is currently under development within the QUIC working group in the IETF.
HTTP/3 is designed to improve in areas where HTTP/2 still has some shortcomings, primarily by changing the transport layer. HTTP/3 is the first major protocol to step away from TCP and instead it uses QUIC.
Daniel Stenberg does a presentation about HTTP/3 and QUIC. Why the new protocols are deemed necessary, how they work, how they change how things are sent over the network and what some of the coming deployment challenges will be.
ROS2 on VxWorks - Challenges in porting a modern, software framework to RTOSAndrei Kholodnyi
A slim ROS2 design with minimal external dependencies makes it a promising candidate for direct deployment in the mobile robot environment. Such embedded systems are the area where VxWorks RTOS historically plays a leading role. This presentation describes the motivation and challenges of porting the Robot Operating System to VxWorks, gives some recommendations to the ROS2 community and discusses possible next steps.
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...DataStax
Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create features is critical for machine learning projects to be successful. To enable this, we built a time machine that computes features for any arbitrary time in the recent past for offline experimentation. We also built a real-time stream processing system to capture the interests of members during different times of the day and to quickly adapt to changes in the collective interests of members as it happens in case of real-world events.
Building the time machine for offline experimentation and the real-time infrastructure for online recommendations with Apache Spark (Streaming) and Apache Cassandra empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. We will delve into the architecture, use case details, data models used for cassandra and share our learnings.
About the Speakers
Prasanna Padmanabhan Engineering Manager, Netflix
Prasanna leads the Data Systems for Personalization team at Netflix. His primary focus is on building various big data infrastructure components that help their algorithmic engineers to innovate faster and improve personalization for Netflix members. In the past, he has built distributed data systems that leverages both batch and stream processing.
Roopa Tangirala Engineering Manager, Netflix
Roopa Tangirala is an experienced engineering leader with extensive background in databases, be they distributed or relational. She manages the database engineering team at Netflix responsible for operating cloud persistent and semipersistent runtime stores for Netflix, which includes Cassandra, Elasticsearch, Dynomite and MySQL databases, by ensuring data availability, durability, and scalability to meet the growing business needs.
On demand recording: https://www.nginx.com/resources/webinars/nginx-http2-server-push-grpc/
We discuss new NGINX support for HTTP/2 server push and proxying gRPC traffic.
Check out this webinar to learn:
- About NGINX HTTP/2 support
- How to use HTTP/2 server push with NGINX
- How to proxy gRPC traffic using NGINX
- How to configure both features, with live demonstrations
DPDK IPSec performance benchmark ~ Georgii TkachukIntel
DPDK IPSec performance benchmark ~ Georgii Tkachuk
IPSec and cryptodev overview and performance numbers by Intel Benchmarking team.
Part of 2 day SDN/NFV/DPDK dev lab
https://www.meetup.com/Out-Of-The-Box-Network-Developers/events/237028223/
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Accelerating Envoy and Istio with Cilium and the Linux KernelThomas Graf
This talk will provide an introduction to injection options of Envoy and then deep dive into ongoing Linux kernel work that enables injecting Envoy while introducing as little latency as possible.
The servicemesh and the sidecar proxy model are on a steep trajectory to redefine many networking and security use cases. This talk explains and demos a new socket redirect Linux kernel technology that allows running Envoy with similar performance as if the sidecar was linked to the application using a UNIX domain socket. The talk will also give an outlook on how Envoy can use the recently merged kernel TLS functionality to gain access to the clear text payload transparently for end to end encrypted applications without requiring to decrypt and re-encrypt any data to further reduce the overhead and latency.
BPF of Berkeley Packet Filter mechanism was first introduced in linux in 1997 in version 2.1.75. It has seen a number of extensions of the years. Recently in versions 3.15 - 3.19 it received a major overhaul which drastically expanded it's applicability. This talk will cover how the instruction set looks today and why. It's architecture, capabilities, interface, just-in-time compilers. We will also talk about how it's being used in different areas of the kernel like tracing and networking and future plans.
In the Cloud Native community, eBPF is gaining popularity, which can often be the best solution for solving different challenges with deep observability of system. Currently, eBPF is being embraced by major players.
Mydbops co-Founder, Kabilesh P.R (MySQL and Mongo Consultant) illustrates on debugging linux issues with eBPF. A brief about BPF & eBPF, BPF internals and the tools in actions for faster resolution.
Full text search | Speech by Matteo Durighetto | PGDay.IT 2013 Miriade Spa
Slide dell'intervento di Matteo Durighetto al PGDay.IT 2013, Prato, 25 Ottobre 2013
Il Full Text Search nasce dall’esigenza di ricercare parole o loro derivati all’interno di un documento. Infatti non sempre il problema è risolubile con le espressioni regolari, basti pensare ai plurali irregolari (per cui il problema del matching necessità di un dizionario) o al problema di calcolare la similarità di una parola (ad esempio per cercare l’argomento più attinente e farne una classifica).
In questo talk andremo ad esplorare le peculirità di PostgreSQL e le sue potenzialità al riguardo.
There are number of players that provide full text search feature, starting from embedded search to dedicated search servers [solr, sphinx, elasticsearch etc], but setting up and configuring them is a time consuming process and requires considerable knowledge of the tools.
What if we could get comparable search results using full text search capabilities of Postgres. Developers already have the working knowledge of the database, so this should come natural. In addition to that, it will be one less tool to manage.
Code: https://github.com/Syerram/postgres_search
Traditional virtualization technologies have been used by cloud infrastructure providers for many years in providing isolated environments for hosting applications. These technologies make use of full-blown operating system images for creating virtual machines (VMs). According to this architecture, each VM needs its own guest operating system to run application processes. More recently, with the introduction of the Docker project, the Linux Container (LXC) virtualization technology became popular and attracted the attention. Unlike VMs, containers do not need a dedicated guest operating system for providing OS-level isolation, rather they can provide the same level of isolation on top of a single operating system instance.
An enterprise application may need to run a server cluster to handle high request volumes. Running an entire server cluster on Docker containers, on a single Docker host could introduce the risk of single point of failure. Google started a project called Kubernetes to solve this problem. Kubernetes provides a cluster of Docker hosts for managing Docker containers in a clustered environment. It provides an API on top of Docker API for managing docker containers on multiple Docker hosts with many more features.
Kernel Recipes 2019 - Suricata and XDPAnne Nicolas
Suricata is a network threat detection engine using network packets capture to reconstruct the traffic till the application layer and find threats on the network using rules that define behavior to detect. This task is really CPU intensive and discarding non interesting traffic is a solution to enable a scaling of Suricata to 40gbps and other.
This talk will present the latest evolution of Suricata that knows uses eBPF and XDP to bypass traffic. Suricata 5.0 is supporting the hardware XDP to provide ypass with network card such as Netronome. It also takes advantage of pinned maps to get persistance of the bypassed flows. This talk will cover the different usage of XDP and eBPF in Suricata and shows how it impact performance and usability. If development time permit, the talk will also cover AF_XDP and the impact on this new capture method on Suricata.
Eric Leblond
While Go is the language-of-choice in the cloud-native world, Python has a huge community and makes it really easy to extend Kubernetes in only a few lines of code.
This talk shows examples on how to use Python to query the Kubernetes API, how to write simple controllers in only 10 lines of Python, how to build complete web UIs, and how to test everything with py.test and Kind.
Some of the open-source projects which will be covered: pykube-ng, Kubernetes Web View, kube-janitor, and Kopf (Kubernetes Operator Pythonic Framework).
Talk held in Prague on 2019-09-05:
https://www.meetup.com/Cloud-Native-Prague/events/263802447/
HTTP/3 is designed to improve in areas where HTTP/2 still has some shortcomings, primarily by changing the transport layer. HTTP/3 is the first major protocol to step away from TCP and instead it uses QUIC.
HTTP/3 is the designated name for the coming next version of the protocol that is currently under development within the QUIC working group in the IETF.
HTTP/3 is designed to improve in areas where HTTP/2 still has some shortcomings, primarily by changing the transport layer. HTTP/3 is the first major protocol to step away from TCP and instead it uses QUIC.
Daniel Stenberg does a presentation about HTTP/3 and QUIC. Why the new protocols are deemed necessary, how they work, how they change how things are sent over the network and what some of the coming deployment challenges will be.
ROS2 on VxWorks - Challenges in porting a modern, software framework to RTOSAndrei Kholodnyi
A slim ROS2 design with minimal external dependencies makes it a promising candidate for direct deployment in the mobile robot environment. Such embedded systems are the area where VxWorks RTOS historically plays a leading role. This presentation describes the motivation and challenges of porting the Robot Operating System to VxWorks, gives some recommendations to the ROS2 community and discusses possible next steps.
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...DataStax
Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create features is critical for machine learning projects to be successful. To enable this, we built a time machine that computes features for any arbitrary time in the recent past for offline experimentation. We also built a real-time stream processing system to capture the interests of members during different times of the day and to quickly adapt to changes in the collective interests of members as it happens in case of real-world events.
Building the time machine for offline experimentation and the real-time infrastructure for online recommendations with Apache Spark (Streaming) and Apache Cassandra empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. We will delve into the architecture, use case details, data models used for cassandra and share our learnings.
About the Speakers
Prasanna Padmanabhan Engineering Manager, Netflix
Prasanna leads the Data Systems for Personalization team at Netflix. His primary focus is on building various big data infrastructure components that help their algorithmic engineers to innovate faster and improve personalization for Netflix members. In the past, he has built distributed data systems that leverages both batch and stream processing.
Roopa Tangirala Engineering Manager, Netflix
Roopa Tangirala is an experienced engineering leader with extensive background in databases, be they distributed or relational. She manages the database engineering team at Netflix responsible for operating cloud persistent and semipersistent runtime stores for Netflix, which includes Cassandra, Elasticsearch, Dynomite and MySQL databases, by ensuring data availability, durability, and scalability to meet the growing business needs.
On demand recording: https://www.nginx.com/resources/webinars/nginx-http2-server-push-grpc/
We discuss new NGINX support for HTTP/2 server push and proxying gRPC traffic.
Check out this webinar to learn:
- About NGINX HTTP/2 support
- How to use HTTP/2 server push with NGINX
- How to proxy gRPC traffic using NGINX
- How to configure both features, with live demonstrations
DPDK IPSec performance benchmark ~ Georgii TkachukIntel
DPDK IPSec performance benchmark ~ Georgii Tkachuk
IPSec and cryptodev overview and performance numbers by Intel Benchmarking team.
Part of 2 day SDN/NFV/DPDK dev lab
https://www.meetup.com/Out-Of-The-Box-Network-Developers/events/237028223/
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Accelerating Envoy and Istio with Cilium and the Linux KernelThomas Graf
This talk will provide an introduction to injection options of Envoy and then deep dive into ongoing Linux kernel work that enables injecting Envoy while introducing as little latency as possible.
The servicemesh and the sidecar proxy model are on a steep trajectory to redefine many networking and security use cases. This talk explains and demos a new socket redirect Linux kernel technology that allows running Envoy with similar performance as if the sidecar was linked to the application using a UNIX domain socket. The talk will also give an outlook on how Envoy can use the recently merged kernel TLS functionality to gain access to the clear text payload transparently for end to end encrypted applications without requiring to decrypt and re-encrypt any data to further reduce the overhead and latency.
BPF of Berkeley Packet Filter mechanism was first introduced in linux in 1997 in version 2.1.75. It has seen a number of extensions of the years. Recently in versions 3.15 - 3.19 it received a major overhaul which drastically expanded it's applicability. This talk will cover how the instruction set looks today and why. It's architecture, capabilities, interface, just-in-time compilers. We will also talk about how it's being used in different areas of the kernel like tracing and networking and future plans.
In the Cloud Native community, eBPF is gaining popularity, which can often be the best solution for solving different challenges with deep observability of system. Currently, eBPF is being embraced by major players.
Mydbops co-Founder, Kabilesh P.R (MySQL and Mongo Consultant) illustrates on debugging linux issues with eBPF. A brief about BPF & eBPF, BPF internals and the tools in actions for faster resolution.
Full text search | Speech by Matteo Durighetto | PGDay.IT 2013 Miriade Spa
Slide dell'intervento di Matteo Durighetto al PGDay.IT 2013, Prato, 25 Ottobre 2013
Il Full Text Search nasce dall’esigenza di ricercare parole o loro derivati all’interno di un documento. Infatti non sempre il problema è risolubile con le espressioni regolari, basti pensare ai plurali irregolari (per cui il problema del matching necessità di un dizionario) o al problema di calcolare la similarità di una parola (ad esempio per cercare l’argomento più attinente e farne una classifica).
In questo talk andremo ad esplorare le peculirità di PostgreSQL e le sue potenzialità al riguardo.
There are number of players that provide full text search feature, starting from embedded search to dedicated search servers [solr, sphinx, elasticsearch etc], but setting up and configuring them is a time consuming process and requires considerable knowledge of the tools.
What if we could get comparable search results using full text search capabilities of Postgres. Developers already have the working knowledge of the database, so this should come natural. In addition to that, it will be one less tool to manage.
Code: https://github.com/Syerram/postgres_search
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Ontico
Я расскажу про новые возможности полнотекстового поиска, которые вошли в последний релиз PostgreSQL - поддержку фразового поиска и набор функций для манипулирования полнотекстовым типом данных (tsvector). Помимо этого, мы улучшили поддержку морфологических словарей, что привело к значительному увеличению числа поддерживаемых языков, оптимизировали работу со словарями, разработали новый индексный метод доступа RUM, который значительно ускорил выполнение ряда запросов с полнотекстовыми операторами.
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course PROIDEA
While losing to Oracle in features, it's losing marginally. While not so long on the market, it's still second best. While not so funky and shiny like new NoSQL DBs, it's arguably most shiny of all relational DBs and it has a colourful history. So, let me tell you about Postgresql architecture and internals, walk you through query path and optimization, let me hint about no hinting and how and why, in another thread we'll talk about MVCC and vacuum and if there will be time for more, we'll have a round of questions.
A guide for what to pay attention to when it comes to monitoring and managing your PostgreSQL database. The content is targeted at application developers as opposed to the typical DBA.
[Session given at Engage 2019, Brussels, 15 May 2019]
In this session, Tim Davis (Technical Director at The Turtle Partnership Ltd) takes you through the new Domino Query Language (DQL), how it works, and how to use it in LotusScript, in Java, and in the new domino-db Node.js module. Introduced in Domino 10, DQL provides a simple, efficient and powerful search facility for accessing Domino documents. Originally only used in the domino-db Node.js module, with 10.0.1 DQL also became available to both LotusScript and Java. This presentation will provide code examples in all three languages, ensuring you will come away with a good understanding of DQL and how to use it in your projects.
Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)Ontico
РИТ++ 2017, Backend Conf
Зал Сан-Паулу, 6 июня, 11:00
Тезисы:
http://backendconf.ru/2017/abstracts/2780.html
Не всем известно, что в PostgreSQL есть полнотекстовый поиск. Притом, в отличие от некоторых других РСУБД, в PostgreSQL этот поиск совершенно полноценный и способный посоревноваться в скорости и качестве со специализированными решениями. Что не менее интересно, используя полнотекстовый поиск PostgreSQL, вы можете избавиться от дублирования данных, экономя тем самым место на диске, трафик и обеспечивая согласованность данных.
Из этого доклада вы узнаете, как использовать для полнотекстового поиска PostgreSQL GIN, GiST, а также новый RUM-индекс, в чем заключаются преимущества и недостатки названных индексов, как с их помощью сделать поиск по документами или, например, саджестилки, и не только.
2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...HappyDev
Появление большого количества NoSQL СУБД обусловлено требованиями современных информационных систем, которым большинство традиционных реляционных баз данных не удовлетворяет. Одним из таких требований является поддержка данных, структура которых заранее не определена. Однако при выборе NoSQL БД ради отсутствия схем данных можно потерять ряд преимуществ, которые дают зрелые SQL-решения, а именно: транзакции, скорость чтения строк из таблиц. PostgreSQL, являющаяся передовой реляционной СУБД, имела поддержку слабо-структурированных данных задолго до появления NoSQL, которая обрела новое дыхание в последнем релизе в виде типа данных jsonb, который не только поддерживает стандарт JSON, но и обладает производительностью, сравнимой или даже превосходящей наиболее популярные NoSQL СУБД.
DataStax: An Introduction to DataStax Enterprise SearchDataStax Academy
1) Why We Built DSE Search
2) Basics of the Read and Write Paths
3) Fault-tolerance and Adaptive Routing
4) Analytics with Search and Spark
5) Live Indexing
A brief, but action-packed introduction to DataStax Enterprise Search. In this deck, we'll get an overview of DSE Search's value proposition, see some example CQL search queries, and dive into the details of the indexing and query paths.
Similar to Better Full Text Search in PostgreSQL (20)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. Better
Full Text Search
in PostgreSQL
Oleg Bartunov
Postgres Professional, Moscow University
Artur Zakirov
Postgres Professional
PGConf.EU-2016, Tallinn
2. FTS in PostgreSQL
● FTS is a powerful built-in text search engine
● No new features since 2006 !
● Popular complaints:
• Slow ranking
• No phrase search
• No efficient alternate ranking
• Working with dictionaries is tricky
• Dictionaries are stored in the backend memory
• FTS is flexible, but not enough
3. FTS in PostgreSQL
● tsvector – data type for document optimized for search
● tsquery – textual data type for rich query language
● Full text search operator: tsvector @@ tsquery
● SQL interface to FTS objects (CREATE, ALTER)
• Configuration: {tokens, {dictionaries}}
• Parser: {tokens}
• Dictionary: tokens → lexeme{s}
● Additional functions and operators
● Indexes: GiST, GIN, RUM
http://www.postgresql.org/docs/current/static/textsearch.html
to_tsvector('english','a fat cat sat on a mat and ate a fat rat')
@@
to_tsquery('english','(cats | rat) & ate & !mice');
4. Some FTS problems: #1
156676 Wikipedia articles:
● Search is fast, ranking is slow.
SELECT docid, ts_rank(text_vector, to_tsquery('english', 'title')) AS rank
FROM ti2
WHERE text_vector @@ to_tsquery('english', 'title')
ORDER BY rank DESC
LIMIT 3;
Limit (actual time=476.106..476.107 rows=3 loops=1)
Buffers: shared hit=149804 read=87416
-> Sort (actual time=476.104..476.104 rows=3 loops=1)
Sort Key: (ts_rank(text_vector, '''titl'''::tsquery)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=149804 read=87416
-> Bitmap Heap Scan on ti2 (actual time=6.894..469.215 rows=47855 loops=1)
Recheck Cond: (text_vector @@ '''titl'''::tsquery)
Heap Blocks: exact=4913
Buffers: shared hit=149804 read=87416
-> Bitmap Index Scan on ti2_index (actual time=6.117..6.117 rows=47855 loops
Index Cond: (text_vector @@ '''titl'''::tsquery)
Buffers: shared hit=1 read=12
Planning time: 0.255 ms
Execution time: 476.171 ms
(15 rows)
HEAP IS SLOW
470 ms !
5. Some FTS problems: #2
● No phrase search
● “A & B” is equivalent to “B & A»
There are only 92 posts in -hackers with person
'Tom Good', but FTS finds 34039 posts
● FTS + regex is slow and can be used only for
simple queries.
6. Some FTS problems: #3
● Slow FTS with ordering by timestamp («fresh» results)
● Bitmap index scan by GIN (fts)
● Bitmap index scan by Btree (date)
● BitmapAND
● Bitmap Heap scan
● Sort
● Limit
● 10 ms
SELECT sent, subject from pglist
WHERE fts @@ to_tsquery('english', 'server & crashed')
and sent < '2000-01-01'::timestamp
ORDER BY sent desc
LIMIT 5;
7. Inverted Index in PostgreSQL
E
N
T
R
Y
T
R
E
E
Posting list
Posting tree
No positions in index !
8. Improving GIN
● Improve GIN index
• Store additional information in posting tree, for
example, lexemes positions or timestamps
• Use this information to order results
10. CREATE INDEX ... USING RUM
● Use positions to calculate rank and order results
● Introduce distance operator tsvector <=> tsquery
CREATE INDEX ti2_rum_fts_idx ON ti2 USING rum(text_vector rum_tsvector_ops);
SELECT docid, ts_rank(text_vector, to_tsquery('english', 'title')) AS rank
FROM ti2
WHERE text_vector @@ to_tsquery('english', 'title')
ORDER BY
text_vector <=> plainto_tsquery('english','title') LIMIT 3;
QUERY PLAN
----------------------------------------------------------------------------------------
L Limit (actual time=54.676..54.735 rows=3 loops=1)
Buffers: shared hit=355
-> Index Scan using ti2_rum_fts_idx on ti2 (actual time=54.675..54.733 rows=3 loops=1)
Index Cond: (text_vector @@ '''titl'''::tsquery)
Order By: (text_vector <=> '''titl'''::tsquery)
Buffers: shared hit=355
Planning time: 0.225 ms
Execution time: 54.775 ms VS 476 ms !
(8 rows)
11. CREATE INDEX ... USING RUM
● Top-10 (out of 222813) postings with «Tom Lane»
• GIN index — 1374.772 ms
SELECT subject, ts_rank(fts,plainto_tsquery('english', 'tom lane')) AS rank
FROM pglist WHERE fts @@ plainto_tsquery('english', 'tom lane')
ORDER BY rank DESC LIMIT 10;
QUERY PLAN
----------------------------------------------------------------------------------------
Limit (actual time=1374.277..1374.278 rows=10 loops=1)
-> Sort (actual time=1374.276..1374.276 rows=10 loops=1)
Sort Key: (ts_rank(fts, '''tom'' & ''lane'''::tsquery)) DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on pglist (actual time=98.413..1330.994 rows=222813 loops=1)
Recheck Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
Heap Blocks: exact=105992
-> Bitmap Index Scan on pglist_gin_idx (actual time=65.712..65.712
rows=222813 loops=1)
Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
Planning time: 0.287 ms
Execution time: 1374.772 ms
(11 rows)
12. CREATE INDEX ... USING RUM
● Top-10 (out of 222813) postings with «Tom Lane»
• RUM index — 216 ms vs 1374 ms !!!
create index pglist_rum_fts_idx on pglist using rum(fts rum_tsvector_ops);
SELECT subject FROM pglist WHERE fts @@ plainto_tsquery('tom lane')
ORDER BY fts <=> plainto_tsquery('tom lane') LIMIT 10;
QUERY PLAN
----------------------------------------------------------------------------------
Limit (actual time=215.115..215.185 rows=10 loops=1)
-> Index Scan using pglist_rum_fts_idx on pglist (actual time=215.113..215.183 rows=10 lo
Index Cond: (fts @@ plainto_tsquery('tom lane'::text))
Order By: (fts <=> plainto_tsquery('tom lane'::text))
Planning time: 0.264 ms
Execution time: 215.833 ms
(6 rows)
13. CREATE INDEX ... USING RUM
● RUM uses new ranking function (ts_score) —
combination of ts_rank and ts_tank_cd
• ts_rank doesn't supports logical operators
• ts_rank_cd works poorly with OR queries
SELECT ts_rank(fts,plainto_tsquery('english', 'tom lane')) AS rank,
ts_rank_cd (fts,plainto_tsquery('english', 'tom lane')) AS rank_cd ,
fts <=> plainto_tsquery('english', 'tom lane') as score, subject
FROM pglist WHERE fts @@ plainto_tsquery('english', 'tom lane')
ORDER BY fts <=> plainto_tsquery('english', 'tom lane') LIMIT 10;
rank | rank_cd | score | subject
----------+---------+----------+------------------------------------------------------------
0.999637 | 2.02857 | 0.487904 | Re: ATTN: Tom Lane
0.999224 | 1.97143 | 0.492074 | Re: Bug #866 related problem (ATTN Tom Lane)
0.99798 | 1.97143 | 0.492074 | Tom Lane
0.996653 | 1.57143 | 0.523388 | happy birthday Tom Lane ...
0.999697 | 2.18825 | 0.570404 | For Tom Lane
0.999638 | 2.12208 | 0.571455 | Re: Favorite Tom Lane quotes
0.999188 | 1.68571 | 0.593533 | Re: disallow LOCK on a view - the Tom Lane remix
0.999188 | 1.68571 | 0.593533 | Re: disallow LOCK on a view - the Tom Lane remix
0.999188 | 1.68571 | 0.593533 | Re: disallow LOCK on a view - the Tom Lane remix
0.999188 | 1.68571 | 0.593533 | Re: [HACKERS] disallow LOCK on a view - the Tom Lane remix
(10 rows)
14. Phrase Search ( 8 years old!)
● Queries 'A & B'::tsquery and 'B & A'::tsquery
produce the same result
● Phrase search - preserve order of words in a
query
Results for queries 'A & B' and 'B & A' should be
different !
● Introduce new FOLLOWED BY (<->) operator:
• Guarantee an order of operands
• Distance between operands
a <n> b == a & b & ( i,j : pos(b)i – pos(a)j = n)∃
15. Phrase search - definition
● FOLLOWED BY operator returns:
• false
• true and array of positions of the right
operand, which satisfy distance condition
● FOLLOWED BY operator requires positions
select 'a b c'::tsvector @@ 'a <-> b'::tsquery; – false, there no positions
?column?
----------
f
(1 row)
select 'a:1 b:2 c'::tsvector @@ 'a <-> b'::tsquery;
?column?
----------
t
(1 row)
16. Phrase search - properties
● 'A <-> B' = 'A<1>B'
● 'A <0> B' matches the word with two
different forms ( infinitives )
=# SELECT ts_lexize('ispell','bookings');
ts_lexize
----------------
{booking,book}
to_tsvector('bookings') @@ 'booking <0> book'::tsquery
17. Phrase search - properties
● Precendence of tsquery operators - '! <-> & |'
Use parenthesis to control nesting in tsquery
select 'a & b <-> c'::tsquery;
tsquery
-------------------
'a' & 'b' <-> 'c'
select 'b <-> c & a'::tsquery;
tsquery
-------------------
'b' <-> 'c' & 'a'
select 'b <-> (c & a)'::tsquery;
tsquery
---------------------------
'b' <-> 'c' & 'b' <-> 'a'
18. Phrase search - example
● TSQUERY phraseto_tsquery([CFG,] TEXT)
Stop words are taken into account.
● It’s possible to combine tsquery’s
select phraseto_tsquery('PostgreSQL can be extended by the user in many ways');
phraseto_tsquery
-----------------------------------------------------------
'postgresql' <3> 'extend' <3> 'user' <2> 'mani' <-> 'way'
(1 row)
select phraseto_tsquery('PostgreSQL can be extended by the user in many ways') ||
to_tsquery('oho<->ho & ik');
?column?
-----------------------------------------------------------------------------------
'postgresql' <3> 'extend' <3> 'user' <2> 'mani' <-> 'way' | 'oho' <-> 'ho' & 'ik'
(1 row)
19. Phrase search - internals
● Phrase search has overhead, since it requires access
and operations on posting lists
( (A <-> B) <-> (C | D) ) & F
● We want to avoid slowdown FTS
operators (& |), which do not need
positions.
● Rewrite query, so any <-> operators pushed down in query tree and call
phrase executor for the top <-> operator.
B C
F
<-> |
<->
&
A D
20. Phrase search - transformation
( (A <-> B) <-> (C | D) ) & F
BA
CDBA
F
C
F
<-><->
<-><->
|
<-> |
&
<->
&
BA
D
Phrase top
Regular tree
Phrase tree
( 'A' <-> 'B' <-> 'C' | 'A' <-> 'B' <-> 'D' ) & 'F'
21. Phrase search - push down
a <-> (b&c) => a<->b & a<->c
(a&b) <-> c => a<->c & b<->c
a <-> (b|c) => a<->b | a<->c
(a|b) <-> c => a<->c | b<->c
a <-> !b => a & !(a<->b)
there is no position of A followed by B
!a <-> b => !(a<->b) & b
there is no position of B precedenced by A
22. Phrase search - transformation
# select '( A | B ) <-> ( D | C )'::tsquery;
tsquery
-----------------------------------------------
'A' <-> 'D' | 'B' <-> 'D' | 'A' <-> 'C' | 'B' <-> 'C'
# select 'A <-> ( B & ( C | ! D ) )'::tsquery;
tsquery
-------------------------------------------------------
'A' <-> 'B' & ( 'A' <-> 'C' | 'A' & !( 'A' <-> 'D' ) )
23. Phrase search
1.1 mln postings (postgres mailing lists)
● Phrase search has overhead
select count(*) from pglist where fts @@ to_tsquery('english','tom <-> lane');
count
--------
222777
(1 row)
<->(s) | & (s)
---------------------------------------+---------
Sequential Scan: 2.6 | 2.2
GIN index: 1.1 | 0.48 - significant overhead
RUM index: 0.5 | 0.48 - solves the problem !
24. Some FTS problems: #3
● Slow FTS with ordering by timestamp («fresh» results)
● Store timestamps in additional information in timestamp order !
● Index Scan by RUM (fts, sent)
● Limit
● 0.08 ms vs 10 ms !
create index pglist_fts_ts_order_rum_idx on pglist using
rum(fts rum_tsvector_timestamp_ops, sent) WITH (attach =
'sent', to ='fts', order_by_attach = 't');
select sent, subject from pglist
where fts @@ to_tsquery('server & crashed')
order by sent <=| '2000-01-01'::timestamp limit 5;
25. RUM vs GIN
● 6 mln classifies, real fts quieries, concurrency 24, duration
1 hour
• GIN — 258087 qph
• RUM — 1885698 qph ( 7x speedup )
● RUM has no pending list (not implemented) and stores
more data.
Insert 1 mln messages shows no significant overhead:
Time(min): GiST(10), GIN(10), GIN_no_fast(21), RUM(34)
WAL(GB): GiST(3.5), GIN(7.5), GIN_no_fast(24), RUM(29)
26. RUM vs GIN
● CREATE INDEX
• GENERIC WAL (9.6) generates too big WAL traffic
Page
Used space
Free spaceTo insert
Page
To generic WAL
New data
27. Inverse FTS (FQS)
● Find queries, which match given document
● Automatic text classification
SELECT * FROM queries;
q | tag
-----------------------------------+-------
'supernova' & 'star' | sn
'black' | color
'big' & 'bang' & 'black' & 'hole' | bang
'spiral' & 'galaxi' | shape
'black' & 'hole' | color
(5 rows)
SELECT * FROM queries WHERE
to_tsvector('black holes never exists before we think about them')
@@ q;
q | tag
------------------+-------
'black' | color
'black' & 'hole' | color
(2 rows)
28. Inverse FTS (FQS)
● RUM index supported – store branches of query tree in addinfo
Find queries for the first message in postgres mailing lists
d pg_query
Table "public.pg_query"
Column | Type | Modifiers
--------+---------+-----------
q | tsquery |
count | integer |
Indexes:
"pg_query_rum_idx" rum (q) 33818 queries
select q from pg_query pgq, pglist where q @@ pglist.fts and pglist.id=1;
q
--------------------------
'one' & 'one'
'postgresql' & 'freebsd'
(2 rows)
29. Inverse FTS (FQS)
● RUM index supported – store branches of query tree in addinfo
Find queries for the first message in postgres mailing lists
create index pg_query_rum_idx on pg_query using rum(q);
select q from pg_query pgq, pglist where q @@ pglist.fts and pglist.id=1;
QUERY PLAN
--------------------------------------------------------------------------
Nested Loop (actual time=0.719..0.721 rows=2 loops=1)
-> Index Scan using pglist_id_idx on pglist
(actual time=0.013..0.013 rows=1 loops=1)
Index Cond: (id = 1)
-> Bitmap Heap Scan on pg_query pgq
(actual time=0.702..0.704 rows=2 loops=1)
Recheck Cond: (q @@ pglist.fts)
Heap Blocks: exact=2
-> Bitmap Index Scan on pg_query_rum_idx
(actual time=0.699..0.699 rows=2 loops=1)
Index Cond: (q @@ pglist.fts)
Planning time: 0.212 ms
Execution time: 0.759 ms
(10 rows)
30. Inverse FTS (FQS)
● RUM index supported – store branches of query tree in addinfo
Monstrous postings
select id, t.subject, count(*) as cnt into pglist_q from pg_query,
(select id, fts, subject from pglist) t where t.fts @@ q
group by id, subject order by cnt desc limit 1000;
select * from pglist_q order by cnt desc limit 5;
id | subject | cnt
--------+-----------------------------------------------+------
248443 | Packages patch | 4472
282668 | Re: release.sgml, minor pg_autovacuum changes | 4184
282512 | Re: release.sgml, minor pg_autovacuum changes | 4151
282481 | release.sgml, minor pg_autovacuum changes | 4104
243465 | Re: [HACKERS] Re: Release notes | 3989
(5 rows))
31. RUM vs GIN
● CREATE INDEX
• GENERIC WAL(9.6) generates too big WAL traffic.
It currently doesn't supports shift.
rum(fts, ts+order) generates 186 Gb of WAL !
• RUM writes WAL AFTER creating index+-----------------------------------------------------------+
|table | gin | rum (fts |rum(fts,ts)|rum(fts,ts+order|
+-----------------------------------------------------------+
Create time| | 147 s | 201 | 209 | 215 |
+-----------------------------------------------------------+
Size( mb) |2167/1302| 534 | 980 | 1531 | 1921 |
+-----------------------------------------------------------+
WAL (Gb) | | 0.9 | 0.68 | 1.1 | 1.5 |
+-----------------------------------------------------------+
32. RUM Todo
● Allow multiple additional info
(lexemes positions + timestamp)
● Add support for arrays
● improve ranking function to support TF/IDF
● Improve insert time (pending list ?)
● Improve GENERIC WAL to support shift
Availability:
● 9.6+ only: https://github.com/postgrespro/rum
33. Better FTS configurability
● The problem
• Search multilingual collection requires processing by several
language-specific dictionaries. Currently, logic of processing is
hidden from user and example would“nt works.
● Logic of tokens processing in FTS configuration
• Example: German-English collection
ALTER TEXT SEARCH CONFIGURATION multi_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH unaccent THEN (german_ispell AND english_ispell) OR simple;
ALTER TEXT SEARCH CONFIGURATION multi_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH unaccent, german_ispell, english_ispell, simple;
34. Some FTS problems #4
● Working with dictionaries can be difficult and slow
● Installing dictionaries can be complicated
● Dictionaries are loaded into memory for every session
(slow first query symptom) and eat memory.
time for i in {1..10}; do echo $i; psql postgres -c "select
ts_lexize('english_hunspell', 'evening')" > /dev/null; done
1
2
3
4
5
6
7
8
9
10
real 0m0.656s
user 0m0.015s
sys 0m0.031s
For russian hunspell dictionary:
real 0m3.809s
user0m0.015s
sys 0m0.029s
Each session «eats» 20MB of RAM !
35. Dictionaries in shared memory
● Now it“s easy (Artur Zakirov, Postgres Professional + Thomas
Vondra)
https://github.com/postgrespro/shared_ispell
CREATE EXTENSION shared_ispell;
CREATE TEXT SEARCH DICTIONARY english_shared (
TEMPLATE = shared_ispell,
DictFile = en_us,
AffFile = en_us,
StopWords = english
);
CREATE TEXT SEARCH DICTIONARY russian_shared (
TEMPLATE = shared_ispell,
DictFile = ru_ru,
AffFile = ru_ru,
StopWords = russian
);
time for i in {1..10}; do echo $i; psql postgres -c "select ts_lexize('russian_shared', 'туши')" > /dev/null; done
1
2
…..
10
real 0m0.170s
user 0m0.015s VS
sys 0m0.027s
real 0m3.809s
user0m0.015s
sys 0m0.029s
40. FTS demo
● How to configure and use FTS with PostgreSQL?
https://github.com/select-artur/apod_fts
ToDo:
● Exampe of fixing misspelled user queries
● Query suggestion for user