OSA Con 2022: Building Event Collection SDKs and Data Models
Paul Boocock - Snowplow
In this talk we'll go through how we have designed and built over 20 different SDKs to collect events from all sorts of applications (from web & mobile to IoT to server-side), allowing users to collect a rich event stream of data. Then we'll dive into, and demonstrate, the cross-warehouse downstream data models which aggregate the event stream into easy-to-consume data products for analytics, AI, composable CDP, recommendation engines, and many other use cases.
Last year we decided to build an in-house solution for Funnel analysis which should be accessible to our business user through our BI tool. Backend part should run on Ap;ache Spark and since the BI tool can only run SQL queries that implies that the solution is a pure Spark SQL implementation of Funnel analysis. In this talk we will cover various Spark SQL features we have used to optimize query performance and implement various filters which enable end users to get actionable insights. KEY TAKEAWAYS: – single query approach to Funnel analysis (can be applied to any funnel-like problem) – using window functions to ensure ordering of the events in the funnel – examples of higher order functions to calculate funnel metrics
Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)Dan Robinson
Heap's analytics infrastructure is built around PostgreSQL. The most important choice to make when building a system this way is the schema you'll use to represent your data. This foundation will determine your write throughput, what sorts of read queries will be fast, what indexing strategies will be available to you, and what data inconsistencies will be possible. With the wrong choice, you won't be able to leverage PostgreSQL's most powerful features.
This talk walks through the different schemas we've used to power Heap over the last three years, their relative strengths and weaknesses, and the mistakes we've made.
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)Dan Robinson
At Heap, we lean on PostgreSQL for all our backend heavy lifting. We support an expressive set of queries — conversion funnels with filtering and grouping, retention analysis, and behavioral cohorting to name a few — across billions of users and tens of billions of events. Results need to come back in a matter of seconds and reflect up-to-the-minute data.
This talk will discuss these challenges, with a particular focus on:
- Using CitusDB for interactive analysis across 50 terabytes of data and counting.
- PostgreSQL and Kafka: two great tastes that taste great together.
- UDFs in C and PL/pgSQL, partial indexes for pre-aggregation, and other tricks up our sleeves.
Last year we decided to build an in-house solution for Funnel analysis which should be accessible to our business user through our BI tool. Backend part should run on Ap;ache Spark and since the BI tool can only run SQL queries that implies that the solution is a pure Spark SQL implementation of Funnel analysis. In this talk we will cover various Spark SQL features we have used to optimize query performance and implement various filters which enable end users to get actionable insights. KEY TAKEAWAYS: – single query approach to Funnel analysis (can be applied to any funnel-like problem) – using window functions to ensure ordering of the events in the funnel – examples of higher order functions to calculate funnel metrics
Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)Dan Robinson
Heap's analytics infrastructure is built around PostgreSQL. The most important choice to make when building a system this way is the schema you'll use to represent your data. This foundation will determine your write throughput, what sorts of read queries will be fast, what indexing strategies will be available to you, and what data inconsistencies will be possible. With the wrong choice, you won't be able to leverage PostgreSQL's most powerful features.
This talk walks through the different schemas we've used to power Heap over the last three years, their relative strengths and weaknesses, and the mistakes we've made.
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)Dan Robinson
At Heap, we lean on PostgreSQL for all our backend heavy lifting. We support an expressive set of queries — conversion funnels with filtering and grouping, retention analysis, and behavioral cohorting to name a few — across billions of users and tens of billions of events. Results need to come back in a matter of seconds and reflect up-to-the-minute data.
This talk will discuss these challenges, with a particular focus on:
- Using CitusDB for interactive analysis across 50 terabytes of data and counting.
- PostgreSQL and Kafka: two great tastes that taste great together.
- UDFs in C and PL/pgSQL, partial indexes for pre-aggregation, and other tricks up our sleeves.
Wszyscy zostaliśmy oszukani! Automatyczne zarządzanie pamięci rozwiąże wszystkie Wasze problemy, mówili. W zarządzanych środowiskach takich jak CLR JVM nie będzie wycieków pamięci, mówili! Właściwie pamięć jest tania i nie musisz się już nią nigdy więcej martwić. Wszyscy kłamali. Automatyczne zarządzanie pamięcią jest wygodną abstrakcją i bardzo często działa dobrze. Ale jak każda abstrakcja, wcześniej czy później "wycieka" ona. I to najczęściej w najmniej spodziewanym i przyjemnym momencie. W tej sesji spróbuję otworzyć oczy na fakt, że błoga nieświadomość nt. tej abstrakcji może być kosztowna. Pokażę jak może się objawić frywolne traktowanie pamięci i co możemy zyskać pisząc kod zdając sobie sprawę, że pamięć jednak nie jest nieskończona, tania i zawsze jednakowo szybka.
Beyond PHP - It's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Mixpanel is the most advanced analytics platform for mobile & web. Instead of measuring pageviews, it helps you analyze the actions people take in your application. An action can be anything - someone uploading a picture, playing a video, or sharing a post, for example
A story about developing an application for an online store, persisting all the data as JSON.
Gives an overview of JSON functionality in Oracle Database 19c.
Snowplow: evolve your analytics stack with your businessyalisassoon
Deep dive into how digital analytics stacks need to evolve with businesses, and how self-describing data and event data modeling are the key elements that enable Snowplow data pipeliens to elegantly evolve over time
[WSO2Con EU 2018] Patterns for Building Streaming AppsWSO2
This presentation explains how to enable digital transformation through streaming analytics and how easily streaming applications can be implemented. We look at the following:
- The Architecture of WSO2 Stream Processor
- Understanding streaming constructs
- Patterns of processing data in real-time, incrementally and with intelligence
- Applying patterns when building streaming apps
- Deployment patterns
In Onebip we developed a reporting system based on CQRS (Command Query Responsibility Segregation) and Event Sourcing using MongoDB.
In this talk I will introduce CQRS and Event Sourcing concepts, I will talk about our path and technical and conceptual challenges we faced, the strenght of our solution and the parts where there's room for improvement.
This talk is focused on tuning analysing and optimizing MongoDB query and index with the use of Database Profiler and "explain()" function.
Also, performance of database can also be impacted by configuring the underline ( Linux ) OS with some recommended settings which do not come by default.
Streaming Analytics for Financial EnterprisesDatabricks
Streaming Analytics (or Fast Data processing) is becoming an increasingly popular subject in the financial sector. There are two main reasons for this development. First, more and more data has to be analyze in real-time to prevent fraud; all transactions that are being processed by banks have to pass and ever-growing number of tests to make sure that the money is coming from and going to legitimate sources. Second, customers want to have friction-less mobile experiences while managing their money, such as immediate notifications and personal advise based on their online behavior and other users’ actions.
A typical streaming analytics solution follows a ‘pipes and filters’ pattern that consists of three main steps: detecting patterns on raw event data (Complex Event Processing), evaluating the outcomes with the aid of business rules and machine learning algorithms, and deciding on the next action. At the core of this architecture is the execution of predictive models that operate on enormous amounts of never-ending data streams.
In this talk, I’ll present an architecture for streaming analytics solutions that covers many use cases that follow this pattern: actionable insights, fraud detection, log parsing, traffic analysis, factory data, the IoT, and others. I’ll go through a few architecture challenges that will arise when dealing with streaming data, such as latency issues, event time vs server time, and exactly-once processing. The solution is build on the KISSS stack: Kafka, Ignite, and Spark Structured Streaming. The solution is open source and available on GitHub.
PyCon SG x Jublia - Building a simple-to-use Database Management toolCrea Very
Jublia is a business matching SaaS solution for the events industry. A common challenge faced by us and this industry is managing different dynamic datasets of participant databases. That’s why we built Jublia DATASYNC - an easy-to-use data management tool via Google Sheets. I will be sharing how you can also build a simple-to-use tools for both techies and non-techies alike to manage databases.
Introduction to Segment, Analytics API and Customer Data Platform. (Demo: Segment, AWS Redshift, Redash, Segment and GTM Alternatives) (Frontend Fighters Edition)
Recommended links:
https://segment.com/ - Analytics API and Customer Data Platform
https://open.segment.com/ - Open Source Projects of Segment
https://segment.com/docs/ - Documentation of Segment
https://redash.io/ - Open Sorce Data Dashboard
https://aws.amazon.com/redshift/ - Data Warehouse Solution
https://quicksight.aws/ - Business Analytics Service
https://www.ghostery.com/ - Tracker Detector
Keywords: business agility, tag managers, data-driven
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptxAltinity Ltd
Building an Analytic Extension to MySQL with ClickHouse and Open Source
In this webinar Percona and Altinity offer suggestions and tips on how to recognize when MySQL is overburdened with analytics and can benefit from ClickHouse’s unique capabilities.
Also, they will walk you through important patterns for integrating MySQL and ClickHouse which will enable the building of powerful and cost-efficient applications that leverage the strengths of both databases.
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...Altinity Ltd
Over the last few years Kubernetes has transitioned from an object of curiosity and fear to a robust platform for big data. Watch this webinar and you will learn how the Altinity Kubernetes Operator for ClickHouse enables users to run high performance analytics on ClickHouse. You will see a simple installation and teach you how to scale it into a cluster that can analyze 100s of terabytes of data. Along the way we’ll share our lessons for ClickHouse on Kubernetes in Altinity.Cloud. We built it on Kubernetes using the Altinity Operator and now run hundreds of clusters in the cloud. You can too!
More Related Content
Similar to OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock - Snowplow.pdf
Wszyscy zostaliśmy oszukani! Automatyczne zarządzanie pamięci rozwiąże wszystkie Wasze problemy, mówili. W zarządzanych środowiskach takich jak CLR JVM nie będzie wycieków pamięci, mówili! Właściwie pamięć jest tania i nie musisz się już nią nigdy więcej martwić. Wszyscy kłamali. Automatyczne zarządzanie pamięcią jest wygodną abstrakcją i bardzo często działa dobrze. Ale jak każda abstrakcja, wcześniej czy później "wycieka" ona. I to najczęściej w najmniej spodziewanym i przyjemnym momencie. W tej sesji spróbuję otworzyć oczy na fakt, że błoga nieświadomość nt. tej abstrakcji może być kosztowna. Pokażę jak może się objawić frywolne traktowanie pamięci i co możemy zyskać pisząc kod zdając sobie sprawę, że pamięć jednak nie jest nieskończona, tania i zawsze jednakowo szybka.
Beyond PHP - It's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Mixpanel is the most advanced analytics platform for mobile & web. Instead of measuring pageviews, it helps you analyze the actions people take in your application. An action can be anything - someone uploading a picture, playing a video, or sharing a post, for example
A story about developing an application for an online store, persisting all the data as JSON.
Gives an overview of JSON functionality in Oracle Database 19c.
Snowplow: evolve your analytics stack with your businessyalisassoon
Deep dive into how digital analytics stacks need to evolve with businesses, and how self-describing data and event data modeling are the key elements that enable Snowplow data pipeliens to elegantly evolve over time
[WSO2Con EU 2018] Patterns for Building Streaming AppsWSO2
This presentation explains how to enable digital transformation through streaming analytics and how easily streaming applications can be implemented. We look at the following:
- The Architecture of WSO2 Stream Processor
- Understanding streaming constructs
- Patterns of processing data in real-time, incrementally and with intelligence
- Applying patterns when building streaming apps
- Deployment patterns
In Onebip we developed a reporting system based on CQRS (Command Query Responsibility Segregation) and Event Sourcing using MongoDB.
In this talk I will introduce CQRS and Event Sourcing concepts, I will talk about our path and technical and conceptual challenges we faced, the strenght of our solution and the parts where there's room for improvement.
This talk is focused on tuning analysing and optimizing MongoDB query and index with the use of Database Profiler and "explain()" function.
Also, performance of database can also be impacted by configuring the underline ( Linux ) OS with some recommended settings which do not come by default.
Streaming Analytics for Financial EnterprisesDatabricks
Streaming Analytics (or Fast Data processing) is becoming an increasingly popular subject in the financial sector. There are two main reasons for this development. First, more and more data has to be analyze in real-time to prevent fraud; all transactions that are being processed by banks have to pass and ever-growing number of tests to make sure that the money is coming from and going to legitimate sources. Second, customers want to have friction-less mobile experiences while managing their money, such as immediate notifications and personal advise based on their online behavior and other users’ actions.
A typical streaming analytics solution follows a ‘pipes and filters’ pattern that consists of three main steps: detecting patterns on raw event data (Complex Event Processing), evaluating the outcomes with the aid of business rules and machine learning algorithms, and deciding on the next action. At the core of this architecture is the execution of predictive models that operate on enormous amounts of never-ending data streams.
In this talk, I’ll present an architecture for streaming analytics solutions that covers many use cases that follow this pattern: actionable insights, fraud detection, log parsing, traffic analysis, factory data, the IoT, and others. I’ll go through a few architecture challenges that will arise when dealing with streaming data, such as latency issues, event time vs server time, and exactly-once processing. The solution is build on the KISSS stack: Kafka, Ignite, and Spark Structured Streaming. The solution is open source and available on GitHub.
PyCon SG x Jublia - Building a simple-to-use Database Management toolCrea Very
Jublia is a business matching SaaS solution for the events industry. A common challenge faced by us and this industry is managing different dynamic datasets of participant databases. That’s why we built Jublia DATASYNC - an easy-to-use data management tool via Google Sheets. I will be sharing how you can also build a simple-to-use tools for both techies and non-techies alike to manage databases.
Introduction to Segment, Analytics API and Customer Data Platform. (Demo: Segment, AWS Redshift, Redash, Segment and GTM Alternatives) (Frontend Fighters Edition)
Recommended links:
https://segment.com/ - Analytics API and Customer Data Platform
https://open.segment.com/ - Open Source Projects of Segment
https://segment.com/docs/ - Documentation of Segment
https://redash.io/ - Open Sorce Data Dashboard
https://aws.amazon.com/redshift/ - Data Warehouse Solution
https://quicksight.aws/ - Business Analytics Service
https://www.ghostery.com/ - Tracker Detector
Keywords: business agility, tag managers, data-driven
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptxAltinity Ltd
Building an Analytic Extension to MySQL with ClickHouse and Open Source
In this webinar Percona and Altinity offer suggestions and tips on how to recognize when MySQL is overburdened with analytics and can benefit from ClickHouse’s unique capabilities.
Also, they will walk you through important patterns for integrating MySQL and ClickHouse which will enable the building of powerful and cost-efficient applications that leverage the strengths of both databases.
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...Altinity Ltd
Over the last few years Kubernetes has transitioned from an object of curiosity and fear to a robust platform for big data. Watch this webinar and you will learn how the Altinity Kubernetes Operator for ClickHouse enables users to run high performance analytics on ClickHouse. You will see a simple installation and teach you how to scale it into a cluster that can analyze 100s of terabytes of data. Along the way we’ll share our lessons for ClickHouse on Kubernetes in Altinity.Cloud. We built it on Kubernetes using the Altinity Operator and now run hundreds of clusters in the cloud. You can too!
Building an Analytic Extension to MySQL with ClickHouse and Open SourceAltinity Ltd
This is a joint webinar Percona - Altinity.
In this webinar we will discuss suggestions and tips on how to recognize when MySQL is overburdened with analytics and can benefit from ClickHouse’s unique capabilities.
We will then walk through important patterns for integrating MySQL and ClickHouse which will enable the building of powerful and cost-efficient applications that leverage the strengths of both databases.
Fun with ClickHouse Window Functions-2021-08-19.pdfAltinity Ltd
Fun with ClickHouse Window Functions | Altinity Webinar
Window functions have arrived in ClickHouse!
Our webinar will start with an introduction to standard window function syntax and show how it is implemented in ClickHouse. We’ll next show you problems that you can now solve easily using window functions. Finally, we’ll compare window functions to arrays, another powerful ClickHouse feature.
There will be time for questions with our SQL experts.
Join us for a complete overview of this long-awaited feature!
Speakers:
Robert Hodges, CEO @Altinity
Vitaliy Zakaznikov, QA Manager and Architect @Altinity
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdfAltinity Ltd
Cloud Native Data Warehouses: A Gentle Introduction to Running ClickHouse on Kubernetes | Altinity Webinar
Kubernetes is a powerful platform for big data and is particularly well-suited for ClickHouse.
If you have been wondering about trying Kubernetes, this webinar is for you. The first half introduces Kubernetes basics, building up to operators, which manage cloud-native applications. The second half focuses on ClickHouse and shows how to deploy data warehouses using the ClickHouse Operator. You’ll learn everything you need to start grappling with big data on Kubernetes.
Speaker: Robert Hodges, CEO @Altinity
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...Altinity Ltd
Altinity Stable Builds offer a ClickHouse distribution that is ready for production use and with 3 years of maintenance. Our webinar introduces the special features of Stable Builds and describes how we build them from ClickHouse Long-Term Support (LTS) releases. We’ll show you how to find them and install them yourself, then guide you through the important topic of upgrading. We’ll also walk through how to use Altinity Stable Builds in Altinity.Cloud, our managed ClickHouse platform for high-performance analytics.
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHouse Webinar Slides
Monitoring is the key to the successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open-source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast-growing, high-performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Presented by:
Roman Khavronenko, Co-Founder at VictoriaMetrics
Robert Hodges, CEO at Altinity
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdfAltinity Ltd
Altinity.Cloud is a managed ClickHouse platform for high-performance analytics.
But what if you want to run ClickHouse in your own cloud account? Altinity.Cloud Anywhere does exactly that.
In this webinar, we’ll explain how Altinity.Cloud Anywhere works, then walk through the simple setup procedure to get full cloud management of ClickHouse clusters in your VPCs. This webinar teaches you how to have cloud management for your real-time analytic stack while meeting requirements for compliance, control of data, and freedom from lock-in. Have your cake and eat it too!
ClickHouse ReplacingMergeTree in Telecom AppsAltinity Ltd
Alexandr Dubovikov of QXIP explains how to use ClickHouse ReplacingMergeTree engine for an important Telecom use case: tracking state of calls from incoming call detail records aka CDRs. (https://www.meetup.com/san-francisco-bay-area-clickhouse-meetup/events/289605843/)
Adventures with the ClickHouse ReplacingMergeTree EngineAltinity Ltd
Presentation on ReplacingMergeTree by Robert Hodges of Altinity at the 14 December 2022 SF Bay Area ClickHouse Meetup (https://www.meetup.com/san-francisco-bay-area-clickhouse-meetup/events/289605843/)
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
Building a Real-Time Analytics Application with
Apache Pulsar and Apache Pinot
While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems.
Presenters: Mary Grygleski - Streaming Developer Advocate &
Mark Needham - Developer Relations Engineer at StarTree
Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdfAltinity Ltd
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data - Presentation Slides
Altinity.Cloud is a fully automated cloud service for ClickHouse that is optimized for real-time analytics.
In this webinar, we’ll explain how Altinity.Cloud works, then show how to set up your first ClickHouse cluster. We’ll then tour important features like scale-up, scale-out, uptime schedules, and DBA tools to analyze your tables.
You’ll learn everything necessary to start working on real-time analytics today.
Bring your questions!
Presenters: Robert Hodges & Alexander Zaitsev
Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...Altinity Ltd
OSA Con 2022: What Data Engineering Can Learn from Frontend Engineering
Pete Hunt - Elementl
Frontend engineering went through a revolution in the last decade. I'll recap what happened, and how a similar revolution started in data engineering.
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdfAltinity Ltd
OSA Con 2022: Welcome to OSA CON Version 2022
Robert Hodges - Altinity
Join us as we guide you through the conference and highlight the many presenters who are contributing talks.
We'll also include a few tips about how to use the conference platform.
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...Altinity Ltd
OSA Con 2022: Using ClickHouse Database to Power Analytics and Customer Engagement Platform
Prafulla Gupta - Times Internet
This talk covers how we empowered Product Managers and Editors at Times Internet by developing an in-house product, GrowthRx, using Clickhouse Open Source Database to track and analyze user behavior to increase user retention and customer engagement. Times Internet is India's largest digital news publisher, which manages leading brands like Times of India, Economic Times, Navbharat Times, etc, where we are tracking more than 10 billion events per month in the ClickHouse Database.
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...Altinity Ltd
OSA Con 2022: Tips and Tricks to Keep Your Queries under 100ms with ClickHouse
Javi Santana - Tinybird
ClickHouse is fast as hell by default but when you want to query a 1B rows table with a latency under 100ms and not spend huge amounts of money on hardware you need to follow some simple rules to achieve it.
The talk is a bunch of small tricks we learned over 4 years working with ClickHouse.
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...Altinity Ltd
OSA Con 2022: The Open Source Analytic Universe, Version 2022
Robert Hodges - Altinity
Every generation builds new cathedrals. For many of us, this means implementing analytic applications built on a foundation of open source.
We'll survey developments in analytics since the last OSA Con and highlight new technologies that developers should be watching as we head into the mid-2020s.
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd
OSA Con 2022: Switching Jaeger Distributed Tracing to ClickHouse to Enable Advanced Performance Management
Satbir Chahal - OpsVerse
Our team switched our Jaeger (open source project used for distributed tracing) storage backend to ClickHouse (from Cassandra), which opened the door to a world of advanced analytics that we can run and provide our users. This talk will describe the journey from the switch, the learning curve, the challenges, and the eventual wins.
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...Altinity Ltd
OSA Con 2022: Streaming Data Made Easy
Tim Spann & David Kjerrumgaard - StreamNative
Click into new streaming applications the easy way with Apache Pulsar, Clickhouse, and Open Source. A quick introduction to how to build modern data streaming applications.
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdfAltinity Ltd
OSA Con 2022 - State of Open Source Databases
Peter Zaitsev - Percona
It has been an exciting year in the open-source database industry, with more choices, more cloud, and key changes in the industry. We will dive into the key developments over 2022, including the most important open-source database software releases in general, the significance of cloud-native solutions in a multi-vendor multi-cloud world, the new criticality of security challenges, and the evolution of the open-source software industry.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
2. Table of
contents
What is Snowplow?
A Quick Introduction
Tracking SDKs
How we build them and our decisions
Data Models
Considerations for modeling the raw data
_01
_02
_03
2022 www.snowplow.io
6. But first, a
small detour…
www.snowplow.io
{
"$schema":
"http://iglucentral.com/schemas/com.snowplowana
lytics.self-desc/schema/jsonschema/1-0-0#",
"description": "JSON schema for
a button click event",
"self": {
"vendor":
"com.acme",
"name": "click",
"format":
"jsonschema",
"version": "1-0-
0"
},
"type": "object",
"properties": {
"button": {
"type":
["string", "null"],
"maxLength":
255
}
},
"additionalProperties": false
}
7. www.snowplow.io
The data created looks a bit like this
user_id
page_url
timestamp
device
city
mkt_campaign
Default
content_id
content_title
author
date_created
button_click
Custom Custom
Entity
Event
9. www.snowplow.io
The real benefit of that approach is how the
data then looks
Default
fields (1/130)
Custom Event data
event_name unstruct_event_com_acme_click_1
click {
"button": "open_article"
}
Default fields (1/130) Custom Event data
event_name unstruct_event_com_acme_click_1
click {
"button": "open_article"
}
click {
"button": "open_article",
"index": "2"
}
13. Multi-language Support
1.
Quickly developed beyond a
web only platform
Web Tracker is unique as
different challenges on the
web vs other platforms
So, what do we do?
2.
Tempting to auto
generate
- Works well for
some trackers
(server)
- Less well for
others (client)
https://quicktype.io/?
www.snowplow.io
19. Sensible defaults
Whilst the SDKs are incredibly
configurable, opting for sensible
defaults keeps users happy
Where possible we now try to be more
opinionated and don’t offer
configuration
Works well until someone asks us if
they can configure it on Github!
www.snowplow.io
21. Data modeling is the
process of using
business logic to
aggregate or
otherwise transform
raw data.
www.snowplow.io
22. 1
Process clickstream data from raw events to
create aggregate tables of views, sessions and
users, reducing the volume of data massively
while adding quality
2
Deal with user stitching across sessions and
devices
3
Compatible with schema evolution, so
when you add new fields to your
contexts they immediately show up in
your data models
users
sessions
views
raw events
3%
6%
20%
Rows relative to raw
events
www.snowplow.io
Modeling is not an afterthought
23. This end-to-end approach results in
Clean, structured
data always available
in your warehouse
Consistent business logic
defined once means no
qualms about what metrics
mean
More time spent building
out ML/AI models instead of
toiling with data problems
www.snowplow.io
28. Consolidation
Similarities between queries:
● Common levels of aggregation
● Repeated logic, like joins
Consolidate ad-hoc queries into a set of
derived tables.
These generalised tables can be used to
solve a variety of use cases.
raw events
page views
sessions
users
20%
6%
3%
Rows relative to raw
events
30. One Step Further - Incremental Models
As the business continues to grow:
● The size of the events table grows.
● Behavioural data is more business critical
and needs to be processed more
regularly.
Meaning:
● Running the derived tables in a drop and
compute manner is becoming
increasingly costly and taking
considerable time.
● Processing events in an incremental
manner becomes a necessity.
31. #page_views.sql
{{ config(
materialized='table'
)
}}
select
page_view_id,
...
row_number() over (
partition by session_id
order by derived_tstamp
) AS page_view_in_session_index
from {{ ref('events') }}
where event_name = 'page_view'
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select distinct
session_id
from {{ ref('events') }}
where event_name = 'page_view'
{% if is_incremental() %}
and derived_tstamp > (
select max(derived_tstamp) from
{{this}}
)
{% endif %}
)
select
e.page_view_id,
...
row_number() over (
partition by e.session_id
order by e.derived_tstamp
) AS page_view_in_session_index
from {{ ref('events') }} e
inner join sessions_with_new_events s
on e.session_id = s.session_id
where e.event_name = 'page_view'
Drop & Recompute
Incremental
Page views - Incremental
32. www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select distinct
session_id
from {{ ref('events') }}
where event_name = 'page_view'
{% if is_incremental() %}
and derived_tstamp > (
select max(derived_tstamp) from
{{this}}
)
{% endif %}
)
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
33. www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select distinct
session_id
from {{ ref('events') }}
where event_name = 'page_view'
{% if is_incremental() %}
and collector_tstamp > (
select max(collector_tstamp)
from {{this}}
)
{% endif %}
)
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
34. www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select ...
)
select
e.page_view_id,
...
row_number() over (
partition by e.session_id
order by e.derived_tstamp
) AS page_view_in_session_index
from {{ ref('events') }} e
inner join sessions_with_new_events s
on e.session_id = s.session_id
where e.event_name = 'page_view'
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
● Restrict table scans on all source
tables if possible
35. www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select ...
)
select
e.page_view_id,
...
from {{ ref('events') }} e
inner join sessions_with_new_events s
on e.session_id = s.session_id
where e.event_name = 'page_view'
-- limit table scans
{% if is_incremental() %}
and collector_tstamp > (
select
dateadd(
day,
-3,
max(collector_tstamp))
from {{this}}
)
{% endif %}
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
● Restrict table scans on all source
tables if possible
36. www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select distinct
session_id
from {{ ref('events') }}
where event_name = 'page_view'
{% if is_incremental() %}
and derived_tstamp > (
select max(derived_tstamp) from
{{this}}
)
{% endif %}
)
Incremental
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
● Restrict table scans on all source
tables if possible
● Understanding your warehouse
37. Wrap up Considered the Snowplow Tracking SDKs
- Why we hand craft them
- Differences in Server vs Client SDKs
How do we model billions of rows of atomic data?
- Incrementally!
- Aggregation brings many benefits for analysis
Configuration is painful for everyone
- Easy to get carried away with configurable trackers
but then Data Models need to support it too
Schema’d events make it easy to make type safe objects
for us with the Tracking SDKs
- Great for tracking engineers
Understand your full pipeline to extract the most
- How the data is tracked and processed impacts
how you can model the data