At Stitch Fix, we hire Full Stack Data Scientists (145+) and expect them to perform diverse functions: from conception to modeling to implementation to measurement. Since Kafka is the way we get event data, this inevitably means that a Data Scientist will need to write a Kafka consumer if they’re going to complete their implementation work. E.g. to transform some client data into features, or perform a model prediction, or allocate someone to an A/B test, etc. In this talk I’ll go over how we built an opinionated Kafka client to easily enable Data Scientists to deploy and own production Kafka consumers, by focusing on writing python functions rather than fighting pitfalls with Kafka.
[open source] hamilton, a micro framework for creating dataframes, and its ap...Stefan Krawczyk
At Stitch Fix, we have 130+ “Full Stack Data Scientists” who, in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand team, was in a bind. Their data generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel open source Python micro framework, that solved their pain points by changing their working paradigm.
Specifically, Hamilton enables a simpler paradigm for Data Science & Data Engineering teams to create, maintain, and execute code for generating dataframes, especially when there are lots of inter-column dependencies. Hamilton does this by building a DAG of dependencies directly from Python functions defined in a special manner, which also makes unit testing and documentation easy; tune into the talk to find out how. I’ll also cover our experience migrating to it, our best practices in using it in production for over two years, along with planned extensions to make it a general purpose framework.
Structured logs provide more context and are easier to analyze than traditional logs. This document discusses why one should use structured logs and how to implement structured logging in Python. Key points include:
- Structured logs add context like metadata, payloads and stack traces to log messages. This makes logs more searchable, reusable and easier to debug.
- Benefits of structured logs include easier developer onboarding, improved debugging and monitoring, and the ability to join logs from different systems.
- Python's logging module can be used to implement structured logging. This involves customizing the LogRecord and Formatter classes to output log messages as JSON strings.
- Considerations for structured logs include potential performance impacts from serialization
Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk
At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
[open source] hamilton, a micro framework for creating dataframes, and its ap...Stefan Krawczyk
At Stitch Fix, we have 130+ “Full Stack Data Scientists” who, in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand team, was in a bind. Their data generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel open source Python micro framework, that solved their pain points by changing their working paradigm.
Specifically, Hamilton enables a simpler paradigm for Data Science & Data Engineering teams to create, maintain, and execute code for generating dataframes, especially when there are lots of inter-column dependencies. Hamilton does this by building a DAG of dependencies directly from Python functions defined in a special manner, which also makes unit testing and documentation easy; tune into the talk to find out how. I’ll also cover our experience migrating to it, our best practices in using it in production for over two years, along with planned extensions to make it a general purpose framework.
Structured logs provide more context and are easier to analyze than traditional logs. This document discusses why one should use structured logs and how to implement structured logging in Python. Key points include:
- Structured logs add context like metadata, payloads and stack traces to log messages. This makes logs more searchable, reusable and easier to debug.
- Benefits of structured logs include easier developer onboarding, improved debugging and monitoring, and the ability to join logs from different systems.
- Python's logging module can be used to implement structured logging. This involves customizing the LogRecord and Formatter classes to output log messages as JSON strings.
- Considerations for structured logs include potential performance impacts from serialization
Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk
At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...Flink Forward
Today's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types exist several frontends (e.g., DataFrames/SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that target a particular frontend and possibly a hardware architecture (e.g., GPUs). Putting all the pieces of a data pipeline together simply leads to excessive data materialisation, type conversions and hardware utilisation as well as miss-matches of processing guarantees.
Our research group at RISE and KTH in Sweden has founded Arc, an intermediate language that bridges the gap between any frontend and a dataflow runtime (e.g., Flink) through a set of fundamental building blocks for expressing data pipelines. Arc incorporates Flink and Beam-inspired stream semantics such as windows, state and out of order processing as well as concepts found in batch computation models. With Arc, we can cross- compile and optimise diverse tasks written in any programming language into a unified dataflow program. Arc programs can run on various hardware backends efficiently as well as allowing seamless, distributed execution on dataflow runtimes. To that end, we showcase Arcon a concept runtime built in Rust that can execute Arc programs natively as well as presenting a minimal set of extensions to make Flink an Arc-ready runtime.
Slides for my talk at Hadoop Summit Dublin, April 2016.
The talk motivates how streaming can subsume batch use cases at the example of continuous counting.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
More Data, More Problems: Evolving big data machine learning pipelines with S...Alex Sadovsky
These are the slides from the Denver/Boulder Spark meet-up on February 24th, 2016. (deck build animations are all broken here... sorry!)
This talk provides an evaluation of existing machine learning pipelines in the eyes of different key stakeholders in the data science ecosystem. Focus is be placed upon the entire process from data to product (and keeping everyone in-between happy). Ultimately I explore how to utilize Spotify’s Luigi pipeline tool in combination with Spark to produce batch processing machine learning pipelines that have operational insights and redundancy built in.
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
1. Google Cloud Dataflow is a fully managed service that allows users to define data processing pipelines that can run batch or streaming computations.
2. The Dataflow programming model defines pipelines as directed graphs of transformations on collections of data elements. This provides flexibility in how computations are defined across batch and streaming workloads.
3. The Dataflow service handles graph optimization, scaling of workers, and monitoring of jobs to efficiently execute user-defined pipelines on Google Cloud Platform.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
Ufuk Celebi presented on the architecture and execution of Apache Flink's streaming data flow engine. Flink allows for both stream and batch processing using a common runtime. It translates APIs into a directed acyclic graph (DAG) called a JobGraph. The JobGraph is distributed across TaskManagers which execute parallel tasks. Communication between components like the JobManager and TaskManagers uses an actor system to coordinate scheduling, checkpointing, and monitoring of distributed streaming data flows.
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
Flink 0.10 focuses on operational readiness with improvements to high availability, monitoring, and integration with other systems. It provides first-class support for event time processing and refines the DataStream API to be both easy to use and powerful for stream processing tasks.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
Apache Flink is a general-purpose platform for batch and streaming distributed data processing. This talk describes how Flink’s powerful APIs, iterative operators and other unique features make it a competitive alternative for large-scale graph processing as well. We take a close look at how one can elegantly express graph analysis tasks, using common Flink operators and how different graph processing models, like vertex-centric, can be easily mapped to Flink dataflows. Next, we get a sneak preview into Flink's upcoming Graph API, Gelly, which further simplifies graph application development in Flink. Finally, we show how to perform end-to-end data analysis, mixing common Flink operators and Gelly, without having to build complex pipelines and combine different systems. We go through a step-by-step example, demonstrating how to perform loading, transformation, filtering, graph creation and analysis, with a single Flink program.
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
Pinot is a real-time OLAP data store that can support multiple analytics use cases like interactive dashboards, site facing queries, and anomaly detection in a single system. It achieves this through features like configurable indexes, dynamic query planning and execution, smart data partitioning and routing, and pre-materialized indexes like star-trees that optimize for latency and throughput across different workloads. The document discusses Pinot's architecture and optimizations that enable it to meet the performance requirements of these different use cases.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Spark Streaming allows processing of live data streams using Spark. It works by receiving data streams, chopping them into batches, and processing the batches using Spark. This presentation covered Spark Streaming concepts like the lifecycle of a streaming application, best practices for aggregations, operationalization through checkpointing, and achieving high throughput. It also discussed debugging streaming jobs and the benefits of combining streaming with batch, machine learning, and SQL processing.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
RDMS have their data modeling methodology and diagrams. What about Cassandra? Let's discover the key principles of Cassandra data modeling with the Chebotko methodology. Have a look at KDM, a Chebotko modeling tool. And finally, let's talk about the time dimension in Cassandra.
This presentation was made for the Lyon Cassandra Users meetup (France).
Leveraging the Power of Solr with SparkQAware GmbH
Lucene Revolution 2016, Boston: Talk by Johannes Weigend (@JohannesWeigend, CTO at QAware).
Abstract: Solr is a distributed NoSQL database with impressive search capabilities. Spark is the new megastar in the distributed computing universe. In this code-intense session we show you how to combine both to solve real-time search and processing problems. We will show you how to set up a Solr/Spark combination from scratch and develop first jobs with runs distributed on shared Solr data. We will also show you how to use this combination for your next-generation BI platform.
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019Michael Noll
Talk URL: https://kafka-summit.org/sessions/kafka-102-streams-tables-way/
Video recording: https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-102-streams-and-tables-all-the-way-down
Abstract: Streams and Tables are the foundation of event streaming with Kafka, and they power nearly every conceivable use case, from payment processing to change data capture, from streaming ETL to real-time alerting for connected cars, and even the lowly WordCount example. Tables are something that most of us are familiar with from the world of databases, whereas Streams are a rather new concept. Trying to leverage Kafka without understanding tables and streams is like building a rocket ship without understanding the laws of physics-a mission bound to fail. In this session for developers, operators, and architects alike we take a deep dive into these two fundamental primitives of Kafka’s data model. We discuss how streams and tables incl. global tables relate to each other and to topics, partitioning, compaction, serialization (Kafka’s storage layer), and how they interplay to process data, react to data changes, and manage state in an elastic, scalable, fault-tolerant manner (Kafka’s compute layer). Developers will understand better how to use streams and tables to build event-driven applications with Kafka Streams and KSQL, and we answer questions such as “How can I query my tables?” and “What is data co-partitioning, and how does it affect my join?”. Operators will better understand how these applications will run in production, with questions such as “How do I scale my application?” and “When my application crashes, how will it recover its state?”. At a higher level, we will explore how Kafka uses streams and tables to turn the Database inside-out and put it back together.
This document provides an overview of how to contribute to the cPython source code. It discusses running benchmarks to understand performance differences between loops inside and outside functions. It encourages contributing to improve coding skills and help the open source community. The steps outlined are to clone the cPython source code repository, resolve any dependencies during building, review open issues on bugs.python.org, and work on resolving issues - starting with easier ones. Tips are provided such as commenting when taking ownership of an issue, reproducing bugs before working on them, writing tests for code changes, and updating documentation.
Updating materialized views and caches using kafkaZach Cox
The document discusses using Apache Kafka to publish data changes and updates to materialized views and caches. It describes building a user information service that currently queries normalized data from multiple tables in a database. This has issues like slow response times and stale data. The document proposes using Kafka streams to read from topics containing normalized data changes. This allows building and storing materialized views locally using RocksDB for low latency access without stale data.
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
For a long time, a substantial portion of data processing that companies did ran as big batch jobs. But businesses operate in real-time and the software they run is catching up. Today, processing data in a streaming fashion becomes more and more popular in many companies over the more "traditional" way of batch-processing big data sets available as a whole.
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...Flink Forward
Today's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types exist several frontends (e.g., DataFrames/SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that target a particular frontend and possibly a hardware architecture (e.g., GPUs). Putting all the pieces of a data pipeline together simply leads to excessive data materialisation, type conversions and hardware utilisation as well as miss-matches of processing guarantees.
Our research group at RISE and KTH in Sweden has founded Arc, an intermediate language that bridges the gap between any frontend and a dataflow runtime (e.g., Flink) through a set of fundamental building blocks for expressing data pipelines. Arc incorporates Flink and Beam-inspired stream semantics such as windows, state and out of order processing as well as concepts found in batch computation models. With Arc, we can cross- compile and optimise diverse tasks written in any programming language into a unified dataflow program. Arc programs can run on various hardware backends efficiently as well as allowing seamless, distributed execution on dataflow runtimes. To that end, we showcase Arcon a concept runtime built in Rust that can execute Arc programs natively as well as presenting a minimal set of extensions to make Flink an Arc-ready runtime.
Slides for my talk at Hadoop Summit Dublin, April 2016.
The talk motivates how streaming can subsume batch use cases at the example of continuous counting.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
More Data, More Problems: Evolving big data machine learning pipelines with S...Alex Sadovsky
These are the slides from the Denver/Boulder Spark meet-up on February 24th, 2016. (deck build animations are all broken here... sorry!)
This talk provides an evaluation of existing machine learning pipelines in the eyes of different key stakeholders in the data science ecosystem. Focus is be placed upon the entire process from data to product (and keeping everyone in-between happy). Ultimately I explore how to utilize Spotify’s Luigi pipeline tool in combination with Spark to produce batch processing machine learning pipelines that have operational insights and redundancy built in.
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
1. Google Cloud Dataflow is a fully managed service that allows users to define data processing pipelines that can run batch or streaming computations.
2. The Dataflow programming model defines pipelines as directed graphs of transformations on collections of data elements. This provides flexibility in how computations are defined across batch and streaming workloads.
3. The Dataflow service handles graph optimization, scaling of workers, and monitoring of jobs to efficiently execute user-defined pipelines on Google Cloud Platform.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
Ufuk Celebi presented on the architecture and execution of Apache Flink's streaming data flow engine. Flink allows for both stream and batch processing using a common runtime. It translates APIs into a directed acyclic graph (DAG) called a JobGraph. The JobGraph is distributed across TaskManagers which execute parallel tasks. Communication between components like the JobManager and TaskManagers uses an actor system to coordinate scheduling, checkpointing, and monitoring of distributed streaming data flows.
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
Flink 0.10 focuses on operational readiness with improvements to high availability, monitoring, and integration with other systems. It provides first-class support for event time processing and refines the DataStream API to be both easy to use and powerful for stream processing tasks.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
Apache Flink is a general-purpose platform for batch and streaming distributed data processing. This talk describes how Flink’s powerful APIs, iterative operators and other unique features make it a competitive alternative for large-scale graph processing as well. We take a close look at how one can elegantly express graph analysis tasks, using common Flink operators and how different graph processing models, like vertex-centric, can be easily mapped to Flink dataflows. Next, we get a sneak preview into Flink's upcoming Graph API, Gelly, which further simplifies graph application development in Flink. Finally, we show how to perform end-to-end data analysis, mixing common Flink operators and Gelly, without having to build complex pipelines and combine different systems. We go through a step-by-step example, demonstrating how to perform loading, transformation, filtering, graph creation and analysis, with a single Flink program.
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
Pinot is a real-time OLAP data store that can support multiple analytics use cases like interactive dashboards, site facing queries, and anomaly detection in a single system. It achieves this through features like configurable indexes, dynamic query planning and execution, smart data partitioning and routing, and pre-materialized indexes like star-trees that optimize for latency and throughput across different workloads. The document discusses Pinot's architecture and optimizations that enable it to meet the performance requirements of these different use cases.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Spark Streaming allows processing of live data streams using Spark. It works by receiving data streams, chopping them into batches, and processing the batches using Spark. This presentation covered Spark Streaming concepts like the lifecycle of a streaming application, best practices for aggregations, operationalization through checkpointing, and achieving high throughput. It also discussed debugging streaming jobs and the benefits of combining streaming with batch, machine learning, and SQL processing.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
RDMS have their data modeling methodology and diagrams. What about Cassandra? Let's discover the key principles of Cassandra data modeling with the Chebotko methodology. Have a look at KDM, a Chebotko modeling tool. And finally, let's talk about the time dimension in Cassandra.
This presentation was made for the Lyon Cassandra Users meetup (France).
Leveraging the Power of Solr with SparkQAware GmbH
Lucene Revolution 2016, Boston: Talk by Johannes Weigend (@JohannesWeigend, CTO at QAware).
Abstract: Solr is a distributed NoSQL database with impressive search capabilities. Spark is the new megastar in the distributed computing universe. In this code-intense session we show you how to combine both to solve real-time search and processing problems. We will show you how to set up a Solr/Spark combination from scratch and develop first jobs with runs distributed on shared Solr data. We will also show you how to use this combination for your next-generation BI platform.
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019Michael Noll
Talk URL: https://kafka-summit.org/sessions/kafka-102-streams-tables-way/
Video recording: https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-102-streams-and-tables-all-the-way-down
Abstract: Streams and Tables are the foundation of event streaming with Kafka, and they power nearly every conceivable use case, from payment processing to change data capture, from streaming ETL to real-time alerting for connected cars, and even the lowly WordCount example. Tables are something that most of us are familiar with from the world of databases, whereas Streams are a rather new concept. Trying to leverage Kafka without understanding tables and streams is like building a rocket ship without understanding the laws of physics-a mission bound to fail. In this session for developers, operators, and architects alike we take a deep dive into these two fundamental primitives of Kafka’s data model. We discuss how streams and tables incl. global tables relate to each other and to topics, partitioning, compaction, serialization (Kafka’s storage layer), and how they interplay to process data, react to data changes, and manage state in an elastic, scalable, fault-tolerant manner (Kafka’s compute layer). Developers will understand better how to use streams and tables to build event-driven applications with Kafka Streams and KSQL, and we answer questions such as “How can I query my tables?” and “What is data co-partitioning, and how does it affect my join?”. Operators will better understand how these applications will run in production, with questions such as “How do I scale my application?” and “When my application crashes, how will it recover its state?”. At a higher level, we will explore how Kafka uses streams and tables to turn the Database inside-out and put it back together.
This document provides an overview of how to contribute to the cPython source code. It discusses running benchmarks to understand performance differences between loops inside and outside functions. It encourages contributing to improve coding skills and help the open source community. The steps outlined are to clone the cPython source code repository, resolve any dependencies during building, review open issues on bugs.python.org, and work on resolving issues - starting with easier ones. Tips are provided such as commenting when taking ownership of an issue, reproducing bugs before working on them, writing tests for code changes, and updating documentation.
Updating materialized views and caches using kafkaZach Cox
The document discusses using Apache Kafka to publish data changes and updates to materialized views and caches. It describes building a user information service that currently queries normalized data from multiple tables in a database. This has issues like slow response times and stale data. The document proposes using Kafka streams to read from topics containing normalized data changes. This allows building and storing materialized views locally using RocksDB for low latency access without stale data.
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
For a long time, a substantial portion of data processing that companies did ran as big batch jobs. But businesses operate in real-time and the software they run is catching up. Today, processing data in a streaming fashion becomes more and more popular in many companies over the more "traditional" way of batch-processing big data sets available as a whole.
Hagen Toennies from Gaikai Inc. presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"In this talk we will present how we enable distributed, Unix style programming using Docker and Apache Kafka. We will show how we can take the famous Unix Pipe Pattern and apply it to a Distributed Computing System. We will demonstrate the development of two simple applications with the focus on "Do One Thing and Do It Well." Afterwards we demonstrate how we make these two programs work to together using Apache Kafka. By encapsulating our applications in containers we will also show how that enables us to go from the limited resources of a development machine to cluster of computers in a data center without changing our applications or containers."
Watch the video: http://wp.me/p3RLHQ-goG
Learn more: http://www.hpcadvisorycouncil.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides an overview of Weather.com's analytics architecture using Apache Cassandra and Spark. It summarizes Weather.com's initial attempts using Cassandra, lessons learned, and its improved architecture. The improved architecture uses Cassandra for streaming event data with time-window compaction, stores all other data in Amazon S3 for batch processing in Spark, and replaces Kafka with Amazon SQS for event ingestion. It discusses best practices for data modeling in Cassandra including partitioning, secondary indexes, and avoiding wide rows and nulls. The document also highlights how Weather.com uses Apache Zeppelin notebooks for data exploration and visualization.
Slides that were presented during the webrtc Qt Cmake tutorial at IIT-RTC in October 2017 in Chicago. The slides are not yet complete, and will be updated later.
Making your Life Easier with MongoDB and Kafka (Robert Walters, MongoDB) Kafk...HostedbyConfluent
Kafka Connect makes it possible to easily integrate data sources like MongoDB! In this session we will first explore how MongoDB enables developers to rapidly innovate through the use of the document model. We will then put the document model to life and showcase how to integrate MongoDB and Kafka through the use of the MongoDB Connector with Apache Kafka. Finally, we will explore the different ways of using the connector including the new Confluent Cloud integration.
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...HostedbyConfluent
"Kafka Connect, the framework for building scalable and reliable data pipelines, has gained immense popularity in the data engineering landscape. This session will provide a comprehensive guide to creating Kafka connectors using Kotlin, a language known for its conciseness and expressiveness.
In this session, we will explore a step-by-step approach to crafting Kafka connectors with Kotlin, from inception to deployment using an simple use case. The process includes the following key aspects:
Understanding Kafka Connect: We'll start with an overview of Kafka Connect and its architecture, emphasizing its importance in real-time data integration and streaming.
Connector Design: Delve into the design principles that govern connector creation. Learn how to choose between source and sink connectors and identify the data format that suits your use case.
Building a Source Connector: We'll start with building a Kafka source connector, exploring key considerations, such as data transformations, serialization, deserialization, error handling and delivery guarantees. You will see how Kotlin's concise syntax and type safety can simplify the implementation.
Testing: Learn how to rigorously test your connector to ensure its reliability and robustness, utilizing best practices for testing in Kotlin.
Connector Deployment: go through the process of deploying your connector in a Kafka Connect cluster, and discuss strategies for monitoring and scaling.
Real-World Use Cases: Explore real-world examples of Kafka connectors built with Kotlin.
By the end of this session, you will have a solid foundation for creating and deploying Kafka connectors using Kotlin, equipped with practical knowledge and insights to make your data integration processes more efficient and reliable. Whether you are a seasoned developer or new to Kafka Connect, this guide will help you harness the power of Kafka and Kotlin for seamless data flow in your applications."
The document discusses how a company called HBC evolved their architecture from a monolithic application to a microservices architecture with streams. It describes how they introduced Kafka and Kafka Streams to share data between microservices in real-time, avoid common antipatterns, simplify development, and improve resilience and performance. The talk outlines how HBC uses Kafka Streams within their microservices to process streaming data, perform aggregations and joins, enable interactive queries, and power their search functionality.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Spark Streaming and Kafka Streams are two popular stream processing platforms. Spark Streaming uses micro-batching and allows for code reuse between batch and streaming jobs. Kafka Streams is embedded directly into Apache Kafka and leverages Kafka as its internal messaging layer. Both platforms support stateful stream processing operations like windowing, aggregations, and joins through distributed state stores. A demo application is shown that detects dangerous driving by joining truck position data with driver data using different streaming techniques.
The document describes a presentation about data processing with CDK (Cloud Development Kit). It includes an agenda that covers CDK and Projen, serverless ETL with Glue, Databrew with continuous integration/continuous delivery (CICD), and using Amazon Comprehend with S3 object lambdas. Constructs are demonstrated for building architectures with CDK across multiple programming languages. Examples are provided of using CDK to implement Glue workflows, Databrew CICD pipelines, and combining Comprehend with S3 object lambdas for PII detection and redaction.
Kafka Connect & Streams - the ecosystem around KafkaGuido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
Matei Zaharia is an assistant professor of computer science at Stanford University, Chief Technologist and Co-founder of Databricks. He started the Spark project at UC Berkeley and continues to serve as its vice president at Apache. Matei also co-started the Apache Mesos project and is a committer on Apache Hadoop. Matei’s research work on datacenter systems was recognized through two Best Paper awards and the 2014 ACM Doctoral Dissertation Award.
Apache Pulsar Development 101 with PythonTimothy Spann
Apache Pulsar Development 101 with Python PS2022_Ecosystem_v0.0
There is always the fear a speaker cannot make it. So just in case, since I was the MC for the ecosystem track I put together a talk just in case.
Here it is. Never seen or presented.
Stream Processing using Apache Spark and Apache KafkaAbhinav Singh
This document provides an agenda for a session on Apache Spark Streaming and Kafka integration. It includes an introduction to Spark Streaming, working with DStreams and RDDs, an example of word count streaming, and steps for integrating Spark Streaming with Kafka including creating topics and producers. The session will also include a hands-on demo of streaming word count from Kafka using CloudxLab.
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko
Activision Data team has been running a data pipeline for a variety of Activision games for many years. Historically we used a mix of micro-batch microservices coupled with classic Big Data tools like Hadoop and Hive for ETL. As a result, it could take up to 4-6 hours for data to be available to the end customers.
In the last few years, the adoption of data in the organization skyrocketed. We needed to de-legacy our data pipeline and provide near-realtime access to data in order to improve reporting, gather insights faster, power web and mobile applications. I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
This document summarizes Activision Data's transition from a batch data pipeline to a real-time streaming data pipeline using Apache Kafka and Kafka Streams. Some key points:
- The new pipeline ingests, processes, and stores game telemetry data from over 200k messages per second and over 5PB of data across 9 years of games.
- Kafka Streams is used to transform the raw streaming data through multiple microservices with low 10-second end-to-end latency, compared to 6-24 hours previously.
- Kafka Connect integrates the streaming data with data stores like AWS S3, Cassandra, and Elasticsearch.
- The new pipeline provides real-time and historical access to structured
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent
Apache Kafka sits at the center of a technology ecosystem that can be a bit overwhelming to someone just getting started. Fortunately, Apache Kafka is also at the heart of an amazing community that is able and eager to help! So, if you are new, or relatively new, to Apache Kafka, welcome! I’d like to introduce you to the Kafka ecosystem, and present you with a plan for how to learn and be productive with it. I’d also like to introduce you to one of the most helpful and welcoming software communities I’ve ever encountered.
I’ll take you through the basics of Kafka—the brokers, the partitions, the topics—and then on and up into the different APIs and tools that are available to work with it. Consider it a Kafka 101, if you will. We’ll stay at a high level, but we’ll cover a lot of ground, with an emphasis on where and how you can dig in deeper.
I am still learning myself, so I will share with you what and who have helped me in my journey, and then I’ll invite you to continue that journey with me. It’s going to be a great adventure!
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Mario Molina, Datio, Software Engineer
Kafka Streams is an open source JVM library for building event streaming applications on top of Apache Kafka. Its goal is to allow programmers to create efficient, real-time, streaming applications and perform analysis and operations on the incoming data.
In this presentation we’ll cover the main features of Kafka Streams and do a live demo!
This demo will be partially on Confluent Cloud, if you haven’t already signed up, you can try Confluent Cloud for free. Get $200 every month for your first three months ($600 free usage in total) get more information and claim it here: https://cnfl.io/cloud-meetup-free
https://www.meetup.com/Mexico-Kafka/events/271972045/
Similar to Enabling Data Scientists to easily create and own Kafka Consumers (20)
The Role of DevOps in Digital Transformation.pdfmohitd6
DevOps plays a crucial role in driving digital transformation by fostering a collaborative culture between development and operations teams. This approach enhances the speed and efficiency of software delivery, ensuring quicker deployment of new features and updates. DevOps practices like continuous integration and continuous delivery (CI/CD) streamline workflows, reduce manual errors, and increase the overall reliability of software systems. By leveraging automation and monitoring tools, organizations can improve system stability, enhance customer experiences, and maintain a competitive edge. Ultimately, DevOps is pivotal in enabling businesses to innovate rapidly, respond to market changes, and achieve their digital transformation goals.
Enhanced Screen Flows UI/UX using SLDS with Tom KittPeter Caitens
Join us for an engaging session led by Flow Champion, Tom Kitt. This session will dive into a technique of enhancing the user interfaces and user experiences within Screen Flows using the Salesforce Lightning Design System (SLDS). This technique uses Native functionality, with No Apex Code, No Custom Components and No Managed Packages required.
14 th Edition of International conference on computer visionShulagnaSarkar2
About the event
14th Edition of International conference on computer vision
Computer conferences organized by ScienceFather group. ScienceFather takes the privilege to invite speakers participants students delegates and exhibitors from across the globe to its International Conference on computer conferences to be held in the Various Beautiful cites of the world. computer conferences are a discussion of common Inventions-related issues and additionally trade information share proof thoughts and insight into advanced developments in the science inventions service system. New technology may create many materials and devices with a vast range of applications such as in Science medicine electronics biomaterials energy production and consumer products.
Nomination are Open!! Don't Miss it
Visit: computer.scifat.com
Award Nomination: https://x-i.me/ishnom
Conference Submission: https://x-i.me/anicon
For Enquiry: Computer@scifat.com
The Rising Future of CPaaS in the Middle East 2024Yara Milbes
Explore "The Rising Future of CPaaS in the Middle East in 2024" with this comprehensive PPT presentation. Discover how Communication Platforms as a Service (CPaaS) is transforming communication across various sectors in the Middle East.
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISTier1 app
Are you ready to unlock the secrets hidden within Java thread dumps? Join us for a hands-on session where we'll delve into effective troubleshooting patterns to swiftly identify the root causes of production problems. Discover the right tools, techniques, and best practices while exploring *real-world case studies of major outages* in Fortune 500 enterprises. Engage in interactive lab exercises where you'll have the opportunity to troubleshoot thread dumps and uncover performance issues firsthand. Join us and become a master of Java thread dump analysis!
Boost Your Savings with These Money Management AppsJhone kinadey
A money management app can transform your financial life by tracking expenses, creating budgets, and setting financial goals. These apps offer features like real-time expense tracking, bill reminders, and personalized insights to help you save and manage money effectively. With a user-friendly interface, they simplify financial planning, making it easier to stay on top of your finances and achieve long-term financial stability.
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Paul Brebner
Closing talk for the Performance Engineering track at Community Over Code EU (Bratislava, Slovakia, June 5 2024) https://eu.communityovercode.org/sessions/2024/why-apache-kafka-clusters-are-like-galaxies-and-other-cosmic-kafka-quandaries-explored/ Instaclustr (now part of NetApp) manages 100s of Apache Kafka clusters of many different sizes, for a variety of use cases and customers. For the last 7 years I’ve been focused outwardly on exploring Kafka application development challenges, but recently I decided to look inward and see what I could discover about the performance, scalability and resource characteristics of the Kafka clusters themselves. Using a suite of Performance Engineering techniques, I will reveal some surprising discoveries about cosmic Kafka mysteries in our data centres, related to: cluster sizes and distribution (using Zipf’s Law), horizontal vs. vertical scalability, and predicting Kafka performance using metrics, modelling and regression techniques. These insights are relevant to Kafka developers and operators.
Penify - Let AI do the Documentation, you write the Code.KrishnaveniMohan1
Penify automates the software documentation process for Git repositories. Every time a code modification is merged into "main", Penify uses a Large Language Model to generate documentation for the updated code. This automation covers multiple documentation layers, including InCode Documentation, API Documentation, Architectural Documentation, and PR documentation, each designed to improve different aspects of the development process. By taking over the entire documentation process, Penify tackles the common problem of documentation becoming outdated as the code evolves.
https://www.penify.dev/
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...Luigi Fugaro
Vector databases are transforming how we handle data, allowing us to search through text, images, and audio by converting them into vectors. Today, we'll dive into the basics of this exciting technology and discuss its potential to revolutionize our next-generation AI applications. We'll examine typical uses for these databases and the essential tools
developers need. Plus, we'll zoom in on the advanced capabilities of vector search and semantic caching in Java, showcasing these through a live demo with Redis libraries. Get ready to see how these powerful tools can change the game!
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...kalichargn70th171
In today's fiercely competitive mobile app market, the role of the QA team is pivotal for continuous improvement and sustained success. Effective testing strategies are essential to navigate the challenges confidently and precisely. Ensuring the perfection of mobile apps before they reach end-users requires thoughtful decisions in the testing plan.
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...The Third Creative Media
"Navigating Invideo: A Comprehensive Guide" is an essential resource for anyone looking to master Invideo, an AI-powered video creation tool. This guide provides step-by-step instructions, helpful tips, and comparisons with other AI video creators. Whether you're a beginner or an experienced video editor, you'll find valuable insights to enhance your video projects and bring your creative ideas to life.
Voxxed Days Trieste 2024 - Unleashing the Power of Vector Search and Semantic...Luigi Fugaro
Vector databases are redefining data handling, enabling semantic searches across text, images, and audio encoded as vectors.
Redis OM for Java simplifies this innovative approach, making it accessible even for those new to vector data.
This presentation explores the cutting-edge features of vector search and semantic caching in Java, highlighting the Redis OM library through a demonstration application.
Redis OM has evolved to embrace the transformative world of vector database technology, now supporting Redis vector search and seamless integration with OpenAI, Hugging Face, LangChain, and LlamaIndex. This talk highlights the latest advancements in Redis OM, focusing on how it simplifies the complex process of vector indexing, data modeling, and querying for AI-powered applications. We will explore the new capabilities of Redis OM, including intuitive vector search interfaces and semantic caching, which reduce the overhead of large language model (LLM) calls.
Building API data products on top of your real-time data infrastructureconfluent
This talk and live demonstration will examine how Confluent and Gravitee.io integrate to unlock value from streaming data through API products.
You will learn how data owners and API providers can document, secure data products on top of Confluent brokers, including schema validation, topic routing and message filtering.
You will also see how data and API consumers can discover and subscribe to products in a developer portal, as well as how they can integrate with Confluent topics through protocols like REST, Websockets, Server-sent Events and Webhooks.
Whether you want to monetize your real-time data, enable new integrations with partners, or provide self-service access to topics through various protocols, this webinar is for you!
A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions.
Enabling Data Scientists to easily create and own Kafka Consumers
1. Enabling Data Scientists to easily
create and own Kafka Consumers
Stefan Krawczyk
Mgr. Data Platform - Model Lifecycle
@stefkrawczyk
linkedin.com/in/skrawczyk
Try out Stitch Fix → goo.gl/Q3tCQ3
2. 2
- What is Stitch Fix?
- Data Science @ Stitch Fix
- Stitch Fix’s opinionated Kafka consumer
- Learnings & Future Directions
Agenda
4. Kafka Summit 2021 4
Stitch Fix is a personal styling service.
Shop at your personal curated store. Check out what you like.
5. Kafka Summit 2021 5
Data Science is behind everything we do.
algorithms-tour.stitchfix.com
Algorithms Org.
- 145+ Data Scientists and Platform Engineers
- 3 main verticals + platform
Data Platform
6. Kafka Summit 2021 6
whoami
Stefan Krawczyk
Mgr. Data Platform - Model Lifecycle
Pre-covid look
8. Kafka Summit 2021 8
Most common approach to Data Science
Typical organization:
● Horizontal teams
● Hand off
● Coordination required
DATA SCIENCE /
RESEARCH TEAMS
ETL TEAMS
ENGINEERING TEAMS
9. Kafka Summit 2021 9
At Stitch Fix:
● Single Organization
● No handoff
● End to end ownership
● We have a lot of them!
● Built on top of data
platform tools &
abstractions.
Full Stack Data Science
See https://cultivating-algos.stitchfix.com/
DATA SCIENCE
ETL
ENGINEERING
10. Kafka Summit 2021 10
Full Stack Data Science
A typical DS flow at Stitch Fix
Typical flow:
● Idea / Prototype
● ETL
● “Production”
● Eval/Monitoring/Oncall
● Start on next iteration
11. Kafka Summit 2021 11
Full Stack Data Science
A typical DS flow at Stitch Fix
Production can mean:
● Web service
● Batch job / Table
● Kafka consumer
Heavily biased towards Python.
12. Kafka Summit 2021 12
Example use cases DS have built kafka consumers for
Example Kafka Consumers
● A/B testing bucket allocation
● Transforming raw inputs into features
● Saving data into feature stores
● Event driven model prediction
● Triggering workflows
15. Kafka Summit 2021 15
To run this:
> pip install sf_kafka
> python -m sf_kafka.server hello_world_consumer
A simple example
Hello world consumer
hello_world_consumer.py
import sf_kafka
@sf_kafka.register(kafka_topic='some.topic', output_schema={})
def hello_world(messages: List[str]) -> dict:
"""Hello world example
:param messages: list of strings, which are JSON objects.
:return: empty dict, as we don't need to emit any events.
"""
list_of_dicts = [json.loads(m) for m in messages]
print(f'Hello world I have consumed the following {list_of_dicts}')
return {}
16. Kafka Summit 2021 16
import sf_kafka
@sf_kafka.register(kafka_topic='some.topic', output_schema={})
def hello_world(messages: List[str]) -> dict:
"""Hello world example
:param messages: list of strings, which are JSON objects.
:return: empty dict, as we don't need to emit any events.
"""
list_of_dicts = [json.loads(m) for m in messages]
print(f'Hello world I have consumed the following {list_of_dicts}')
return {}
A simple example
Hello world consumer
hello_world_consumer.py
So what is this doing?
To run this:
> pip install sf_kafka
> python -m sf_kafka.server hello_world_consumer
17. Kafka Summit 2021 17
import sf_kafka
@sf_kafka.register(kafka_topic='some.topic', output_schema={})
def hello_world(messages: List[str]) -> dict:
"""Hello world example
:param messages: list of strings, which are JSON objects.
:return: empty dict, as we don't need to emit any events.
"""
list_of_dicts = [json.loads(m) for m in messages]
print(f'Hello world I have consumed the following {list_of_dicts}')
return {}
1. Python function that takes in a list of strings called messages.
A simple example
Hello world consumer
hello_world_consumer.py
18. Kafka Summit 2021 18
import sf_kafka
@sf_kafka.register(kafka_topic='some.topic', output_schema={})
def hello_world(messages: List[str]) -> dict:
"""Hello world example
:param messages: list of strings, which are JSON objects.
:return: empty dict, as we don't need to emit any events.
"""
list_of_dicts = [json.loads(m) for m in messages]
print(f'Hello world I have consumed the following {list_of_dicts}')
return {}
1. Python function that takes in a list of strings called messages.
2. We’re processing the messages into dictionaries.
A simple example
Hello world consumer
hello_world_consumer.py
19. Kafka Summit 2021 19
import sf_kafka
@sf_kafka.register(kafka_topic='some.topic', output_schema={})
def hello_world(messages: List[str]) -> dict:
"""Hello world example
:param messages: list of strings, which are JSON objects.
:return: empty dict, as we don't need to emit any events.
"""
list_of_dicts = [json.loads(m) for m in messages]
print(f'Hello world I have consumed the following {list_of_dicts}')
return {}
1. Python function that takes in a list of strings called messages.
2. We’re processing the messages into dictionaries.
3. Printing them to console. (DS would replace this with a call to their function())
A simple example
Hello world consumer
hello_world_consumer.py
20. Kafka Summit 2021 20
import sf_kafka
@sf_kafka.register(kafka_topic='some.topic', output_schema={})
def hello_world(messages: List[str]) -> dict:
"""Hello world example
:param messages: list of strings, which are JSON objects.
:return: empty dict, as we don't need to emit any events.
"""
list_of_dicts = [json.loads(m) for m in messages]
print(f'Hello world I have consumed the following {list_of_dicts}')
return {}
1. Python function that takes in a list of strings called messages.
2. We’re processing the messages into dictionaries.
3. Printing them to console.(DS would replace this with a call to their function())
4. We are registering this function to consume from ‘some.topic’ with no output.
A simple example
Hello world consumer
hello_world_consumer.py
29. Kafka Summit 2021 29
Platform Concerns DS Concerns
● Kafka consumer operation:
○ What python kafka client to use
○ Kafka client configuration
○ Processing assumptions
■ At least once or at most once
○ How to write back to kafka
■ Direct to cluster or via a proxy?
■ Message serialization format
What does each side own
Platform Concerns vs DS Concerns
30. Kafka Summit 2021 30
Platform Concerns DS Concerns
● Kafka consumer operation:
○ What python kafka client to use
○ Kafka client configuration
○ Processing assumptions
■ At least once or at most once
○ How to write back to kafka
■ Direct to cluster or via a proxy?
■ Message serialization format
● Production operations:
○ Topic partitioning
○ Deployment vehicle for consumers
○ Monitoring hooks & tools
What does each side own
Platform Concerns vs DS Concerns
31. Kafka Summit 2021 31
Platform Concerns DS Concerns
● Kafka consumer operation:
○ What python kafka client to use
○ Kafka client configuration
○ Processing assumptions
■ At least once or at most once
○ How to write back to kafka
■ Direct to cluster or via a proxy?
■ Message serialization format
● Production operations:
○ Topic partitioning
○ Deployment vehicle for consumers
○ Monitoring hooks & tools
● Configuration:
○ App name [required]
○ Which topic(s) to consume from [required]
○ Process from beginning/end? [optional]
○ Processing “batch” size [optional]
○ Number of consumers [optional]
● Python function that operates over a list
● Output topic & message [if any]
● Oncall
What does each side own
Platform Concerns vs DS Concerns
32. Kafka Summit 2021 32
Platform Concerns DS Concerns
● Kafka consumer operation:
○ What python kafka client to use
○ Kafka client configuration
○ Processing assumptions
■ At least once or at most once
○ How to write back to kafka
■ Direct to cluster or via a proxy?
■ Message serialization format
● Production operations:
○ Topic partitioning
○ Deployment vehicle for consumers
○ Monitoring hooks & tools
● Configuration:
○ App name [required]
○ Which topic(s) to consume from [required]
○ Process from beginning/end? [optional]
○ Processing “batch” size [optional]
○ Number of consumers [optional]
● Python function that operates over a list
● Output topic & message [if any]
● Oncall
What does each side own
Platform Concerns vs DS Concerns
Can change without DS involvement -- just need to rebuild their app.
33. Kafka Summit 2021 33
Platform Concerns DS Concerns
● Kafka consumer operation:
○ What python kafka client to use
○ Kafka client configuration
○ Processing assumptions
■ At least once or at most once
○ How to write back to kafka
■ Direct to cluster or via a proxy?
■ Message serialization format
● Production operations:
○ Topic partitioning
○ Deployment vehicle for consumers
○ Topic monitoring hooks & tools
● Configuration:
○ App name [required]
○ Which topic(s) to consume from [required]
○ Process from beginning/end? [optional]
○ Processing “batch” size [optional]
○ Number of consumers [optional]
● Python function that operates over a list
● Output topic & message [if any]
● Oncall
What does each side own
Platform Concerns vs DS Concerns
Requires coordination with DS
34. Kafka Summit 2021 34
Platform Concern Choice Benefit
Kafka Client
Processing assumption
Salient choices we made on Platform
35. Kafka Summit 2021 35
Platform Concern Choice Benefit
Kafka Client python confluent-kafka
(librdkafka)
librdkafka is very performant & stable.
Processing assumption
Salient choices we made on Platform
36. Kafka Summit 2021 36
Platform Concern Choice Benefit
Kafka Client python confluent-kafka
(librdkafka)
librdkafka is very performant & stable.
Processing assumption At least once; functions
should be idempotent.
Enables very easy error recovery strategy:
● Consumer app breaks until it is fixed; can usually
wait until business hours.
● No loss of events.
● Monitoring trigger is consumer lag.
Salient choices we made on Platform
37. Kafka Summit 2021 37
Platform Concern Choice Benefit
Message serialization
format
Do we want to write back
to kafka directly?
Salient choices we made on Platform
38. Kafka Summit 2021 38
Platform Concern Choice Benefit
Message serialization
format
JSON Easy mapping to and from python dictionaries.
Easy to grok for DS.
* python support for other formats wasn’t great.
Do we want to write back
to kafka directly?
Salient choices we made on Platform
39. Kafka Summit 2021 39
Platform Concern Choice Benefit
Message serialization
format
JSON Easy mapping to and from python dictionaries.
Easy to grok for DS.
* python support for other formats wasn’t great.
Do we want to write back
to kafka directly?
Write via proxy service first. Enabled:
● Not having producer code in the engine.
● Ability to validate/introspect all messages.
● Ability to augment/change minor format
structure without having to redeploy all
consumers.
Salient choices we made on Platform
41. Kafka Summit 2021 41
Completing the production story
Consumer Code ✅
Architecture ✅
Mechanics ✅
What’s missing? ⍰
42. Kafka Summit 2021 42
Completing the production story
Consumer Code ✅
Architecture ✅
Mechanics ✅
What’s missing? ⍰
Self-service
^
43. Kafka Summit 2021 43
The self-service story of how a DS gets a consumer to production
Completing the production story
Example Use Case: Event Driven Model Prediction
1. Client signs up & fills out profile.
2. Event is sent - client.signed_up.
3. Predict something about the client.
4. Emit predictions back to kafka.
5. Use this for email campaigns.
-> $$
44. Kafka Summit 2021 44
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to
consume.
45. Kafka Summit 2021 45
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to consume.
2. Write code:
a. Create a function & decorate it to
process events
b. If outputting an event, write a
schema
c. Commit code to a git repository
Event
46. Kafka Summit 2021 46
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to consume.
2. Write code:
a. Create a function & decorate it to
process events
b. If outputting an event, write a
schema
c. Commit code to a git repository
Event
def create_prediction(client: dict) -> dict:
# DS would write side effects or fetches here.
# E.g. grab features, predict,
# create output message;
prediction = ...
return make_ouput_event(client, prediction)
@sf_kaka.register(
kafka_topic='client.signed_up',
output_schema={'predict.topic': schema})
def predict_foo(messages: List[str]) -> dict:
"""Predict XX about a client. ..."""
clients = [json.loads(m) for m in messages]
predictions = [create_prediction(c)
for c in clients]
return {'predict.topic': predictions}
my_prediction.py
47. Kafka Summit 2021 47
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to consume.
2. Write code:
a. Create a function & decorate it to
process events
b. If outputting an event, write a
schema
c. Commit code to a git repository
Event
def create_prediction(client: dict) -> dict:
# DS would write side effects or fetches here.
# E.g. grab features, predict,
# create output message;
prediction = ...
return make_ouput_event(client, prediction)
@sf_kaka.register(
kafka_topic='client.signed_up',
output_schema={'predict.topic': schema})
def predict_foo(messages: List[str]) -> dict:
"""Predict XX about a client. ..."""
clients = [json.loads(m) for m in messages]
predictions = [create_prediction(c)
for c in clients]
return {'predict.topic': predictions}
my_prediction.py
48. Kafka Summit 2021 48
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to consume.
2. Write code:
a. Create a function & decorate it to
process events
b. If outputting an event, write a
schema
c. Commit code to a git repository
Event
def create_prediction(client: dict) -> dict:
# DS would write side effects or fetches here.
# E.g. grab features, predict,
# create output message;
prediction = ...
return make_ouput_event(client, prediction)
@sf_kaka.register(
kafka_topic='client.signed_up',
output_schema={'predict.topic': schema})
def predict_foo(messages: List[str]) -> dict:
"""Predict XX about a client. ..."""
clients = [json.loads(m) for m in messages]
predictions = [create_prediction(c)
for c in clients]
return {'predict.topic': predictions}
my_prediction.py
49. Kafka Summit 2021 49
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to consume.
2. Write code:
a. Create a function & decorate it to
process events
b. If outputting an event, write a
schema
c. Commit code to a git repository
Event
def create_prediction(client: dict) -> dict:
# DS would write side effects or fetches here.
# E.g. grab features, predict,
# create output message;
prediction = ...
return make_ouput_event(client, prediction)
@sf_kaka.register(
kafka_topic='client.signed_up',
output_schema={'predict.topic': schema})
def predict_foo(messages: List[str]) -> dict:
"""Predict XX about a client. ..."""
clients = [json.loads(m) for m in messages]
predictions = [create_prediction(c)
for c in clients]
return {'predict.topic': predictions}
my_prediction.py
50. Kafka Summit 2021 50
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to consume.
2. Write code:
a. Create a function & decorate it to
process events
b. If outputting an event, write a
schema
c. Commit code to a git repository
"""Schema that we want to validate against."""
schema = {
'metadata': {
'timestamp': str,
'id': str,
'version': str
},
'payload': {
'some_prediction_value': float,
'client': int
}
}
Event
my_prediction.py
51. Kafka Summit 2021 51
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to consume.
2. Write code
3. Deploy via command line:
a. Handles python environment creation
b. Builds docker container
c. Deploys
52. Kafka Summit 2021 52
Self-service deployment via command line
Completing the production story
53. Kafka Summit 2021 53
Self-service deployment via command line
Completing the production story
1
54. Kafka Summit 2021 54
Self-service deployment via command line
Completing the production story
2
1
55. Kafka Summit 2021 55
Self-service deployment via command line
Completing the production story
2
3
1
56. Kafka Summit 2021 56
Self-service deployment via command line
Completing the production story
DS Touch Points
Can be in production in < 1 hour
Self-service!
57. Kafka Summit 2021 57
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to
consume.
2. Write code
3. Deploy via command line
4. Oncall:
a. Small runbook
1
58. Kafka Summit 2021 58
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to
consume.
2. Write code
3. Deploy via command line
4. Oncall:
a. Small runbook
1
2
59. Kafka Summit 2021 59
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to
consume.
2. Write code
3. Deploy via command line
4. Oncall:
a. Small runbook
1
2
3
60. Kafka Summit 2021 60
The self-service story of how a DS gets a consumer to production
Completing the production story
1. Determine the topic(s) to
consume.
2. Write code
3. Deploy via command line
4. Oncall:
a. Small runbook
1
2
3
4
63. Kafka Summit 2021 63
What? Learning
Do they use it? ✅ 👍
Learnings - DS Perspective
64. Kafka Summit 2021 64
What? Learning
Do they use it? ✅ 👍
Focusing on the function 1. All they need to know about kafka is that it’ll give
them a list of events.
2. Leads to better separation of concerns:
a. Can split driver code versus their logic.
b. Test driven development is easy.
Learnings - DS Perspective
65. Kafka Summit 2021 65
What? Learning
Do they use it? ✅ 👍
Focusing on the function 1. All they need to know about kafka is that it’ll give
them a list of events.
2. Leads to better separation of concerns:
a. Can split driver code versus their logic.
b. Test driven development is easy.
At least once processing 1. They enjoy easy error recovery; gives DS time to fix
things.
2. Idempotency requirement not an issue.
Learnings - DS Perspective
67. Kafka Summit 2021 67
What? Learning
Writing back via proxy service 1. Helped early on with some minor message format
adjustments & validation.
2. Would recommend writing back directly if we were to
start again.
a. Writing back directly leads to better performance.
Learnings - Platform Perspective (1/2)
68. Kafka Summit 2021 68
What? Learning
Writing back via proxy service 1. Helped early on with some minor message format
adjustments & validation.
2. Would recommend writing back directly if we were to
start again.
a. Writing back directly leads to better performance.
Central place for all things kafka Very useful to have a central place to:
1. Understand topics & topic contents.
2. Having “off the shelf” ability to materialize stream to a
datastore removed need for DS to manage/optimize this
process. E.g. elasticsearch, data warehouse, feature
store.
Learnings - Platform Perspective (1/2)
69. Kafka Summit 2021 69
What? Learning
Using internal async libraries Using internal asyncio libs is cumbersome for DS.
Native asyncio framework would feel better.*
Learnings - Platform Perspective (2/2)
* we ended up creating a very narrow focused micro-framework addressing these two issues using aiokafka.
70. Kafka Summit 2021 70
What? Learning
Using internal async libraries Using internal asyncio libs is cumbersome for DS.
Native asyncio framework would feel better.*
Lineage & Lineage Impacts If there is a chain of consumers*, didn’t have easy
introspection into:
● Processing speed of full chain
● Knowing what the chain was
Learnings - Platform Perspective (2/2)
* we ended up creating a very narrow focused micro-framework addressing these two issues using aiokafka.
71. Kafka Summit 2021 71
What? Why?
Being able to replace different
subcomponents & assumptions of the
system more easily.
Cleaner abstractions & modularity:
● Want to remove leaking business logic into
engine.
● Making parts pluggable means we can easily
change/swap out e.g. schema validation, or
serialization format, or how we write back to
kafka, processing assumptions, support asyncio,
etc.
Future Directions
72. Kafka Summit 2021 72
What? Why?
Being able to replace different
subcomponents & assumptions of the
system more easily.
Cleaner abstractions & modularity:
● Want to remove leaking business logic into
engine.
● Making parts pluggable means we can easily
change/swap out e.g. schema validation, or
serialization format, or how we write back to
kafka, processing assumptions, support asyncio,
etc.
Exploring stream processing like kafka
streams & faust.
Streaming processing over windows is slowly
becoming something more DS ask about.
Future Directions
73. Kafka Summit 2021 73
What? Why?
Being able to replace different
subcomponents & assumptions of the
system more easily.
Cleaner abstractions & modularity:
● Want to remove leaking business logic into
engine.
● Making parts pluggable means we can easily
change/swap out e.g. schema validation, or
serialization format, or how we write back to
kafka, processing assumptions, support asyncio,
etc.
Exploring stream processing like kafka
streams & faust.
Streaming processing over windows is slowly
becoming something more DS ask about.
Writing an open source version Hypothesis that this is valuable and that the
community would be interested; would you be?
Future Directions
75. Kafka Summit 2021 75
TL;DR:
Summary
Kafka + Data Scientists @ Stitch Fix:
● We have a self-service platform for Data Scientists to deploy kafka consumers
● We achieve self-service through a separation of concerns:
○ Data Scientists focus on functions to process events
○ Data Platform provides guardrails for kafka operations