"We will discuss how we at Pinterest transformed real time user engagement event consumption.
Every day, we log hundreds of billions of user engagement events across different domains to a few common Kafka topics which are consumed by hundreds of real time applications. These real time applications were built upon diverged frameworks (e.g. Spark Streaming, Storm, Flink, and internally developed frameworks using Kafka Consumer API) without standardization on processing logics. It led to repeated processing of similar logic, multiple codebases to maintain, low data quality, and inconsistency with offline datasets. These negatively impact scalability, reliability, efficiency and data accuracy of these applications and eventually affect the real-time content recommendation quality and user experience.
To address these challenges, we unified the way of consuming events in our real time applications by consolidating the compute engines to Flink, splitting events in those common topics by engagement types, generating cleansed events with standardized processing to align on business concepts. Throughout these efforts, we achieved multi-million dollar infrastructure savings and double-digit engagement gain after applications adopted those cleansed events.
Moving forward, we are implementing frameworks for better tracking and governing the Kafka events and real time use cases."
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
Gregory Fee presented on Lyft's use of streaming technologies like Kafka and Flink. Lyft uses streaming for real-time tasks like traffic updates and fraud detection. Previously they used Kinesis and Spark/Hive but are moving to Kafka and Flink for better scalability and developer experience. Lyft's Dryft platform provides consistent feature generation for machine learning using Flink SQL to process streaming and batch data. Dryft programs can backfill historical data and process real-time streams.
The Lyft data platform: Now and in the futuremarkgrover
- Lyft has grown significantly in recent years, providing over 1 billion rides to 30.7 million riders through 1.9 million drivers in 2018 across North America.
- Data is core to Lyft's business decisions, from pricing and driver matching to analyzing performance and informing investments.
- Lyft's data platform supports data scientists, analysts, engineers and others through tools like Apache Superset, change data capture from operational stores, and streaming frameworks.
- Key focuses for the platform include business metric observability, streaming applications, and machine learning while addressing challenges of reliability, integration and scale.
Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.
Shaik Niyas Ahamed Mohamed Hajiyar has over 7 years of experience in data warehousing and business intelligence, specializing in Ab Initio ETL tool, Teradata, and UNIX scripting. He has worked on several projects for clients like Tata Consultancy Services, Citi Bank, JPMorgan Chase, and John Lewis, taking on roles like developer, team lead, and trainer. His skills include ETL design, development, testing, support, and performance tuning across various technologies.
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenzhong XU | Current 2022
If you are a data scientist or a platform engineer, you probably can relate to the pains of working with the current explosive growth of Data/ML technologies and toolings. With many overlapping options and steep learning curves for each, it’s increasingly challenging for data science teams. Many platform teams started thinking about building an abstracted ML platform layer to support generalized ML use cases. But there are many complexities involved, especially as the underlying real-time data is shifting into the mainstream.
In this talk, we’ll discuss why ML platforms can benefit from a simple and ""invisible"" abstraction. We’ll offer some evidence on why you should consider leveraging streaming technologies even if your use cases are not real-time yet. We’ll share learnings (combining both ML and Infra perspectives) about some of the hard complexities involved in building such simple abstractions, the design principles behind them, and some counterintuitive decisions you may come across along the way.
By the end of the talk, I hope data scientists can walk away with some tips on how to evaluate ML platforms, and platform engineers learned a few architectural and design tricks.
This document provides an overview of Spring Cloud Data Flow, including what it is, its key components like Spring Batch and Spring Cloud Stream applications, how it can be used for batch jobs, tasks, and streams, and how it provides orchestration and deployment on platforms like Kubernetes. It also discusses Spring Cloud Data Flow's observability features and includes an interview discussing how one user implemented batch and stream processing using Spring Cloud Data Flow to ingest and process data in a more real-time and fault-tolerant manner.
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
Gregory Fee presented on Lyft's use of streaming technologies like Kafka and Flink. Lyft uses streaming for real-time tasks like traffic updates and fraud detection. Previously they used Kinesis and Spark/Hive but are moving to Kafka and Flink for better scalability and developer experience. Lyft's Dryft platform provides consistent feature generation for machine learning using Flink SQL to process streaming and batch data. Dryft programs can backfill historical data and process real-time streams.
The Lyft data platform: Now and in the futuremarkgrover
- Lyft has grown significantly in recent years, providing over 1 billion rides to 30.7 million riders through 1.9 million drivers in 2018 across North America.
- Data is core to Lyft's business decisions, from pricing and driver matching to analyzing performance and informing investments.
- Lyft's data platform supports data scientists, analysts, engineers and others through tools like Apache Superset, change data capture from operational stores, and streaming frameworks.
- Key focuses for the platform include business metric observability, streaming applications, and machine learning while addressing challenges of reliability, integration and scale.
Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.
Shaik Niyas Ahamed Mohamed Hajiyar has over 7 years of experience in data warehousing and business intelligence, specializing in Ab Initio ETL tool, Teradata, and UNIX scripting. He has worked on several projects for clients like Tata Consultancy Services, Citi Bank, JPMorgan Chase, and John Lewis, taking on roles like developer, team lead, and trainer. His skills include ETL design, development, testing, support, and performance tuning across various technologies.
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenzhong XU | Current 2022
If you are a data scientist or a platform engineer, you probably can relate to the pains of working with the current explosive growth of Data/ML technologies and toolings. With many overlapping options and steep learning curves for each, it’s increasingly challenging for data science teams. Many platform teams started thinking about building an abstracted ML platform layer to support generalized ML use cases. But there are many complexities involved, especially as the underlying real-time data is shifting into the mainstream.
In this talk, we’ll discuss why ML platforms can benefit from a simple and ""invisible"" abstraction. We’ll offer some evidence on why you should consider leveraging streaming technologies even if your use cases are not real-time yet. We’ll share learnings (combining both ML and Infra perspectives) about some of the hard complexities involved in building such simple abstractions, the design principles behind them, and some counterintuitive decisions you may come across along the way.
By the end of the talk, I hope data scientists can walk away with some tips on how to evaluate ML platforms, and platform engineers learned a few architectural and design tricks.
This document provides an overview of Spring Cloud Data Flow, including what it is, its key components like Spring Batch and Spring Cloud Stream applications, how it can be used for batch jobs, tasks, and streams, and how it provides orchestration and deployment on platforms like Kubernetes. It also discusses Spring Cloud Data Flow's observability features and includes an interview discussing how one user implemented batch and stream processing using Spring Cloud Data Flow to ingest and process data in a more real-time and fault-tolerant manner.
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Bighead is Airbnb's machine learning infrastructure that was created to:
- Standardize and simplify the ML development workflow;
- Reduce the time and effort to build ML models from weeks/months to days/weeks; and
- Enable more teams at Airbnb to utilize ML.
It provides shared services and tools for data management, model training/inference, and model management to make the ML process more efficient and production-ready. This includes services like Zipline for feature storage, Redspot for notebook environments, Deep Thought for online inference, and the Bighead UI for model monitoring.
Bighead is Airbnb's machine learning infrastructure that was created to:
1) Standardize and simplify the ML development workflow;
2) Reduce the time and effort to build ML models from weeks/months to days/weeks; and
3) Enable more teams at Airbnb to utilize ML.
It provides services for data management, model training/scoring, production deployment, and model management to make the ML process more efficient and consistent across teams. Bighead is built on open source technologies like Spark, TensorFlow, and Kubernetes but addresses gaps to fully support the end-to-end ML pipeline.
Chaitanya Lakshmi Chitrala has over 7 years of experience in data warehousing and ETL development using Ab Initio. They have extensive experience designing and developing ETL processes and data models for clients including Citigroup. Their roles have included requirement analysis, data modeling, ETL design, development and testing of complex ETL graphs to load large volumes of data from various source systems into data warehouses.
Chaitanya Chitrala has over 7 years of experience in data warehousing and ETL development using Ab Initio. He has strong knowledge of data warehousing concepts like star schemas and experience designing and developing ETL processes to load data into dimensional models. He has worked on multiple projects at Citigroup extracting data from various source systems and transforming the data for reporting and analytics using tools like Ab Initio, SQL Server, Oracle, and Unix shell scripts.
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma
This document discusses using Function Point Analysis (FPA) as a metric for Agile software projects. It provides context on replacing an existing trading system and outlines an architecture and development approach using Agile/Scrum. Metrics are proposed for use at the sprint level and cumulatively, including function points, story points, lines of code, and productivity rates. FPA is argued to provide benefits for scope management, benchmarking, and proving productivity and quality for Agile projects. Contracting based on function points is also discussed.
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
CloudStream service is a Full Management Service in Huawei Cloud. Support several features, such as On-Demand Billing, easy-to-use Stream SQL in online SQL editor, test Stream SQL in real-time style, Multi-tenant, security isolation and so on. We choose Apache Flink as streaming compute platform. Inside of CloudStream Cluster, Flink job can run on Yarn, Mesos, Kubernetes. We also have extended Apache Flink to meet IoT scenario needs. There are specialized tests on Flink reliability with college cooperation. Finally continuously improve the infrastructure around CS including open source projects and cloud services. CloudStream is different with any other real-time analysis cloud service. The development process can also be shared at architecture and principles.
The document contains the resume of Naveen Reddy Tamma which summarizes his work experience and qualifications. He has over 7 years of experience working as an Associate at Cognizant Technology Solutions on various projects involving Informatica ETL development, data quality testing, and report generation. He holds a B.Tech in Computer Science and has experience working with technologies like Informatica, Teradata, Oracle, and Cognos.
The document contains the resume of Naveen Reddy Tamma which summarizes his work experience and qualifications. He has over 7 years of experience working as an Associate at Cognizant Technology Solutions on various projects involving Informatica ETL development, data quality, and reporting. He holds a B.Tech in Computer Science and has experience with technologies like Informatica, Teradata, Oracle, and Cognos.
Reducing Cost of Production ML: Feature Engineering Case StudyVenkata Pingali
Production machine learning feature engineering is complex, expensive, and a key activity. It involves generating thousands of features from data in a continuous process. A disciplined approach can deliver a 10x improvement by increasing development speed, ensuring correctness, facilitating evolution to scale, and controlling executions. Major companies have developed platforms to manage feature engineering workflows and reduce costs through standardization, reuse, and automation.
Accelerating Digital Transformation: It's About Digital EnablementJoshua Gossett
Digital Transformation is a strategy that industries have been embracing over the past several years. Efforts are maturing but organizations are continuing to struggle to capture new digital value and reflect it on the bottom line. Digital Transformation efforts for most legacy companies are struggling, as they are looked on as a Technology problem.
Any "Transformational" strategy must address all the stakeholders involved as well as have a focus on delivering value to these stakeholders at multiple levels. Success can and has been delivered through the creation of Digital Transformation Enablement Programs that address the multiple stakeholder dimensions (people, process, and technology) and ultimately lead to digital being just how we do business.
In this discussion I will specifically outline the steps that we have leveraged to deliver Digital Transformation Enablement and as a byproduct change the way people work, how they approach problems with the application of technologies, and ultimately drive new value for their organization and customers.
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...Flink Forward
The streaming platform team at Lyft has been running Flink jobs in production for more than a year now, powering critical use cases like improving pickup ETA accuracy, dynamic pricing, generating machine learning features for fraud detection, real-time analytics among many others. Broadly, the jobs fall into two abstraction layers: applications (Flink jobs that run on the native platform) and analytics (that leverage Dryft, Lyft’s fully managed data processing engine). This talk will give an overview of the platform architecture, deployment model and user experience. The talk will also dive deeper into some of the challenges and the lessons that were learnt, running Flink jobs at scale, specifically around scaling Flink connectors, dealing with event time skew (source synchronization) and highlight common patterns of problems observed across several Flink jobs. Finally, the talk will give insights into how we are re-architecting the streaming platform @ Lyft using a Kubernetes based deployment.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiBowen Li
This talk was presented at Scale by the bay conference on Nov 14, 2019.
As the most popular and widely adopted stream processing framework, Apache Flink powers some of the world's largest stream processing use cases in companies like Netflix, Alibaba, Uber, Lyft, Pinterest, Yelp , etc.
In this talk, we will first go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will then take a look at how Flink is going beyond stream processing into areas like unified streaming/batch data processing, enterprise intergration with Hive, AI/machine learning, and serverless computation, how Flink fits with its distinct value, and what development is going on in Flink community to gap.
• A competent professional with 3.5 years of experience in Data warehousing and Investment Banking Domain.
• Expertise in end-to-end implementation of various projects including designing, development, coding, implementation of software applications.
Story of migrating event pipeline from batch to streaminglohitvijayarenu
The document summarizes Twitter's migration of its 4 trillion event log pipeline from batch to streaming processing using Apache technologies. Key aspects include:
1. Twitter aggregated 10PB of event logs across millions of clients into categories stored hourly on HDFS.
2. They designed a log pipeline in Google Cloud Platform using PubSub for storage, Dataflow jobs to stream to destinations like BigQuery and GCS, and a client library for uniform event publishing.
3. The pipeline supports streaming 4+ trillion events per day between Twitter datacenters and Google Cloud at sub-second latency while ensuring data integrity.
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiKim Kao
Ian Tsai shared his past 3years developer journey at Linkedin. it was about migrate monolith into microservices 3 years ago, he faced so diffcult challenges and need to have effective tools to support the change.
Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a Teradata data warehouse. He has certifications in Oracle Database SQL and skills in technologies like Unix, SQL, and data warehousing concepts.
Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a Teradata data warehouse. He has certifications in Oracle Database SQL and skills in technologies like Unix, SQL, and data warehousing concepts.
Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a data warehouse serving the banking and financial services domain. He has certifications in Oracle Database SQL and skills in technologies like Unix, SQL, and data warehousing concepts.
Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a data warehouse serving the banking and financial services domain. He has certifications in Oracle Database SQL and seeks to further contribute his skills in data integration.
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
"In this talk, attendees will be provided with an introduction to Kafka Connect and the basics of Single Message Transforms (SMTs) and how they can be used to transform data streams in a simple and efficient way. SMTs are a powerful feature of Kafka Connect that allow custom logic to be applied to individual messages as they pass through the data pipeline. The session will explain how SMTs work, the types of transformations they can be used for, and how they can be applied in a modular and composable way.
Further, the session will discuss where SMTs fit in with Kafka Connect and when they should be used. Examples will be provided of how SMTs can be used to solve common data integration challenges, such as data enrichment, filtering, and restructuring. Attendees will also learn about the limitations of SMTs and when it might be more appropriate to use other tools or frameworks.
Additionally, an overview of the alternatives to SMTs, such as Kafka Streams and KSQL, will be provided. This will help attendees make an informed decision about which approach is best for their specific use case.
Whether attendees are developers, data engineers, or data scientists, this talk will provide valuable insights into how Kafka Connect and SMTs can help streamline data processing workflows. Attendees will come away with a better understanding of how these tools work and how they can be used to solve common data integration challenges."
"While Apache Kafka lacks native support for topic renaming, there are scenarios where renaming topics becomes necessary. This presentation will delve into the utilization of MirrorMaker 2.0 as a solution for renaming Kafka topics. It will illustrate how MirrorMaker 2.0 can efficiently facilitate the migration of messages from the old topic to the new one and how Kafka Connect Metrics can be employed to monitor the mirroring progress. The discussion will encompass the complexity of renaming Kafka topics, addressing certain limitations, and exploring potential workarounds when using MirrorMaker 2.0 for this purpose. Despite not being originally designed for topic renaming, MirrorMaker 2.0 has a suitable solution for renaming Kafka topics.
Blog Post : https://engineering.hellofresh.com/renaming-a-kafka-topic-d6ff3aaf3f03"
More Related Content
Similar to Evolution of Real-time User Engagement Event Consumption at Pinterest
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Bighead is Airbnb's machine learning infrastructure that was created to:
- Standardize and simplify the ML development workflow;
- Reduce the time and effort to build ML models from weeks/months to days/weeks; and
- Enable more teams at Airbnb to utilize ML.
It provides shared services and tools for data management, model training/inference, and model management to make the ML process more efficient and production-ready. This includes services like Zipline for feature storage, Redspot for notebook environments, Deep Thought for online inference, and the Bighead UI for model monitoring.
Bighead is Airbnb's machine learning infrastructure that was created to:
1) Standardize and simplify the ML development workflow;
2) Reduce the time and effort to build ML models from weeks/months to days/weeks; and
3) Enable more teams at Airbnb to utilize ML.
It provides services for data management, model training/scoring, production deployment, and model management to make the ML process more efficient and consistent across teams. Bighead is built on open source technologies like Spark, TensorFlow, and Kubernetes but addresses gaps to fully support the end-to-end ML pipeline.
Chaitanya Lakshmi Chitrala has over 7 years of experience in data warehousing and ETL development using Ab Initio. They have extensive experience designing and developing ETL processes and data models for clients including Citigroup. Their roles have included requirement analysis, data modeling, ETL design, development and testing of complex ETL graphs to load large volumes of data from various source systems into data warehouses.
Chaitanya Chitrala has over 7 years of experience in data warehousing and ETL development using Ab Initio. He has strong knowledge of data warehousing concepts like star schemas and experience designing and developing ETL processes to load data into dimensional models. He has worked on multiple projects at Citigroup extracting data from various source systems and transforming the data for reporting and analytics using tools like Ab Initio, SQL Server, Oracle, and Unix shell scripts.
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma
This document discusses using Function Point Analysis (FPA) as a metric for Agile software projects. It provides context on replacing an existing trading system and outlines an architecture and development approach using Agile/Scrum. Metrics are proposed for use at the sprint level and cumulatively, including function points, story points, lines of code, and productivity rates. FPA is argued to provide benefits for scope management, benchmarking, and proving productivity and quality for Agile projects. Contracting based on function points is also discussed.
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
CloudStream service is a Full Management Service in Huawei Cloud. Support several features, such as On-Demand Billing, easy-to-use Stream SQL in online SQL editor, test Stream SQL in real-time style, Multi-tenant, security isolation and so on. We choose Apache Flink as streaming compute platform. Inside of CloudStream Cluster, Flink job can run on Yarn, Mesos, Kubernetes. We also have extended Apache Flink to meet IoT scenario needs. There are specialized tests on Flink reliability with college cooperation. Finally continuously improve the infrastructure around CS including open source projects and cloud services. CloudStream is different with any other real-time analysis cloud service. The development process can also be shared at architecture and principles.
The document contains the resume of Naveen Reddy Tamma which summarizes his work experience and qualifications. He has over 7 years of experience working as an Associate at Cognizant Technology Solutions on various projects involving Informatica ETL development, data quality testing, and report generation. He holds a B.Tech in Computer Science and has experience working with technologies like Informatica, Teradata, Oracle, and Cognos.
The document contains the resume of Naveen Reddy Tamma which summarizes his work experience and qualifications. He has over 7 years of experience working as an Associate at Cognizant Technology Solutions on various projects involving Informatica ETL development, data quality, and reporting. He holds a B.Tech in Computer Science and has experience with technologies like Informatica, Teradata, Oracle, and Cognos.
Reducing Cost of Production ML: Feature Engineering Case StudyVenkata Pingali
Production machine learning feature engineering is complex, expensive, and a key activity. It involves generating thousands of features from data in a continuous process. A disciplined approach can deliver a 10x improvement by increasing development speed, ensuring correctness, facilitating evolution to scale, and controlling executions. Major companies have developed platforms to manage feature engineering workflows and reduce costs through standardization, reuse, and automation.
Accelerating Digital Transformation: It's About Digital EnablementJoshua Gossett
Digital Transformation is a strategy that industries have been embracing over the past several years. Efforts are maturing but organizations are continuing to struggle to capture new digital value and reflect it on the bottom line. Digital Transformation efforts for most legacy companies are struggling, as they are looked on as a Technology problem.
Any "Transformational" strategy must address all the stakeholders involved as well as have a focus on delivering value to these stakeholders at multiple levels. Success can and has been delivered through the creation of Digital Transformation Enablement Programs that address the multiple stakeholder dimensions (people, process, and technology) and ultimately lead to digital being just how we do business.
In this discussion I will specifically outline the steps that we have leveraged to deliver Digital Transformation Enablement and as a byproduct change the way people work, how they approach problems with the application of technologies, and ultimately drive new value for their organization and customers.
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...Flink Forward
The streaming platform team at Lyft has been running Flink jobs in production for more than a year now, powering critical use cases like improving pickup ETA accuracy, dynamic pricing, generating machine learning features for fraud detection, real-time analytics among many others. Broadly, the jobs fall into two abstraction layers: applications (Flink jobs that run on the native platform) and analytics (that leverage Dryft, Lyft’s fully managed data processing engine). This talk will give an overview of the platform architecture, deployment model and user experience. The talk will also dive deeper into some of the challenges and the lessons that were learnt, running Flink jobs at scale, specifically around scaling Flink connectors, dealing with event time skew (source synchronization) and highlight common patterns of problems observed across several Flink jobs. Finally, the talk will give insights into how we are re-architecting the streaming platform @ Lyft using a Kubernetes based deployment.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiBowen Li
This talk was presented at Scale by the bay conference on Nov 14, 2019.
As the most popular and widely adopted stream processing framework, Apache Flink powers some of the world's largest stream processing use cases in companies like Netflix, Alibaba, Uber, Lyft, Pinterest, Yelp , etc.
In this talk, we will first go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will then take a look at how Flink is going beyond stream processing into areas like unified streaming/batch data processing, enterprise intergration with Hive, AI/machine learning, and serverless computation, how Flink fits with its distinct value, and what development is going on in Flink community to gap.
• A competent professional with 3.5 years of experience in Data warehousing and Investment Banking Domain.
• Expertise in end-to-end implementation of various projects including designing, development, coding, implementation of software applications.
Story of migrating event pipeline from batch to streaminglohitvijayarenu
The document summarizes Twitter's migration of its 4 trillion event log pipeline from batch to streaming processing using Apache technologies. Key aspects include:
1. Twitter aggregated 10PB of event logs across millions of clients into categories stored hourly on HDFS.
2. They designed a log pipeline in Google Cloud Platform using PubSub for storage, Dataflow jobs to stream to destinations like BigQuery and GCS, and a client library for uniform event publishing.
3. The pipeline supports streaming 4+ trillion events per day between Twitter datacenters and Google Cloud at sub-second latency while ensuring data integrity.
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiKim Kao
Ian Tsai shared his past 3years developer journey at Linkedin. it was about migrate monolith into microservices 3 years ago, he faced so diffcult challenges and need to have effective tools to support the change.
Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a Teradata data warehouse. He has certifications in Oracle Database SQL and skills in technologies like Unix, SQL, and data warehousing concepts.
Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a Teradata data warehouse. He has certifications in Oracle Database SQL and skills in technologies like Unix, SQL, and data warehousing concepts.
Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a data warehouse serving the banking and financial services domain. He has certifications in Oracle Database SQL and skills in technologies like Unix, SQL, and data warehousing concepts.
Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a data warehouse serving the banking and financial services domain. He has certifications in Oracle Database SQL and seeks to further contribute his skills in data integration.
Similar to Evolution of Real-time User Engagement Event Consumption at Pinterest (20)
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
"In this talk, attendees will be provided with an introduction to Kafka Connect and the basics of Single Message Transforms (SMTs) and how they can be used to transform data streams in a simple and efficient way. SMTs are a powerful feature of Kafka Connect that allow custom logic to be applied to individual messages as they pass through the data pipeline. The session will explain how SMTs work, the types of transformations they can be used for, and how they can be applied in a modular and composable way.
Further, the session will discuss where SMTs fit in with Kafka Connect and when they should be used. Examples will be provided of how SMTs can be used to solve common data integration challenges, such as data enrichment, filtering, and restructuring. Attendees will also learn about the limitations of SMTs and when it might be more appropriate to use other tools or frameworks.
Additionally, an overview of the alternatives to SMTs, such as Kafka Streams and KSQL, will be provided. This will help attendees make an informed decision about which approach is best for their specific use case.
Whether attendees are developers, data engineers, or data scientists, this talk will provide valuable insights into how Kafka Connect and SMTs can help streamline data processing workflows. Attendees will come away with a better understanding of how these tools work and how they can be used to solve common data integration challenges."
"While Apache Kafka lacks native support for topic renaming, there are scenarios where renaming topics becomes necessary. This presentation will delve into the utilization of MirrorMaker 2.0 as a solution for renaming Kafka topics. It will illustrate how MirrorMaker 2.0 can efficiently facilitate the migration of messages from the old topic to the new one and how Kafka Connect Metrics can be employed to monitor the mirroring progress. The discussion will encompass the complexity of renaming Kafka topics, addressing certain limitations, and exploring potential workarounds when using MirrorMaker 2.0 for this purpose. Despite not being originally designed for topic renaming, MirrorMaker 2.0 has a suitable solution for renaming Kafka topics.
Blog Post : https://engineering.hellofresh.com/renaming-a-kafka-topic-d6ff3aaf3f03"
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
"Trendyol, Turkey's leading e-commerce company, is committed to positively impacting the lives of millions of customers. Our decision-making processes are entirely driven by data. As a data warehouse team, our primary goal is to provide accurate and up-to-date data, enabling the extraction of valuable business insights.
We utilize the benefits provided by Kafka and Kafka Connect to facilitate the transfer of data from the source to our analytical environment. We recently transitioned our Kafka Connect clusters from on-premise VMs to Kubernetes. This shift was driven by our desire to effectively manage rapid growth(marked by a growing number of producers, consumers, and daily messages), ensuring proper monitoring and consistency. Consistency is crucial, especially in instances where we employ Single Message Transforms to manipulate records like filtering based on their keys or converting a JSON Object into a JSON string.
Monitoring our cluster's health is key and we achieve this through Grafana dashboards and alerts generated through kube-state-metrics. Additionally, Kafka Connect's JMX metrics, coupled with NewRelic, are employed for comprehensive monitoring.
The session will aim to explain our approach to NRT data ingestion, outlining the role of Kafka and Kafka Connect, our transition journey to K8s, and methods employed to monitor the health of our clusters."
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
"Join our lightning talk to delve into the strategies vital for maintaining a resilient Kafka service.
While proactive monitoring is key for issue prevention, failures will still occur. Rapid detection tools will enable you to identify and resolve problems before they impact end-users. This session explores the techniques employed by Kafka cloud providers for this detection, many of which are also applicable if you are managing independent Kafka clusters or applications.
The talk focuses on health-checking, a powerful tool that encompasses an application and its monitoring to validate Kafka environment availability. The session navigates through Kafka health-check methods, sharing best practices, identifying common pitfalls, and highlighting the monitoring of critical performance metrics like throughput and latency for early issue detection.
Attendees will gain valuable insights into the art of health-checking their Kafka environment, equipping them with the tools to identify and address issues before they escalate into critical problems. We invite all Kafka enthusiasts to join us in this talk to foster a deeper understanding of Kafka health-checking and ensure the continued smooth operation of your Kafka environment."
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
"Stream processing systems traditionally gave their users the choice between at least once processing and at most once processing: accepting duplicate data or missing data. But ideally we would provide exactly-once processing, where every event in the input data is represented exactly once in the output.
Kafka provides a transaction API that enables exactly-once when using Kafka as your source and sink. But this API has turned out to not be well suited for use by high level streaming systems, requiring various work arounds to still provide transactional processing.
In this talk, I’ll cover how the transaction API works, and how systems like Arroyo and Flink have used it to build exactly-once support, and how improvements to the transactional API will enable better end-to-end support for consistent stream processing."
"In this talk, we will explore the exciting world of IoT and computer vision by presenting a unique project: Fish Plays Pokemon. Using an ESP Eye camera connected to an ESP32 and other IoT devices, to monitor fish's movements in an aquarium.
This project showcases the power of IoT and computer vision, demonstrating how even a fish can play a popular video game. We will discuss the challenges we faced during development, including real-time processing, IoT device integration, and Kafka message consumption.
By the end of the talk, attendees will have a better understanding of how to combine IoT, computer vision, and the usage of a serverless cloud to create innovative projects. They will also learn how to integrate IoT devices with Kafka to simulate keyboard behavior, opening up endless possibilities for real-time interactions between the physical and digital worlds."
What is tiered storage and what is it good for? After this session you will know how to leverage the tiered storage feature to enable longer retention than the storage attached to brokers allows. You will get acquainted with the different configuration options and know what to expect when you enable the feature, like for example when will the first upload to the remote object storage take place.
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
"Real-time 24/7 monitoring and verification of massive data is challenging – even more so for the world’s second largest manufacturer of memory chips and semiconductors. Tolerance levels are incredibly small, any small defect needs to be identified and dealt with immediately. The goal of semiconductor manufacturing is to improve yield and minimize unnecessary work.
However, even with real-time data collection, the data was not easy to manipulate by users and it took many days to enable stream processing requests – limiting its usefulness and value to the business.
You’ll hear why SK hynix switched to Confluent and how we developed a self-service stream process portal on top of it. Now users have an easy-to-use service to manipulate the data they want.
Results have been impressive, stream processing requests are available the same day – previously taking 5 days! We were also able to drive down costs by 10% as stream processing requests no longer require additional hardware.
What you’ll take away from our talk:
- What were the pain points in the previous environment
- How we transitioned to Confluent without service downtime
- Creating a self-service stream processing portal built on top of Connect and ksqlDB
- Use case of stream process portal"
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
"Discover how default configurations might impact ingestion times, especially when dealing with large files. We'll explore a real-world scenario with a 20,000,000+ line file, assessing metrics and exploring the bottleneck in the default setup. Understand the intricacies of batch size calculations and how to optimize them based on your unique data characteristics.
Walk away with actionable insights as we showcase a practical example, turning a 7-hour ingestion process into a mere 30 minutes for over 30,000,000 records in a Kafka topic. Uncover metrics, configurations, and best practices to elevate the performance of your Kafka Connect CSV source connectors. Don't miss this opportunity to optimize your data pipeline and ensure smooth, efficient data flow."
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
"In order to meet the current and ever-increasing demand for near-zero RPO/RTO systems, a focus on resiliency is critical. While Kafka offers built-in resiliency features, a perfect blend of client and cluster resiliency is necessary in order to achieve a highly resilient Kafka client application.
At Fidelity Investments, Kafka is used for a variety of event streaming needs such as core brokerage trading platforms, log aggregation, communication platforms, and data migrations. In this lightening talk, we will discuss the governance framework that has enabled producers and consumers to achieve their SLAs during unprecedented failure scenarios. We will highlight how we automated resiliency tests through chaos engineering and tightly integrated observability dashboards for Kafka clients to analyze and optimize client configurations. And finally, we will summarize the chaos test suite and the ""test, test and test"" mantra that are helping Fidelity Investments reach its goal of a future with zero down-time."
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
"There are various strategies for securely connecting to Kafka clusters between different networks or over the public internet. Many cloud providers even offer endpoints that privately route traffic between networks and are not exposed to the internet. But, depending on your network setup and how you are running Kafka, these options ... might not be an option!
In this session, we’ll discuss how you can use SSH bastions or a self managed PrivateLink endpoint to establish connectivity to your Kafka clusters without exposing brokers directly to the internet. We explain the required network configuration, and show how we at Materialize have contributed to librdkafka to simplify these scenarios and avoid fragile workarounds."
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
"In my talk, we will examine all the stages of building our self-service Streaming Data Platform based on Apache Flink and Kafka Connect, from the selection of a solution for stateful streaming data processing, right up to the successful design of a robust self-service platform, covering the challenges that we’ve met.
I will share our experience in providing non-Java developers with a company-wide self-service solution, which allows them to quickly and easily develop their streaming data pipelines.
Additionally, I will highlight specific business use cases that would not have been implemented without our platform.0 characters0 characters"
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
"Almost everyone has heard about large language models, and tens of millions of people have tried out OpenAI ChatGPT and Google Bard. However, the intricate architecture and underlying mathematics driving these remarkable systems remain elusive to many.
LLM's are fascinating - so let's grab a drink and find out how these systems are built and dive deep into their inner workings. In the length of time it to enjoy a round of drinks, you'll understand the inner workings of these models. We'll take our first sip of word vectors, enjoy the refreshing taste of the transformer, and drain a glass understanding how these models are trained on phenomenally large quantities of data.
Large language models for your streaming application - explained with a little maths and a lot of pub stories"
"Monitoring is a fundamental operation when running Kafka and Kafka applications in production. There are numerous metrics available when using Kafka, however the sheer number is overwhelming, making it challenging to know where to start and how to properly utilise them.
This session will introduce you to some of the key metrics that should be monitored and best practices in fine tuning your monitoring. We will delve into which metrics are the indicators for cluster’s availability and performance and are the most helpful when debugging client applications."
Kafka Streams relies on state restoration for maintaining standby tasks as failure recovery mechanism as well as for restoring the state after rebalance scenarios. When you are scaling up or down your application instances, it is necessary to know the current state of the restoration process for each active and standby task in order to prevent a long restoration process as much as possible. During this presentation, you will get an understanding of how KIP-869 provides valuable information about the current active task restoration after a rebalance and KIP-988 opens a window to the continuous process of standby restoration. When you encounter a situation in which you need to choose whether or not to scale up or down your application instances, both KIPs will be an invaluable ally for you.
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
"In this talk, we will dive into the world of Kafka producer configs and explore how to understand and optimize them for better performance. We will cover the different types of configs, their impact on performance, and how to tune them to achieve the best results. Whether you're new to Kafka or a seasoned pro, this session will provide valuable insights and practical tips for improving your Kafka producer performance.
- Introduction to Kafka producer internal and workflow
- Understanding the producer configs like linger.ms, batch.size, buffer.memory and their impact on performance
- Learning about producer configs like max.block.ms, delivery.timeout.ms, request.timeout.ms and retries to make producer more resilient.
- Discuss configs like enable.idempotence, max.in.flight.requests.per.connection and transaction related configs to achieve delivery guarantees.
- Q&A session with attendees to address specific questions and concerns."
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
"Data contracts are one of the hottest topics in the data management community. A data contract is a formal agreement between a data producer and its consumers, aimed at reducing data downtime and improving data quality. Schemas are an important part of data contracts, but they are not the only relevant element.
In this talk, we’ll:
1. see why data contracts are so important but also difficult to implement;
2. identify the characteristics of a well-designed data contract:
discuss the anatomy of a data contract, its main elements and, how to formally describe them;
3. show how to manage the lifecycle of a data contract leveraging Confluent Platform's services."
"In the realm of stateful stream processing, Apache Flink has emerged as a powerful and versatile platform. However, the conventional SQL-based approach often limits the full potential of Flink applications.
We will delve into the benefits of adopting a code-first approach, which provides developers with greater control over application logic, facilitates complex transformations, and enables more efficient handling of state and time. We will also discuss how the code-first approach can lead to more maintainable and testable code, ultimately improving the overall quality of your Flink applications.
Whether you're a seasoned Flink developer or just starting your journey, this talk will provide valuable insights into how a code-first approach can revolutionize your stream processing applications."
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
"Change Data Capture (CDC) has become a commodity in data engineering, much in part due to the ever-rising success of Debezium [1]. But is that all there is? In this lightning talk, we’ll outline the current state of the CDC ecosystem, and understand why adopting a Debezium alternative is still a hard sell. If you’ve ever wondered what else is out there, but can’t keep up with the sprawling of new tools in the ecosystem; we’ll wrap it up for you!
[1] https://debezium.io/"
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
"Separation of compute and storage has become the de-facto standard in the data industry for batch processing.
The addition of tiered storage to open source Apache Kafka is the first step in bringing true separation of compute and storage to the streaming world.
In this talk, we'll discuss in technical detail how to take the concept of tiered storage to its logical extreme by building an Apache Kafka protocol compatible system that has zero local disks.
Eliminating all local disks in the system requires not only separating storage from compute, but also separating data from metadata. This is a monumental task that requires reimagining Kafka's architecture from the ground up, but the benefits are worth it.
This approach enables a stateless, elastic, and serverless deployment model that minimizes operational overhead and also drives inter-zone networking costs to almost zero."
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
4. Pinterest is the visual inspiration platform people
around the world use to shop products personalized to
their taste, find ideas to do offline and discover the
most inspiring creators.
Pinterest’s mission is to bring everyone the inspiration
to create a life that they love.
What is Pinterest?
5. We are engineers from Pinterest Data Eng.
Data Eng’s mission is to create and run reliable, efficient and planet-scale
data platforms and services to accelerate innovation and sustain Pinterest
business.
Who are we?
13. Scaling challenges and solutions 2019 ~ 2020 Stability
Challenge
● outbound traffic = number of jobs * inbound traffic
● Kafka clusters hosting event topics had very high resource saturation
Observation
● each job only needs to process a few common event types (eg: click, view)
● events of those common types are a small portion of all the events
event
(type 1, type 2, …, type M)
streaming job 1
streaming job 2
streaming job N
…
14. Solution
event
(type 1, type 2, …, type M)
Stream Splitter v1
● Flink DataStream API
● Job graph consists of source, filter
and sink
● Filter operator only keep events of a
small set of types required by
downstream
event_core
(type i, type j, type k)
Scaling challenges and solutions 2019 ~ 2020 Stability
15. Win
● The derived topics were about 10% of the original event topics, and high Kafka
cluster resource saturation issue was mitigated.
● Due to the smaller input QPS, jobs processing derived topics required less CPU /
memory and AWS cross-az traffic cost was reduced. Infra savings!!!
event_core
(type i, type j, type k)
streaming job 1
streaming job 2
streaming job N
…
Scaling challenges and solutions 2019 ~ 2020 Stability
16. Challenge
● With new jobs requiring more event types, the derived topics grew larger and
larger (10% -> 30% of original event topics)
● Infra cost grew significantly with new jobs onboarded
Observation
● QPS for each job became larger due to the growth of derived topics and job
required more resources
● Each job had to filter input events by types to get what it needs
Scaling challenges and solutions 2021 ~ 2022 Efficiency
17. Solution
event
(type 1, type 2, …, type M)
Stream Splitter v2
● Flink SQL
● Job consists of a statement set of DML
statements – insert into event_type_i (select *
from event where type = type_i)
● one DML statement for one per-type event topic
event_type_i
event_type_j
event_type_k
…
Scaling challenges and solutions 2021 ~ 2022 Efficiency
18. Scaling challenges and solutions 2021 ~ 2022 Efficiency
Win
● Downstream jobs only process several
per-type event topics that they needs
● Downstream jobs no longer needs filter logic
● Downstream jobs require much less infra
resources (infra savings!!!)
● Setting up a new pipeline requires a new
topic and a SQL statement
streaming job 1
streaming job 2
streaming job N
…
event_type_i
event_type_j
event_type_k
…
19. Scaling challenges and solutions 2021 ~ 2022 Efficiency
Issues with Stream Splitter v2
● All the records coming out of source operator are forwarded to every pipeline
● Stream Splitter v2 jobs cost twice as much as v1 jobs
Note:
Job graph is generated by the internal
SQL planner from the DML statements
other operators that do not affect the
data transportation pattern are not
shown for better visualization
Kafka
Source
filter i
(type = type_i)
Kafka Sink i
Kafka Sink j
Kafka Sink k
…
filter j
(type = type_j)
filter k
(type = type_k)
…
M
M
M
M
Mi Mi
Mj Mj
Mk Mk
20. Scaling challenges and solutions 2021 ~ 2022 Data quality
Challenge
● Streaming and batch workflows
generated inconsistent results
Observation
● Streaming job re-implemented many
batch ETL logics without standardization
Streaming
jobs
70 impressions
100 impressions
event
DWH
SOT
tables
Batch
workflows
21. Scaling challenges and solutions 2021 ~ 2022 Data quality
Solution
event
(type 1, type 2, …, type M)
Real time DWH streams
● Build with NRTG - mini framework on top
of a subset of Flink Datastream API (Flink
state API is not supported)
● Job graph consists of source, filter,
enrich, dedup and sink
● filter, enrich and dedup logics are reusing
those in batch ETL
● dedup key is stored in off-heap memory
(with pre-configured memory size) via a 3d
party library ohc
dwh_event (enriched and deduped)
(type i, type j, type k)
Dedup accuracy is compromised
during task restart or job deployment
as in-memory dedup keys are lost
It takes up to 1 day’s raw events to
rebuild the state.
22. Scaling challenges and solutions 2021 ~ 2022 Data quality
Improved Solution
event
(type 1, type 2, …, type M)
Real time DWH streams with native
Flink State
● Native Flink state API is added to NRTG
● Dedup operator is re-written using the Flink
MapState to store dudup key with 1d TTL
● Rocksdb state backend and S3 to store
active (read / write) keys and backup
● Savepoint size is tens of TB. Full state is
preserved during task restart and job
redeployment
dwh_event (enriched and deduped)
(type i, type j, type k)
Dedup accuracy is guaranteed
during task restart or job
redeployment with specified
checkpoint (from s3)
23. Scaling challenges and solutions 2021 ~ 2022 Data quality
Win
● Downstream jobs reading dwh_events can generate consistent results with the batch workflows;
the computed real-time signals used in recommendation helped to boost Pinterest engagement
metrics by double digits.
● Downstream jobs no longer need to implement enrich and dedup logics and job graphs are
simplified to only focus on the business logic.
Streaming
jobs
70 impressions
70 impressions
dwh_event
DWH
SOT
tables
Batch
workflows
24. Scaling challenges and solutions 2021 ~ 2022 Data quality
Issues with Real-time DWH streams job
● The generated dwh_event topic consists of multiple types and downstream jobs are
reading unnecessary data and thus implementing filter logics
● The mini framework introduces extra overhead
● Supporting a new type is slow - The logics for processing different types are coupled
together due to the mini framework’s API requirement
25. Two solutions for pre-processing engagement events
Stream Splitter
Efficient downstream consumption
Fast onboarding
No data quality
Repetitive processing logics in downstreams
Inefficient job runtime (data duplication )
Data Quality
simplified downstream job logic
Slow onboarding
Inefficient downstream consumption
Inefficient job runtime (framework overhead)
Realtime DWH
Downstream job developers are confused about what to use
Infra cost doubles and KTLO cost doubles
26. Unified Solution - Requirements
● Efficiency
○ Pre-processing jobs have efficient runtime
○ Downstream jobs only read events what they need to process
● Data quality
○ Downstream jobs read enriched and deduped events that can generate
consistent results with the batch workflows
● Dev velocity
○ Supporting a new type in the pre-processing jobs should be simple and can be
enabled easily without affecting the existing pipelines
○ Downstream jobs no longer port the filter-enrich logics from batch ETL and no
longer implement deduplication logic on data source
● KTLO
○ maintain one unified solution rather than 2 solutions
27. Unified solution - API choice
● Flink Datastream API
● Flink SQL
● Mini framework like NRTG
● Flink Table API - our final choice
○ It is more expressive than Flink SQL - complex logics can’t be easily implemented as SQL
○ It is very flexible
■ source and sinks are registered and accessed as Flink tables
■ easily convert Table to Datastream when we want to leverage low-level features
○ It does not have any extra framework overhead like NRTG
28. Unified solution - job framework
Framework design
● Each output stream is generated through a pipeline made up of filter, enrich dedup and sink
operators
● Pipeline is pluggable and independent from each other
● Classes from batch ETL are re-used to maintain consistent semantics
● Java reflection are leveraged to easily configure each pipeline
Job graph optimization - side outputs
● An job operator assign every source event based on type to each pipelines through side output
● Essentially we are implementing “filter pushdown” to reduce unnecessary data transportation
31. Platinum Event Streams - What it offers?
raw event platinum event streams
Standardized
Event
Selection
Event
Deduplication
Downstream
Efficiency
streaming
applications
32. Platinum Event Streams - User Flow
Logging / Metric
Owners
Streaming
App
Developers
I want to use event A as one of my signals, what’s
the correct logic to process it from raw events?
Before After
Logging
Owners
Metric
Definition
Owners
Streaming
App
Developers
Data
Warehouse
Team
platinum event
streams
Faster onboarding w/ guaranteed
quality and efficiency!
34. Platinum Event Stream - Flink Processing
Kafka
Source
Table
Dedup 1 Kafka Sink 1
Enrich 1
Dedup 2 Kafka Sink 2
Enrich 2
Dedup N Kafka Sink N
Enrich N
… … …
Splitter
w/ Filters
Side output 1
Side output 2
…
Side output N
M
M1
M2
MN
35. Platinum Event Stream - Splitter w/ Filters
Splitter Functionalities:
1. Filter out the events we don’t need.
2. Split the stream into different sub-pipelines according to event types.
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
36. Metric Repository
(shared by batch and streaming
processing)
Event / Metric X
def filter(event: Event): Boolean =
……
……
def createDedupKey(event: Event) =
……
……
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Standardized Event Selection
Consistent w/ Batch Applications
Platinum Event Stream - Splitter w/ Filters
37. Splitter Functionalities: (1) Filtering (2) Split Streams
Solution 1 - FlatMapFunc
Severe back pressure and scalability
issue when input traffic is high.
Solution 2 - Side Output
Kafka
Source
Table
Splitter:
Initialize:
Map<event type, pipeline tag>
Process:
- Emit events with
corresponding pipeline tag.
- Throw away if not needed.
…
FlatMapFunc - 1
Kafka
Source
Table
FlatMapFunc - 2
FlatMapFunc - N
…
M
M - QPS of input
raw event stream
M
M
M
M1
M2
Mn
Mi - QPS of side output i which
is needed by pipeline i
ΣMi << M
Scalability issue solved!
Platinum Event Stream - Splitter w/ Filters
38. Latency Information
Decoded Info Derived Info
● Derived spam flags from a
couple different fields logged
in raw event data.
ms
● Additional latency information.
● Help latency sensitive
downstream takes different
actions according to latency
per event.
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
● Decoded some
commonly used fields for
downstream to use.
BASE64
Platinum Event Stream - Enrich
39. … …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Why we need deduplication?
● Duplicate events exist in Pinterest’s raw event data.
● In some cases, duplicate rates vary from ~10-40% depending on the types of events.
Causes of duplicates:
1. Repeated users’ actions when interacting with Pinterest app.
2. Incorrect client logging implementation.
3. Client resend logging messages.
Solution:
● Deduplication in both batch and streaming pipelines before exporting to dashboards or
flowing into ML systems.
Platinum Event Stream - Dedup
40. Key by
UserID
Not
exists
Update
state
& Output
DedupKey
(e)
24hr TTL
2-10 TB
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Platinum Event Stream - Dedup
Flink Stateful Functions
41. 24hr TTL
2-10 TB
Incremental checkpoint for large state
● Full state size: 2-10TB
● Every time checkpoint size: tens of GB
Re-deployment
● From savepoint: ~10 - 20 mins
● From checkpoint: < 2 mins
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Platinum Event Stream - Dedup
42. Easy-to-Extend Framework with Java Reflection
Metric Repository
(Referenced by both online and offline
processing)
event_definitions
EventA.scala
EventB.scala
EventC.scala
One Line
Configuration
pipeline1.eventClass=A
pipeline2.eventClass=B
pipeline3.eventClass=C
*.properties:
1. Only several line configuration changes needed for adding new streaming pipeline.
2. Guaranteed batch and streaming logic consistent by referencing the same code repo.
Java Reflection
Look up Event class by its
name with Java Reflection
when building job graph.
Invoke functions defined for
each metric at runtime for
each pipeline:
MetricA.filter()
MetricA.createDedupeKey()
43. Platinum Event Stream - Data Quality Monitoring
Before
30-40%
discrepancies on
streaming vs. batch
applications
After
>99% match rate
between streaming vs.
batch applications
Daily comparison with batch SOT dataset
platinum event streams offline tables
offline
SOT
tables
Kafka topic → S3 dump
(internal framework)
Internal offline data
checker system
Alerts for match rate violation
Dashboards for continuous monitoring
44. Platinum Event Streams - Cost Efficiency
Efficiency Solution
600 vcores
600 vcores
Data Quality Solution
Unified Solution
(Efficiency + Data Quality)
600 vcores
Achieve both functionalities with single
copy of cost similar to previous offerings!
45. 5. Wins and Learns
1. User engagement boost brought by cleaner source data!
2. Highly simplified downstream streaming applications’ onboarding flow!
3. Hundreds of thousands infra saving as well as maintenance cost saving!
47. Ongoing efforts - streaming governance
We build streaming lineage & catalog which are integrated with batch
lineage and catalog for unified data governance
● catalog of Flink tables that are registered all the external systems that are
interacting with Flink jobs
● lineage between Flink jobs and external systems
48. Ongoing efforts - streaming and incremental ETL
We build solutions on top of CDC, Kafka, Flink, Iceberg and Spark to
● ingest data in near real-time from online systems to offline data lake
● incrementally process offline data in data lake