This document discusses building resilient predictive data pipelines. It begins by distinguishing between ETL and predictive data pipelines, noting that predictive pipelines require high availability with downtimes of less than an hour. The document then outlines design goals for resilient data pipelines, including being scalable, available, instrumented/monitored/alert-enabled, and quickly recoverable. It proposes using AWS services like SQS, SNS, S3, and Auto Scaling Groups to build such pipelines. The document also recommends using Apache Airflow for workflow automation and scheduling to reliably manage pipelines as directed acyclic graphs. It presents an architecture using these techniques and assesses how well it meets the outlined design goals.
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
This presentation aims to be useful by covering the following topics:
- Modern Data Processing System Architectures and Models,
- Batch and Stream Processing Pipelines' details,
- Apache Spark Architecture and Internals,
- Real life use cases used with Apache Spark.
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...Natan Silnitsky
Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
In 1853 Britain’s workshops built 90 new gunboats for the Royal Navy in just 90 days: an astonishing feat of engineering. Industrial standardization made this possible - and in this talk, my first at Strata, I argued that data-sophisticated corporations need a new standardization of their own, in the form of schema registries like Confluent Schema Registry or Snowplow’s own Iglu.
Talk abstract:
At the start of the Crimean War in 1853, Britain's Royal Navy needed 90 new gunboats ready to fight in the Baltic in just 90 days. Assembling the boats was straightforward - the challenge was to build all of the engine sets in time. Marine engineer John Penn did an unusual thing: he took a pair of reference engines, disassembled them and distributed the pieces to the best machine shops across Britain. These workshops - latter-day micro-services - each built 90 sets of their allocated parts, which were then assembled into the engines for the new gunboats, ready for battle.
This was the nineteenth century - how could the Admiralty be certain that the parts from all these independent workshops would come together to form 90 high-powered engines? The answer lay in a crucial piece of standardization: the Whitworth thread, the world’s first national screw thread standard, devised by Sir Joseph Whitworth in 1841. By the time the Royal Navy came knocking, this standard had been adopted by workshops across Britain; John Penn could be confident that engine parts built by any workshop to the Whitworth standard would fit together.
In this talk, Snowplow co-founder Alexander Dean will draw on the story of the Crimean War gunboats to argue that our data processing architectures urgently require a standardization of their own, in the form of schema registries. Like the Whitworth screw thread, a schema registry, such as Confluent Schema Registry or Snowplow’s own Iglu, allows enterprises to standardise on a set of business entities which can be used throughout their batch and stream processing architectures. Like the artisanal workshops in 1850s Britain, micro-services can work on narrowly defined data processing tasks, confident that their inputs and outputs will be compatible with their peers.
This talk will start with the rationale for putting a schema registry at the heart of your business, before moving on to the practicalities of an implementation, including: a side-by-side comparison of the available registries; best practises about schema versioning; strategies around schema federation across different companies such as Snowplow’s own Iglu Central.
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
Hadoop is everywhere these days, but it can seem like a complex, intimidating ecosystem to those who have yet to jump in.
In this hands-on workshop, Alex Dean, co-founder of Snowplow Analytics, will take you "from zero to Hadoop", showing you how to run a variety of simple (but powerful) Hadoop jobs on Elastic MapReduce, Amazon's hosted Hadoop service. Alex will start with a no-nonsense overview of what Hadoop is, explaining its strengths and weaknesses and why it's such a powerful platform for data warehouse practitioners. Then Alex will help get you setup with EMR and Amazon S3, before leading you through a very simple job in Pig, a simple language for writing Hadoop jobs. After this we will move onto writing a more advanced job in Scalding, Twitter's Scala API for writing Hadoop jobs. For our final job, we will consolidate everything we have learnt by building a more sophisticated job in Scalding.
Extending the Yahoo Streaming BenchmarkJamie Grier
This presentation covers describes my own benchmarking of Apache Storm and Apache Flink based on the work started by Yahoo! It shows the incredible performance of Apache Flink
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...confluent
Robin is a Developer Advocate at Confluent, the company founded by the creators of Apache Kafka, as well as an Oracle Groundbreaker Ambassador. His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop, and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing and optimization. He blogs at http://cnfl.io/rmoff and http://rmoff.net/ and can be found tweeting grumpy geek thoughts as @rmoff. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...Paul Brebner
Join with me in a journey of exploration upriver with "Kongo", a scalable streaming IoT logistics demonstration application using Apache Kafka, the popular open source distributed streaming platform. Along the way you'll discover: an example logistics IoT problem domain (involving the rapid movement of thousands of goods by trucks between warehouses, with real-time checking of complex business and safety rules from sensor data); an overview of the Apache Kafka architecture and components; lessons learned from making critical Kaka application design decisions; an example of Kafka Streams for checking truck load limits; and finish the journey by overcoming final performance challenges and shooting the rapids to scale Kongo on a production Kafka cluster.
https://aceu19.apachecon.com/session/kongo-building-scalable-streaming-iot-application-using-apache-kafka
Flink at netflix paypal speaker seriesMonal Daxini
* Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events flowing through the Keystone stream processing infrastructure to help glean business insights and improve customer experience. The self-serve infrastructure enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share our experience building building this platform with Flink, and lessons learnt.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
Storm as well as Spark Streaming are Open-Source Frameworks supporting distributed stream processing. Storm has been developed by Twitter and is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. This presentation shows how you can implement stream processing solutions with the two frameworks, discusses how they compare and highlights the differences and similarities.
Large-Scale ETL Data Flows With Data Pipeline and DataductSourabh Bajaj
As data volumes grow, managing and scaling data pipelines for ETL and batch processing can be daunting. With more than 13.5 million learners worldwide, hundreds of courses, and thousands of instructors, Coursera manages over a hundred data pipelines for ETL, batch processing, and new product development.
In this session, we dive deep into AWS Data Pipeline and Dataduct, an open source framework built at Coursera to manage pipelines and create reusable patterns to expedite developer productivity. We share the lessons learned during our journey: from basic ETL processes, such as loading data from Amazon RDS to Amazon Redshift, to more sophisticated pipelines to power recommendation engines and search services.
Ultimate journey towards realtime data platform with 2.5M events per secb0ris_1
The story is about pain and enjoy or how we built realtime data platform with 2.5M events per second. What challenges we faced and which lessons were learned. Step by step we will introduce technologies we use like Spark, Presto, Kafka, AWS and EMR and best practices we come up to like monitoring, deployment and scaling.
This presentation aims to be useful by covering the following topics:
- Modern Data Processing System Architectures and Models,
- Batch and Stream Processing Pipelines' details,
- Apache Spark Architecture and Internals,
- Real life use cases used with Apache Spark.
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...Natan Silnitsky
Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
In 1853 Britain’s workshops built 90 new gunboats for the Royal Navy in just 90 days: an astonishing feat of engineering. Industrial standardization made this possible - and in this talk, my first at Strata, I argued that data-sophisticated corporations need a new standardization of their own, in the form of schema registries like Confluent Schema Registry or Snowplow’s own Iglu.
Talk abstract:
At the start of the Crimean War in 1853, Britain's Royal Navy needed 90 new gunboats ready to fight in the Baltic in just 90 days. Assembling the boats was straightforward - the challenge was to build all of the engine sets in time. Marine engineer John Penn did an unusual thing: he took a pair of reference engines, disassembled them and distributed the pieces to the best machine shops across Britain. These workshops - latter-day micro-services - each built 90 sets of their allocated parts, which were then assembled into the engines for the new gunboats, ready for battle.
This was the nineteenth century - how could the Admiralty be certain that the parts from all these independent workshops would come together to form 90 high-powered engines? The answer lay in a crucial piece of standardization: the Whitworth thread, the world’s first national screw thread standard, devised by Sir Joseph Whitworth in 1841. By the time the Royal Navy came knocking, this standard had been adopted by workshops across Britain; John Penn could be confident that engine parts built by any workshop to the Whitworth standard would fit together.
In this talk, Snowplow co-founder Alexander Dean will draw on the story of the Crimean War gunboats to argue that our data processing architectures urgently require a standardization of their own, in the form of schema registries. Like the Whitworth screw thread, a schema registry, such as Confluent Schema Registry or Snowplow’s own Iglu, allows enterprises to standardise on a set of business entities which can be used throughout their batch and stream processing architectures. Like the artisanal workshops in 1850s Britain, micro-services can work on narrowly defined data processing tasks, confident that their inputs and outputs will be compatible with their peers.
This talk will start with the rationale for putting a schema registry at the heart of your business, before moving on to the practicalities of an implementation, including: a side-by-side comparison of the available registries; best practises about schema versioning; strategies around schema federation across different companies such as Snowplow’s own Iglu Central.
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
Hadoop is everywhere these days, but it can seem like a complex, intimidating ecosystem to those who have yet to jump in.
In this hands-on workshop, Alex Dean, co-founder of Snowplow Analytics, will take you "from zero to Hadoop", showing you how to run a variety of simple (but powerful) Hadoop jobs on Elastic MapReduce, Amazon's hosted Hadoop service. Alex will start with a no-nonsense overview of what Hadoop is, explaining its strengths and weaknesses and why it's such a powerful platform for data warehouse practitioners. Then Alex will help get you setup with EMR and Amazon S3, before leading you through a very simple job in Pig, a simple language for writing Hadoop jobs. After this we will move onto writing a more advanced job in Scalding, Twitter's Scala API for writing Hadoop jobs. For our final job, we will consolidate everything we have learnt by building a more sophisticated job in Scalding.
Extending the Yahoo Streaming BenchmarkJamie Grier
This presentation covers describes my own benchmarking of Apache Storm and Apache Flink based on the work started by Yahoo! It shows the incredible performance of Apache Flink
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...confluent
Robin is a Developer Advocate at Confluent, the company founded by the creators of Apache Kafka, as well as an Oracle Groundbreaker Ambassador. His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop, and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing and optimization. He blogs at http://cnfl.io/rmoff and http://rmoff.net/ and can be found tweeting grumpy geek thoughts as @rmoff. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...Paul Brebner
Join with me in a journey of exploration upriver with "Kongo", a scalable streaming IoT logistics demonstration application using Apache Kafka, the popular open source distributed streaming platform. Along the way you'll discover: an example logistics IoT problem domain (involving the rapid movement of thousands of goods by trucks between warehouses, with real-time checking of complex business and safety rules from sensor data); an overview of the Apache Kafka architecture and components; lessons learned from making critical Kaka application design decisions; an example of Kafka Streams for checking truck load limits; and finish the journey by overcoming final performance challenges and shooting the rapids to scale Kongo on a production Kafka cluster.
https://aceu19.apachecon.com/session/kongo-building-scalable-streaming-iot-application-using-apache-kafka
Flink at netflix paypal speaker seriesMonal Daxini
* Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events flowing through the Keystone stream processing infrastructure to help glean business insights and improve customer experience. The self-serve infrastructure enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share our experience building building this platform with Flink, and lessons learnt.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
Storm as well as Spark Streaming are Open-Source Frameworks supporting distributed stream processing. Storm has been developed by Twitter and is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. This presentation shows how you can implement stream processing solutions with the two frameworks, discusses how they compare and highlights the differences and similarities.
Large-Scale ETL Data Flows With Data Pipeline and DataductSourabh Bajaj
As data volumes grow, managing and scaling data pipelines for ETL and batch processing can be daunting. With more than 13.5 million learners worldwide, hundreds of courses, and thousands of instructors, Coursera manages over a hundred data pipelines for ETL, batch processing, and new product development.
In this session, we dive deep into AWS Data Pipeline and Dataduct, an open source framework built at Coursera to manage pipelines and create reusable patterns to expedite developer productivity. We share the lessons learned during our journey: from basic ETL processes, such as loading data from Amazon RDS to Amazon Redshift, to more sophisticated pipelines to power recommendation engines and search services.
Ultimate journey towards realtime data platform with 2.5M events per secb0ris_1
The story is about pain and enjoy or how we built realtime data platform with 2.5M events per second. What challenges we faced and which lessons were learned. Step by step we will introduce technologies we use like Spark, Presto, Kafka, AWS and EMR and best practices we come up to like monitoring, deployment and scaling.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
At Databricks, we manage Apache Spark clusters for customers to run various production workloads. In this talk, we share our experiences in building a real-time monitoring system for thousands of Spark nodes, including the lessons we learned and the value we’ve seen from our efforts so far.
The was part of the talk presented at #monitorSF Meetup held at Databricks HQ in SF.
Citi Tech Talk: Monitoring and Performanceconfluent
The objective of the engagement is for Citi to have an understanding and path forward to monitor their Confluent Platform and
- Platform Monitoring
- Maintenance and Upgrade
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
Customer Data Platforms, commonly called CDPs, form an integral part of the marketing stack powering Zeotap's Adtech and Martech use-cases. The company offers a privacy-compliant CDP platform, and ScyllaDB is an integral part. Zeotap's CDP demands a mix of OLTP, OLAP, and real-time data ingestion, requiring a highly-performant store.
In this presentation, Shubham Patil, Lead Software Engineer, and Safal Pandita, Senior Software Engineer at Zeotap will share how ScyllaDB is powering their solution and why it's a great fit. They begin by describing their business use case and the challenges they were facing before moving to ScyllaDB. Then they cover their technical use-cases and requirements for real-time and batch data ingestions. They delve into our data access patterns and describe their data model supporting all use cases simultaneously for ingress/egress. They explain how they are using Scylla Migrator for our migration needs, then describe their multiregional, multi-tenant production setup for onboarding more than 130+ partners. Finally, they finish by sharing some of their learnings, performance benchmarks, and future plans.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
"As data volumes grow, managing and scaling data pipelines for ETL and batch processing can be daunting. With more than 13.5 million learners worldwide, hundreds of courses, and thousands of instructors, Coursera manages over a hundred data pipelines for ETL, batch processing, and new product development.
In this session, we dive deep into AWS Data Pipeline and Dataduct, an open source framework built at Coursera to manage pipelines and create reusable patterns to expedite developer productivity. We share the lessons learned during our journey: from basic ETL processes, such as loading data from Amazon RDS to Amazon Redshift, to more sophisticated pipelines to power recommendation engines and search services.
Attendees learn:
Do's and don’ts of Data Pipeline
Using Dataduct to streamline your data pipelines
How to use Data Pipeline to power other data products, such as recommendation systems
What’s next for Dataduct"
Presented at AI NEXTCon Seattle 1/17-20, 2018
http://aisea18.xnextcon.com
join our free online AI group with 50,000+ tech engineers to learn and practice AI technology, including: latest AI news, tech articles/blogs, tech talks, tutorial videos, and hands-on workshop/codelabs, on machine learning, deep learning, data science, etc..
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
You're building your startup and you know it will be big. You don't want to spend a lot of time on infrastructure, but you also don't want to be putting out fires after you get mentioned on Hacker News. In this session, we will give you real practical tips that you can take home with you on building an infrastructure that will scale quickly with minimal up front work on your part, using time tested techniques in infrastructure as code, SaaS, and Serverless, among other things.
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder of DataTorrent presented "Streaming Analytics with Apache Apex" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
Running or planning on deploying a large ClearPass cluster? See what others are doing in larger environments to improve their deployments This session is designed to help customers that run the largest and most demanding networks learn how to deal with multiple locations, 100k+ endpoints, and strict SLA’s. Come to this session to discuss architecture for distributed deployments and how to better design your install for high performance, high availability needs. This is the one session where we’ll include the most experienced ClearPass team members for what will be a highly interactive session.
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention, to content and network recommendations, to ranking and bidding, our world not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the ecosystem of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
LinkedIn Data Infrastructure Slides (Version 2)Sid Anand
Learn about Espresso, Databus, and Voldemort. LinkedIn Data Infrastructure Slides (Version 2). This talk was given in NYC on June 20, 2012
You can download the slides as PPT in order to see the transitions here :
http://bit.ly/LfH6Ru
This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
Keeping Movies Running Amid Thunderstorms!Sid Anand
How does Netflix strive to deliver an uninterrupted service? This talk, delivered for the first time in November, 2011, covers some engineering design concepts that help us deliver features at a rapid pace while assuring high availability.
Netflix's Transition to High-Availability Storage (QCon SF 2010)Sid Anand
This talk focuses on Netflix's transition from Oracle to SimpleDB -- a cloud-hosted, key-value store -- during Netflix's transition to the cloud (i.e. AWS). Stay tuned for future talks as Netflix evaluates more technologies, e.g. Cassandra.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
ER(Entity Relationship) Diagram for online shopping - TAEHimani415946
https://bit.ly/3KACoyV
The ER diagram for the project is the foundation for the building of the database of the project. The properties, datatypes, and attributes are defined by the ER diagram.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
4. Different Types of Data Pipelines
4
ETL
• used for : loading data related
to business health into a Data
Warehouse
• user-engagement stats (e.g.
social networking)
• product success stats (e.g.
e-commerce)
• audience : Business, BizOps
• downtime? : 1-3 days
Predictive
• used for :
•building recommendation
products (e.g. social
networking, shopping)
•updating fraud prevention
endpoints (e.g. security,
payments, e-commerce)
• audience : Customers
• downtime? : < 1 hour
VS
5. 5
ETL
• used for
Data Warehouse —> Reports
• user-engagement stats (
social networking
• product success stats (
e-commerce
• audience
• downtime?
Predictive
• used for :
•building recommendation
products (e.g. social
networking, shopping)
•updating fraud prevention
endpoints (e.g. security,
payments, e-commerce)
• audience : Customers
• downtime? : < 1 hour
VS
Different Types of Data Pipelines
6. Why Do We Care About Resilience?
6
d4d5d6d7d8
DB
Search
Engines
Recommenders
d1 d2 d3
7. DB
Why Do We Care About Resilience?
7
d4
d6d7d8
Search
Engines
Recommenders
d1 d2 d3
d5
8. Why Do We Care About Resilience?
8
d7d8
DB
Search
Engines
Recommenders
d1 d2 d3 d4
d5
d6
9. Why Do We Care About Resilience?
9
d7d8
DB
Search
Engines
Recommenders
d1 d2 d3 d4
d5
d6
12. 12
• Scalable
• Available
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Desirable Qualities of a Resilient
Data Pipeline
13. 13
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Desirable Qualities of a Resilient
Data Pipeline
14. Desirable Qualities of a Resilient
Data Pipeline
14
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Ditto
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
15. Instrumented
15
Instrumentation must reveal SLA metrics at each stage of the pipeline!
What SLA metrics do we care about? Correctness & Timeliness
• Correctness
• No Data Loss
• No Data Corruption
• No Data Duplication
• A Defined Acceptable Staleness of Intermediate Data
• Timeliness
• A late result == a useless result
• Delayed processing of now()’s data may delay the processing of future
data
16. Instrumented, Monitored, & Alert-
enabled
16
• Instrument
• Instrument Correctness & Timeliness SLA metrics at each stage of the
pipeline
• Monitor
• Continuously monitor that SLA metrics fall within acceptable bounds (i.e.
pre-defined SLAs)
• Alert
• Alert when we miss SLAs
17. Desirable Qualities of a Resilient
Data Pipeline
17
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Ditto
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
18. Quickly Recoverable
18
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
21. SQS - Overview
21
AWS’s low-latency, highly scalable, highly available message queue
Infinitely Scalable Queue (though not FIFO)
Low End-to-end latency (generally sub-second)
Pull-based
How it Works!
Producers publish messages, which can be batched, to an SQS queue
Consumers
consume messages, which can be batched, from the queue
commit message contents to a data store
ACK the messages as a batch
22. visibility
timer
SQS - Typical Operation Flow
22
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 1: A consumer reads a message from
SQS. This starts a visibility timer!
23. visibility
timer
SQS - Typical Operation Flow
23
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 2: Consumer persists message
contents to DB
25. visibility
timer
SQS - Time Out Example
25
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 1: A consumer reads a message from
SQS
26. visibility
timer
SQS - Time Out Example
26
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 2: Consumer attempts persists
message contents to DB
27. visibility
time out
SQS - Time Out Example
27
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 3: A Visibility Timeout occurs & the
message becomes visible again.
28. visibility
timer
SQS - Time Out Example
28
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
m1
SQS
Step 4: Another consumer reads and
persists the same message
29. visibility
timer
SQS - Time Out Example
29
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 5: Consumer ACKs message in SQS
30. SQS - Dead Letter Queue
30
SQS - DLQ
visibility
timer
Producer
Producer
Producer
m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Redrive
rule : 2x
m1
32. SNS - Overview
32
Highly Scalable, Highly Available, Push-based Topic Service
Whereas SQS ensures each message is seen by at least 1 consumer
SNS ensures that each message is seen by every consumer
Reliable Multi-Push
Whereas SQS is pull-based, SNS is push-based
There is no message retention & there is a finite retry count
No Reliable Message Delivery
Can we work around this limitation?
38. Architectural Elements
38
A Schema-aware Data format for all data (Avro)
The entire pipeline is built from Highly-Available/Highly-Scalable
components
S3, SNS, SQS, ASG, EMR Spark (exception DB)
The pipeline is never blocked because we use a DLQ for messages we
cannot process
We use queue-based auto-scaling to get high on-demand ingest rates
We manage everything with Airflow
Every stage in the pipeline is idempotent
Every stage in the pipeline is instrumented
40. ASG - Overview
40
What is it?
A means to automatically scale out/in
clusters to handle variable load/traffic
A means to keep a cluster/service always
up
Fulfills AWS’s pay-per-use promise!
When to use it?
Feed-processing, web traffic load
balancing, zone outage, etc…
41. ASG - Data Pipeline
41
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB
43. ASG : CPU-based
43
Sent
CPU
Recv
Premature
Scale-in
Premature Scale-in: The CPU drops to noise-levels before all
messages are consumed. This causes scale in to occur while the
last few messages are still being committed resulting in a long time-
to-drain for the queue!
44. ASG - Queue-based
44
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is
ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
47. 47
Historical Context
Our first cut at the pipeline used cron to schedule hourly runs of Spark
Problem
We only knew if Spark succeeded. What if a downstream task failed?
We needed something smarter than cron that
Reliably managed a graph of tasks (DAG - Directed Acyclic Graph)
Orchestrated hourly runs of that DAG
Retried failed tasks
Tracked the performance of each run and its tasks
Reported on failure/success of runs
Our Needs
56. 56
Airflow - Join the Community
With >30 Companies, >100 Contributors , and >2500 Commits,
Airflow is growing rapidly!
We are looking for more contributors to help support the
community!
Disclaimer : I’m a maintainer on the project