Netflix manages petabyte-scale Apache Cassandra databases in the cloud through declarative infrastructure and tooling. They provision new Cassandra clusters quickly using robust AMIs and pre-flight checks. Existing clusters are kept running through declarative control planes that monitor desired states and automatically remedy issues. Netflix migrates software and hardware seamlessly through immutable infrastructure, incremental backups to S3, and parallel data transfers. They apply lessons from failures like accidentally deleting real backup data through over-automation.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
This talk is from ApacheCon North America 2017 - Cassandra serving netflix @ scale - https://apachecon2017.sched.com/event/9zvG/cassandra-serving-netflix-scale-vinay-chella-netflix
https://www.youtube.com/watch?v=2l0_onmQsPI&index=3&t=284s&list=PL7uQt4PWyRW0XoVhEnNcSdCw5ufLEn9HA
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#netflix2
Operating C* can involve a lot of required manpower, complex automation, or both. Some of this complexity comes from operational/configuration activity of the underlying kernel and hardware but much of it is operation complexity stemming from C* itself. Some examples of this complexity are restarting the database in a safe way, reliability backing up and restoring snapshots, monitoring the health of the datastore, and even ensuring eventual consistency through repair. As a result of these complexities, C* operators end up with complicated operational setups, which are expensive to build, manage and monitor. As part of this talk, we will share lessons learned in managing such complexity via our Priam sidecar including recent innovations in how our sidecar ensures the highest possible uptime and correctness of Cassandra. We then use this to motivate building in the management sidecar directly as part of C* itself (CASSANDRA-14395).
What does a typical day as an SRE look like? In this presentation I will discuss the challenges we face while running a SaaS platform that is used 24 / 7 / 365 around the globe. In doing so, we have embraced the core principles described in the Google SRE handbook. While we are not Google by any means, most of the principles apply to our daily work one way or another. Having a fully distributed team running a distributed system can be quite challenging. In this talk I’ll be covering:
- Core SRE principles
- How Instana has applied them to our daily work
- Lessons learned along the way
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
This talk is from ApacheCon North America 2017 - Cassandra serving netflix @ scale - https://apachecon2017.sched.com/event/9zvG/cassandra-serving-netflix-scale-vinay-chella-netflix
https://www.youtube.com/watch?v=2l0_onmQsPI&index=3&t=284s&list=PL7uQt4PWyRW0XoVhEnNcSdCw5ufLEn9HA
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#netflix2
Operating C* can involve a lot of required manpower, complex automation, or both. Some of this complexity comes from operational/configuration activity of the underlying kernel and hardware but much of it is operation complexity stemming from C* itself. Some examples of this complexity are restarting the database in a safe way, reliability backing up and restoring snapshots, monitoring the health of the datastore, and even ensuring eventual consistency through repair. As a result of these complexities, C* operators end up with complicated operational setups, which are expensive to build, manage and monitor. As part of this talk, we will share lessons learned in managing such complexity via our Priam sidecar including recent innovations in how our sidecar ensures the highest possible uptime and correctness of Cassandra. We then use this to motivate building in the management sidecar directly as part of C* itself (CASSANDRA-14395).
What does a typical day as an SRE look like? In this presentation I will discuss the challenges we face while running a SaaS platform that is used 24 / 7 / 365 around the globe. In doing so, we have embraced the core principles described in the Google SRE handbook. While we are not Google by any means, most of the principles apply to our daily work one way or another. Having a fully distributed team running a distributed system can be quite challenging. In this talk I’ll be covering:
- Core SRE principles
- How Instana has applied them to our daily work
- Lessons learned along the way
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.
Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:
* Snapshot isolation with no directory listing or file renames
* Distributed planning to relieve metastore bottlenecks
* Improved data layout for S3 performance
* Immediately available writes from streaming applications
* Opportunistic compaction and data optimization
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Modularized ETL Writing with Apache SparkDatabricks
Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse.
Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table.
These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business.
Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data.
And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table.
Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
Large partitions shall no longer be a nightmare. That is the goal of CASSANDRA-11206.
100MB and 100,000 cells per partition is the recommended limit for a single partition in Cassandra up to 3.5. Exceeding these limits can cause a lot of trouble. Repairs and compactions could fail and reads cause out-of-memory failures.
This talk provides a deep-dive of the reasons for the previous limitations, why exceeding these limitations caused trouble, how the improvements in Cassandra 3.6 helps with big partitions and why you should not blindly let your partitions get huge.
About the Speaker
Robert Stupp Solution Architect, DataStax
Robert is working as a Solutions Architect at DataStax and is also a Committer to Apache Cassandra. Before joining DataStax he worked with his customers to architect and build distributed systems using Cassandra and has a long experience in building distributed backend systems mostly using Java as the preferred language of choice.
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.
Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:
* Snapshot isolation with no directory listing or file renames
* Distributed planning to relieve metastore bottlenecks
* Improved data layout for S3 performance
* Immediately available writes from streaming applications
* Opportunistic compaction and data optimization
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Modularized ETL Writing with Apache SparkDatabricks
Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse.
Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table.
These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business.
Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data.
And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table.
Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
Large partitions shall no longer be a nightmare. That is the goal of CASSANDRA-11206.
100MB and 100,000 cells per partition is the recommended limit for a single partition in Cassandra up to 3.5. Exceeding these limits can cause a lot of trouble. Repairs and compactions could fail and reads cause out-of-memory failures.
This talk provides a deep-dive of the reasons for the previous limitations, why exceeding these limitations caused trouble, how the improvements in Cassandra 3.6 helps with big partitions and why you should not blindly let your partitions get huge.
About the Speaker
Robert Stupp Solution Architect, DataStax
Robert is working as a Solutions Architect at DataStax and is also a Committer to Apache Cassandra. Before joining DataStax he worked with his customers to architect and build distributed systems using Cassandra and has a long experience in building distributed backend systems mostly using Java as the preferred language of choice.
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
Dean discusses architecture patterns with InfluxDB Enterprise, covering an overview of InfluxDB Enterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice.
Pull, Don't Push! Sensu Summit 2018 TalkJulian Dunn
Architectures for monitoring and configuration in a microservices era. A talk given by Julian Dunn and Fletcher Nichol at Sensu Summit 2018 in Portland, Oregon.
Pull, don’t push: Architectures for monitoring and configuration in a microse...Sensu Inc.
Applications today are increasingly being designed using a share-nothing, microservices architecture that is resilient to the failure of individual components, even when built atop cloud infrastructure that can suffer infrequent-but-massive outages. Yet we still see many supporting tools for application monitoring, observability, configuration management and release management using a centralized “orchestration” approach that depends on pushing changes to unreliable distributed systems.
In this Sensu Summit 2018 talk, Chef's Julian Dunn & Fletcher Nichol give you a primer about promise theory and the autonomous actor model that underlies the design of products like Sensu and Habitat, why it leads to not only higher overall system reliability but human comprehension for easier operations. They argue that you should consider designing all of your applications and supporting systems in this way. They may even show a demo or two to illustrate how inverting the design radically changes the notion of “application release orchestration”, so that you can retain orchestration-type semantics even with an eventually-consistent system design.
This session will re-evaluate Cassandra’s relationship with runtime and build systems, pointing out ways that the existing systems fall down, and identifying avenues for improvement. Over the past few years, a number of platforms have emerged for running user code. Container runtimes like Docker, container orchestrators such as Kubernetes, and metrics collections agents like Prometheus and Spectator have all gained popularity and mind-share. Cassandra functionality such as metrics, bootstrapping, and monitoring integrates with the newer paradigms, but in an ad-hoc and improvised fashion. By taking a purposeful approach to integrating with these new methods of deployment, the Cassandra community can more fully benefit from their advertised strengths. The Cassandra build system based on Ant+Ivy dates to the early 2000’s, and reflects legacy complexity that could be avoided with modern build systems. Cassandra’s system package builds are not much better and often fail to integrate with industry standards such as systemd. Iterating on the existing systems is difficult, but this technical debt slows innovation in our build systems. In this talk, we propose solutions to make building, deploying and monitoring Cassandra easy and low overhead, while taking advantage of cloud advancements wherever possible.
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
In this InfluxDays NYC 2019 presentation, InfluxData VP of Products Tim Hall and Sales Engineer Sam Dillard discuss architecture patterns with InfluxEnterprise time series platform. They cover an overview of InfluxEnterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice. Presentation highlights include InfluxEnterprise cluster architecture and how to determine if you're ready for adopting InfluxEnterprise.
Infrastructure as Code to Maintain your SanityDewey Sasser
Infrastructure as Code is all the rage, but suffers the same problems as any other code: it can easily become an unmanageable plate of spaghetti. Organizing your IaC is critical but the methods are different than traditional program code. We present an organizational pattern for IaC that has proven itself across multiple technologies in multiple cloud systems to allow isolation of concerns, stability, and controlled rollouts and maintain your sanity while doing so.
The Architecture of Continuous Innovation - OSCON 2015Chip Childers
For many years, the gold standard of business strategy has been the mantra “Sustainable competitive advantage.” But the world has changed. Moving forward, the mantra for survival must be “Continuous innovation.”
In this talk, I will take the audience inside the architectural foundation of a modern cloud native platform. I’ll walk through the tools they’ll use to deliver on the promise of continuous innovation — tools such as Docker, Lattice, Puppet, and Cloud Foundry. And I’ll show examples of how to use those tools to deliver the speed and portability businesses need to thrive in a cloud native world.
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
Watch this talk here: https://www.confluent.io/online-talks/scaling-security-on-100s-of-millions-of-mobile-devices-using-kafka-and-scylla-on-demand
Join mobile cybersecurity leader Lookout as they talk through their data ingestion journey.
Lookout enables enterprises to protect their data by evaluating threats and risks at post-perimeter endpoint devices and providing access to corporate data after conditional security scans. Their continuous assessment of device health creates a massive amount of telemetry data, forcing new approaches to data ingestion. Learn how Lookout changed its approach in order to grow from 1.5 million devices to 100 million devices and beyond, by implementing Confluent Platform and switching to Scylla.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
The realization of network softwarization, an overarching buzzword to encompass all software-centric developments from the Software-Defined Networking (SDN) and Network Function Virtualization (NFV) trends, is being enabled through a set of innovations in high-speed data plane design and implementation. Recent efforts include te-architecting the hardware-software interfaces and exposing programmatic interfaces (e.g., OpenFlow), programmable hardware-based pipelines (e.g. Protocol Independent Switch Architecture – PISA) along suitabe programming languages (e.g., P4), and multiple advances on low overhead virtualization and fast packet processing libraries (e.g. DPDK, FD.io) for Linux based general purpose processor platforms. This talk provides an overview of relevant ongoing work and discusses the trade-offs of each design and implementation choice of software-defined dataplanes regarding Programmability, Performance, and Portability.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
Capturing live traffic in a typical micro-services setup is one of the industry standard strategies for various needs including testing, canary, and dark launches, but when it comes to databases, capturing and replaying live traffic is quite challenging but equally critical. Cassandra is one of the few databases which supports a live traffic capture and replay feature out of the box starting with 4.0. As part of this talk, Vinay Chella will dive deeper into the query logging framework that is introduced in Cassandra 4.0, and the features built on top of this framework, including full query logging, audit logging, and traffic replay. At end of this talk, you will learn how to use audit logging, live traffic capture and replay features in the upcoming release of Cassandra.
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#chella
Audit logging is one of the most critical features in an enterprise-ready database in terms of security compliance. Furthermore, live traffic troubleshooting is critical for operators to troubleshoot production issues quickly. While past versions have lacked these critical features, the Cassandra team understood the need for better solutions and in the upcoming release of Cassandra both of these features now come out of the box which makes Cassandra even more awesome to work with. Cassandra now supports Audit logging and query logging as part of C* itself. As part of this talk, audience will learn about how to enable, configure, and tune audit logging for their C* clusters and how to log live traffic/queries for serverel needs including troubleshooting or even live traffic reply
These slides are from the recent meetup @ Uber - Apache Cassandra at Uber and Netflix on new features in 4.0.
Abstract:
A glimpse of Cassandra 4.0 features:
There are a lot of exciting features coming in 4.0, but this talk covers some of the features that we at Netflix are particularly excited about and looking forward to. In this talk, we present an overview of just some of the many improvements shipping soon in 4.0.
Apache Cassandra makes it possible to execute millions of operations per second in scalable fashion. Harnessing the power of C* leaves many developers pondering about the following:
- Is my data model appropriate and not going to end up as wide partition(s) causing heap pressure and other issues?
- How do I tune my connection pool configuration? What are the optimal settings for my environment ?
- What is my C* cluster capacity in terms of number of IOPs for a given 95th and 99th latency?
- How do I perf-test my data access layer?
In this talk, Vinay Chella, Cloud Data Architect @ Netflix, will share open source tools, techniques and platform(NDBench) that Netflix uses to perf-test their C* fleet with simulations millions of operations per second.
At a millions of operation/sec entropy is inevitable. How does Netflix deal with this data inconsistency? and What it takes Netflix to repair their fleet of C* systems for keeping Entropy to it’s minimum?
In this talk, Vinay Chella, Cloud Data Architect @ Netflix will shed light on Netflix's AntiEntropy strategy and problems that arise at scale. He will discuss repair strategies and how Netflix repairs petabytes of data while maintaining a strict SLA for client applications and giving seamless movie experience to its customers . Come learn about different repair strategies, tools and techniques to repair your C* clusters.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
How netflix manages petabyte scale apache cassandra in the cloud
1. How Netflix manages
petabyte scale
Apache Cassandra
in the cloud
Joey Lynch, Vinay Chella
Netflix’s Distributed Database Engineers
2. Who are we?
Vinay Chella
Distributed Systems Engineer
Focusing on Apache Cassandra and Data
Abstractions
Cloud Data Engineering
Netflix
Joey Lynch
Distributed Systems Engineer
Distributed system addict and data wrangler
Cloud Data Engineering
Netflix
3. Agenda Why use Cassandra?
Scale of Cassandra
Life of Cassandra Cluster
- Where does it start?
- Provisioning
- Keep it running
- Migration / Retiring
Murphy’s law applied
4. Why
Apache
Cassandra
— Millions of operations per sec
— Global data replication
— Failure isolation at rack level
— Chaos ready database
— Tunable consistency
— Log structured storage engine
5. — 10’s of thousands instances
— 100’s of global C* clusters
— >6 PB of data
— Millions of requests / second
— Replicating several GiB/sec data
across the globe
Scale
6. Story of Apache Cassandra and Netflix
Inception Provision Keep it running Migrations
9. Inception
Service philosophy
— Context not control
◆ Education
◆ Tooling
— SLOs are key
◆ Size, rate, latency,
availability
— Every party must be
responsible
37. Keep it Running
Declarative Control PlaneImperative Control Plane
Pros:
— Easy to implement
— Easy to understand
Cons:
— Fragile to failure
— Scaling issues
— Have to implement
parallelism
Pros:
— Lower total cost of
ownership
— Parallel out of the box
Cons
— Requires different
engineering mindset
— Hard to implement
— Have to stop
parallelism
38. Keep it Running
Declarative ToolsImperative Tools
jenkins.io/
— “Orchestrators”
◆ Jenkins
◆ Fabric
◆ Ansible
— Imperative
operation
ansible.com/fabfile.org/
— Distributed Databases
— Distributed operation
— Similar to Actor model
puppet.com
consul.io/
39. Declarative Control Plane
1. Obtain desires from
datastore
2. Obtain state of world
3. If state != desire, act!
Keep it Running
40. Keep it Running
Ring Health Imperative Solution
— O(#nodes) commands
— Single Network Path
— Hard to synthesize global
view
42. — Desire: Healthy <30m
— Observe: Full cluster
state
— Action: Remediate or
page
Keep it Running
Ring Health Declarative Solution
43. — O(#nodes) messages
— Degraded hardware is
hard to SSH into
◆ Failed ephemerals
◆ Failed network
Keep it Running
Hardware Health Imperative Solution
44. Keep it Running
Hardware Health Declarative Solution
— Desire: Degraded
hardware should die <10m
— Observe: Local disks,
creds, filesystems
— Action: Request
termination
45. Keep it Running
JVM Health Declarative Solution
— Desire: JVM
Throughput > 90%
— Observe: GC vs
Runtime
— Action: core dump +
terminate
https://github.com/Netflix-Skunkworks/jvmquake
46. Keep it Running
Backup Declarative Solution
— Desire: Full snapshot
every interval
— Observe:
◆ State of data
◆ State of S3
— Action:
◆ Upload diff
50. Keep it Running
Repair Imperative Solution
— O(lots) commands
— Remote JMX = :(
— Single point of failure = :(
— 1 month job duration = D:
51. Keep it Running
Repair Declarative Solution
— Distributed Control Plane
— Makes progress when
database is available
— Every node follows repair
state machine
https://issues.apache.org/jira/browse/CASSANDRA-14346
54. Keep it Running
C* Instance ecosystem
PAGE
CDEPre-flight check
AWS IO detection
W
I
N
S
T
O
N
CDE Service
Alert Atlas Mantis
C*
C*
C*
Priam
Bolt
Cluster
Metadata
Remediation
C*
64. Cautionary Tale: Rarely run automation
— Auto remediation to clean up stale backups,
repair snapshots
◆ Triggers only above FATAL disk threshold
◆ Assumption mismatch
65. Cautionary Tale: Rarely run automation
— Auto remediation to clean up stale backups,
repair snapshots
◆ Triggers only above FATAL disk threshold
◆ Assumption mismatch
— Result:
◆ Automation ended up deleting real backup
data