This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
Did you know almost every feature of the Spark Cassandra connector can be accessed without even a single Monad! In this talk I’ll demonstrate how you can take advantage of Spark on Cassandra using only the SQL you already know! Learn how to register tables, ETL data, and analyze query plans all from the comfort of your very own JDBC Client. Find out how you can access Cassandra with ease from the BI tool of your choice and take your analysis to the next level. Discover the tricks of debugging and analyzing predicate pushdowns using the Spark SQL Thrift Server. Preview the latest developments of the Spark Cassandra Connector.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Data Pipeline with Kafka, This slide include
Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
Did you know almost every feature of the Spark Cassandra connector can be accessed without even a single Monad! In this talk I’ll demonstrate how you can take advantage of Spark on Cassandra using only the SQL you already know! Learn how to register tables, ETL data, and analyze query plans all from the comfort of your very own JDBC Client. Find out how you can access Cassandra with ease from the BI tool of your choice and take your analysis to the next level. Discover the tricks of debugging and analyzing predicate pushdowns using the Spark SQL Thrift Server. Preview the latest developments of the Spark Cassandra Connector.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Data Pipeline with Kafka, This slide include
Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.
NOTE: This was converted to Powerpoint from Keynote. Slideshare does not play the embedded videos. You can download the powerpoint from slideshare and import it into keynote. The videos should work in the keynote.
Abstract:
In this presentation, we will describe the "Spark Kernel" which enables applications, such as end-user facing and interactive applications, to interface with Spark clusters. It provides a gateway to define and run Spark tasks and to collect results from a cluster without the friction associated with shipping jars and reading results from peripheral systems. Using the Spark Kernel as a proxy, applications can be hosted remotely from Spark.
An overview of Apache Spark and AWS Glue.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.
This is a sharing on a seminar held together by Cathay Bank and the AWS User Group in Taiwan. In this sharing, overview of Amazon EMR and AWS Glue is offered and CDK management on those services via practical scenarios is also presented
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
The presentation covers lambda architecture and implementation with spark. In the presentation we will discuss about components of lambda architecture like batch layer, speed layer and serving layer. We will also discuss its advantages and benefits with spark.
Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges.
In this webinar, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures and best practices to quickly create Spark clusters using Amazon Elastic MapReduce (EMR), and ways to use Spark with Amazon Redshift, Amazon DynamoDB, Amazon Kinesis, and other big data applications in the Apache Hadoop ecosystem.
Learning Objectives:
Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing
How to deploy and tune scalable clusters running Spark on Amazon EMR
How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3
Common architectures to leverage Spark with DynamoDB, Redshift, Kinesis, and more
Advanced Natural Language Processing with Apache Spark NLPDatabricks
This hands-on deep-dive session uses the open-source Apache Spark NLP library to explore advanced NLP in Python. Apache Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. Apache Spark NLP is the only open-source NLP library that can natively scale to use any Apache Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. It’s the most widely used NLP library in the enterprise today.
You’ll edit and run executable Python notebooks as we walk through these common NLP tasks: document classification, named entity recognition, sentiment analysis, spell checking and correction, grammar understanding, question answering, and translation. The discussion of each NLP task includes the latest advances in deep learning and transfer learning used to tackle it – from the hundreds of BERT-based embeddings to models based on the T5 transformer, MarianNMT, multilingual and domain-specific models.
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
At Nielsen Identity, we use Apache Spark to process 10’s of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we’re spinning-up clusters with 1000’s of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.
Ultimate journey towards realtime data platform with 2.5M events per secb0ris_1
The story is about pain and enjoy or how we built realtime data platform with 2.5M events per second. What challenges we faced and which lessons were learned. Step by step we will introduce technologies we use like Spark, Presto, Kafka, AWS and EMR and best practices we come up to like monitoring, deployment and scaling.
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...Databricks
Redis-ML is a Redis module for high performance, real-time serving of Spark-ML models. It allows users to train large complex models in Spark, and then store and query the models directly on Redis clusters. The high throughput and low latency of Redis-ML allows users to perform heavy classification operations in real time while using a minimal number of servers. This unique architecture enables significant savings in resources compared to current commonly used methods, without loss in precision or server performance.
This session will demonstrate how to build a production-level recommendation system from the ground up using Spark-ML and Redis-ML. It will also describe performance and accuracy benchmarks, comparing the results with current standard methods.
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn how to migrate from Cassandra to DynamoDB
- Learn about the considerations and pre-requisites for migrating to DynamoDB
- Learn the benefits of a fully managed nosql database - DynamoDB
Building Robust ETL Pipelines with Apache SparkDatabricks
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
Netflix's Transition to High-Availability Storage (QCon SF 2010)Sid Anand
This talk focuses on Netflix's transition from Oracle to SimpleDB -- a cloud-hosted, key-value store -- during Netflix's transition to the cloud (i.e. AWS). Stay tuned for future talks as Netflix evaluates more technologies, e.g. Cassandra.
NOTE: This was converted to Powerpoint from Keynote. Slideshare does not play the embedded videos. You can download the powerpoint from slideshare and import it into keynote. The videos should work in the keynote.
Abstract:
In this presentation, we will describe the "Spark Kernel" which enables applications, such as end-user facing and interactive applications, to interface with Spark clusters. It provides a gateway to define and run Spark tasks and to collect results from a cluster without the friction associated with shipping jars and reading results from peripheral systems. Using the Spark Kernel as a proxy, applications can be hosted remotely from Spark.
An overview of Apache Spark and AWS Glue.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.
This is a sharing on a seminar held together by Cathay Bank and the AWS User Group in Taiwan. In this sharing, overview of Amazon EMR and AWS Glue is offered and CDK management on those services via practical scenarios is also presented
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
The presentation covers lambda architecture and implementation with spark. In the presentation we will discuss about components of lambda architecture like batch layer, speed layer and serving layer. We will also discuss its advantages and benefits with spark.
Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges.
In this webinar, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures and best practices to quickly create Spark clusters using Amazon Elastic MapReduce (EMR), and ways to use Spark with Amazon Redshift, Amazon DynamoDB, Amazon Kinesis, and other big data applications in the Apache Hadoop ecosystem.
Learning Objectives:
Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing
How to deploy and tune scalable clusters running Spark on Amazon EMR
How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3
Common architectures to leverage Spark with DynamoDB, Redshift, Kinesis, and more
Advanced Natural Language Processing with Apache Spark NLPDatabricks
This hands-on deep-dive session uses the open-source Apache Spark NLP library to explore advanced NLP in Python. Apache Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. Apache Spark NLP is the only open-source NLP library that can natively scale to use any Apache Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. It’s the most widely used NLP library in the enterprise today.
You’ll edit and run executable Python notebooks as we walk through these common NLP tasks: document classification, named entity recognition, sentiment analysis, spell checking and correction, grammar understanding, question answering, and translation. The discussion of each NLP task includes the latest advances in deep learning and transfer learning used to tackle it – from the hundreds of BERT-based embeddings to models based on the T5 transformer, MarianNMT, multilingual and domain-specific models.
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
At Nielsen Identity, we use Apache Spark to process 10’s of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we’re spinning-up clusters with 1000’s of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.
Ultimate journey towards realtime data platform with 2.5M events per secb0ris_1
The story is about pain and enjoy or how we built realtime data platform with 2.5M events per second. What challenges we faced and which lessons were learned. Step by step we will introduce technologies we use like Spark, Presto, Kafka, AWS and EMR and best practices we come up to like monitoring, deployment and scaling.
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...Databricks
Redis-ML is a Redis module for high performance, real-time serving of Spark-ML models. It allows users to train large complex models in Spark, and then store and query the models directly on Redis clusters. The high throughput and low latency of Redis-ML allows users to perform heavy classification operations in real time while using a minimal number of servers. This unique architecture enables significant savings in resources compared to current commonly used methods, without loss in precision or server performance.
This session will demonstrate how to build a production-level recommendation system from the ground up using Spark-ML and Redis-ML. It will also describe performance and accuracy benchmarks, comparing the results with current standard methods.
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn how to migrate from Cassandra to DynamoDB
- Learn about the considerations and pre-requisites for migrating to DynamoDB
- Learn the benefits of a fully managed nosql database - DynamoDB
Building Robust ETL Pipelines with Apache SparkDatabricks
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
Netflix's Transition to High-Availability Storage (QCon SF 2010)Sid Anand
This talk focuses on Netflix's transition from Oracle to SimpleDB -- a cloud-hosted, key-value store -- during Netflix's transition to the cloud (i.e. AWS). Stay tuned for future talks as Netflix evaluates more technologies, e.g. Cassandra.
NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis.
They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data
that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists
can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds).
Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure.
At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices
and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using
Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible.
Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL)
for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy
In this talk will show how Large Scale Data Analytics can be done with Spark and Cassandra on the DataStax Enterprise Platform. First we will give an overview of what is the Spark Cassandra Connector and how it enables working with large data sets. Then we will use the Spark Notebook to show live examples in the browser of interacting with the data. The example will load a large Movies Database from Cassandra into Spark and then show how that data can be transformed and analyzed using Spark.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam K Dey | Current 2022
Robinhood’s mission is to democratize finance for all. Data driven decision making is key to achieving this goal. Data needed are hosted in various OLTP databases. Replicating this data near real time in a reliable fashion to data lakehouse powers many critical use cases for the company. In Robinhood, CDC is not only used for ingestion to data-lake but is also being adopted for inter-system message exchanges between different online micro services. .
In this talk, we will describe the evolution of change data capture based ingestion in Robinhood not only in terms of the scale of data stored and queries made, but also the use cases that it supports. We will go in-depth into the CDC architecture built around our Kafka ecosystem using open source system Debezium and Apache Hudi. We will cover online inter-system message exchange use-cases along with our experience running this service at scale in Robinhood along with lessons learned.
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptChris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra - as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
This session explores building graph databases on AWS, examining common use cases, design patterns, and best practices. We then discuss the main options for running graph databases on AWS and go deeper into the Amazon DynamoDB storage backend plugin for Titan launched earlier this year. The Amazon Fulfillment team will share their story of running the Titan graph database on DynamoDB to track inventory going in and out of the company's fulfillment network. They provide best practices on running an efficient graph database at massive scale.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention, to content and network recommendations, to ranking and bidding, our world not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the ecosystem of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
LinkedIn Data Infrastructure Slides (Version 2)Sid Anand
Learn about Espresso, Databus, and Voldemort. LinkedIn Data Infrastructure Slides (Version 2). This talk was given in NYC on June 20, 2012
You can download the slides as PPT in order to see the transitions here :
http://bit.ly/LfH6Ru
Keeping Movies Running Amid Thunderstorms!Sid Anand
How does Netflix strive to deliver an uninterrupted service? This talk, delivered for the first time in November, 2011, covers some engineering design concepts that help us deliver features at a rapid pace while assuring high availability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
3. Motivation
Circa late 2008, Netflix had a single data center
Single-point-of-failure (a.k.a. SPOF)
Approaching limits on cooling, power, space, traffic
capacity
Alternatives
Build more data centers
Outsource the majority of capacity planning and scale
out
Allows us to focus on core competencies
@r39132 3
4. Motivation
Winner : Outsource the majority of capacity planning and scale
out
Leverage a leading Infrastructure-as-a-service provider
Amazon Web Services
@r39132 4
5.
6. Cloud Migration Strategy
Components
Applications and Software Infrastructure
Data
Migration Considerations
Avoid sensitive data for now
PII and PCI DSS stays in our DC, rest can go to the cloud
Favor Web Scale applications & data
@r39132 6
7. Cloud Migration Strategy
Examples of Data that can be moved
Video-centric data
Critics’ and Users’ reviews
Video Metadata (e.g. director, actors, plot description, etc…)
User-video-centric data – some of our largest data sets
Video Queue
Watched History
Video Ratings (i.e. a 5-star rating system)
Video Playback Metadata (e.g. streaming bookmarks, activity
logs)
@r39132 7
13. Pick a Data Store in the Cloud
An ideal storage solution should have the following features:
• Be hosted in AWS
• We wanted a database-as-a-service
• Be highly scalable and available and have acceptable latencies
• It should automatically scale with Netflix’s traffic growth
• It should be as available as television – i.e. zero downtime
• Support SQL
• Developers already familiar with the model
@r39132 13
14. Pick a Data Store in the Cloud
We picked SimpleDB and S3
SimpleDB was targeted as the AP (c.f. CAP theorem)
equivalent of our RDBMS databases in our Data
Center
S3 was used for data sets where item or row data
exceeded SimpleDB limits and could be looked up
purely by a single key (i.e. does not require
secondary indices and complex query semantics)
Video encodes
Streaming device activity logs (i.e. CLOB, BLOB, etc…)
Compressed (old) Rental History
@r39132 14
16. Technology Overview : SimpleDB
Terminology
SimpleDB Hash Table Relational Databases
Domain Hash Table Table
Item Entry Row
Item Name Key Mandatory Primary Key
Attribute Part of the Entry Value Column
@r39132 16
17. Technology Overview : SimpleDB
Soccer Players
Key Value
Nickname = Wizard of Teams = Leeds United,
ab12ocs12v9 First Name = Harold Last Name = Kewell Oz Liverpool, Galatasaray
Nickname = Czech Teams = Lazio,
b24h3b3403b First Name = Pavel Last Name = Nedved Cannon Juventus
Teams = Sporting,
Manchester United,
cc89c9dc892 First Name = Cristiano Last Name = Ronaldo Real Madrid
SimpleDB’s salient characteristics
• SimpleDB offers a range of consistency options
• SimpleDB domains are sparse and schema-less
• The Key and all Attributes are indexed
• Each item must have a unique Key
• An item contains a set of Attributes
• Each Attribute has a name
• Each Attribute has a set of values
• All data is stored as UTF-8 character strings (i.e. no support for types such as numbers or dates)
@r39132 17
18. Technology Overview : SimpleDB
What does the API look like?
Manage Domains
CreateDomain
DeleteDomain
ListDomains
DomainMetaData
Access Data
Retrieving Data
GetAttributes – returns a single item
Select – returns multiple items using SQL syntax
Writing Data
PutAttributes – put single item
BatchPutAttributes – put multiple items
Removing Data
DeleteAttributes – delete single item
BatchDeleteAttributes – delete multiple items
@r39132 18
19. Technology Overview : SimpleDB
Options available on reads and writes
Consistent Read
Read the most recently committed write
May have lower throughput/higher latency/lower
availability
Conditional Put/Delete
i.e. Optimistic Locking
Useful if you want to build a consistent multi-master data
store – you will still require your own consistency
checking
@r39132 19
21. Major Issues with SimpleDB
Manual data set partitioning was needed to work around size and
throughput limits
Read & Write latency variance could be large
SimpleDB multi-tenancy created both throughput and latency issues
Bad cost model – poor performance cost us more money
Not Global – new requirement : global expansion
No (external) back-up and recovery available – new requirement : refresh
Test DBs from Prod DBs
@r39132 21
22.
23. Pick Another Data Store in the Cloud
An ideal storage solution should have the following
features:
• New Requirements
• Support Global Use-cases
• Support Back-up and Recovery
• Retained Requirements
• Be highly scalable and available and have acceptable
latencies
• Obsolete Requirements
• Be hosted in AWS
• Support SQL
@r39132 23
24. Pick Another Data Store in the Cloud
We picked Cassandra (i.e. Dynamo + BigTable)
• Support Global Use-cases
• Easy to add new nodes to the cluster, even if the new nodes
are on a different continent
•
• Be easy to own and operate
• Dynamo’s masterless design avoids SPOF
• Dynamo’s repair mechanisms (i.e. read repair, anti-entropy
repair, and hinted handoff) promote self-management
@r39132 24
25. Pick Another Data Store in the Cloud
We picked Cassandra (i.e. Dynamo + BigTable)
• Be highly scalable and available and have acceptable
latencies
• Known to scale for writes
• Netflix achieved >1 million writes/sec
• Netflix leverages caches for reads
• Bonus
• Data model identical to SimpleDB – i.e. one less thing
for developers to learn
@r39132 25
27. Technology Overview : Cassandra
Features
• Consistent Hashing
• Tunable Consistency at the Request-level (like Simple DB)
• Automatic Healing
• Read Repair
• Anti-entropy Repair
• Hinted-Handoff
• Failure Detection
• Clusters are configurable & upgradeable without restart
• Infinite Incremental Write Scalability
@r39132 27
28. Consistent Hashing
How does Consistent Hashing
work in Cassandra?
• Take a number line from [0-159]
• Wrap it on itself, so you now
have a number ring
@r39132 28
29. Consistent Hashing
• Given a key k, map the key to the ring
using:
• hash_func(k)%160 = token = position
on the ring
@r39132 29
30. Consistent Hashing
• You can then manually map
machines to the ring
• 8 machines are mapped to the
ring here
• N1 0
• N2 20
• N3 40
@r39132 30
31. Consistent Hashing
Now map key ranges to
machines:
• Host N2 owns all tokens in
the range (0, 20]
• In other words, “foo” is
mapped to the bucket 9
but assigned to server N2
• Host N5 owns all tokens in
the range (60, 80]
• “bar” is mapped to bucket
79 and assigned to server
N5
@r39132 31
32. Consistent Hashing : Dead
node?
What happens if a node dies or
becomes unresponsive?
With a replication factor of say 3, data
is always written to 3 places
• Writes to tokens in N2’s primary
token range are also written to N3
and N4
• Writes to tokens in N5’s primary
token range are also written to N6
and N7
@r39132 32
34. The Write Path
• Cassandra clients are not
required to know token-ring
mapping
• Client can send a request to any
node in the cluster
• The receiving node is called the
“coordinator” for that request
• The coordinator will always
execute <Replication Factor>
number of writes (e.g. 3)
@r39132 34
35. The Write Path
• Coordinators take care of:
• Key routing
• Executing the consistency level
of the request
• e.g. “CL=1” means that the
coordinator will wait till 1
node ACKs the write before
sending a response to the
cllent
• e.g. “CL=Quorum” means
that the coordinator will wait
@r39132
till 2 of 3 nodes ACK the 35
write
36. The Write Path
• If node N3 is presumed dead by
the failure detection algorithm or
if N3 is too busy to respond
• N5 will log the write to N3
and deliver it as soon as N3
comes back
• This is called Hinted
Handoff
@r39132 36
37. The Write Path
• Now that N5 is holding
uncommitted writes, what
happens if N5 dies before it can
replay the failed write?
• How will the logged write
ever make it to N3?
• Another repair mechanism
helps in this case: Read
Repair (stay tuned)
@r39132 37
39. The Read Path
• Again, N5 acts as a proxy, this
time for a get request
• Assume that the the
consistency_level on the
request is “Quorum”, N5 will
send a full read request to N2
and digest requests to N3 & N4
• This is a network
optimization
@r39132 39
40. The Read Path : Read Repair
• N5 will compare the responses as
soon as any 2 requests return
• At least one response will be a
digest (e.g. say N2 and N3)
• If the digest is more recent than the
full read, N5 doesn’t have the latest
data
• N5 then ask the N3 replica for a
full read
@r39132 40
41. The Read Path : Read Repair
• Once N5 receives the response of
the full read from N3, N5 returns this
data to the client
• But wait! What do we do about the
stale data on N2?
• Then schedule an async update for
the stale node(s)
• This is called read repair
@r39132 41
43. Consistent Hashing : Growing the
Ring
• To double the capacity of
the ring and keep the data
distributed evenly, we add
nodes (N9, N10, etc… ) in
the interstitial positions
(10, 30, etc… )
• Cassandra bootstraps new
nodes via data streaming
• For large data sets, Netflix
recovers from backup and
runs AER
@r39132 43
44. Consistent Hashing : Growing the
Ring
• After new nodes have
been added, they take
over half of the primary
token ranges of each old
node
@r39132 44
45. A Global Ring
• One of the benefits of
Cassandra is that it
can support
deployment of a
global ring
@r39132 45
46. A Global Ring
• In January 2012,
Netflix rolled out its
streaming service to
the UK and Ireland
• It did this by running a
few Cassandra
clusters between Va
and the UK
@r39132 46
47. A Global Ring
• Growing the ring into the UK from
its origin in Va was easy
• Double the ring as mentioned
before
• New interstitial nodes are in the
UK
• Increase the global replication
factor 3 6
• Now each key is covered by 6
nodes, 3 in Va, 3 in the UK
@r39132 47
49. Single Node Design
A single node is implemented as per the LSM
tree (i.e. log structured merge tree) pattern:
• Writes first append to the commit log and
then write to the memtable in memory
• This model gives fast writes
When the memtable is full, it is flushed to disk
to give SSTable 3
The SSTables are immutable
@r39132 49
50. Single Node Design
In a pure KV LSM store (e.g. Level DB),
when a Read occurs, the memtable is first
consulted
If the key is found, the data is returned from
the memtable
This is a fast read
If the data is not in the memtable, then it is
read from the latest SSTable
This is a slower read
@r39132 50
51. Single Node Design
Since Cassandra supports multiple value
columns, only a subset of which may be
written at any time, a read must consult
memtable and potentially multiple sstables
Hence, reads can be slower than they would
be for pure LSM KV stores
@r39132 51
52. Single Node Design
To reduce the number of SSTables
that need to be read, periodic
compaction needs to be run
This puts an additional load on GC
and I/O
The end result of compaction is
fewer files with hopefully fewer total
rows
To mitigate the problems inherent in
compaction, Cassandra 1.x support
Leveled Compaction
@r39132 52
54. Current Status
Timeline of Migration
• Jan 2009 – Start investigating the AWS cloud & SimpleDB
• Feb 2010 – Deliver the app to production
• Dec 2010 – ~95% of Netflix traffic has been moved into both the
AWS cloud and SimpleDB (+10 months)
• Apr 2011 – Start migration to Cassandra
• Jan 2012 – Netflix EU launched on Cassandra (+9 months)
@r39132 54
Editor's Notes
Existing functionality needs to move in phasesLimits the risk and exposure to bugsLimits conflicts with new product launches
Existing functionality needs to move in phasesLimits the risk and exposure to bugsLimits conflicts with new product launches
Existing functionality needs to move in phasesLimits the risk and exposure to bugsLimits conflicts with new product launches