Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

•

5 likes•719 views

Netflix processes trillions of events and petabytes of data a day in the Keystone data pipeline, which is built on top of Apache Flink. As Netflix has scaled up original productions annually enjoyed by more than 150 million global members, data integration across the streaming service and the studio has become a priority. Scalably integrating data across hundreds of different data stores in a way that enables us to holistically optimize cost, performance and operational concerns presented a significant challenge. Learn how we expanded the scope of the Keystone pipeline into the Netflix Data Mesh, our real-time, general-purpose, data transportation platform for moving data between Netflix systems. The Keystone Platform’s unique approach to declarative configuration and schema evolution, as well as our approach to unifying batch and streaming data and processing will be covered in depth.

Technology

Netflix Data Mesh
Composable Data Processing
jcunningham@netflix.com

Cross-platform
Eventing
Netflix Streaming &
Keystone

More than 150M Global Members
Trillions of Messages / Petabytes a Day

A high level view of Netflix’s Studio Structure

Airport: Netflix
Air Traffic Control: Studio
Airplanes: Productions
Credit: Christopher Goss, Netflix

Production
Company D Production
Company E
Studio B
Studio
A
Production
Company F
Studio C
Parent Studio
(many airports, many airplanes per airport)

Netflix
(one large airport, huge # of airplanes)

Data Mesh: Composable
Data Processing
Data Transport
Problems
Significant duplication of
effort across pipelines and
teams.
Delay in bringing online new
pipelines and increasing
maintenance overhead from
existing pipeline.
Uneven implementation of
best practices.
Need for lower latency data
transportation and
warehousing for operational
reporting.
Correctness issues related
to distributed systems error
recovery.

Data Mesh: Composable
Data Processing
Flink Processing
RDS
Cassandra
Airtable
Logging Data
…
RDS
Cassandra
S3 Data Warehouse
Elastic Search
…
Extract Transform Load

Data Mesh: Composable
Data Processing
Stream 1
Stream 2
Stream 3
Stream 4
Catalog
EV Cache
ES
S3
Service
RDS
Cassandra
Stream Processor
SourceConnector
SourceConnector
Sources Sinks
SinkConnector
SinkConnector
SinkConnector
Out
In
(Avro)

Stream 1
Stream 2
Stream 1
Stream Processor
Stream Processor
Streams
Sinks
Data Mesh: Composable
Data Processing

Data Mesh: Composable
Data Processing
Source
Database
DB CDC Source
Connector
DB Change
Stream
CDC Flink Auditor
GraphQL Flink
Processor
Enriched Stream
Iceberg Sink
Flink Processor
Iceberg
S3 Data
GraphQL Flink
Auditor
Batch Iceberg
Auditor

Data Mesh: Composable
Data Processing
Overall Schema
Evolution Approach
Apache Avro
schema format
Stream
processors are
deployed with
fixed input and
output schemas
Schema changes
are managed by
redeploying with
new fixed input
and output
schemas
Processors can
opt-in to
Automatic
schema upgrades
Most schema
changes don’t
require a topic
change

Data Mesh: Composable
Data Processing
Data Mesh Controller
DB CDC Source
Connector
GraphQL Flink
Processor
Iceberg Sink
Flink Processor
Iceberg
S3 Data

Physical Data Mesh Storage
id: name
1: id
2: first
3: last
Physical S3 Storage
id
1
2
3
Iceberg Data
id: name
1: id
2: first
3: last
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
Data Mesh: Composable
Data Processing

Physical Data Mesh Storage
id: name
1: id
2: first
3: last
4: city
Physical S3 Storage
id
1
2
3
4
Iceberg Data
id: name
1: id
2: first
3: last
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
Data Mesh: Composable
Data Processing

id: name
1: id
2: first
3: last
Physical Data Mesh Storage
id: name
1: id
2: first
3: last
4: city
Physical S3 Storage
id
1
2
3
4
Iceberg Data
id: name
1: id
2: first
3: last
4: city
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
Data Mesh: Composable
Data Processing

Physical Data Mesh Storage
id: name
1: id
2: first_name
3: last_name
4: city
Physical S3 Storage
id
1
2
3
4
Iceberg Data
id: name
1: id
2: first
3: last
4: city
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
id: name
1: id
2: first_name
3: last_name
4: city
Data Mesh: Composable
Data Processing

Physical Data Mesh Storage
id: name
1: id
2: first_name
4: city
5: last
Physical S3 Storage
id
1
2
3
4
5
Iceberg Data
id: name
1: id
2: first_name
4: city
5: last
id: name
1: id
2: first_name
3: last_name
4: city
id: name
1: id
2: first
3: last
4: city
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
Data Mesh: Composable
Data Processing

Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems. There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.

Architect’s Open-Source Guide for a Data Mesh Architecture

Databricks

Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh? In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry. The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems. This session is targeted for architects, decision-makers, data-engineers, and system designers.

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

Intro to Delta Lake

Databricks

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

What’s New with Databricks Machine Learning

Databricks

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Paris Data Engineers !

Building a Modern Data Architecture on AWS - Webinar

Amazon Web Services

Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data. https://aws.amazon.com/webinars/anz-webinar-series/

Meetup: Streaming Data Pipeline Development In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns. He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg. If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/ You can join the meeting virtually here: https://cloudera.zoom.us/j/91603330726 Speaker - Tim Spann Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.

Some Iceberg Basics for Beginners (CDP).pdf

Michael Kogan

Data Catalog & ETL - Glue & Athena

Amazon Web Services

Building Data Quality pipelines with Apache Spark and Delta Lake

Databricks

Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports. With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data. Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.

Best Practices for Building Your Data Lake on AWS

Amazon Web Services

Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data. Learning Objectives: • Discover how you can rapidly scale and build your data lake with AWS. • Explore the key pillars behind a successful data lake implementation. • Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake. • Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.

Building an open data platform with apache iceberg

Alluxio, Inc.

Evolution from EDA to Data Mesh: Data in Motion

confluent

Thoughtworks Zhamak Dehghani observations on these traditional approaches’s failure modes, inspired her to develop an alternative big data management architecture that she aptly named the Data Mesh. This represents a paradigm shift that draws from modern distributed architecture and is founded on the principles of domain-driven design, self-serve platform, and product thinking with Data. In the last decade Apache Kafka has established a new category of data management infrastructure for data in motion that has been leveraged in modern distributed data architectures.

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Altinity Ltd

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems. Presenters: Mary Grygleski - Streaming Developer Advocate & Mark Needham - Developer Relations Engineer at StarTree Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).

Databricks Fundamentals

Dalibor Wijas

CDC patterns in Apache Kafka®

confluent

Mario Molina, Software Engineer CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others. In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo! https://www.meetup.com/Mexico-Kafka/events/277309497/

Modernizing to a Cloud Data Architecture

Databricks

Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.

Data Mesh Part 4 Monolith to Mesh

Jeffrey T. Pollock

This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems. Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/) Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.

The columnar roadmap: Apache Parquet and Apache Arrow

Julien Le Dem

How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

Amazon Web Services

As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

Free Training: How to Build a Lakehouse

Databricks

Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data. That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads. Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos. Here’s what you’ll learn in this 2-hour session: How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse

Hudi architecture, fundamentals and capabilities

Nishith Agarwal

Databricks on AWS.pptx

Wasm1953

3D: DBT using Databricks and Delta

Databricks

Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.

Build Real-Time Applications with Databricks Streaming

Databricks

In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure. This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.) • The current locations and status of firefighters, EMT personnel and other relevant fire department employees • The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments. In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

DataScienceConferenc1

Databricks' founders caused a seismic shift in data analysis community when they created Apache Spark which has become a cornerstone of Big Data processing pipelines and tools in large and small companies all around the world. Now they've built a revolutionary, comprehensive and easy-to-use platform around Apache Spark and their other inventions, such as MLFlow and Koalas frameworks and most importantly the Data Lakehouse: a concept of fusing data warehouse and data lake architectures into a single versatile and fast platform. Technical foundation for Databricks Data Lakehouse is Delta Lake. More than 7000 organizations today rely on Databricks to enable massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Come to the talk and see the demo to find out why.

Plank

FNian

Os Gottfridoscon2007

What's hot

Meetup: Streaming Data Pipeline Development

Timothy Spann

Some Iceberg Basics for Beginners (CDP).pdf

Michael Kogan

Data Catalog & ETL - Glue & Athena

Amazon Web Services

Building Data Quality pipelines with Apache Spark and Delta Lake

Databricks

Best Practices for Building Your Data Lake on AWS

Amazon Web Services

Building an open data platform with apache iceberg

Alluxio, Inc.

Evolution from EDA to Data Mesh: Data in Motion

confluent

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Altinity Ltd

Databricks Fundamentals

Dalibor Wijas

CDC patterns in Apache Kafka®

confluent

Modernizing to a Cloud Data Architecture

Databricks

Data Mesh Part 4 Monolith to Mesh

Jeffrey T. Pollock

The columnar roadmap: Apache Parquet and Apache Arrow

Julien Le Dem

How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

Amazon Web Services

Free Training: How to Build a Lakehouse

Databricks

Hudi architecture, fundamentals and capabilities

Nishith Agarwal

Databricks on AWS.pptx

Wasm1953

3D: DBT using Databricks and Delta

Databricks

Build Real-Time Applications with Databricks Streaming

Databricks

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

DataScienceConferenc1

What's hot (20)

Meetup: Streaming Data Pipeline Development

Some Iceberg Basics for Beginners (CDP).pdf

Data Catalog & ETL - Glue & Athena

Building Data Quality pipelines with Apache Spark and Delta Lake

Best Practices for Building Your Data Lake on AWS

Building an open data platform with apache iceberg

Evolution from EDA to Data Mesh: Data in Motion

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Databricks Fundamentals

CDC patterns in Apache Kafka®

Modernizing to a Cloud Data Architecture

Data Mesh Part 4 Monolith to Mesh

The columnar roadmap: Apache Parquet and Apache Arrow

How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

Free Training: How to Build a Lakehouse

Hudi architecture, fundamentals and capabilities

Databricks on AWS.pptx

3D: DBT using Databricks and Delta

Build Real-Time Applications with Databricks Streaming

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

Similar to Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

Plank

FNian

Os Gottfridoscon2007

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Databricks

This topic describes the use of Spark and SequoiaDB in the Operational Data Lake of China’s financial industry, including how to use SequoiaDB to provide online high concurrent services and how to use Spark for data processing and machine learning. China has the world’s largest population, and also the world’s second largest economy. Many of the best technologies used in the United States and Europe are difficult to play effectively in China. This topic will show you how Spark and SequoiaDB are able to provide online financial services to billions of population.

Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...

Thomas Gottron

The intensive growth of the Linked Open Data (LOD) Cloud has spawned a web of data where a multitude of data sources provides huge amounts of valuable information across different domains. Nowadays, when accessing and using Linked Data more and more often the challenging question is not so much whether there is relevant data available, but rather where it can be found, how it is structured and to make best use of it. I this lecture I will start with giving a brief introduction to the concepts underlying LOD. Then I will focus on three aspects of current research: (1) Managing Linked Data. Index structures play an important role for making use of the information in LOD cloud. I will give an overview of indexing approaches, present algorithms and discuss the ideas behind the index structures. (2) Analysing Linked Data. I will present methods for analysing various aspects of LOD. From an information theoretic analysis for measuring structural redundancy, over formal concept analysis for identifying alternative declarative descriptions to a dynamics analysis for capturing the evolution of Linked Data sources. (3) Making Use of Linked Data. Finally I will give a brief overview and outlook on where the presented techniques and approaches are of practical relevance in applications. (Talk at the IRSS summerschool 2014 in Athens)

Understanding Hadoop Clusters and the Networkbradhedlund

Hadoop architecture meetupvmoorthy

Windows Azure: Lessons From The Field

Rob Gillen

Azure: Lessons From The Field

Rob Gillen

Bids talk 9.18

Travis Oliphant

Instrumenting and Scaling Databases with Envoy

Daniel Hochman

Every request to a database at Lyft is proxied by Envoy, providing complete visibility into the L3/L4 aspects of database interactions. This allows engineers to easily visualize changes to a database's load profile and pinpoint the root cause if necessary. Lyft has also open-sourced codecs for MongoDB, DynamoDB, and Redis. Protocol codecs in combination with custom filters yield benefits ranging from operation-level observability to horizontal scalability via sharding. Using Envoy for this purpose means that enhancements are implemented once and usable across a polyglot stack. The talk demonstrates Envoy's utility beyond traditional RPC service interactions in the network.

Synapse 2018 Guarding against failure in a hundred step pipeline

Calvin French-Owen

Google Cloud Computing on Google Developer 2008 Dayprogrammermag

XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...

The Linux Foundation

As users demand greater scalability from Citrix XenServer, the transmission of performance data from guests via xenstore is increasingly becoming a bottleneck. Future use of service domains is likely to make this problem worse. A simple, efficient way of transmitting time-varying datasets between userspace components in different domains is required. This talk will propose a lock-free mechanism to allow interdomain reporting of performance data without relying on continuous xenstore usage, and describe how it fits into the XAPI toolstack.

AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...

Sungmin Kim

How to build Business Intelligence System from scratch on AWS (Day1, Day2) ------------------------------------------------------------------------------------------ 2020-03-18(수)~19(목) 2일 동안 온라인으로 진행한 Online AWS Analytics Immersion Day 전체 발표 자료 입니다. BI(Business Intelligence) 시스템을 설계하는 과정에서 AWS Analytics 서비스들을 어떻게 활용할 수 있는지 설명 드리고자 만든 자료 입니다. Target Audience ------------------- Online Analytics Immersion Day는 다음과 같은 고객을 대상으로 진행됩니다. - AWS Analytics Services (ex. Kinesis, Athena, Redshift, EMR, etc)의 기본 개념을 알고 있지만, 이러한 서비스 활용 방법 및 데이터 분석 시스템 구축 과정이 궁금하신 분 - 데이터 분석 시스템을 구축한 경험은 있지만, 자신이 만든 시스템을 아키텍처 관점에서 어떻게 평가하고 확인할 수 있는지 궁금하신 분

Data Science Across Data Sources with Apache Arrow

Databricks

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

Amazon Web Services

Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other streaming data solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.

Amazed by AWS Series #4

Amazon Web Services Korea

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

Data Con LA

Abstract:- The data marts and warehouses we work with often require us to think about how to scope our analytic questions based on the finite amount of storage allocated to these enterprise components. With new innovations in the cloud space, we can leverage the near-infinite storage capacities of Data Lake object storage and use this as foundational source that can be combined with online data in the warehouse. In this talk we present reference architecture patterns based on Amazon Redshift Spectrum - a new technology enabling you to run MPP Warehouse SQL queries against exabytes of data in a backing object store. With Redshift Spectrum, customers can extend the analytic reach of their SQL interactions to push beyond data stored on local disks in the data warehouse to query vast amounts of unstructured data in the Amazon S3 Data Lake-- without having to load or transform any data.

Extending Analytic Reach

Agilisium Consulting

Named data networking. Basic Principle

Михаил Климарёв

he Named Data Networking (NDN) project proposed an evolution of the IP architecture that generalizes the role of this thin waist, such that packets can name objects other than communication endpoints. More speciﬁcally, NDN changes the semantics of network service from delivering the packet to a given destination address to fetching data identiﬁed by a given name. The name in an NDN packet can name anything – an endpoint, a data chunk in a movie or a book, a command to turn on some lights, etc. The hope is that this conceptually simple change allows NDN networks to apply almost all of the Internet’s well-tested engineering properties to broader range of problems beyond end-to-end communications.

Similar to Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham (20)

Plank

Os Gottfrid

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...

Understanding Hadoop Clusters and the Network

Hadoop architecture meetup

Windows Azure: Lessons From The Field

Azure: Lessons From The Field

Bids talk 9.18

Instrumenting and Scaling Databases with Envoy

Synapse 2018 Guarding against failure in a hundred step pipeline

Google Cloud Computing on Google Developer 2008 Day

XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...

AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...

Data Science Across Data Sources with Apache Arrow

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

Amazed by AWS Series #4

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

Extending Analytic Reach

Named data networking. Basic Principle

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...

Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

Similar to Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham