Streaming Data Integration - For Women in Big Data Meetup

•Download as PPTX, PDF•

2 likes•1,354 views

A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk, we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.

Software

1Confidential
Streaming Data Integration
with Apache Kafka

3Confidential
The Plan
1. What is Data Integration About?
2. How things changed?
3. What is difficult and important?
4. How we solve things in Kafka?

4Confidential
Data Integration
Making sure the right data
Gets to the right places

5Confidential
10 years ago…
Informatica
DataStage
Manual Optimizations

9Confidential
Today…
• Everything streaming
• Everything real-time
• Everything in-memory
• Everything containers
• Everything clouds

10Confidential
These Things Matter
• Reliability – Losing data is (usually) not OK.
• Exactly Once vs At Least Once
• Timeliness
• Push vs Pull
• High throughput, Varying throughput
• Compression, Parallelism, Back Pressure
• Data Formats
• Flexibility, Structure
• Security
• Error Handling

12Confidential
After: Stream Data Platform with Kafka
 Distributed  Fault Tolerant  Stores Messages
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle
Hadoop Log Search Monitoring
Data
Warehouse
Kafka
 Processes Streams

18Confidential
Introducing
Kafka Connect
Large-scale streaming data import/export for Kafka

20Confidential
Overview of Connect
1. Install a cluster of Workers
2. Download / Build and install Connector Plugins
3. Use REST API to Start and Configure Connectors
4. Connectors start Tasks. Tasks run inside Workers and copy data.

Presented by Michael Noll, Product Manager, Confluent. Why are there so many stream processing frameworks that each define their own terminology? Are the components of each comparable? Why do you need to know about spouts or DStreams just to process a simple sequence of records? Depending on your application’s requirements, you may not need a full framework at all. Processing and understanding your data to create business value is the ultimate goal of a stream data platform. In this talk we will survey the stream processing landscape, the dimensions along which to evaluate stream processing technologies, and how they integrate with Apache Kafka. Particularly, we will learn how Kafka Streams, the built-in stream processing engine of Apache Kafka, compares to other stream processing systems that require a separate processing infrastructure.

Simplify Governance of Streaming Data

confluent

Presentation by Gwen Shapira, Product Manager, Confluent. With the rapid increase of Apache Kafka use within organizations, issues of data governance and data quality take center stage. When more and more disparate departments and teams depend on the data in Apache Kafka, it’s important to provide a way to make sure "bad data" does not make its way into critical topics. Every organization that uses Kafka at large scale realize they need a way to deliver these guarantees. In this talk, Kafka committer, Gwen Shapira will review the benefits of a schema registry for large-scale Kafka deployments and will give high-level overview of how the Confluent schema registry is being used in enterprise architectures across industry.

Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...

confluent

Apache Kafka has come the modern central point for a fast and scalable streaming platform. Now, thanks to the open source explosion over the last decade, there are now numerous data stores available as sinks for Kafka-brokered data, from search to document stores, columnular DBs, time series DBs and more. While many claim they are the swiss army knife, in reality each is designed for specific types of data and analytics approaches. In this talk, we will cover the taxonomy of various data sinks, delve into each categories pros, cons and ideal use cases, so you can select the right ones and tie them together with Kafka into a well-considered architecture.

Monitoring Apache Kafka with Confluent Control Center

confluent

Presentation by Nick Dearden, Direct, Product and Engineering, Confluent It’s 3 am. Do you know how your Kafka cluster is doing? With over 150 metrics to think about, operating a Kafka cluster can be daunting, particularly as a deployment grows. Confluent Control Center is the only complete monitoring and administration product for Apache Kafka and is designed specifically for making the Kafka operators life easier. Join Confluent as we cover how Control Center is used to simplify deployment, operability, and ensure message delivery. Watch the recording: https://www.confluent.io/online-talk/monitoring-and-alerting-apache-kafka-with-confluent-control-center/

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019

confluent

Tesla ingests trillions of events every day from hundreds of unique data sources through our streaming data platform. Find out how we developed a set of high-throughput, non-blocking primitives that allow us to transform and ingest data into a variety of data stores with minimal development time. Additionally, we will discuss how these primitives allowed us to completely migrate the streaming platform in just a few months. Finally, we will talk about how we scale team size sub-linearly to data volumes, while continuing to onboard new use cases.

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Guozhang Wang

Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.

Running Kafka for Maximum Pain

Todd Palino

Kafka makes so many things easier to do, from managing metrics to processing streams of data. Yet it seems that so many things we have done to this point in configuring and managing it have been object studies in how to make our lives, as the plumbers who keep the data flowing, more difficult than they have to be. What are some of our favorites? * Kafka without access controls * Multitenant clusters with no capacity controls * Worrying about message schemas * MirrorMaker inefficiencies * Hope and pray log compaction * Configurations as shared secrets * One-way upgrades We’ve made a lot of progress over the last few years improving the situation, in part by focusing some of this incredibly talented community towards operational concerns. We’ll talk about the big mistakes you can avoid when setting up multi-tenant Kafka, and some that you still can’t. And we will talk about how to continue down the path of marrying the hot, new features with operational stability so we can all continue to come back here every year to talk about it.

Since it was open sourced, Apache Kafka has been adopted very widely from web companies like Uber, Netflix, LinkedIn to more traditional enterprises like Cerner, Goldman Sachs and Cisco. At these companies, Kafka is used in a variety of ways - as a pipeline for collecting high-volume log data for load into Hadoop, a means for collecting operational metrics to feed monitoring and alerting applications, for low latency messaging use cases and to power near realtime stream processing.

Data Pipelines Made Simple with Apache Kafka

confluent

Presentation by Ewen Cheslack-Postava, Engineer, Apache Kafka Committer, Confluent In streaming workloads, often times data produced at the source is not useful down the pipeline or it requires some transformation to get it into usable shape. Similarly, where sensitive data is concerned, filtering of topics is helpful to ensure that the wrong data doesn't get to the wrong place. The newest release of Apache Kafka now offers the ability to do transformations on individual messages, making is possible to implement finer grained transformations customized to your unique needs. In this session we’ll talk about the new single message transform capabilities, how to use them to implement things like data masking and advanced partitioning, and when you’ll need to use more complex tools like the Kafka Streams API instead.

The Future of ETL Isn't What It Used to Be

confluent

Speaker: Gwen Shapira, Principal Data Architect, Confluent Join Gwen Shapira, Apache Kafka® committer and co-author of ""Kafka: The Definitive Guide,"" as she presents core patterns of modern data engineering and explains how you can use microservices, event streams and a streaming platform like Apache Kafka to build scalable and reliable data pipelines designed to evolve over time. This is part 1 of 3 in Streaming ETL - The New Data Integration series. Watch the recording: https://videos.confluent.io/watch/q7roRtNZBnjiT9C3ii88fo?.

PostgreSQL + Kafka: The Delight of Change Data Capture

Jeff Klukas

PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together. In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform

confluent

Many enterprises have a large technical debt in legacy applications hosted in on-premises data centers. There is a strong desire to modernize and move to a cloud-based infrastructure, but the world won’t stop for you to transition. Existing applications need to be supported and enhanced; data from legacy platforms is required to make decisions that drive the business. On the other hand, data from cloud-based applications does not exist in a vacuum. Legacy applications need access to these cloud data sources and vice versa. Can an enterprise have it both ways? Can new applications be built in the cloud while existing applications are maintained in a private data center? Monsanto has adopted a cloud-first mentality—today most new development is focused on the cloud. However, this transition did not happen overnight. Chrix Finne and Bob Lehmann share their experience building and implementing a Kafka-based cross-data-center streaming platform to facilitate the move to the cloud—in the process, kick-starting Monsanto’s transition from batch to stream processing. Details include an overview of the challenges involved in transitioning to the cloud and a deep dive into the cross-data-center stream platform architecture, including best practices for running this architecture in production and a summary of the benefits seen after deploying this architecture.

Espresso Database Replication with Kafka, Tom Quiggle

confluent

The initial deployment of Espresso relies on MySQL’s built-in mechanism for Master-Slave replication. Storage hosts running MySQL masters service HTTP requests to store and retrieve documents, while hosts running slave replicas remain mostly idle. Since replication is at the MySQL instance level, masters and slaves must contain the exact same partitions – precluding flexible and dynamic partition placement and migration within the cluster. Espresso is migrating to a new deployment topology where each Storage Node may host a combination of master and slave partitions; thus distributing the application requests equally across all available hardware resources. This topology requires per-partition replication between master and slave nodes. Kafka will be used as the transport for replication between partitions. For use as the replication stream for the source-of-truth data store for LinkedIn’s most valuable data, Kafka must be as reliable as MySQL replication. The session will cover Kafka configuration options to ensure highly reliable, in-order message delivery. Additionally, the application logic maintains state both within the Kafka event stream and externally to detect message re-delivery, out of order delivery, and messages inserted out-of-band. These application protocols to guarantee high fidelity will be discussed.

URP? Excuse You! The Three Metrics You Have to Know

confluent

(Todd Palino, LinkedIn) Kafka Summit SF 2018 What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows. We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain: -Under-replicated Partitions: The mother of all metrics -Request Latencies: Why your users complain -Thread pool utilization: How could 80% be a problem? We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!

Common Patterns of Multi Data-Center Architectures with Apache Kafka

confluent

Whether you know you want to run Apache Kafka in multiple data centers and need practical advice or you are wondering why some organizations even need more than one cluster, this online talk is for you. In this short session, we’ll discuss the basic patterns of multi-datacenter Kafka architectures, explore some of the use-cases enabled by each architecture and show how Confluent Enterprise products make these patterns easy to implement. Visit www.confluent.io for more information.

Apache Kafka: Past, Present and Future

confluent

Speaker: Jun Rao, Co-founder, Confluent In 2010, LinkedIn began developing Apache Kafka®. In 2011, Kafka was released an Apache open source project. Since then, the use of Kafka has grown rapidly in a variety of businesses. Now more than 30% of Fortune 500 companies are already using Kafka. In this 60-minute online talk, Confluent Co-founder Jun Rao will: -Explain how Kafka became the predominant publish/subscribe messaging system that it is today -Introduce Kafka's most recent additions to its set of enterprise-level features -Demonstrate how to evolve your Kafka implementation into a complete real-time streaming data platform that functions as the central nervous system for your organization Watch the recording: https://cnfl.io/kafka-past-present-future-on-demand

Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...

confluent

The field of astronomy is rapidly changing away from the traditional notion of a lone astronomer pointing a telescope at a single object in a static sky. Initiatives such as the Sloan Digital Sky Survey have ushered in a collaborative big data era of wide-field sky surveys, in which telescopes collect observations continuously while sweeping across the visible night sky. This method of data collection enables not only very deep imaging of far and faint objects but is also optimal for searching for objects that might be changing or moving. By analyzing the differences in astronomical image data from one night to the next, astronomers can detect "transient" objects, such as variable stars, supernova, and near Earth asteroids. New sky surveys provide a wealth of scientific value for astronomers but not without technical challenges. Survey data need to be automatically processed and the results immediately distributed to the scientific community in order to enable rapid follow-up observations as transient astronomy can be highly time sensitive. Detection alert data distribution mechanisms need to be robust and reliable to maintain scientific integrity without data loss. Additionally, alerting systems need to be scalable to support a data volume unprecedented in astronomy, as transient detection rates have increased to exceed all historical data in a single night. A streaming architecture is an ideal architecture for automated distribution and processing of transient data in real time as it is being collected. In this talk, we will discuss how Kafka and Avro are being used in wide-field astronomical sky survey pipelines to serialize and distribute transient data, the design choices behind this system, and how this alert stream system has been successfully deployed in production to distribute transient detection alerts to the scientific research community in excess of 1 million events per night.

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

confluent

When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean? Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines. In this presentation we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually be misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.

Apache Kafka 0.8 basic training - Verisign

Michael Noll

Apache Kafka 0.8 basic training (120 slides) covering: 1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka 2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers 3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning 4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps 5. Playing with Kafka using Wirbelsturm Audience: developers, operations, architects Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/ Verisign is a global leader in domain names and internet security. Tools mentioned: - Wirbelsturm (https://github.com/miguno/wirbelsturm) - kafka-storm-starter (https://github.com/miguno/kafka-storm-starter) Blog post at: http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/ Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!

Event Driven Architectures with Apache Kafka on Heroku

Heroku

Apache Kafka is the backbone for building architectures that deal with billions of events a day. Chris Castle, Developer Advocate, will show you where it might fit in your roadmap. - What Apache Kafka is and how to use it on Heroku - How Kafka enables you to model your data as immutable streams of events, introducing greater parallelism into your applications - How you can use it to solve scale problems across your stack such as managing high throughput inbound events and building data pipelines Learn more at https://www.heroku.com/kafka Reveal.js version of slides: http://slides.com/christophercastle/deck#/

Data integration with Apache Kafka

confluent

A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.

Confluent building a real-time streaming platform using kafka streams and k...

Thomas Alex

Apache Kafka® Delivers a Single Source of Truth for The New York Times

confluent

With 3.6 million paid print and digital subscriptions, how did The New York Times remain a leader in an evolving industry that once relied on print? It fundamentally changed its infrastructure at the core to keep up with the new expectations of the digital age and its consumers. Now every piece of content ever published by The New York Times throughout the past 166 years and counting is stored in Apache Kafka. Join The New York Times' Director of Engineering Boerge Svingen to learn how the innovative news giant of America transformed the way it sources content while still maintaining searchability, accuracy and accessibility through a variety of applications and services—all through the power of a real-time streaming platform. In this talk, Boerge will: -Provide an overview of what the publishing infrastructure used to look like -Deep dive into the log-based architecture of The New York Times’ Publishing Pipeline -Explain the schema, monolog and skinny log used for storing articles -Share challenges and lessons learned -Answer live questions submitted by the audience Watch the recording: https://videos.confluent.io/watch/SURnGMNNzsvDHYCmnCkJEY?

kafka for db as postgres

PivotalOpenSourceHub

Introduction to Apache Kafka

Jim Plush

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...

confluent

Kafka for DBAs

Gwen (Chen) Shapira

Data Architectures for Robust Decision Making

Gwen (Chen) Shapira

What's hot

Apache kafka-a distributed streaming platform

confluent

The Many Faces of Apache Kafka: Leveraging real-time data at scale

Neha Narkhede

Data Pipelines Made Simple with Apache Kafka

confluent

The Future of ETL Isn't What It Used to Be

confluent

PostgreSQL + Kafka: The Delight of Change Data Capture

Jeff Klukas

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform

confluent

Espresso Database Replication with Kafka, Tom Quiggle

confluent

URP? Excuse You! The Three Metrics You Have to Know

confluent

Common Patterns of Multi Data-Center Architectures with Apache Kafka

confluent

Apache Kafka: Past, Present and Future

confluent

Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...

confluent

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

confluent

Apache Kafka 0.8 basic training - Verisign

Michael Noll

Event Driven Architectures with Apache Kafka on Heroku

Heroku

Data integration with Apache Kafka

confluent

Confluent building a real-time streaming platform using kafka streams and k...

Thomas Alex

Apache Kafka® Delivers a Single Source of Truth for The New York Times

confluent

kafka for db as postgres

PivotalOpenSourceHub

Introduction to Apache Kafka

Jim Plush

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...

confluent

What's hot (20)

Apache kafka-a distributed streaming platform

The Many Faces of Apache Kafka: Leveraging real-time data at scale

Data Pipelines Made Simple with Apache Kafka

The Future of ETL Isn't What It Used to Be

PostgreSQL + Kafka: The Delight of Change Data Capture

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform

Espresso Database Replication with Kafka, Tom Quiggle

URP? Excuse You! The Three Metrics You Have to Know

Common Patterns of Multi Data-Center Architectures with Apache Kafka

Apache Kafka: Past, Present and Future

Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

Apache Kafka 0.8 basic training - Verisign

Event Driven Architectures with Apache Kafka on Heroku

Data integration with Apache Kafka

Confluent building a real-time streaming platform using kafka streams and k...

Apache Kafka® Delivers a Single Source of Truth for The New York Times

kafka for db as postgres

Introduction to Apache Kafka

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...

Viewers also liked

Kafka for DBAs

Gwen (Chen) Shapira

Data Architectures for Robust Decision Making

Gwen (Chen) Shapira

Kafka Reliability - When it absolutely, positively has to be there

Gwen (Chen) Shapira

Kafka connect-london-meetup-2016

Gwen (Chen) Shapira

Kafka at scale facebook israel

Gwen (Chen) Shapira

Fraud Detection for Israel BigThings Meetup

Gwen (Chen) Shapira

Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.

Nyc kafka meetup 2015 - when bad things happen to good kafka clusters

Gwen (Chen) Shapira

Application architectures with hadoop – big data techcon 2014

Jonathan Seidman

Streaming Data Ingest and Processing with Apache Kafka

Attunity

Apache™ Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. It offers higher throughput, reliability and replication. To manage growing data volumes, many companies are leveraging Kafka for streaming data ingest and processing. Join experts from Confluent, the creators of Apache™ Kafka, and the experts at Attunity, a leader in data integration software, for a live webinar where you will learn how to: -Realize the value of streaming data ingest with Kafka -Turn databases into live feeds for streaming ingest and processing -Accelerate data delivery to enable real-time analytics -Reduce skill and training requirements for data ingest The recorded webinar on slide 32 includes a demo using automation software (Attunity Replicate) to stream live changes from a database into Kafka and also includes a Q&A with our experts. For more information, please go to www.attunity.com/kafka.

Multi-Datacenter Kafka - Strata San Jose 2017

Gwen (Chen) Shapira

Fraud Detection Architecture

Gwen (Chen) Shapira

This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.

Queues, Pools, Caches

Gwen (Chen) Shapira

Transaction processing systems are generally considered easier to scale than data warehouses. Relational databases were designed for this type of workload, and there are no esoteric hardware requirements. Mostly, it is just matter of normalizing to the right degree and getting the indexes right. The major challenge in these systems is their extreme concurrency, which means that small temporary slowdowns can escalate to major issues very quickly. In this presentation, Gwen Shapira will explain how application developers and DBAs can work together to built a scalable and stable OLTP system - using application queues, connection pools and strategic use of caches in different layers of the system.

Kafka and Hadoop at LinkedIn Meetup

Gwen (Chen) Shapira

Fraud Detection with Hadoop

markgrover

Event Driven Architecture

Chris Patterson

In this presentation, I will explain event driven architecture, describe the different types of events, demonstrate how events can be related and orchestrated, and provide a basic understanding of how this method can drive the architecture of enterprise systems. In addition to understanding the concepts of event driven architecture, we will explore a working sample built using an open-source .NET messaging framework called MassTransit.

Change Data Capture using Kafka

Akash Vacher

Have your cake and eat it too

Gwen (Chen) Shapira

Advanced Shell Scripting for Oracle professionals

Andrejs Vorobjovs

There are a lot of tasks in Oracle world which would not be possible without a programming languages. Shell scripting can be applied to a wide variety of system and database tasks. In my presentation I will share advanced shell scripting techniques on real life customer success story migrating users from on premise Oracle Internet Directory (OID) instance to AWS OID instance. Migration with standard OID provided tools was not possible due to specific customer requirements. Therefore shell scripting came to achieve desired goals. I`ll give deep overview about issues faced during the scripting, troubleshooting techniques used, scripting performance aspects and solutions applied to make efficient user migration possible.

Kafka & Hadoop - for NYC Kafka Meetup

Gwen (Chen) Shapira

Architecting applications with Hadoop - Fraud Detection

hadooparchbook

Viewers also liked (20)

Kafka for DBAs

Data Architectures for Robust Decision Making

Kafka Reliability - When it absolutely, positively has to be there

Kafka connect-london-meetup-2016

Kafka at scale facebook israel

Fraud Detection for Israel BigThings Meetup

Nyc kafka meetup 2015 - when bad things happen to good kafka clusters

Application architectures with hadoop – big data techcon 2014

Streaming Data Ingest and Processing with Apache Kafka

Multi-Datacenter Kafka - Strata San Jose 2017

Fraud Detection Architecture

Queues, Pools, Caches

Kafka and Hadoop at LinkedIn Meetup

Fraud Detection with Hadoop

Event Driven Architecture

Change Data Capture using Kafka

Have your cake and eat it too

Advanced Shell Scripting for Oracle professionals

Kafka & Hadoop - for NYC Kafka Meetup

Architecting applications with Hadoop - Fraud Detection

Similar to Streaming Data Integration - For Women in Big Data Meetup

CA Technologies Customer Presentation

Splunk

Operational Buddhism: Building Reliable Services From Unreliable Components -...

Ernie Souhrada

Presto Summit 2018 - 02 - LinkedIn

kbajda

Deploying Big Data Platforms

Chris Kernaghan

Joe Witt presentation on Apache NiFi

Mark Kerzner

[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion

AIIM International

We’ve been collecting information for many years, driven by the usual suspects: compliance and fear. Now it’s time to take advantage of the information we’ve gathered by shifting our focus from the people who felt they had to keep it to the people who can actually use it. In short, it’s time to reap the benefits of the hard work we have already done. Learn how American Nuclear Insurers is using their information today, the process that got them there, and the technology it took to make it happen. Learn about the current state of Information Management in AIIM’s latest report: http://info.aiim.org/2017-state-of-information-management

The Cloud, Cold Chain, and Compliance

Michael Miller

A presentation on the Cold Chain, Cloud Computing, and Validation from the DicksonOne Product Manager, Matt McNamara, presented at the Cold Chain Conference in Dubai. Matt walks you through the different advantages to using cloud computing in the cold chain, and how new technology currently complies with regulatory agencies such as the FDA. Questions? Send us an email at content@dicksondata.com, and for more content on temperature, visit our blog at www.blog.dicksondata.com. Enjoy!

CERN Data Centre EvolutionGavin McCance

Soft-Shake 2013 : Enabling Realtime Queries to End Users

Benoit Perroud

Since it became an Apache Top Level Project in early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major drawbacks: query latency and data freshness. At the same time, businesses have started to exchange more and more data through REST API, leveraging HTTP words (GET, POST, PUT, DELETE) and URI (for instance http://company/api/v2/domain/identifier), pushing the need to read data in a random access style – from simple key/value to complex queries. Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind. There is a lot of traction today in this area and this talk will try to answer the question of how to fill in this gap with specific open-source components, ultimately building a dedicated platform that will enable real-time queries on Internet-scale data sets. After discussing the evolution of the deployments of common Hadoop platform, a hybrid approach called lambda architecture will be proposed. It will be demonstrated with concrete examples, discussing which technology could be a good match, and how they would interact together.

Science for the Future: Strategies for Moving and Sharing Data

Ian Foster

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017

AWS Chicago

Virtual Desktops on AWS by Mike Burke, Farm Credit Canada

TriNimbus

Introduction to Apache Cassandra

Instaclustr

Mediawiki to Confluence migration

Nils Hofmeister

IBM Aspera for High Speed Data Migration to Your AWS Cloud - DEM06-S - Anahei...

Amazon Web Services

While the cloud offers many benefits, moving TBs and PBs of data to the cloud can be challenging. Traditional software transfer technologies are slow and unreliable, and shipping physical storage disks is time consuming and exposes data to unnecessary security risks. IBM Aspera offers high-speed data transfer that uses the public internet to securely and reliably migrate large amounts of data from your existing environment to AWS. Learn how IBM Aspera on Cloud can dramatically reduce migration windows, help lower the costs of migration, and eliminate the risks associated with physical disk shipment. This presentation is brought to you by AWS partner, IBM.

Connecting Akka with Oracle Event Hub Cloud Service

Dalibor Blazevic

Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...

Looker

Delivering a Campus Research Data Service with Globus

Ian Foster

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...

confluent

(Bob Lehmann, Bayer) Kafka Summit SF 2018 You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why? In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform. In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.

50 Billion pins and counting: Using Hadoop to build data driven Products

DataWorks Summit

Similar to Streaming Data Integration - For Women in Big Data Meetup (20)

CA Technologies Customer Presentation

Operational Buddhism: Building Reliable Services From Unreliable Components -...

Presto Summit 2018 - 02 - LinkedIn

Deploying Big Data Platforms

Joe Witt presentation on Apache NiFi

[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion

The Cloud, Cold Chain, and Compliance

CERN Data Centre Evolution

Soft-Shake 2013 : Enabling Realtime Queries to End Users

Science for the Future: Strategies for Moving and Sharing Data

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017

Virtual Desktops on AWS by Mike Burke, Farm Credit Canada

Introduction to Apache Cassandra

Mediawiki to Confluence migration

IBM Aspera for High Speed Data Migration to Your AWS Cloud - DEM06-S - Anahei...

Connecting Akka with Oracle Event Hub Cloud Service

Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...

Delivering a Campus Research Data Service with Globus

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...

50 Billion pins and counting: Using Hadoop to build data driven Products

More from Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive

Gwen (Chen) Shapira

Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote

Gwen (Chen) Shapira

Gluecon - Kafka and the service mesh

Gwen (Chen) Shapira

Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

Gwen (Chen) Shapira

Kafka reliability velocity 17

Gwen (Chen) Shapira

Twitter with hadoop for oow

Gwen (Chen) Shapira

R for hadoopers

Gwen (Chen) Shapira

Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira

Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira

Incredible Impala

Gwen (Chen) Shapira

Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax. This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.

Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira

Scaling etl with hadoop shapira 3Gwen (Chen) Shapira

Is hadoop for you

Gwen (Chen) Shapira

Ssd collab13Gwen (Chen) Shapira

Integrated dwh 3Gwen (Chen) Shapira

Visualizing database performance hotsos 13-v2Gwen (Chen) Shapira

Flexible DesignGwen (Chen) Shapira

More from Gwen (Chen) Shapira (17)

Velocity 2019 - Kafka Operations Deep Dive

Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote

Gluecon - Kafka and the service mesh

Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

Kafka reliability velocity 17

Twitter with hadoop for oow

R for hadoopers

Scaling ETL with Hadoop - Avoiding Failure

Intro to Spark - for Denver Big Data Meetup

Incredible Impala

Data Wrangling and Oracle Connectors for Hadoop

Scaling etl with hadoop shapira 3

Is hadoop for you

Ssd collab13

Integrated dwh 3

Visualizing database performance hotsos 13-v2

Flexible Design

Recently uploaded

GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)

Alina Yurenko

Transform Your Communication with Cloud-Based IVR Solutions

TheSMSPoint

Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony

Artificia Intellicence and XPath Extension Functions

Octavian Nadolu

Empowering Growth with Best Software Development Company in Noida - Deuglo

Deuglo Infosystem Pvt Ltd

Do you want Software for your Business? Visit Deuglo Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions. Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC). Requirement — Collecting the Requirements is the first Phase in the SSLC process. Feasibility Study — after completing the requirement process they move to the design phase. Design — in this phase, they start designing the software. Coding — when designing is completed, the developers start coding for the software. Testing — in this phase when the coding of the software is done the testing team will start testing. Installation — after completion of testing, the application opens to the live server and launches! Maintenance — after completing the software development, customers start using the software.

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

Navigating the Metaverse: A Journey into Virtual Evolution"

Donna Lenk

Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management

Utilocate

Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly. The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.

AI Genie Review: World’s First Open AI WordPress Website Creator

Google

AI Genie Review: World’s First Open AI WordPress Website Creator 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-genie-review AI Genie Review: Key Features ✅Creates Limitless Real-Time Unique Content, auto-publishing Posts, Pages & Images directly from Chat GPT & Open AI on WordPress in any Niche ✅First & Only Google Bard Approved Software That Publishes 100% Original, SEO Friendly Content using Open AI ✅Publish Automated Posts and Pages using AI Genie directly on Your website ✅50 DFY Websites Included Without Adding Any Images, Content Or Doing Anything Yourself ✅Integrated Chat GPT Bot gives Instant Answers on Your Website to Visitors ✅Just Enter the title, and your Content for Pages and Posts will be ready on your website ✅Automatically insert visually appealing images into posts based on keywords and titles. ✅Choose the temperature of the content and control its randomness. ✅Control the length of the content to be generated. ✅Never Worry About Paying Huge Money Monthly To Top Content Creation Platforms ✅100% Easy-to-Use, Newbie-Friendly Technology ✅30-Days Money-Back Guarantee See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review #AIGenieApp #AIGenieBonus #AIGenieBonuses #AIGenieDemo #AIGenieDownload #AIGenieLegit #AIGenieLiveDemo #AIGenieOTO #AIGeniePreview #AIGenieReview #AIGenieReviewandBonus #AIGenieScamorLegit #AIGenieSoftware #AIGenieUpgrades #AIGenieUpsells #HowDoesAlGenie #HowtoBuyAIGenie #HowtoMakeMoneywithAIGenie #MakeMoneyOnline #MakeMoneywithAIGenie

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

rickgrimesss22

Vitthal Shirke Java Microservices Resume.pdf

Vitthal Shirke

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

timtebeek1

Fundamentals of Programming and Language Processors

Rakesh Kumar R

Launch Your Streaming Platforms in Minutes

Roshan Dwivedi

The claim of launching a streaming platform in minutes might be a bit of an exaggeration, but there are services that can significantly streamline the process. Here's a breakdown: Pros of Speedy Streaming Platform Launch Services: No coding required: These services often use drag-and-drop interfaces or pre-built templates, eliminating the need for programming knowledge. Faster setup: Compared to building from scratch, these platforms can get you up and running much quicker. All-in-one solutions: Many services offer features like content management systems (CMS), video players, and monetization tools, reducing the need for multiple integrations. Things to Consider: Limited customization: These platforms may offer less flexibility in design and functionality compared to custom-built solutions. Scalability: As your audience grows, you might need to upgrade to a more robust platform or encounter limitations with the "quick launch" option. Features: Carefully evaluate which features are included and if they meet your specific needs (e.g., live streaming, subscription options). Examples of Services for Launching Streaming Platforms: Muvi [muvi com] Uscreen [usencreen tv] Alternatives to Consider: Existing Streaming platforms: Platforms like YouTube or Twitch might be suitable for basic streaming needs, though monetization options might be limited. Custom Development: While more time-consuming, custom development offers the most control and flexibility for your platform. Overall, launching a streaming platform in minutes might not be entirely realistic, but these services can significantly speed up the process compared to building from scratch. Carefully consider your needs and budget when choosing the best option for you.

OpenMetadata Community Meeting - 5th June 2024

OpenMetadata

The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features. * How to run your own data quality framework * What is the performance impact of running data quality frameworks * How to run the test cases in your own ETL pipelines * How the Incident Manager is integrated * Get notified with alerts when test cases fail Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E

E-commerce Application Development Company.pdf

Hornet Dynamics

APIs for Browser Automation (MoT Meetup 2024)

Boni García

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM

lorraineandreiamcidl

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Mind IT Systems

openEuler Case Study - The Journey to Supply Chain Security

Shane Coughlan

Enterprise Resource Planning System in Telangana

NYGGS Automation Suite

Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics. To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/

Recently uploaded (20)

GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)

Transform Your Communication with Cloud-Based IVR Solutions

Artificia Intellicence and XPath Extension Functions

Empowering Growth with Best Software Development Company in Noida - Deuglo

May Marketo Masterclass, London MUG May 22 2024.pdf

Navigating the Metaverse: A Journey into Virtual Evolution"

Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management

AI Genie Review: World’s First Open AI WordPress Website Creator

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

Vitthal Shirke Java Microservices Resume.pdf

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

Fundamentals of Programming and Language Processors

Launch Your Streaming Platforms in Minutes

OpenMetadata Community Meeting - 5th June 2024

E-commerce Application Development Company.pdf

APIs for Browser Automation (MoT Meetup 2024)

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

openEuler Case Study - The Journey to Supply Chain Security

Enterprise Resource Planning System in Telangana

Streaming Data Integration - For Women in Big Data Meetup

1. 1Confidential Streaming Data Integration with Apache Kafka

2. 2Confidential About Gwen Gwen Shapira – System Architect @Confluent PMC @ Apache Kafka Moving data round since 2000 Previously: • Software Engineer @ Cloudera • Oracle Database Consultant Find me: • gwen@confluent.io • @gwenshap

3. 3Confidential The Plan 1. What is Data Integration About? 2. How things changed? 3. What is difficult and important? 4. How we solve things in Kafka?

4. 4Confidential Data Integration Making sure the right data Gets to the right places

5. 5Confidential 10 years ago… Informatica DataStage Manual Optimizations

6. 6Confidential 5 years ago…

7. 7Confidential

8. 8Confidential

9. 9Confidential Today… • Everything streaming • Everything real-time • Everything in-memory • Everything containers • Everything clouds

10. 10Confidential These Things Matter • Reliability – Losing data is (usually) not OK. • Exactly Once vs At Least Once • Timeliness • Push vs Pull • High throughput, Varying throughput • Compression, Parallelism, Back Pressure • Data Formats • Flexibility, Structure • Security • Error Handling

11. 11Confidential

12. 12Confidential After: Stream Data Platform with Kafka  Distributed  Fault Tolerant  Stores Messages Search Security Fraud Detection Application User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle Hadoop Log Search Monitoring Data Warehouse Kafka  Processes Streams

13. 13Confidential

14. 14Confidential 1 4

15. 15Confidential 1 5

16. 16Confidential 1 6

17. 17Confidential 1 7

18. 18Confidential Introducing Kafka Connect Large-scale streaming data import/export for Kafka

19. 19Confidential

20. 20Confidential Overview of Connect 1. Install a cluster of Workers 2. Download / Build and install Connector Plugins 3. Use REST API to Start and Configure Connectors 4. Connectors start Tasks. Tasks run inside Workers and copy data.

31. 32Confidential Questions?

Editor's Notes

More types of data stores with specialized functionality – e.g. rise of NoSQL systems handling document-oriented and columnar stores. A lot more sources of data. Rise of secondary data stores and indexes – e.g. Elasticsearch for efficient text-based queries, graph DBs for graph-oriented queries, time series databases. A lot more destinations for data, and a lot of transformations along the way to those destinations. Real-time: data needs to be moved between these systems continuously and at low latency.
Unfortunately, as you build up large, complex data pipelines in an ad hoc fashion by connecting different data systems that need copies of the same data with one-off connectors for those systems, or build out custom connectors for stream processing frameworks to handle different sources and sinks of streaming data, we end up with a giant, unmaintainable mess. This mess has a huge impact on productivity and agility once you get past just a few systems. Adding any new data storage system or stream processing job requires carefully tracking down all the downstream systems that might be affected, which may require coordinating with dozens of teams and code spread across many repositories. Trying to change one data source’s data format can impact many downstream systems, yet there’s no simple way to discover how these jobs are related. This is a real problem that we’re seeing across a variety of companies today. We need to do something to simplify this picture. While Confluent is working to build out a number of tools to help with these challenges, today I want to focus on how we can standardize and simplify constructing these data pipelines so that, at a minimum, we reduce operational complexity and make it easier to discover and understand the full data pipeline and dependencies.
One step towards getting to a separation of concerns is being able to decouple the E, T, and L steps. Kafka, when used as shown here, can help us do that. The vision of Kafka when originally built at LinkedIn was for it to act as a common hub for real-time data. When streaming data from data stores like RDBMS or K/V store, we produce data into Kafka, making it available to as many downstream consumers as want it. Save data to other systems like secondary indexes and batch storage systems, which are implemented with consumers. Stream processing frameworks and custom consumer apps fit in by being both consumers and producers – reading data from Kafka transforming it, and then possibly publishing derived data back into Kafka. Using this model can simplify the problem as we’re now always interacting with Kafka.
To set some context, I want to just quickly list a few of the features that make it possible for Kafka to handle data at this scale. We’ll come back to many of these properties when looking at Kafka Connect. At its core, pub/sub messaging system rethought as distributed commit log. Based on an append-only and sequentially accessed log, which results in very high performance reading and writing data. Extends the model to a *partitioned stream* model for a single logical topic of data, which allows for distribution of data on the brokers and parallelism in both writes and reads. In order to still provide organization and ordering within a single partition, it guarantees ordering within each partition and uses keys to determine which partition to put data in. As part of its append-only approach, it decouples data consumption from data retention policy, e.g. retaining data for 7 days or until we have 1TB in a topic. This both gets rid of individual message acking and allows multiple consumption of the same data, i.e. pub/sub, by simply tracking offsets in the stream. Because data is split across partitions, we can also parallelize consumption and make it elastically scalable with Kafka’s unique automatically balanced consumer groups.
But what exactly is Kafka? At high level, "just" another pub/sub message queue A few key features make it scale to handle the requirements of a stream data platform Multiple consumers can read the same data, and can be at different offsets in the log. Consuming data doesn't delete it from the log. Instead, Kafka use time- or data size- based retention. Your data will stick around for, e.g., 7 days or until you have 100GB. This retention policy is simple and avoids having to keep accounting info for individual messages.
Topics are partitioned so they can scale across multiple servers Partitions are also replicated for fault tolerance
As I mentioned before, Kafka is multi-subscriber where the same topic can be consumed by multiple groups of consumers where each consumer group can subscribe to read the full copy of data. Furthermore, every consumer group can have multiple consumer processes distributed over several machines and Kafka takes care of assigning the partitions of the subscribed topics evenly amongst the consumer processes in a group so that at all times, every partition of a subscribed topic is being consumed by some consumer process within the group. In addition to being easy to scale, consumption is also fault tolerant. If one fails, the other ones automatically rebalance to pick up the load of the failed consumer instance. So it is operationally cheap to consume large amounts of data.
Today, I want to introduce you to Kafka Connect, Kafka’s new large-scale, streaming data import/export tool that drastically simplifies the construction, maintenance, and monitoring of these data pipelines. Kafka Connect is part of the Apache Kafka project, open source under the Apache license, and ships with Kafka. It’s a framework for building connectors between other data systems and Kafka, and the associated runtime to run these connectors in a distributed, fault tolerant manner at scale.
Goals: Focus – copying only Batteries included – framework does all the common stuff so connector developers can focus specifically on details that need to be customized for their system. This covers a lot more than many connector developers realize: beyond managing the producer or consumer, it includes challenges like scalability, recovery from faults and reasoning about delivery guarantees, serialization, connector control, monitoring for ops, and more. Standardize – configuration, status and connector control, monitoring, etc. Parallelism, scalability, fault tolerance built-in, without a lot of effort from connector developers or users. Scale – in two ways. First, scale individual connectors to copy as much data as possible – ingest an entire database rather than one table at a time. Second, scale up to organization-wide data pipelines or down to development, testing, or just copying a single log file into Kafka With these goals in mind, let’s explore the design of Kafka Connect to see how it fulfills these.
At it’s core, Kafka connect is pretty simple. It has source connectors which copy data from another system into Kafka, and sink connectors that copy data from Kafka into a destination system. Here I’ve shown a couple of examples. The source and sink systems don’t necessarily have to naturally match Kafka’s data model exactly. However, we do need to be able to translate data between the two. For example, we might load data from a database in a source connector. By using a timestamp column associated with each row, we can effectively generate an ordered stream of events that are then produced into Kafka. To store data into HDFS, we might load data from one or more topics in Kafka and then write it in sequence to files in an HDFS directory, rotating files periodically. Although Kafka Connect is designed around streaming data, because Kafka acts as a good buffer between streaming and batch systems, we can use it here to load data into HDFS. Neither of these systems map directly to Kafka’s model, but both can be adapted to the concepts of streams with offsets. More about this in a minute. The most important design point for Kafka Connect is that one half of a connection is always Kafka – the destination for sources, or the source of data for sink connectors. This allows the framework to handle the common functionality of connectors while maintaining the ability to automatically provide scalability, fault tolerance, and delivery guarantees without requiring a lot of effort from connector developers. This key assumption is what makes it possible for Kafka Connect to get a better set of tradeoffs than the systems I mentioned earlier.
So now, coming back to the model that connectors need to map to. Just as Kafka’s data model enables certain features around scalability, Kafka Connect’s data model can as well. Kafka Connect requires every connector to map to a “partitioned stream” model. The basic idea is a generalization of Kafka’s data model of topics and partitions. This mapping is defined by the input system for the connector – the source system for source connectors, and Kafka topics for sink connectors -- and has the following: A set of partitions which divide the whole set of data logically. Unlike Kafka, the number of partitions can potentially be very large and may be more dynamic than we would expect with Kafka. Each partition contains an ordered sequence of events/messages. Under the hood these are key/value pairs with byte[], but Kafka Connect requires that they can be converted into a generic data API Each event/message has a unique offset representing its position in the partition. Since the mapping is determined by the input system, these offsets must be meaningful to that system – these may be quite different from the Kafka offsets you’re used to.
To give a more concrete example, we can revisit the database example from earlier. Previously I only showed a single table, but if we consider the database as a whole, we can apply this model to copy the entire database. We partition by table, delivering each into its own Kafka topic. Each event represents a row that we’ve inserted into the database. The offsets are IDs or timestamps, or even more complex representations like a combination of ID and timestamp. Although there isn’t *actually* a stream for each table, we can effectively construct one by querying the database and ordering results according to specific rules. As a result of this model, we can see a few properties emerging: First, we have a built-in concept of parallelism, a requirement for automatically providing scalable data copying. We’re going to be able to distribute processing of partitions across multiple hosts. Second, this model encourages making copying broad by default – partitioned streams should cover the largest logical collection of data. Finally, offsets provide an easy way to track which data has been processed and which still needs to be copied. In some cases, mapping from the native data model to streams may not be simple; however, a bit of effort in creating this mapping pays off by providing a common framework and implementation for tracking which data has been copied. Again, we’ll revisit this a bit later, but this allows the framework to handle a lot of the heavy lifting with regards to delivery semantics.
Partitioned streams are the logical data model, but they don’t directly map to physical parallelism, or threads, in Kafka Connect. In the case of the database connector, a direct mapping might seem reasonable. However, some connectors will have a much larger number of partitions that are much finer-grained. For example, consider a connector for collecting metrics data – each metric might be considered its own partition, resulting in tens of thousands of partitions for even a small set of application servers. However, we do want to exploit the parallelism provided by partitions. Connectors do this by assigning partitions to tasks. Tasks are, simply, threads of control given to the connector code which perform the actual copying of data. Each connector is given a thread it can use to monitor the input system for the active set of partitions. Remember that this set can be dynamic, so continuous monitoring is sometimes needed to detect changes to the set of partitions. When there are changes, the connector notifies the framework so it can reconfigure the current set of tasks. Then, each task is given a dedicated thread for processing. The connector assigns a subset of partitions to each task and the task is the one that actually copies the data for that partition. Given the assignment, the connector implementer handles the reading or writing data from that set of partitions. And how do we decide how many tasks to generate? That’s up to the user, and it’s the primary way to control the total resources used by the connector. Since each task corresponds to a thread, the user can choose to dynamically increase or decrease the maximum number of tasks the connector may create in order to scale resource usage up or down. So now we have some set of threads, but where do they actually execute? Kafka Connect has two modes of execution.
In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. All of this functionality can be accessed via REST API – submit connectors, see their status, update configs, and so on. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
I want to mention two important features that also simplify both connector developer’s and user’s lives. The first feature is offset management, which provides for standardized data delivery guarantees. Delivery guarantees are actually rarely provided in many other systems. They generally offer some sort of best effort, but unreliable, delivery. Ironically, stream processing frameworks often do a better job than tools specifically designed for data copying. Kafka Connect handles offset checkpointing for connectors, and this fits in as a natural extension to Kafka’s offset commit functionality. For sources this works with offsets that have complex structure (e.g. timestamps + autoincrementing IDs in a database) and requires no implementation support from the connector beyond defining the offsets and being able to start reading from a saved offset. For sinks, we can leverage Kafka’s existing offset functionality, but in order to ensure data is completely written, sinks must also support a flush operation. Commits are automatically processed periodically. By default, this mode of managing offsets will provide at least once delivery; internally both sources and sinks are simply flushing all data to the output and the committing offsets. Note that some connectors will opt out of this functionality in order to provide even stronger guarantees. For example, the HDFS connector manages its own offsets because (carefully) tracking them in HDFS along with the data allows for exactly-once delivery.
The second feature I want to mention are converters. Serialization formats may seem like a minor detail, but not separating the details of data serialization in Kafka from the details of source or sink systems results in a lot of inefficiency: A lot of code for doing simple data conversions are duplicated across a large number of ad hoc connector implementations. Each connector ultimately contains its own set of serialization options as it is used in more environments – JSON, Avro, Thrift, protobufs, and more. Much like the serializers in Kafka’s producer and consumer, the Converters abstract away the details of serialization. Converters are different because they guarantee data is transformed to a common data API defined by Kafka Connect. This API supports both schema and schemaless data, common primitive data types, complex types like structs, and logical type extensions. By sharing this API, connectors write one set of translation code and Converters handle format-specific details. For example, the JDBC connector can easily be used to produce either JSON or Avro to Kafka, without any format-specific code in the connector.
Kafka Connect provides the framework, but I want to spend a few minutes describing the current state of the connector ecosystem. While the framework ships with Apache Kafka, connectors use a federated approach to development. Confluent helped kick off connector development with a few key open source connectors – JDBC for importing data from any relational database and HDFS, for exactly once delivery of data into HDFS and Hive. Confluent will be continuing to add more open source connectors. We’ve also started tracking connectors that the community has been developing on a page we’re calling the Connector Hub. We’ve already got a dozen or so connectors, and more are popping up every week. We’ll be working to make this index as useful to users as possible, offering information about the current state of the connector implementations and feature sets.
With all these pieces you can see how we can tie together Kafka and Kafka Connect with stream processing frameworks and applications to not only simplify building these data pipelines and solve data integration challenges, but also transform how your company manages its data pipelines. Kafka provides the central hub for real-time data and Kafka Connect simplifies operationalization: one service to maintain, common metrics, common monitoring, and agnostic to your choice of process and cluster management. You can centrally managed Kafka Connect cluster running in distributed mode, and accessed via REST API, allowing your ops team to provide data integration as a service to your entire organization. For developers who want to build a complex data pipeline, they can submit jobs to copy data into and out of Kafka – it’s zero coding (assuming a connector is available) Then, they can easily leverage either the traditional clients or stream processing frameworks to transform that data. The output is stored back into another Kafka topic or served up directly. As a side benefit, standardizing on Kafka encourages reuse of existing data (both raw and transformed). Providing this service not only makes it easy to build your *own* complex data pipeline, it encourages other people in the org to build on top of your existing work. Confluent Platform also provides additional tools that make this setup even more powerful. For example, the schema registry controls the format of data in each topic, and besides ensuring data quality and compatibility, it also encourages decoupling of teams by allowing anyone to discover what data is in a topic, grab its schema, and immediately start utilizing that data without ever adding coordination overhead with another team. A stream data platform built around Kafka and Kafka Connect allows you to scale to handle your entire organization’s real-time data, while maintaining simple management and easy operationalization of your data pipeline.

Streaming Data Integration - For Women in Big Data Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Streaming Data Integration - For Women in Big Data Meetup

Similar to Streaming Data Integration - For Women in Big Data Meetup (20)

More from Gwen (Chen) Shapira

More from Gwen (Chen) Shapira (17)

Recently uploaded

Recently uploaded (20)

Streaming Data Integration - For Women in Big Data Meetup

Editor's Notes