Eric Frenkiel, MemSQL CEO and co-founder and Gartner Catalyst. August 11, 2015, San Diego, CA. Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative
Apache InLong is a one-stop data streaming platform, it chooses Apache Pulsar to cache data for forwarding sort. Apache Pulsar has great reliability and stability, which helps the InLong to be more confident for users.
This session will share Tencent Big Data Team's journal of adopting Pulsar in their core data engine to process tens of billions of data integration. Besides, some problems they encountered during the process and the improvements on Pulsar they have made will also be shared as an example for future Pulsar users.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative
Apache InLong is a one-stop data streaming platform, it chooses Apache Pulsar to cache data for forwarding sort. Apache Pulsar has great reliability and stability, which helps the InLong to be more confident for users.
This session will share Tencent Big Data Team's journal of adopting Pulsar in their core data engine to process tens of billions of data integration. Besides, some problems they encountered during the process and the improvements on Pulsar they have made will also be shared as an example for future Pulsar users.
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
Kappa architecture for event processing using Apache Kafka and Querona for managing data, joining external data sources and empowering data science teams.
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas
Presentation on the struggles with traditional architectures and an overview of the Lambda Architecture utilizing Spark to drive massive amounts of both batch and streaming data for processing and analytics
Strata Singapore 2017 business use case section
"Big Telco Real-Time Network Analytics"
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/62797
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
This talk will address how a new architecture is emerging for analytics, based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK). Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (i.e. ETL). I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Since 2014, Typesafe has been actively contributing to the Apache Spark project, and has become a certified development support partner of Databricks, the company started by the creators of Spark. Typesafe and Mesosphere have forged a partnership in which Typesafe is the official commercial support provider of Spark on Apache Mesos, along with Mesosphere’s Datacenter Operating Systems (DCOS).
In this webinar with Iulian Dragos, Spark team lead at Typesafe Inc., we reveal how Typesafe supports running Spark in various deployment modes, along with the improvements we made to Spark to help integrate backpressure signals into the underlying technologies, making it a better fit for Reactive Streams. He also show you the functionalities at work, and how to make it simple to deploy to Spark on Mesos with Typesafe.
We will introduce:
Various deployment modes for Spark: Standalone, Spark on Mesos, and Spark with Mesosphere DCOS
Overview of Mesos and how it relates to Mesosphere DCOS
Deeper look at how Spark runs on Mesos
How to manage coarse-grained and fine-grained scheduling modes on Mesos
What to know about a client vs. cluster deployment
A demo running Spark on Mesos
SMACK Stack 1.0 has been Spark, Mesos, Akka, Cassandra and Kafka working into different cohesive systems delivering different solutions for different use cases. Haven't heard about it before? Oh man! Where have you been? https://www.google.com/search?q=smack+stack+1.0
SMACK Stack 1.1 we go a step further Streaming, Mesos, Analytics, Cassandra and Kafka and Joe Stein will walk through in detail some of the different viable options for Streaming and Analytics with Mesos, Kafka and Cassandra.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...Lightbend
In this webinar, Engineering Manager at Credit Karma, Dustin Lyons, discusses how not long ago his team was facing a common challenge shared by many financial services architects and engineering leaders: not only how to move from the offline, batch-mode processing of Big Data to streaming, Fast Data, and how to enable real-time decision making based on the information flowing in from over 60 million members.
Dustin reviews how his team migrated away from PHP and successfully implemented Akka Streams with Apache Kafka to ingest, process and route real-time events throughout their data ecosystem. At the end of this presentation, you’ll better understand:
* The design considerations for new Fast Data architectures, from streaming to microservices to real-time analysis.
* Some lessons learned when it comes to progressing from batch to streaming using Akka, Spark and Kafka
* Why Akka’s self-healing actor model and the resilience that it provides is actually what matters most when delivering real-time customer experiences
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
ETL can be painful with dirty data and outdated batch processes slowing you down; there has to be a better way. In this talk we’ll discuss the benefits of introducing a streaming platform to your architecture including how it can greatly simplify complexity, speed up performance, and help your team deliver the features they need with real-time data integration.
Pandora’s Lawrence Weikum will discuss what they’ve done to bring real-time data integration to the team. We’ll review their Kafka-powered data pipelines and how they make the most of Kafka’s Connect API to make it surprisingly system to keep systems in sync.
Presented by:
Lawrence Weikum, Senior Software Engineer, Pandora
Gehrig Kunz, Technical Product Marketing Manager, Confluent
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
While the last few years have seen great advancements in computing paradigms for big data stores, there remains one critical bottleneck in this architecture - the ingestion process. Instead of immediate insights into the data, a poor ingestion process can cause headaches and problems to no end. On the other hand, a well-designed ingestion infrastructure should give you real-time visibility into how your systems are functioning at any given time. This can significantly increase the overall effectiveness of your ad-campaigns, fraud-detection systems, preventive-maintenance systems, or other critical applications underpinning your business.
In this session we will explore various modes of ingest including pipelining, pub-sub, and micro-batching, and identify the use-cases where these can be applied. We will present this in the context of open source frameworks such as Apache Flume, Kafka, among others that can be used to build related solutions. We will also present when and how to use multiple modes and frameworks together to form hybrid solutions that can address non-trivial ingest requirements with little or no extra overhead. Through this discussion we will drill-down into details of configuration and sizing for these frameworks to ensure optimal operations and utilization for long-running deployments.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
Hadoop is not an island. To deliver a complete Big Data solution, a data pipeline needs to be developed that incorporates and orchestrates many diverse technologies. In this session we will demonstrate how the open source Spring Batch, Spring Integration and Spring Hadoop projects can be used to build manageable and robust pipeline solutions to coordinate the running of multiple Hadoop jobs (MapReduce, Hive, or Pig), but also encompass real-time data acquisition and analysis.
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
As disparate data volumes continue to be operationalized across the enterprise, data will need to be processed, cleansed, transformed, and made available to end users at greater speeds. Traditional ODS systems run into issues when trying to process large data volumes causing operations to be backed up, data to be archived, and ETL/ ELT processes to fail. Join this breakout to learn how to battle these issues.
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
Kappa architecture for event processing using Apache Kafka and Querona for managing data, joining external data sources and empowering data science teams.
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas
Presentation on the struggles with traditional architectures and an overview of the Lambda Architecture utilizing Spark to drive massive amounts of both batch and streaming data for processing and analytics
Strata Singapore 2017 business use case section
"Big Telco Real-Time Network Analytics"
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/62797
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
This talk will address how a new architecture is emerging for analytics, based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK). Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (i.e. ETL). I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Since 2014, Typesafe has been actively contributing to the Apache Spark project, and has become a certified development support partner of Databricks, the company started by the creators of Spark. Typesafe and Mesosphere have forged a partnership in which Typesafe is the official commercial support provider of Spark on Apache Mesos, along with Mesosphere’s Datacenter Operating Systems (DCOS).
In this webinar with Iulian Dragos, Spark team lead at Typesafe Inc., we reveal how Typesafe supports running Spark in various deployment modes, along with the improvements we made to Spark to help integrate backpressure signals into the underlying technologies, making it a better fit for Reactive Streams. He also show you the functionalities at work, and how to make it simple to deploy to Spark on Mesos with Typesafe.
We will introduce:
Various deployment modes for Spark: Standalone, Spark on Mesos, and Spark with Mesosphere DCOS
Overview of Mesos and how it relates to Mesosphere DCOS
Deeper look at how Spark runs on Mesos
How to manage coarse-grained and fine-grained scheduling modes on Mesos
What to know about a client vs. cluster deployment
A demo running Spark on Mesos
SMACK Stack 1.0 has been Spark, Mesos, Akka, Cassandra and Kafka working into different cohesive systems delivering different solutions for different use cases. Haven't heard about it before? Oh man! Where have you been? https://www.google.com/search?q=smack+stack+1.0
SMACK Stack 1.1 we go a step further Streaming, Mesos, Analytics, Cassandra and Kafka and Joe Stein will walk through in detail some of the different viable options for Streaming and Analytics with Mesos, Kafka and Cassandra.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...Lightbend
In this webinar, Engineering Manager at Credit Karma, Dustin Lyons, discusses how not long ago his team was facing a common challenge shared by many financial services architects and engineering leaders: not only how to move from the offline, batch-mode processing of Big Data to streaming, Fast Data, and how to enable real-time decision making based on the information flowing in from over 60 million members.
Dustin reviews how his team migrated away from PHP and successfully implemented Akka Streams with Apache Kafka to ingest, process and route real-time events throughout their data ecosystem. At the end of this presentation, you’ll better understand:
* The design considerations for new Fast Data architectures, from streaming to microservices to real-time analysis.
* Some lessons learned when it comes to progressing from batch to streaming using Akka, Spark and Kafka
* Why Akka’s self-healing actor model and the resilience that it provides is actually what matters most when delivering real-time customer experiences
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
ETL can be painful with dirty data and outdated batch processes slowing you down; there has to be a better way. In this talk we’ll discuss the benefits of introducing a streaming platform to your architecture including how it can greatly simplify complexity, speed up performance, and help your team deliver the features they need with real-time data integration.
Pandora’s Lawrence Weikum will discuss what they’ve done to bring real-time data integration to the team. We’ll review their Kafka-powered data pipelines and how they make the most of Kafka’s Connect API to make it surprisingly system to keep systems in sync.
Presented by:
Lawrence Weikum, Senior Software Engineer, Pandora
Gehrig Kunz, Technical Product Marketing Manager, Confluent
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
While the last few years have seen great advancements in computing paradigms for big data stores, there remains one critical bottleneck in this architecture - the ingestion process. Instead of immediate insights into the data, a poor ingestion process can cause headaches and problems to no end. On the other hand, a well-designed ingestion infrastructure should give you real-time visibility into how your systems are functioning at any given time. This can significantly increase the overall effectiveness of your ad-campaigns, fraud-detection systems, preventive-maintenance systems, or other critical applications underpinning your business.
In this session we will explore various modes of ingest including pipelining, pub-sub, and micro-batching, and identify the use-cases where these can be applied. We will present this in the context of open source frameworks such as Apache Flume, Kafka, among others that can be used to build related solutions. We will also present when and how to use multiple modes and frameworks together to form hybrid solutions that can address non-trivial ingest requirements with little or no extra overhead. Through this discussion we will drill-down into details of configuration and sizing for these frameworks to ensure optimal operations and utilization for long-running deployments.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
Hadoop is not an island. To deliver a complete Big Data solution, a data pipeline needs to be developed that incorporates and orchestrates many diverse technologies. In this session we will demonstrate how the open source Spring Batch, Spring Integration and Spring Hadoop projects can be used to build manageable and robust pipeline solutions to coordinate the running of multiple Hadoop jobs (MapReduce, Hive, or Pig), but also encompass real-time data acquisition and analysis.
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
As disparate data volumes continue to be operationalized across the enterprise, data will need to be processed, cleansed, transformed, and made available to end users at greater speeds. Traditional ODS systems run into issues when trying to process large data volumes causing operations to be backed up, data to be archived, and ETL/ ELT processes to fail. Join this breakout to learn how to battle these issues.
Key attributes for modern real time streaming processing and interactive analytics
What is so exciting to me about Spark?
What are some of the myths?
What is missing in Spark for real time?
SnappyData’s mission – fuse Spark with in-memory data management in one unified cluster to offer – OLTP + OLAP + Stream processing + Probabilistic data
O'Reilly Media Webcast: Building Real-Time Data PipelinesSingleStore
As our customers tap into new sources of data or modify to existing data pipelines, we are often asked questions like: What technologies should we consider? Where can we reduce data latency? How can we simplify our data architecture?
To eliminate the guesswork, we teamed up with Ben Lorica, Chief Data Scientist at O’Reilly Media to host a webcast centered around building real-time data pipelines.
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Pe...Ernie Souhrada
As data volume grows, finding ways to slow the growth velocity becomes more and more important. We want to do everything possible to maximize the efficiency of our hardware before we spend the money on more storage, so one way to do that is with compression. These slides discuss compression theory and compression options in MySQL, ending with some benchmark data that compares column-level compression in InnoDB with other available compression technologies. Presented at Percona Live 2016.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Keynote: The Journey to Pervasive AnalyticsCloudera, Inc.
We are in the middle of a data rush. When you are right in the center of a storm, it can seem overwhelming. Where should I start? What do I need to think about? What is the best long-term bet? But don’t forget that more data should mean great news. More data should mean more insight, more guidance, and more strategic direction. However, more data doesn’t automatically rally your entire business around common goals and insights. You need a platform and architecture that can support a thriving, analytic-driven business culture that embraces a pervasive analytics strategy.
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Big Data Spain
Agora owns dozens of themed, classified, entertainment and social services. There are news and sports portals, forums, advertising services, blogs and many other thematic websites. All sites generate over 400 page views per second (under normal conditions) and considerably more events (likes focus, clicks and scrolling events). It raises one question: how to build user profiles real-time in such a dynamic and changing environment?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-16.html
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Streaming Data Ingest and Processing with Apache KafkaAttunity
Apache™ Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system. It offers higher throughput, reliability and replication. To manage growing data volumes, many companies are leveraging Kafka for streaming data ingest and processing.
Join experts from Confluent, the creators of Apache™ Kafka, and the experts at Attunity, a leader in data integration software, for a live webinar where you will learn how to:
-Realize the value of streaming data ingest with Kafka
-Turn databases into live feeds for streaming ingest and processing
-Accelerate data delivery to enable real-time analytics
-Reduce skill and training requirements for data ingest
The recorded webinar on slide 32 includes a demo using automation software (Attunity Replicate) to stream live changes from a database into Kafka and also includes a Q&A with our experts.
For more information, please go to www.attunity.com/kafka.
Leveraging Mainframe Data for Modern Analyticsconfluent
“The mainframe is going away” is as true now as it was 10, 20 and 30 years ago. Mainframes are still crucial in handling critical business transactions, they were however built for an era where batch data movement was the norm and can be difficult to integrate into today’s data-driven, real-time, analytics-focused business processes as well as the environments that support them. Until now.
Join experts from Confluent, Attunity, and Capgemini for a one-hour online talk session where you’ll learn how to:
Unlock your mainframe data with unique change data capture (CDC) functionality without incurring the complexity and expense that come with sending ongoing queries into the mainframe database
How using CDC benefits advanced analytics approaches such as deep machine learning and predictive analytics
Deliver ongoing streams of data in real-time to the most demanding analytics environments
Ensure that your analytics environment includes the broadest possible range of data sources and destinations while ensuring true enterprise-grade functionality
Identify use cases that can help you get started delivering value to the business moving from POC to Pilot to Production
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsSingleStore
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics: Novus, DigitalOcean, Akamai.
Building Predictive Applications with Real-Time Data Pipelines and Streamliner. Eric Frenkiel, CEO and Co-Founder, MemSQL
Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramièreconfluent
During the Confluent Streaming event in Paris, Florent Ramière, Technical Account Manager at Confluent, goes beyond brokers, introducing a whole new ecosystem with Kafka Streams, KSQL, Kafka Connect, Rest proxy, Schema Registry, MirrorMaker, etc.
Putting the Micro into Microservices with Stateful Stream Processingconfluent
How small can a microservice be? This talk will look at how Stateful Stream Processing is used to build truly autonomous, often minuscule services. With the distributed guarantees of Exactly Once Processing, Event Driven Services supported by Apache Kafka become reliable, fast and nimble, blurring the line between business system and big data pipeline.
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
Facing Open Banking regulation, rapidly increasing transaction volumes and increasing customer expectations, Nationwide took the decision to take load off their back-end systems through real-time streaming of data changes into Kafka. Hear about how Nationwide started their journey with Kafka, from their initial use case of creating a real-time data cache using Change Data Capture, Kafka and Microservices to how Kafka allowed them to build a stream processing backbone used to reengineer the entire banking experience including online banking, payment processing and mortgage applications. See a working demo of the system and what happens to the system when the underlying infrastructure breaks. Technologies covered include: Change Data Capture, Kafka (Avro, partitioning and replication) and using KSQL and Kafka Streams Framework to join topics and process data.
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...confluent
Watch this talk here: https://www.confluent.io/online-talks/using-apache-kafka-to-optimize-real-time-analytics-financial-services-iot-applications
When it comes to the fast-paced nature of capital markets and IoT, the ability to analyze data in real time is critical to gaining an edge. It’s not just about the quantity of data you can analyze at once, it’s about the speed, scale, and quality of the data you have at your fingertips.
Modern streaming data technologies like Apache Kafka and the broader Confluent platform can help detect opportunities and threats in real time. They can improve profitability, yield, and performance. Combining Kafka with Panopticon visual analytics provides a powerful foundation for optimizing your operations.
Use cases in capital markets include transaction cost analysis (TCA), risk monitoring, surveillance of trading and trader activity, compliance, and optimizing profitability of electronic trading operations. Use cases in IoT include monitoring manufacturing processes, logistics, and connected vehicle telemetry and geospatial data.
This online talk will include in depth practical demonstrations of how Confluent and Panopticon together support several key applications. You will learn:
-Why Apache Kafka is widely used to improve performance of complex operational systems
-How Confluent and Panopticon open new opportunities to analyze operational data in real time
-How to quickly identify and react immediately to fast-emerging trends, clusters, and anomalies
-How to scale data ingestion and data processing
-Build new analytics dashboards in minutes
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
PayPal currently processes tens of billions of signals per day from different sources in batch and streaming mode. The data processing platform is the one powering these different analytical needs and use cases, not just at PayPal but our adjacencies like Venmo, Hyperwallet and iZettle. End users of this platform demand access to data insights with as much flexibility as possible to explore it with low processing latency.
One such use case is where our Switchboard(data de-multiplexer) platform where we process approximately 20 billion events daily and provide data to different teams and platforms with-in PayPal and also to platform outside PayPal for more insights. When we started building this platform Kafka was just another asynchronous message processing platform for us but we have seen it evolving to a place where its adds value not just in terms of event processing but also for platform resiliency and scalability.
Takeaway for the audience: Most people work with and have knowledge about data. With this talk I want to present information which is relevant and meaningful to the audience. Information and examples which will make it easier for attendees to understand our complex system and hopefully have some practical takeaways to use Kafka for similar problems on their hand.
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...Michael Noll
Talk URL: https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/77360
Abstract: Would you cross the street with traffic information that’s a minute old? Certainly not. Modern businesses have the same needs nowadays, whether it’s due to competitive pressure or because their customers have much higher expectations of how they want to interact with a product or service. At the heart of this movement are events: in today’s digital age, events are everywhere. Every digital action—across online purchases to ride-sharing requests to bank deposits—creates a set of events around transaction amount, transaction time, user location, account balance, and much more. The technology that allows businesses to read, write, store, and compute and process these events in real-time are event-streaming platforms, and tens of thousands of companies like Netflix, Audi, PayPal, Airbnb, Uber, and Pinterest have picked Apache Kafka as the de facto choice to implement event-driven architectures and reshape their industries.
Michael Noll explores why and how you can use Apache Kafka and its growing ecosystem to build event-driven architectures that are elastic, scalable, robust, and fault tolerant, whether it’s on-premises, in the cloud, on bare metal machines, or in Kubernetes with Docker containers. Specifically, you’ll look at Kafka as the storage and publish and subscribe layer; Kafka’s Connect framework for integrating external data systems such as MySQL, Elastic, or S3 with Kafka; and Kafka’s Streams API and KSQL as the compute layer to implement event-driven applications and microservices in Java and Scala and streaming SQL, respectively, that process the events flowing through Kafka in real time. Michael provides an overview of the most relevant functionality, both current and upcoming, and shares best practices and typical use cases so you can tie it all together for your own needs.
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward
Witnessing the rise of stream processing from the driving seat, we see Apache Flink® and associated technologies used for a wide variety of business applications, from routing data through systems, serving as a backbone for real-time analytics on live data using SQL, detecting credit card fraud, to implementing complete end-to-end social networks. Such applications enable modern data-driven businesses where decisions and actions happen in real-time, and transform traditional businesses to become more data-driven. Observing the variety of these applications implemented using Flink, it becomes apparent that the traditional dividing line between analytics and operational applications is becoming more and more blurry. Historically, operational applications were built using transactional databases, and analytics were done offline. In contrast, Flink’s, state, checkpoints, and time management are the core building blocks for both operational applications with strong data consistency needs, and for real-time analytics with correctness guarantees. With these shared building blocks, developers start building what is arguably a new class of data-driven applications: applications that are operational in that they serve live systems and at the same time analytical in that they perform complex data analysis. Following application architectures like CQRS and using new features like Flink’s queryable state, streaming analytics and online applications move even closer to each other. In this talk, guided by real-world use cases, we present how the unique core concepts behind Flink simplify the development, deployment, and management of data-driven applications, and we conclude with a vision for the future for Flink and stream processing.
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...HostedbyConfluent
Kafka and MemSQL are the perfect combination of speed, scale, and power to take on the world’s most complex operational analytics challenges. In this session, you will learn how Kafka and MemSQL have become the dynamic duo, and how you can use them together to achieve ingest of tens of millions of records per second and enable highly concurrent, real-time analytics. In the last few months, Kafka and MemSQL have been hard at work, devising a plan to take on the world’s next set of streaming data challenges. So stay tuned: there may just be an announcement!
Similar to Real-Time Data Pipelines with Kafka, Spark, and Operational Databases (20)
The database market is large and filled with many solutions. In this talk, Seth Luersen from MemSQL we will take a look at what is happening within AWS, the overall data landscape, and how customers can benefit from using MemSQL within the AWS ecosystem.
Converging Database Transactions and Analytics SingleStore
delivered at the Gartner Data and Analytics 2018 show in Texas. This presentation discusses real-time applications and their impact on existing data infrastructures
MemSQL 201: Advanced Tips and Tricks WebcastSingleStore
Topics discussed include differences between columnstore and rowstore engines, data ingestion, data sharding and query tuning, lastly memory and workload management.
Watch the replay at https://memsql.wistia.com/medias/4siccvlorm
An Engineering Approach to Database EvaluationsSingleStore
This talk will go over a methodical approach for making a decision, dig into interesting tradeoffs, and give tips about what things to look for under the hood and how to evaluate the tech behind the database.
Building a Fault Tolerant Distributed ArchitectureSingleStore
This talk will highlight some of the challenges to building a fault tolerant distributed architecture, and how MemSQL's architecture tackles these challenges.
Stream Processing with Pipelines and Stored ProceduresSingleStore
This talk will discuss an upcoming feature in MemSQL 6.5 showing how advanced stream processing use cases can be tackled with a combination of stored procedures (new in 6.0) and MemSQL's pipelines feature.
Learn how to leverage MPP technology and distributed data to deliver high volume transactional and analytical work loads which result in real time dashboards on rapidly changing data using standard SQL tools. Demonstrations will include the streaming of structured and JSON data from Kafka messages through a micro-batch ETL process into the MemSQL database where the data is then queried using standard SQL tools and visualized leveraging Tableau.
This session will focus on image recognition, the techniques available, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition.
LIVE DEMO: Constructing and executing a real-time image recognition pipeline using Kafka and Spark.
Speaker: Neil Dahlke, MemSQL Senior Solutions Engineer
How Database Convergence Impacts the Coming Decades of Data ManagementSingleStore
How Database Convergence Impacts the Coming Decades of Data Management by Nikita Shamgunov, CEO and co-founder of MemSQL.
Presented at NYC Database Month in October 2017. NYC Database Month is the largest database meetup in New York, featuring talks from leaders in the technology space. You can learn more at http://www.databasemonth.com.
James Burkhart explains how Uber supports millions of analytical queries daily across real-time data with Apollo. James covers the architectural decisions and lessons learned building an exactly-once ingest pipeline storing raw events across in-memory row storage and on-disk columnar storage and a custom metalanguage and query layer leveraging partial OLAP result set caching and query canonicalization. Putting all the pieces together provides thousands of Uber employees with subsecond p95 latency analytical queries spanning hundreds of millions of recent events.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
1. Eric Frenkiel, MemSQL CEO and co-founder
August 11, 2015 • San Diego, CA
Real-Time Data Pipelines with
Kafka, Spark, and Operational Databases
2. What’s In Store
MemSQL and a
fresh look at
Lambda
architectures
Building real-time
data pipelines for
immediate impact
One architecture
for many
applications
2
3. MemSQL at a Glance
• Enable every company to be a real-time enterprise
• Founded 2011, based in San Francisco
• Founders are ex-Facebook, SQL Server engineers
• Deliver a database technology for modern
architecture
Enterprise Focus
3
4. 4
The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud
13. By 2020, HP predicts that over
a trillion sensors will be online
“The Internet of Things Will Drastically Change Our Future” – Datafloq
14. Going Real-Time is the Next Phase for Big Data
More
Devices
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind
17. Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
17
18. A high-throughput
distributed messaging
system
Publish and subscribe to
Kafka “topics”
Centralized data transport
for the organization
Kafka
18
19. In-memory execution
engine
High level operators for
procedural and
programmatic analytics
Faster than MapReduce
Spark
19
21. Use Spark and Operational Databases Together
Spark Operational Databases
Interface Programatic Declarative
Execution Environment Job Scheduler SQL Engine and Query Optimizer
Persistent Storage Use another system Built-in
21
22. Subscribing to Kafka
22
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010
111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010
111100001110101100000010010010111…
1110010101000101010001010100010111
111010100011110101100011010101000…
0101111000011100101010111110001111
011010111100000000101110101100000…
Event added to message queue
23. Enrich and Transform the Data
23
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23,
‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010
111100001110101100000010010010111…
24. Persist and Prepare for Production
24
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time
house_i
d
zip
device
_id
device_type watts
2015-
07-
06T16:4
3:40.33
Z
32928
0
94110 23
‘kitchen_app
liance’
60
… … … … … …
30. 30
Collect streaming data at scale
(hundreds of MemSQL
machines)
Proactively diagnose issues
Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time
Analytics
35. Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Operational Data Store (ODS)
Star Schema MictoStrategy
Reach overlap and ad optimization
Over 60,000 queries per second
Millisecond response times
35
Sensors are being integrated into our cars, our phones, our medical devices – trillions of sensors impact many facets of our lives
“HP expects that by 2020 a trillion sensors are needed in the world, the equivalent of 150 sensors per human. Sensors will end-up in anything imaginable” (https://datafloq.com/read/internet-of-things-with-trillions-of-sensors-will-/218)
- In 2020, 25 billion connected things will be in use (Gartner); 4.9 billion (2015) http://www.gartner.com/newsroom/id/2905717
HP’s Peter Hartwell: “one trillion nanoscale sensors and actuators will need the equivalent of 1000 internets: the next huge demand for computing!”