The document discusses Knorex's approach to processing large volumes of streaming user data in real-time using Google Cloud technologies. It describes a serverless streaming pipeline that ingests data into Pub/Sub, uses Dataflow for stream processing, and stores processed data in BigQuery for analytics and a Cloud Bigtable for real-time user targeting. The pipeline handles 1500 events per second, processes 1TB of data daily, and reprocesses 30TB of historical data each day using both streaming and batch Dataflow jobs.
Building Pinterest Real-Time Ads Platform Using Kafka Streams confluent
Building Pinterest Real-Time Ads Platform Using Kafka Streams (Liquan Pei + Boyang Chen, Pinterest) Kafka Summit SF 2018
In this talk, we are sharing the experience of building Pinterest’s real-time Ads Platform utilizing Kafka Streams. The real-time budgeting system is the most mission-critical component of the Ads Platform as it controls how each ad is delivered to maximize user, advertiser and Pinterest value. The system needs to handle over 50,000 queries per section (QPS) impressions, requires less than five seconds of end-to-end latency and recovers within five minutes during outages. It also needs to be scalable to handle the fast growth of Pinterest’s ads business.
The real-time budgeting system is composed of real-time stream-stream joiner, real-time spend aggregator and a spend predictor. At Pinterest’s scale, we need to overcome quite a few challenges to make each component work. For example, the stream-stream joiner needs to maintain terabyte size state while supporting fast recovery, and the real-time spend aggregator needs to publish to thousands of ads servers while supporting over one million read QPS. We choose Kafka Streams as it provides milliseconds latency guarantee, scalable event-based processing and easy-to-use APIs. In the process of building the system, we performed tons of tuning to RocksDB, Kafka Producer and Consumer, and pushed several open source contributions to Apache Kafka. We are also working on adding a remote checkpoint for Kafka Streams state to reduce the time of code start when adding more machines to the application. We believe that our experience can be beneficial to people who want to build real-time streaming solutions at large scale and deeply understand Kafka Streams.
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
At VMware Corporate IT Data Solution and Delivery Team , we have built the Enterprise Advance Data Analytic Platform on Top of vSphere 6.0 with VMware BigData Extension, Isilon HDFS, Pivotal HD 3.0, Spring XD 1.2 and Alpine Data Lab
Change Data Streaming Patterns for Microservices With Debezium confluent
(Gunnar Morling, RedHat) Kafka Summit SF 2018
Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/): secret sauce for change data capture (CDC) streaming changes from your datastore that enables you to solve multiple challenges: synchronizing data between microservices, gradually extracting microservices from existing monoliths, maintaining different read models in CQRS-style architectures, updating caches and full-text indexes and feeding operational data to your analytics tools
Join this session to learn what CDC is about, how it can be implemented using Debezium, an open source CDC solution based on Apache Kafka and how it can be utilized for your microservices. Find out how Debezium captures all the changes from datastores such as MySQL, PostgreSQL and MongoDB, how to react to the change events in near real time and how Debezium is designed to not compromise on data correctness and completeness also if things go wrong. In a live demo we’ll show how to set up a change data stream out of your application’s database without any code changes needed. You’ll see how to sink the change events into other databases and how to push data changes to your clients using WebSockets.
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
Public cloud adoption is exploding and big data technologies are rapidly becoming an important driver of this growth. According to Wikibon, big data public cloud revenue will grow from 4.4% in 2016 to 24% of all big data spend by 2026. Digital transformation initiatives are now a priority for most organizations, with data and advanced analytics at the heart of enabling this change. This is key to driving competitive advantage in every industry.
There is nothing better than a real-world customer use case to help you understand how to get value from big data in the cloud and apply the learnings to your business. Join Microsoft, MapR, and Sullexis on November 10th to:
Hear from Sullexis on the business use case and technical implementation details of one of their oil & gas customers
Understand the integration points of the MapR Platform with other Azure services and why they matter
Know how to deploy the MapR Platform on the Azure cloud and get started easily
You will also get to hear about customer use cases of the MapR Converged Data Platform on Azure in other verticals such as real estate and retail.
Speakers
Rafael Godinho
Technical Evangelist
Microsoft Azure
Tim Morgan
Managing Director
Sullexis
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...HostedbyConfluent
Hermes, Germany's largest post-independent logistics service provider for deliveries, had one main goal—make faster and smarter data-driven business decisions. But with high volumes of diverse and disparate data, how can you effectively leverage it as an asset for real-time insights and business intelligence? During this session, Hermes will share their data challenges and how HVR's high volume data replication capabilities enabled Hermes to securely and seamlessly integrate data into Kafka for real-time decision-making and greater visibility into the entire logistics process.
Building the Next-gen Digital Meter Platform for FluviusDatabricks
Fluvius is the network operator for electricity and gas in Flanders, Belgium. Their goal is to modernize the way people look at energy consumption using a digital meter that captures consumption and injection data from any electrical installation in Flanders ranging from households to large companies. After full roll-out there will be roughly 7 million digital meters active in Flanders collecting up to terabytes of data per day. Combine this with regulation that Fluvius has to maintain a record of these reading for at least 3 years, we are talking petabyte scale. delaware BeLux was assigned by Fluvius to setup a modern data platform and did so on Azure using Databricks as the core component to collect, store, process and serve these volumes of data to every single consumer and beyond in Flanders. This enables the Belgian energy market to innovate and move forward. Maarten took up the role as project manager and solution architect.
Building Pinterest Real-Time Ads Platform Using Kafka Streams confluent
Building Pinterest Real-Time Ads Platform Using Kafka Streams (Liquan Pei + Boyang Chen, Pinterest) Kafka Summit SF 2018
In this talk, we are sharing the experience of building Pinterest’s real-time Ads Platform utilizing Kafka Streams. The real-time budgeting system is the most mission-critical component of the Ads Platform as it controls how each ad is delivered to maximize user, advertiser and Pinterest value. The system needs to handle over 50,000 queries per section (QPS) impressions, requires less than five seconds of end-to-end latency and recovers within five minutes during outages. It also needs to be scalable to handle the fast growth of Pinterest’s ads business.
The real-time budgeting system is composed of real-time stream-stream joiner, real-time spend aggregator and a spend predictor. At Pinterest’s scale, we need to overcome quite a few challenges to make each component work. For example, the stream-stream joiner needs to maintain terabyte size state while supporting fast recovery, and the real-time spend aggregator needs to publish to thousands of ads servers while supporting over one million read QPS. We choose Kafka Streams as it provides milliseconds latency guarantee, scalable event-based processing and easy-to-use APIs. In the process of building the system, we performed tons of tuning to RocksDB, Kafka Producer and Consumer, and pushed several open source contributions to Apache Kafka. We are also working on adding a remote checkpoint for Kafka Streams state to reduce the time of code start when adding more machines to the application. We believe that our experience can be beneficial to people who want to build real-time streaming solutions at large scale and deeply understand Kafka Streams.
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
At VMware Corporate IT Data Solution and Delivery Team , we have built the Enterprise Advance Data Analytic Platform on Top of vSphere 6.0 with VMware BigData Extension, Isilon HDFS, Pivotal HD 3.0, Spring XD 1.2 and Alpine Data Lab
Change Data Streaming Patterns for Microservices With Debezium confluent
(Gunnar Morling, RedHat) Kafka Summit SF 2018
Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/): secret sauce for change data capture (CDC) streaming changes from your datastore that enables you to solve multiple challenges: synchronizing data between microservices, gradually extracting microservices from existing monoliths, maintaining different read models in CQRS-style architectures, updating caches and full-text indexes and feeding operational data to your analytics tools
Join this session to learn what CDC is about, how it can be implemented using Debezium, an open source CDC solution based on Apache Kafka and how it can be utilized for your microservices. Find out how Debezium captures all the changes from datastores such as MySQL, PostgreSQL and MongoDB, how to react to the change events in near real time and how Debezium is designed to not compromise on data correctness and completeness also if things go wrong. In a live demo we’ll show how to set up a change data stream out of your application’s database without any code changes needed. You’ll see how to sink the change events into other databases and how to push data changes to your clients using WebSockets.
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
Public cloud adoption is exploding and big data technologies are rapidly becoming an important driver of this growth. According to Wikibon, big data public cloud revenue will grow from 4.4% in 2016 to 24% of all big data spend by 2026. Digital transformation initiatives are now a priority for most organizations, with data and advanced analytics at the heart of enabling this change. This is key to driving competitive advantage in every industry.
There is nothing better than a real-world customer use case to help you understand how to get value from big data in the cloud and apply the learnings to your business. Join Microsoft, MapR, and Sullexis on November 10th to:
Hear from Sullexis on the business use case and technical implementation details of one of their oil & gas customers
Understand the integration points of the MapR Platform with other Azure services and why they matter
Know how to deploy the MapR Platform on the Azure cloud and get started easily
You will also get to hear about customer use cases of the MapR Converged Data Platform on Azure in other verticals such as real estate and retail.
Speakers
Rafael Godinho
Technical Evangelist
Microsoft Azure
Tim Morgan
Managing Director
Sullexis
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...HostedbyConfluent
Hermes, Germany's largest post-independent logistics service provider for deliveries, had one main goal—make faster and smarter data-driven business decisions. But with high volumes of diverse and disparate data, how can you effectively leverage it as an asset for real-time insights and business intelligence? During this session, Hermes will share their data challenges and how HVR's high volume data replication capabilities enabled Hermes to securely and seamlessly integrate data into Kafka for real-time decision-making and greater visibility into the entire logistics process.
Building the Next-gen Digital Meter Platform for FluviusDatabricks
Fluvius is the network operator for electricity and gas in Flanders, Belgium. Their goal is to modernize the way people look at energy consumption using a digital meter that captures consumption and injection data from any electrical installation in Flanders ranging from households to large companies. After full roll-out there will be roughly 7 million digital meters active in Flanders collecting up to terabytes of data per day. Combine this with regulation that Fluvius has to maintain a record of these reading for at least 3 years, we are talking petabyte scale. delaware BeLux was assigned by Fluvius to setup a modern data platform and did so on Azure using Databricks as the core component to collect, store, process and serve these volumes of data to every single consumer and beyond in Flanders. This enables the Belgian energy market to innovate and move forward. Maarten took up the role as project manager and solution architect.
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL confluent
(Sönke LIebau, OpenCore GmbH & Co.KG) Kafka Summit SF 2018
Airports are complex networks consisting of an immense number of systems that are necessary to keep the daily stream of passengers in constant motion. Connecting these systems in order to make the big picture transparent to the people running the show, authorities and last but not least the passengers is no simple endeavor.
In this talk I will describe a fictional airport and its effort to restructure the IT infrastructure around Kafka Streams to serve the real-time data needs of a busy airport. I will start by giving a brief overview of Kafka Streams, KSQL and the opportunities they offer for real-time stream processing. Following that we will explore the the target architecture, which relies heavily on manifested views to serve up-to-date data, while also persisting to a traditional data lake for larger analytics workflows. Additionally we will take a look at the generic data transformation framework that was created to minimize integration effort of the data receiving systems. To illustrate these ideas I will describe some examples of possible integrations: joining flight data with radar and weather data to predict arrival time at the gate down to the second, constantly updated processing data from the luggage conveyor belts as well as results from prediction models for passenger flow, and many more.
How to leverage Kafka data streams with Neo4jGraphRM
Descrizione:
Integrating Apache Kafka with other systems in a reliable and scalable way is often a key part of an event streaming platform. In this talk we'll introduce how to use Apache Kafka (the most used Message Brocker) in combination with Neo4j through the Neo4j-Streams project, demonstrating via simple use-cases how you can leverage the information driven by the Change Data Capture Module and how to add Neo4j in your Kafka flow by using the Sink module in combination with the Neo4j Streams Procedures.
Speaker:
Andrea Santurbano - Neo4J Architect - LARUS Business Automation
Video link: https://youtu.be/oNXWOyDd5HI
IoT Analytics at Google Scale with James Chittenden: Using PubSub Dataflow, and BigQuery to Capture Millions of Connected Devices
There is the potential for 50 billion connected devices by 2020. Google Cloud Platform gives you the tools to scale connections, gather and make sense of data, and provide the reliable customer experiences that hardware devices require. Google’s Cloud Platform provides the infrastructure to handle streams of data fed from millions of intelligent devices.
In this meetup, we'll explore one of the world's largest appliance manufacturer's IoT architecture along with Google's partner Archipelago, and will drill into how they are leveraging Google's massive infrastructure in their solution. We'll explore what Google provides for IoT, including Pub/Sub for messaging, Dataflow for data processing, BigQuery for large scale analytics as well as best practices for real time stream processing accounting for ingest, processing, storage and analysis of hundreds of millions of events per hour.
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
During the rise and innovation of “big data,” the geospatial analytics landscape has grown and evolved. We are beyond just analyzing static maps. Geospatial data is streaming from devices, sensors, infrastructure systems, or social media, and our applications and use cases must dynamically scale to meet the increased demands.
Cloud can provide cost-effective storage and that ephemeral resource-burst needed for fast processing and low latency, all to monetize the immediate value of fresh geospatial data. Geospatial analytics require optimized spatial data types and algorithms to distill data to knowledge. Such processing, especially with strict latency requirements, has always been a challenge.
We propose an open source big data stack for geospatial analytics on Cloud based on Apache NiFi, Apache Spark and LocationTech GeoMesa. GeoMesa is a geospatial framework deployed in a modern big data platform that provides a scalable and low latency solution for indexing volumes of historical data and generating live views and streaming geospatial analytics. CONSTANTIN STANCA, Solutions Engineer, Hortonworks and JAMES HUGHES, Mathematician, CCRi
Kafka as an Eventing System to Replatform a Monolith into Microservices confluent
(Madhulika Tripathi, Intuit) Kafka Summit SF 2018
Breaking down monolithic applications into smaller manageable microservices can be a tough challenge. But the benefits are many. Faster changes, developer productivity, maintainability, scalability and high performance are a few of the motivators that make companies undertake this difficult journey.
At Intuit, we have our fair share of monolithic applications. One such application is Quickbooks Online, our accounting product for small businesses. In order to decompose the application, we needed to create new services, and reduce footprint of data in the monolith by moving it to new services in a phased manner. As more and more data and services keep moving out of the monolith, this data now distributed across multiple microservices needs to be synchronized in near real time to provide a seamless and fast experience to the customers of our product.
To achieve this, we are using Kafka as our eventing backbone that can aid us in keeping distributed data in sync, without compromising performance and user experience. Guaranteed publishing of financial events with no loss, high accuracy and performance is of utmost importance as majority of Intuit products deal with highly sensitive, financial data. Strong ordering guarantees is another important criteria that Kafka can provide with low latency and high throughput. Use cases for data and streaming analytics, insights, personalization, machine-learning-based predictions, can all be unlocked by adopting Kafka as our distributed streaming platform.
This talk will take you through Intuit’s journey of building a distributed, asynchronous system using Kafka. Specifically about the choices made, challenges faced, the adaptations clients had to make and how we see Kafka powering our future!
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringInfluxData
On average, a business supporting digital transactions now crosses 35 backend systems—and legacy tools haven’t been able to keep up. This session will cover how MuleSoft uses InfluxCloud to help power their monitoring and diagnostic solutions as well as provide end-to-end actionable visibility to APIs and integrations to help customers identify and resolve issues quickly.
We describe an application of CEP using a microservice-based streaming architecture. We use Drools business rule engine to apply rules in real time to an event stream from IoT traffic sensor data.
Les objets connectés : de nombreux cas d'usage Jedha Bootcamp
Aujourd'hui, les objets connectés sont partout et nous entourent sans même s'en apercevoir : téléphones, transports, musique, montres, "The Internet of Things" (IoT) a pris une part importante dans notre vie. En nous montrant des cas d'usages des entreprises telles que la NASA, Airbus, Red bull et d'autres, Sean nous expliquera comment ils fonctionnent et comment sont gérées toutes ces données récoltées.
An Introduction to the MapR Converged Data PlatformMapR Technologies
Listen to the webinar on-demand: http://info.mapr.com/WB_Partner_CDP_Intro_EMEA_DG_17.05.31_RegistrationPage.html
In this 90-minute webinar, we discuss:
- The MapR Converged Data Platform and its components
- Use cases for the Converged Data Platform
- MapR Converged Partner Program
- How to get started with MapR
- Becoming a partner
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL confluent
(Sönke LIebau, OpenCore GmbH & Co.KG) Kafka Summit SF 2018
Airports are complex networks consisting of an immense number of systems that are necessary to keep the daily stream of passengers in constant motion. Connecting these systems in order to make the big picture transparent to the people running the show, authorities and last but not least the passengers is no simple endeavor.
In this talk I will describe a fictional airport and its effort to restructure the IT infrastructure around Kafka Streams to serve the real-time data needs of a busy airport. I will start by giving a brief overview of Kafka Streams, KSQL and the opportunities they offer for real-time stream processing. Following that we will explore the the target architecture, which relies heavily on manifested views to serve up-to-date data, while also persisting to a traditional data lake for larger analytics workflows. Additionally we will take a look at the generic data transformation framework that was created to minimize integration effort of the data receiving systems. To illustrate these ideas I will describe some examples of possible integrations: joining flight data with radar and weather data to predict arrival time at the gate down to the second, constantly updated processing data from the luggage conveyor belts as well as results from prediction models for passenger flow, and many more.
How to leverage Kafka data streams with Neo4jGraphRM
Descrizione:
Integrating Apache Kafka with other systems in a reliable and scalable way is often a key part of an event streaming platform. In this talk we'll introduce how to use Apache Kafka (the most used Message Brocker) in combination with Neo4j through the Neo4j-Streams project, demonstrating via simple use-cases how you can leverage the information driven by the Change Data Capture Module and how to add Neo4j in your Kafka flow by using the Sink module in combination with the Neo4j Streams Procedures.
Speaker:
Andrea Santurbano - Neo4J Architect - LARUS Business Automation
Video link: https://youtu.be/oNXWOyDd5HI
IoT Analytics at Google Scale with James Chittenden: Using PubSub Dataflow, and BigQuery to Capture Millions of Connected Devices
There is the potential for 50 billion connected devices by 2020. Google Cloud Platform gives you the tools to scale connections, gather and make sense of data, and provide the reliable customer experiences that hardware devices require. Google’s Cloud Platform provides the infrastructure to handle streams of data fed from millions of intelligent devices.
In this meetup, we'll explore one of the world's largest appliance manufacturer's IoT architecture along with Google's partner Archipelago, and will drill into how they are leveraging Google's massive infrastructure in their solution. We'll explore what Google provides for IoT, including Pub/Sub for messaging, Dataflow for data processing, BigQuery for large scale analytics as well as best practices for real time stream processing accounting for ingest, processing, storage and analysis of hundreds of millions of events per hour.
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
During the rise and innovation of “big data,” the geospatial analytics landscape has grown and evolved. We are beyond just analyzing static maps. Geospatial data is streaming from devices, sensors, infrastructure systems, or social media, and our applications and use cases must dynamically scale to meet the increased demands.
Cloud can provide cost-effective storage and that ephemeral resource-burst needed for fast processing and low latency, all to monetize the immediate value of fresh geospatial data. Geospatial analytics require optimized spatial data types and algorithms to distill data to knowledge. Such processing, especially with strict latency requirements, has always been a challenge.
We propose an open source big data stack for geospatial analytics on Cloud based on Apache NiFi, Apache Spark and LocationTech GeoMesa. GeoMesa is a geospatial framework deployed in a modern big data platform that provides a scalable and low latency solution for indexing volumes of historical data and generating live views and streaming geospatial analytics. CONSTANTIN STANCA, Solutions Engineer, Hortonworks and JAMES HUGHES, Mathematician, CCRi
Kafka as an Eventing System to Replatform a Monolith into Microservices confluent
(Madhulika Tripathi, Intuit) Kafka Summit SF 2018
Breaking down monolithic applications into smaller manageable microservices can be a tough challenge. But the benefits are many. Faster changes, developer productivity, maintainability, scalability and high performance are a few of the motivators that make companies undertake this difficult journey.
At Intuit, we have our fair share of monolithic applications. One such application is Quickbooks Online, our accounting product for small businesses. In order to decompose the application, we needed to create new services, and reduce footprint of data in the monolith by moving it to new services in a phased manner. As more and more data and services keep moving out of the monolith, this data now distributed across multiple microservices needs to be synchronized in near real time to provide a seamless and fast experience to the customers of our product.
To achieve this, we are using Kafka as our eventing backbone that can aid us in keeping distributed data in sync, without compromising performance and user experience. Guaranteed publishing of financial events with no loss, high accuracy and performance is of utmost importance as majority of Intuit products deal with highly sensitive, financial data. Strong ordering guarantees is another important criteria that Kafka can provide with low latency and high throughput. Use cases for data and streaming analytics, insights, personalization, machine-learning-based predictions, can all be unlocked by adopting Kafka as our distributed streaming platform.
This talk will take you through Intuit’s journey of building a distributed, asynchronous system using Kafka. Specifically about the choices made, challenges faced, the adaptations clients had to make and how we see Kafka powering our future!
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringInfluxData
On average, a business supporting digital transactions now crosses 35 backend systems—and legacy tools haven’t been able to keep up. This session will cover how MuleSoft uses InfluxCloud to help power their monitoring and diagnostic solutions as well as provide end-to-end actionable visibility to APIs and integrations to help customers identify and resolve issues quickly.
We describe an application of CEP using a microservice-based streaming architecture. We use Drools business rule engine to apply rules in real time to an event stream from IoT traffic sensor data.
Les objets connectés : de nombreux cas d'usage Jedha Bootcamp
Aujourd'hui, les objets connectés sont partout et nous entourent sans même s'en apercevoir : téléphones, transports, musique, montres, "The Internet of Things" (IoT) a pris une part importante dans notre vie. En nous montrant des cas d'usages des entreprises telles que la NASA, Airbus, Red bull et d'autres, Sean nous expliquera comment ils fonctionnent et comment sont gérées toutes ces données récoltées.
An Introduction to the MapR Converged Data PlatformMapR Technologies
Listen to the webinar on-demand: http://info.mapr.com/WB_Partner_CDP_Intro_EMEA_DG_17.05.31_RegistrationPage.html
In this 90-minute webinar, we discuss:
- The MapR Converged Data Platform and its components
- Use cases for the Converged Data Platform
- MapR Converged Partner Program
- How to get started with MapR
- Becoming a partner
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseRittman Analytics
“Tech startups can't afford DBAs, and they don't have time to provision servers and scale them up and down or deal with patches or downtime. They've never heard of indexes and they need data loaded and ready for analysis in days, not months. In this session learn how Oracle Database developers can build data warehouses as a hip startup data engineer would—but using a proper database built on Oracle technology. Oracle Data Visualization Desktop provides analytics and data exploration with techniques explained in this session. Hear real-world development experiences from working on data and analytics projects at a tech startup in the UK.”
Digital Business Transformation in the Streaming EraAttunity
Enterprises are rapidly adopting stream computing backbones, in-memory data stores, change data capture, and other low-latency approaches for end-to-end applications. As businesses modernize their data architectures over the next several years, they will begin to evolve toward all-streaming architectures. In this webcast, Wikibon, Attunity, and MemSQL will discuss how enterprise data professionals should migrate their legacy architectures in this direction. They will provide guidance for migrating data lakes, data warehouses, data governance, and transactional databases to support all-streaming architectures for complex cloud and edge applications. They will discuss how this new architecture will drive enterprise strategies for operationalizing artificial intelligence, mobile computing, the Internet of Things, and cloud-native microservices.
Link to the Wikibon report - wikibon.com/wikibons-2018-big-data-analytics-trends-forecast
Link to Attunity Streaming CDC Book Download - http://www.bit.ly/cdcbook
Link to MemSQL's Free Data Pipeline Book - http://go.memsql.com/oreilly-data-pipelines
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018Gleb Otochkin
The presentation explain different use cases and topologies for Oracle GoldenGate Big Data adapters and show how we can offload our data to be analyzed in real time using modern Big Data technologies.
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Amazon Web Services
Companies have valuable data that they might not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. With the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with an Amazon Redshift lead engineer to ask questions and learn more about how you can extend your analytics beyond your data warehouse.
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
Using Kafka to stream data into TigerGraph, a distributed graph database, is a common pattern in our customers’ data architecture. In the TigerGraph database, Kafka Connect framework was used to build the native S3 data loader. In TigerGraph Cloud, we will be building native integration with many data sources such as Azure Blob Storage and Google Cloud Storage using Kafka as an integrated component for the Cloud Portal.
In this session, we will be discussing both architectures: 1. built-in Kafka Connect framework within TigerGraph database; 2. using Kafka cluster for cloud native integration with other popular data sources. Demo will be provided for both data streaming processes.
Comparing three data ingestion approaches where Apache Kafka integrates with ...HostedbyConfluent
Using Kafka to stream data into TigerGraph, a distributed graph database, is a common pattern in our customers’ data architecture. We have seen the integration in three different layers around TigerGraph’s data flow architecture, and many key use case areas such as customer 360, entity resolution, fraud detection, machine learning, and recommendation engine. Firstly, TigerGraph’s internal data ingestion architecture relies on Kafka as an internal component. Secondly, TigerGraph has a builtin Kafka Loader, which can connect directly with an external Kafka cluster for data streaming. Thirdly, users can use an external Kafka cluster to connect other cloud data sources to TigerGraph cloud database solutions through the built-in Kafka Loader feature. In this session, we will present the high-level architecture in three different approaches and demo the data streaming process.
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
Snowflake is one of the most powerful, efficient data warehouses on the market today—and we joined forces with the Snowflake team to show you how it works!
In this webinar:
- Learn how to optimize Snowflake
- Hear insider tips and tricks on how to improve performance
- Get expert insights from Craig Collier, Technical Architect from Snowflake, and Kalyan Arangam, Solution Architect from Matillion
- Find out how leading brands like Converse, Duo Security, and Pets at Home use Snowflake and Matillion ETL to make data-driven decisions
- Discover how Matillion ETL and Snowflake work together to modernize your data world
- Learn how to utilize the impressive scalability of Snowflake and Matillion
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageDatabricks
"At Pure Storage, our strong belief in aggressive automated testing has caused our continuous integration (CI) systems to generate massive amounts of messy log data. Spark's flexible computing platform allows us to write a single application to understand the state of our CI pipeline for both streaming (over a million events per second) and batch jobs (at 40TB/hour).
Decoupling our data storage enabled us to orchestrate and independently scale stateless pipeline components (spark, kafka, rsyslog, and custom code) using nomad. In this talk, we will discuss how we architected our data pipeline to leverage simple orchestration and enable resiliency with ephemeral compute components."
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
PRGX is the world's leading provider of accounts payable audit services and works with leading global retailers. As new forms of data started to flow into their organizations, standard RDBMS systems were not allowing them to scale. Now, by using Talend with Cloudera Enterprise, they are able to acheive a 9-10x performance benefit in processing data, reduce errors, and now provide more innovative products and services to end customers.
Watch this webinar to learn how PRGX worked with Cloudera and Talend to create a high-performance computing platform for data analytics and discovery that rapidly allows them to process, model, and serve massive amount of structured and unstructured data.
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
This is a brief introduction to Snowflake Cloud Data Platform and our revolutionary architecture. It contains a discussion of some of our unique features along with some real world metrics from our global customer base.
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Denodo
Watch full webinar here: https://bit.ly/3idAnbf
Heute werden hochwertige Daten schnell und integriert benötigt, mittlerweile häufig auch über unterschiedliche Clouds hinweg.
Datenvirtualisierung kann hier als logische Datenschicht wahre Wunder wirken und die Modernisierung der Datenarchitektur drastisch beschleunigen.
In unserem kostenlosen Webinar interviewen wir den Experten Otto Neuer von Denodo, der die hier nur angerissenen Gedanken weiter ausführt. Er wird uns Einblicke in den Wandel von Datenarchitekturen geben und wie aus seinem Blickwinkel die nächste Phase der Business Intelligence aussieht.
Was Sie mitnehmen:
- Was sind die Herausforderungen und Limitierungen traditioneller Datenarchitekturen
- Wie können mit modernen Architekturen diese Limitierungen aufgehoben werden
- Welche Rolle spielt Datenvirtualisierung bei modernen Datenarchitekturen
- Was ist die nächste Phase der Business Intelligence
Erfahren Sie am 23. September 2020, den Experten Otto Neuer von Denodo zusammen mit unserem Partner QuinScape GmbH wird uns Einblicke in den Wandel von Datenarchitekturen geben und wie aus seinem Blickwinkel die nächste Phase der Business Intelligence aussieht.
Sie haben Interesse? Dann melden Sie sich am besten direkt an - die Plätze der Veranstaltung sind begrenzt.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Real life use cases from across Europe (Walid Aoudi - Cognizant)
This presentation will present some Cognizant Big Data clients return on experiences on continental Europe and UK. The main focus will be centered on use cases through the presentation of the business drivers behind these projects. Key highlights around the big data architecture and approach solutions will be presented. Finally, the business outcomes in terms of ROI provided by the solutions implementations will be discussed.
Similar to Big data processing with PubSub, Dataflow, and BigQuery (20)
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).