This document summarizes a presentation about real-time data analytics with Apache Flink and Apache BEAM. It discusses possible real-time and batch processing systems using AWS services, challenges of streaming systems including state management, and demos of analyzing user clickstreams and taxi trips with Apache Flink, Kafka, and Elasticsearch. It also covers advantages of Apache BEAM including a unified batch and streaming API that can run on different frameworks like Flink, benefits of native support for Java, Python, and Go, and how it allows mixing languages in pipelines.
¿Qué es eso del desarrollo sin servidores? ¿Qué lenguajes puedo utilizar? ¿Cómo hago cosas como autenticación, o guardar en base de datos, o enviar notificaciones? ¿Esto escala? A todas estas preguntas, y a alguna más, intentaré dar respuesta en esta sesión, donde haré una pequeña demo de montar una app muy sencilla y desplegarla en la nube sin preocuparnos de gestionar infraestructura. Charla realizada por primera vez para AlcarriaConf 2021
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowakijavier ramirez
Do you think you can write a system to get data from sensors across the world, do real time analytics, and display the data on a dashboard in under 100 lines of code? Would you like to add some monitoring and autoscaling too? And what about serverless? In this talk I'll show you all the technologies GCP offers to build such a system reliably and at scale.
In this session, we will introduce Amazon RedShift, a new petabyte scale data warehouse service. We'll walk through the basics of the Redshift architecture, launching a new cluster and run SQL queries across a large scale, public dataset. After demonstrating how easy it is to get started with RedShift, we will show how to visualize and query large scale datasets, running queries, reports, and analytics against millions of rows of records in just a few seconds.
Data-driven companies have a need to make their data easily accessible to those who analyze it. Many organizations have adopted the Looker application, LookML on AWS, a centralized analytical database with a user-friendly interface that allows employees to ask and answer their own questions to make informed business decisions.
Join our webinar to learn how our customer, Casper, an online mattress retailer, made the switch from a transactional database to Looker’s data analytics program on Amazon Redshift. Looker on Amazon Redshift can help you greatly reduce your analytics lifecycle with a simplified infrastructure and rapid cloud scaling.
Join us to learn:
• How to utilize LookML to build reusable definitions and logic for your data
• Best practices for architecting a centralized analytical database
• How Casper leveraged Looker and Amazon Redshift to provide all their employees access to their data and metrics
Who should attend: Heads of Analytics, Heads of BI, Analytics Managers, BI Teams, Senior Analysts
SEC303 Automating Security in Cloud Workloads with DevSecOpsAmazon Web Services
This session is designed to teach security engineers, developers, solutions architects, and other technical security practitioners how to use a DevSecOps approach to design and build robust security controls at cloud-scale. This session walks through the design considerations of operating high-assurance workloads on top of the AWS platform and provides examples of how to automate configuration management and generate audit evidence for your own workloads. We’ll discuss practical examples using real code for automating security tasks, then dive deeper to map the configurations against various industry frameworks. This advanced session showcases how continuous integration and deployment pipelines can accelerate the speed of security teams and improve collaboration with software development teams.
Serverlesss Big Data Analytics with Amazon Athena and QuicksightAmazon Web Services
Check out how you can easily query raw data in various formats in Amazon S3, transform it into a canonical form, analyze it, and build dashboards to get more insights from your data.
¿Qué es eso del desarrollo sin servidores? ¿Qué lenguajes puedo utilizar? ¿Cómo hago cosas como autenticación, o guardar en base de datos, o enviar notificaciones? ¿Esto escala? A todas estas preguntas, y a alguna más, intentaré dar respuesta en esta sesión, donde haré una pequeña demo de montar una app muy sencilla y desplegarla en la nube sin preocuparnos de gestionar infraestructura. Charla realizada por primera vez para AlcarriaConf 2021
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowakijavier ramirez
Do you think you can write a system to get data from sensors across the world, do real time analytics, and display the data on a dashboard in under 100 lines of code? Would you like to add some monitoring and autoscaling too? And what about serverless? In this talk I'll show you all the technologies GCP offers to build such a system reliably and at scale.
In this session, we will introduce Amazon RedShift, a new petabyte scale data warehouse service. We'll walk through the basics of the Redshift architecture, launching a new cluster and run SQL queries across a large scale, public dataset. After demonstrating how easy it is to get started with RedShift, we will show how to visualize and query large scale datasets, running queries, reports, and analytics against millions of rows of records in just a few seconds.
Data-driven companies have a need to make their data easily accessible to those who analyze it. Many organizations have adopted the Looker application, LookML on AWS, a centralized analytical database with a user-friendly interface that allows employees to ask and answer their own questions to make informed business decisions.
Join our webinar to learn how our customer, Casper, an online mattress retailer, made the switch from a transactional database to Looker’s data analytics program on Amazon Redshift. Looker on Amazon Redshift can help you greatly reduce your analytics lifecycle with a simplified infrastructure and rapid cloud scaling.
Join us to learn:
• How to utilize LookML to build reusable definitions and logic for your data
• Best practices for architecting a centralized analytical database
• How Casper leveraged Looker and Amazon Redshift to provide all their employees access to their data and metrics
Who should attend: Heads of Analytics, Heads of BI, Analytics Managers, BI Teams, Senior Analysts
SEC303 Automating Security in Cloud Workloads with DevSecOpsAmazon Web Services
This session is designed to teach security engineers, developers, solutions architects, and other technical security practitioners how to use a DevSecOps approach to design and build robust security controls at cloud-scale. This session walks through the design considerations of operating high-assurance workloads on top of the AWS platform and provides examples of how to automate configuration management and generate audit evidence for your own workloads. We’ll discuss practical examples using real code for automating security tasks, then dive deeper to map the configurations against various industry frameworks. This advanced session showcases how continuous integration and deployment pipelines can accelerate the speed of security teams and improve collaboration with software development teams.
Serverlesss Big Data Analytics with Amazon Athena and QuicksightAmazon Web Services
Check out how you can easily query raw data in various formats in Amazon S3, transform it into a canonical form, analyze it, and build dashboards to get more insights from your data.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
Amazon Athena is a new interactive query service that makes it easy to analyze data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this session, we will show you how easy is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...Swapnil Pawar
AWS Lambda now supports Parallelization Factor, a feature that allows you to process one shard of a Kinesis or DynamoDB data stream with more than one Lambda invocation simultaneously. This new feature allows you to build more agile stream processing applications on volatile data traffic.
By default, Lambda invokes a function with one batch of data records from one shard at a time. For a single event source mapping, the maximum number of concurrent Lambda invocations is equal to the number of Kinesis or DynamoDB shards.
Now you can specify the number of concurrent batches that Lambda polls from a shard via a Parallelization Factor from 1 (default) to 10. For example, when Parallelization Factor is set to 2, you can have 200 concurrent Lambda invocations at maximum to process 100 Kinesis data shards. This helps scale up the processing throughput when the data volume is volatile and the IteratorAge is high.
Amazon EC2 provides you with the flexibility to cost optimize your computing portfolio through purchasing models that fit your business needs. With the flexibility of mix-and-match purchasing models, you can grow your compute capacity and throughput and enable new types of cloud computing applications with the lowest TCO. In this session, we will explore combining pay-as-you-go (On-Demand), reserve ahead of time for discounts (Reserved), and high-discount spare capacity (Spot) purchasing models to optimize costs while maintaining high performance and availability for your applications. Common application examples will be used to demonstrate how to best combine EC2’s purchasing models. You will leave the session with best practices you can immediately apply to your application portfolio.
by Lin Chunyong and Ryan Deivert, Airbnb
AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...Amazon Web Services
Many data sets, such as time-series collections or Internet of Things (IoT) deployments can include huge numbers of sensor reports and other data points, which can be a challenge to manage and aggregate. Amazon ElastiCache for Redis provides an on-demand managed service with the performance and scalability to turn big data into useful information. Join us to learn how to use Amazon ElastiCache to create serverless solutions that lets you rapidly make use of large and multisource data sets.
Learning Objectives:
• Learn how to ingest and analyze sensor data using Amazon ElastiCache for Redis and the AWS IoT Service
• Learn how to use ElastiCache Redis for Time-Series data
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...Amazon Web Services
Building big data applications often requires integrating a broad set of technologies to store, process, and analyze the increasing variety, velocity, and volume of data being collected by many organizations.
Using a combination of Amazon EMR, a managed Hadoop framework, and Amazon Redshift, a managed petabyte-scale data warehouse, organizations can effectively address many of these requirements.
In this webinar, we will show how organizations are using Amazon EMR and Amazon Redshift to build more agile and scalable architectures for big data. We will look into how you can leverage Spark and Presto running on EMR, to address multiple data processing requirements. We will also share best practices and common use cases to integrate EMR and Redshift.
Learning Objectives:
• Best practices for building a big data architecture that includes Amazon EMR and Amazon Redshift
• Understand how to use technologies such as Amazon EMR, Presto and Spark to complement your data warehousing environment
• Learn key use cases for Amazon EMR and Amazon Redshift
Who Should Attend:
• Data architects, Data management professionals, Data warehousing professionals, BI professionals
AWS’ serverless architecture components such as S3, SQS, SNS, CloudWatch Logs, DynamoDB, Kinesis and Lambda can be tightly constrained in their operation, however it is still possible to use some of them to propagate payloads which may be used to exploit vulnerabilities in some consuming endpoints or user- generated code. This session explores techniques for enhancing the security of these services, from assessing and tightening permissions in IAM to integrating tools and mechanisms for inline and out-of-band payload analysis which are more typically applied to traditional server-based architectures.
Presented by: Dave Walker, Security Solutions Architecture, Amazon Web Services
Customer Guest: Wouter Neyndorff, CTO, inSided
AWS Step Functions is a new, fully-managed service that makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Step Functions is a reliable way to connect and step through a series of AWS Lambda functions so that you can build and run multi-step applications in a matter of minutes. This session shows how to use AWS Step Functions to create, run, and debug cloud state machines to execute parallel, sequential, and branching steps of your application, with automatic catch and retry conditions.
Amazon RDS allows you to launch an optimally configured, secure and highly available database with just a few clicks. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you to focus on your applications and business.
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computingAmazon Web Services
AWS Batch is a fully-managed service that enables developers, scientists, and engineers to easily and efficiently run batch computing workloads of any scale on AWS. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there is no need to install or manage batch computing software, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2, Spot Instances, and AWS Lambda. AWS Batch reduces operational complexities, saving time and reducing costs. In this session, Principal Product Managers Jamie Kinney and Dougal Ballantyne describe the core concepts behind AWS Batch and details of how the service functions. The presentation concludes with relevant use cases and sample code.
In this session, we'll review the features and architecture of the new AWS Data Pipeline service and explain how you can use it to better manage your data-driven workloads. We'll then go over a few examples of setting up and provisioning a pipeline in the system.
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)Amazon Web Services
Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data announcements, as we kick off the Big Data re:Source Mini Con.
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...Amazon Web Services
Customers that have Oracle Data warehouses find them complex and expensive to manage. Most are struggling with data load and performance issues. They are looking to migrate to something which is easy to manage, cost effective, and improves their query performance. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. Migrating your Oracle data warehouse to Amazon Redshift can substantially improve query and data load performance, increase scalability, and save costs. This workshop leverages AWS Database Migration Service and AWS Schema Conversion Tool to migrate an existing Oracle data warehouse to Amazon Redshift. When migrating your database from one engine to another, you have two major things to consider: the conversion of the schema and code objects, and the migration and conversion of the data itself. You can convert schema and code with AWS SCT and migrate data with AWS DMS. AWS DMS helps you migrate your data easily and securely with minimal downtime. Prerequisites: Have an AWS account with IAM admin permissions and sufficient limits for AWS resources above, with a comfortable working knowledge of AWS console, relational databases (Oracle) and Amazon Redshift.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
Amazon Athena is a new interactive query service that makes it easy to analyze data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this session, we will show you how easy is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...Swapnil Pawar
AWS Lambda now supports Parallelization Factor, a feature that allows you to process one shard of a Kinesis or DynamoDB data stream with more than one Lambda invocation simultaneously. This new feature allows you to build more agile stream processing applications on volatile data traffic.
By default, Lambda invokes a function with one batch of data records from one shard at a time. For a single event source mapping, the maximum number of concurrent Lambda invocations is equal to the number of Kinesis or DynamoDB shards.
Now you can specify the number of concurrent batches that Lambda polls from a shard via a Parallelization Factor from 1 (default) to 10. For example, when Parallelization Factor is set to 2, you can have 200 concurrent Lambda invocations at maximum to process 100 Kinesis data shards. This helps scale up the processing throughput when the data volume is volatile and the IteratorAge is high.
Amazon EC2 provides you with the flexibility to cost optimize your computing portfolio through purchasing models that fit your business needs. With the flexibility of mix-and-match purchasing models, you can grow your compute capacity and throughput and enable new types of cloud computing applications with the lowest TCO. In this session, we will explore combining pay-as-you-go (On-Demand), reserve ahead of time for discounts (Reserved), and high-discount spare capacity (Spot) purchasing models to optimize costs while maintaining high performance and availability for your applications. Common application examples will be used to demonstrate how to best combine EC2’s purchasing models. You will leave the session with best practices you can immediately apply to your application portfolio.
by Lin Chunyong and Ryan Deivert, Airbnb
AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...Amazon Web Services
Many data sets, such as time-series collections or Internet of Things (IoT) deployments can include huge numbers of sensor reports and other data points, which can be a challenge to manage and aggregate. Amazon ElastiCache for Redis provides an on-demand managed service with the performance and scalability to turn big data into useful information. Join us to learn how to use Amazon ElastiCache to create serverless solutions that lets you rapidly make use of large and multisource data sets.
Learning Objectives:
• Learn how to ingest and analyze sensor data using Amazon ElastiCache for Redis and the AWS IoT Service
• Learn how to use ElastiCache Redis for Time-Series data
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...Amazon Web Services
Building big data applications often requires integrating a broad set of technologies to store, process, and analyze the increasing variety, velocity, and volume of data being collected by many organizations.
Using a combination of Amazon EMR, a managed Hadoop framework, and Amazon Redshift, a managed petabyte-scale data warehouse, organizations can effectively address many of these requirements.
In this webinar, we will show how organizations are using Amazon EMR and Amazon Redshift to build more agile and scalable architectures for big data. We will look into how you can leverage Spark and Presto running on EMR, to address multiple data processing requirements. We will also share best practices and common use cases to integrate EMR and Redshift.
Learning Objectives:
• Best practices for building a big data architecture that includes Amazon EMR and Amazon Redshift
• Understand how to use technologies such as Amazon EMR, Presto and Spark to complement your data warehousing environment
• Learn key use cases for Amazon EMR and Amazon Redshift
Who Should Attend:
• Data architects, Data management professionals, Data warehousing professionals, BI professionals
AWS’ serverless architecture components such as S3, SQS, SNS, CloudWatch Logs, DynamoDB, Kinesis and Lambda can be tightly constrained in their operation, however it is still possible to use some of them to propagate payloads which may be used to exploit vulnerabilities in some consuming endpoints or user- generated code. This session explores techniques for enhancing the security of these services, from assessing and tightening permissions in IAM to integrating tools and mechanisms for inline and out-of-band payload analysis which are more typically applied to traditional server-based architectures.
Presented by: Dave Walker, Security Solutions Architecture, Amazon Web Services
Customer Guest: Wouter Neyndorff, CTO, inSided
AWS Step Functions is a new, fully-managed service that makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Step Functions is a reliable way to connect and step through a series of AWS Lambda functions so that you can build and run multi-step applications in a matter of minutes. This session shows how to use AWS Step Functions to create, run, and debug cloud state machines to execute parallel, sequential, and branching steps of your application, with automatic catch and retry conditions.
Amazon RDS allows you to launch an optimally configured, secure and highly available database with just a few clicks. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you to focus on your applications and business.
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computingAmazon Web Services
AWS Batch is a fully-managed service that enables developers, scientists, and engineers to easily and efficiently run batch computing workloads of any scale on AWS. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there is no need to install or manage batch computing software, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2, Spot Instances, and AWS Lambda. AWS Batch reduces operational complexities, saving time and reducing costs. In this session, Principal Product Managers Jamie Kinney and Dougal Ballantyne describe the core concepts behind AWS Batch and details of how the service functions. The presentation concludes with relevant use cases and sample code.
In this session, we'll review the features and architecture of the new AWS Data Pipeline service and explain how you can use it to better manage your data-driven workloads. We'll then go over a few examples of setting up and provisioning a pipeline in the system.
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)Amazon Web Services
Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data announcements, as we kick off the Big Data re:Source Mini Con.
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...Amazon Web Services
Customers that have Oracle Data warehouses find them complex and expensive to manage. Most are struggling with data load and performance issues. They are looking to migrate to something which is easy to manage, cost effective, and improves their query performance. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. Migrating your Oracle data warehouse to Amazon Redshift can substantially improve query and data load performance, increase scalability, and save costs. This workshop leverages AWS Database Migration Service and AWS Schema Conversion Tool to migrate an existing Oracle data warehouse to Amazon Redshift. When migrating your database from one engine to another, you have two major things to consider: the conversion of the schema and code objects, and the migration and conversion of the data itself. You can convert schema and code with AWS SCT and migrate data with AWS DMS. AWS DMS helps you migrate your data easily and securely with minimal downtime. Prerequisites: Have an AWS account with IAM admin permissions and sufficient limits for AWS resources above, with a comfortable working knowledge of AWS console, relational databases (Oracle) and Amazon Redshift.
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
In diesem Vortrag wird es zunächst einen kurzen Überblick über den aktuellen Stand im Bereich der Streaming-Datenanalyse geben. Danach wird es mit einer kleinen Einführung in das Apache-Flink-System zur Echtzeit-Datenanalyse weitergehen, bevor wir tiefer in einige der interessanten Eigenschaften eintauchen werden, die Flink von den anderen Spielern in diesem Bereich unterscheidet. Dazu werden wir beispielhafte Anwendungsfälle betrachten, die entweder direkt von Nutzern stammen oder auf unserer Erfahrung mit Nutzern basieren. Spezielle Eigenschaften, die wir betrachten werden, sind beispielsweise die Unterstützung für die Zerlegung von Events in einzelnen Sessions basierend auf der Zeit, zu der ein Ereignis passierte (event-time), Bestimmung von Zeitpunkten zum jeweiligen Speichern des Zustands eines Streaming-Programms für spätere Neustarts, die effiziente Abwicklung bei sehr großen zustandsorientierten Streaming-Berechnungen und die Zugänglichkeit des Zustandes von außerhalb.
Presenter: Robert Metzger
Video Link: https://www.youtube.com/watch?v=GWxyiTY-1uQ
Flink.tw Meetup Event (2016/07/19):
"Stream Processing with Apache Flink w/ Flink PMC Robert Metzger"
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022HostedbyConfluent
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs is a hyperscale PaaS event stream broker with protocol support for HTTP, AMQP, and Apache Kafka RPC that accepts and forwards several trillion (!) events per day and is available in all global Azure regions. This session is a look behind the curtain where we dive deep into the architecture of Event Hubs and look at the Event Hubs cluster model, resource isolation, and storage strategies and also review some performance figures.
Real-time Streaming Pipelines with FLaNKData Con LA
Introducing the FLaNK stack which combines Apache Flink, Apache NiFi and Apache Kafka to build fast applications for IoT, AI, rapid ingest and deploy them anywhere. I will walk through live demos and show how to do this yourself.
FLaNK provides a quick set of tools to build applications at any scale for any streaming and IoT use cases.
We will discuss a use case - Smart Stocks with FLaNK (NiFi, Kafka, Flink SQL)
Bio -
Tim Spann is an avid blogger and the Big Data Zone Leader for Dzone (https://dzone.com/users/297029/bunkertor.html). He runs the the successful Future of Data Princeton meetup with over 1200 members at http://www.meetup.com/futureofdata-princeton/. He is currently a Senior Solutions Engineer at Cloudera in the Princeton New Jersey area. You can find all the source and material behind his talks at his Github and Community blog:
https://github.com/tspannhw/ApacheDeepLearning201
https://community.hortonworks.com/users/9304/tspann.html
Apache Flink Overview at SF Spark and FriendsStephan Ewen
Introductory presentation for Apache Flink, with bias towards streaming data analysis features in Flink. Shown at the San Francisco Spark and Friends Meetup
Artsem Semianenko (Adform) - "Flink in action или как приручить белочку"
Slides for presentation: https://www.youtube.com/watch?v=YSI5_RFlcPE
Source: https://github.com/art4ul/flink-demo
Hubo un tiempo en el que casi cualquier componente de software requería pagar una licencia. Afortunadamente, hoy en día gracias al software libre y de código abierto, se puede desarrollar prácticamente cualquier aplicación usando componentes gratuitos.
Pero, si el software es gratis, ¿Quién lo desarrolla? ¿Trabaja la comunidad de software libre de forma altruista? ¿Se puede desarrollar software libre de forma profesional? De hecho, hay quien dice que el código abierto tal y como lo conocimos ya no existe, y que lo que hay hoy en día es otra cosa.
En esta charla hablaré de cómo se puede monetizar el código libre, y de algunos posibles conflictos que puedes encontrarte en el camino.
Además, te contaré cómo hacemos desde QuestDB para desarrollar una base de datos de código abierto y mantener un equipo estable viviendo de ello. Comentaré también algunas situaciones problemáticas a las que proyectos muy destacados se han enfrentado, o que se enfrentan a día de hoy.
QuestDB: The building blocks of a fast open-source time-series databasejavier ramirez
(talk delivered at OSA CON 23)
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed.
We will learn how it deals with data ingestion, and which SQL extensions it implements for working with time-series efficiently.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or data deduplication.
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas.
Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
Time series data pipelines tend to prioritise speed and freshness over completeness and integrity. In such scenarios, it is very common to ingest duplicate data, which may be fine for many analytical use cases, but is very inconvenient for others.
There are many open source databases built specifically for the speed and query semantics of time series, and most of them lack automatic deduplication of events in near real-time. One such database is QuestDB, which requires a manual batch process to deduplicate ingested data.
In this talk, we will see how we can successfully use Apache Beam to deduplicate streaming time series, which can then be analysed by a time series database.
Relational databases were created a long time ago for a simpler world. Even if they are still awesome tools for generic workloads, there are some things they cannot do well.
In this session I will speak about purpose-built databases that you can use for specific business scenarios. We will see the type of queries you can run on a Graph database, a Document Database, and a Time-Series database. We will then see how a relational database could also be used for the same use cases, just in a much more complex way.
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
If you are storing records with a timestamp in your database, it is very likely a time series database can make your life easier.
However, time series databases are still the great unknown for a large part of the tech community.
In this talk, I will show you what use cases they are good for, what they give you that you cannot get from a traditional database, and when it is a good idea (and when it is not) to use them.
For the demos, we will be using QuestDB, the fastest open-source time series database.
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
En esta sesión voy a contar las decisiones técnicas que tomamos al desarrollar QuestDB, una base de datos Open Source para series temporales compatible con Postgres, y cómo conseguimos escribir más de cuatro millones de filas por segundo sin bloquear o enlentecer las consultas.
Hablaré de cosas como (zero) Garbage Collection, vectorización de instrucciones usando SIMD, reescribir en lugar de reutilizar para arañar microsegundos, aprovecharse de los avances en procesadores, discos duros y sistemas operativos, como por ejemplo el soporte de io_uring, o del balance entre experiencia de usuario y rendimiento cuando se plantean nuevas funcionalidades.
Processing and analysing streaming data with Python. Pycon Italy 2022javier ramirez
Data used to be a batch thing, but more and more we get unbounded streams of data, fast or slow, that we need to process and analyse in near real time.
In this talk I’ll show you how you can use Apache Flink and QuestDB to build reliable streaming data pipelines that can grow as much as you need.
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
In this session I will show you the technical decisions we made when building QuestDB, the open source, Postgres compatible, time-series database, and how we can achieve a million row writes per second without blocking or slowing down the reads.
Servicios e infraestructura de AWS y la próxima región en Aragónjavier ramirez
AWS está montando una región de infraestructura en Aragón. Vale, pero ¿Qué significa eso? ¿Es tan diferente de un centro de datos convencional o de otros proveedores de nube? (Spoiler: Sí). En esta sesión te cuento por qué. Hay video en https://catedrasamcadt.unizar.es/noticias/el-momento-tecnologico-actual-contado-por-trabajadores-de-amazon-web-services/
AWS launched publicly on March 2006 with just one service, starting the age of the public cloud. You might think after 15 years everything in cloud has already been invented, but that's simply not the case.
In this session I want to show you how AWS is reinventing the cloud in areas like computing, machine learning, databases and analytics, or cloud infrastructure.
In this webinar we explain which are some of the problems of streaming analytics, and why they are different to batch/big data analytics. Then we go into introducing some basic streaming concepts, like event queues, event processors, event vs processing time, and delivery guarantees. We end this first part of the series presenting a few of the most common open source components for streaming (Kafka, Spark, Flink, Cassandra, or ElasticSearch) and we mention the different options you have to run them on AWS.
Getting started with streaming analytics: Setting up a pipelinejavier ramirez
In this session I will show you how to create a simple streaming analytics pipeline, first using open source tools and developing locally, then moving to a VM, then moving to fully managed AWS services. The session will serve as an introduction to some details of Apache Kafka, Apache Flink, ElasticSearch, Amazon Managed Streaming for Kafka, Kinesis Data Analytics, and Amazon ElasticSearch. It will be an almost slideless presentation, as I will spent most of the time at the command line and the IDE.
Getting started with streaming analytics: Deep Divejavier ramirez
Now that we know how to create simple streaming analytics pipelines, it is time to learn something more interesting. In this session I will show you how to add Complex Event Processing to your Apache Flink (or Kinesis Data Analytics) application using JAVA. For those of you that prefer SQL, I will show you how to run streaming analytics using only SQL.
Getting started with streaming analytics: streaming basics (1 of 3)javier ramirez
In this webinar we explain which are some of the problems of streaming analytics, and why they are different to batch/big data analytics. Then we go into introducing some basic streaming concepts, like event queues, event processors, event vs processing time, and delivery guarantees. We end this first part of the series presenting a few of the most common open source components for streaming (Kafka, Spark, Flink, Cassandra, or ElasticSearch) and we mention the different options you have to run them on AWS.
Monitorización de seguridad y detección de amenazas con AWSjavier ramirez
La seguridad es nuestra prioridad número uno. Cuando despliegas tu infraestructura y aplicaciones en la nube, hay que tener en cuenta que muchas de las prácticas de seguridad son iguales a las que se llevan a cabo tradicionalmente cuando trabajas on-premises, pero hay otros mecanismos que son específicos a AWS y que te ayudan a operar de forma segura.
En este webinar, vamos a explicarte las bases de la monitorización de seguridad y de la detección de amenazas, y veremos cómo servicios como Amazon GuardDuty y AWS Security Hub te ayudan a tener una visión completa, te permiten cumplir con tus requisitos de compliance, y te permiten detectar amenazas en tus cargas de trabajo.
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...javier ramirez
Los desarrolladores hoy en día usamos diferentes bases de datos en función de las necesidades de nuestras aplicaciones. Por ejemplo, si estamos construyendo una red social, puede que usemos una base de datos orientada a grafos como Amazon Neptune, o si nuestro requisito es soportar esquemas muy flexibles quizás usemos Amazon DocumentDB, o si necesitamos latencias super bajas, quizás usemos Amazon DynamoDB. O incluso Amazon ElastiCache con Redis.
Es cada vez más normal encontrar aplicaciones complejas compuestas de diferentes servicios y diferentes almacenes de datos. Esto, que es genial desde el punto de vista de escoger la herramienta ideal para cada caso de uso, hace que la analítica de datos se complique al no tener toda la información en una sola base de datos relacional.
En este webinar, te presentamos la funcionalidad de consultas federadas de Athena, que te permite lanzar consultas SQL contra cualquiera de tus bases de datos, tanto en AWS como on premises. Además, en una sola SELECT puedes consultar diferentes fuentes de datos y hacer joins entre ellas. Para que todo quede más claro, te lo contaremos con una demo consultando mediante SQL una base de datos que no soporta SQL de manera nativa.
Recomendaciones, predicciones y detección de fraude usando servicios de intel...javier ramirez
La implementación de modelos de aprendizaje automático para resolver desafíos de negocios complejos, como detección de fraude, recomendaciones o predicción de series de datos es difícil si se quiere partir desde cero. Sin embargo, utilizando herramientas de AWS, implementar esos modelos está al alcance de cualquier empresa que sea capaz de subir un fichero a la nube, y llamar a un API cuando quiera obtener resultados. Basados en la tecnología de aprendizaje automático que se perfeccionó gracias a años de uso en Amazon.com, Amazon Forecast, Amazon Personalize, y Amazon Fraud Detector permiten a cualquiera sin experiencia previa en aprendizaje automático integrar estas tecnologías en sus aplicaciones. En este video aprenderás cuáles son las dificultades de crear modelos de predicción para los casos ya mencionados, verás como AWS acelera el difícil trabajo que se necesita para diseñar, entrenar e implementar un modelo personalizado para tus datos, y te contaremos todo lo que necesitas para poder empezar a integrar estos modelos en tu aplicación. Por supuesto, veremos demos de cómo funcionan
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
1. Analítica de datos en tiempo real con
Apache Flink y Apache BEAM
Javier Ramírez - @supercoco9
Developer Advocate - Amazon Web Services
Noviembre 3-4-5, 2020
2. Un posible sistema de tiempo (casi) real
AWS Cloud
Transformaciones/Validaciones/
Filtrado/Agregados/Analítica
3. Analítica de un posible clickstream
AWS Cloud
Parsear
clicks
Cada minuto, calcular número de usuarios activos
Cada 5 minutos, productos comprados por categoría
Cada minuto, ranking de productos más visitados
Cada hora, total de pedidos
En tiempo real, seleccionar anuncios
En tiempo real, detectar comportamientos anómalos
4. You don’t know the volume of the data before you start
Data is never complete
Low-latency is expected
Events might be related, but data can come out of order
System should remain available during upgrades
Retos de trabajar con sistemas en streaming
5. Stateless processing
• Working on per-element streams is relatively easy (i.e. change format of each item, or filter our
records based on their own properties)
•
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
The real fun starts when you need to do transforms/ aggregations over groups of elements:
group by, count, max, average, joins, filtering based on properties from related records, or
complex pattern detection
6. Stateful processing: Processing-Time based windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
7. Stateful processing: Event-Time Based Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
8. Stateful processing: Session Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
9. Reto: mantener el estado entre eventos
• El sistema tiene que saber en qué etapa está cada elemento, y si está en un estado
intermedio o ya se ha procesado por completo
• Para operaciones que necesiten ”memoria”, el sistema tiene que mantener el
estado de los elementos y cálculos intermedios
• En un sistema suficientemente grande, el estado será distribuido
10. Apache Flink
• Stateful computations over data streams. Operaciones con estado
sobre flujos de datos
https://flink.apache.org
11. package com.javier_cloud.demos.streaming;
import com.javier_cloud.demos.streaming.util.AppProperties;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
public class KafkaStreaming {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
AppProperties.loadProperties(env);
Properties kafkaProperties = new Properties();
String kafka_servers = AppProperties.getBootstrapServers();
kafkaProperties.setProperty("bootstrap.servers", kafka_servers);
kafkaProperties.setProperty("group.id", AppProperties.getGroupId());
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer011<>(AppProperties.getInputStream(), new SimpleStringSchema(),
kafkaProperties));
FlinkKafkaProducer011<String> streamSink = new FlinkKafkaProducer011<>(kafka_servers, AppProperties.getOutputStream(),
new SimpleStringSchema());
streamSink.setWriteTimestampToKafka(true);
stream.addSink(streamSink);
env.execute("Basic Flink Kafka Streaming");
}
}
12. package com.javier_cloud.demos.streaming;
import com.javier_cloud.demos.streaming.util.AppProperties;
import com.javier_cloud.demos.streaming.util.ESSinkBuilder;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.util.Collector;
import java.util.Properties;
public class KafkaStreamingToES {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
AppProperties.loadProperties(env);
Properties kafkaProperties = new Properties();
String kafka_servers = AppProperties.getBootstrapServers();
kafkaProperties.setProperty("bootstrap.servers", kafka_servers);
kafkaProperties.setProperty("group.id", AppProperties.getGroupId());
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer011<>(AppProperties.getInputStream(), new SimpleStringSchema(),
kafkaProperties));
FlinkKafkaProducer011<String> streamSink = new FlinkKafkaProducer011<String>(kafka_servers,
AppProperties.getOutputStream(), new SimpleStringSchema());
streamSink.setWriteTimestampToKafka(true);
stream.addSink(streamSink);
// split up the lines in pairs (2-tuples) containing: (word,1), then sum
DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new Tokenizer()).keyBy(0).sum(1);
counts.addSink(ESSinkBuilder.buildElasticSearchSink(AppProperties.getESWordCountIndex()));
env.execute("Streaming from a Kafka topic, echoing the message to Kafka, and outputting aggregations to ElasticSearch");
}
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { ... }
}
}
14. Algunos operadores en Apache Flink
Tipo Operadores/Funciones
A nivel de elemento Map, FlatMap, Filter, Select, Project
Agregados KeyBy, Reduce, Fold, sum, min, max
Trabajar con ventanas globales,
de tiempo de proceso o de
evento
Window (TumblingEventTime, TumblingProcessingTime, SlidingEventTime,
SlidingProcessingTime, EventTimeSession, ProcessingTimeSession, GlobalWindows),
WindowAll, Window Apply, trigger, evictor, allowedLateness, sideOutputLateData, getSideOutput
Combinar varios streams
Union, Join, OuterJoin, Cross, Distinct, IntervalJoin, CoGroup, Connect, CoMap, CoFlatMap,
Split, PartitionCustom, Rebalance, Rescale, Shuffle, First-n, SortPartition
Optimizaciones Iterate, StartNewChain,DisableChaining
Bucles y asincronía Iterate, AsyncFunctions
SQL
Funciones SQL para: Comparison, Logical, Arithmetic, String, Temporal, Conditional, Type,
Aggregate, Collection, Columnar
15.
16. ¿Por qué Flink lo peta?
• Manejo propio de la memoria
• Serialización a un formato binario propio
• Gestión optimizada de las comunicaciones entre nodos y tareas
• Opciones para almacenar el estado
• Checkpoints y savepoints
• Varios niveles de abstracción en sus APIs
19. Ventajas de Apache BEAM
• API Unificada para Batch y Stream
• Portable a diferentes Runners (sin vendor lock-in): Flink, Spark,
Samza, DataFlow, Nemo, Twister2, Hazelcast Jet...
• Soporte nativo de Java, Python, y Go (con todas sus librerías)
• Posibilidad de mezclar lenguajes en una misma pipeline
20. from __future__ import absolute_import
import re
from past.builtins import unicode
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def run(argv=None, save_main_session=True):
"""Main entry point; defines and runs the wordcount pipeline."""
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word_count):
(word, count) = word_count
return '%s: %s' % (word, count)
output = counts | 'Format' >> beam.Map(format_result)
output | WriteToText(known_args.output)
if __name__ == '__main__':
run()
21. package com.amazonaws.samples.beam.taxi.count;
import org.apache.beam.runners.flink.FlinkRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.kinesis.KinesisIO;
import org.apache.beam.sdk.transforms.*;
(..)
import software.amazon.awssdk.services.cloudwatch.model.Dimension;
public class BeamTaxiCount {
public static void main(String[] args) {
String[] kinesisArgs = TaxiCountOptions.argsFromKinesisApplicationProperties(args,"BeamApplicationProperties");
TaxiCountOptions options = PipelineOptionsFactory.fromArgs(ArrayUtils.addAll(args, kinesisArgs)).as(TaxiCountOptions.class);
options.setRunner(FlinkRunner.class);
options.setAwsRegion(Regions.getCurrentRegion().getName());
PipelineOptionsValidator.validate(TaxiCountOptions.class, options);
Pipeline p = Pipeline.create(options);
PCollection<TripEvent> input = p
.apply("Kinesis source", KinesisIO
.read()
.withStreamName(options.getInputStreamName())
.withAWSClientsProvider(new DefaultCredentialsProviderClientsProvider(Regions.fromName(options.getAwsRegion())))
.withInitialPositionInStream(InitialPositionInStream.LATEST)
)
.apply("Parse Kinesis events", ParDo.of(new EventParser.KinesisParser()));
PCollection<Metric> metrics = input
.apply("Group into 5 second windows", Window
.<TripEvent>into(FixedWindows.of(Duration.standardSeconds(5)))
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(15)))
)
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes() )
.apply("Count globally", Combine
.globally(Count.<TripEvent>combineFn())
.withoutDefaults()
)
.apply("Map to Metric", ParDo.of(
new DoFn<Long, Metric>() {
@ProcessElement
public void process(ProcessContext c) {
c.output(new Metric(c.element().longValue(), c.timestamp()));
}
}
));
prepareMetricForCloudwatch(metrics)
.apply("CloudWatch sink", ParDo.of(new CloudWatchSink(options.getInputStreamName())));
p.run().waitUntilFinish();
}
23. Apache Flink en AWS
Modelo de responsabilidad compartida
Amazon Kinesis
Data Analytics for
Apache Flink
Amazon EMR
Hadoop/Yarn
gestionado
Más
gestionado
Menos
gestionado
AWS gestiona El cliente gestiona
• Almacenamiento y estado
• Métricas, monitorización, e interfaz dedicado
• Hardware, software, red
• Provisionado y autoescalado
• Código de la aplicación
• Configuración básica
• Escalado del cluster (basado en Yarn)
• Hardware, software, red
• Código de la aplicación
• Configuración de estado y almacenamiento
• Configuración de seguridad del interfaz
• Gestión/ejecución de las aplicaciones
• Plano de control de orquestación de contenedores
• Hardware, software del orquestador, red (física)
• Código y configuración completa de la aplicación
• Instalación y actualización del software
• Gestión de clusters, seguridad, y configuración de red
• Gestión/ejecución de las aplicaciones
• Escalado
• Hardware, software, red (física)
• Código y configuración completa de la aplicación
• Instalación y actualización del software
• Seguridad, y configuración de red
• Gestión/ejecución de las aplicaciones
• Escalado
• Provisionado, instalación y gestión de imágenes, parches de
seguridad
ECS/EKS
Gestión de
contenedores
EC2
Infraestructura
como servicio