SlideShare a Scribd company logo
Analítica de datos en tiempo real con
Apache Flink y Apache BEAM
Javier Ramírez - @supercoco9
Developer Advocate - Amazon Web Services
Noviembre 3-4-5, 2020
Un posible sistema de tiempo (casi) real
AWS Cloud
Transformaciones/Validaciones/
Filtrado/Agregados/Analítica
Analítica de un posible clickstream
AWS Cloud
Parsear
clicks
Cada minuto, calcular número de usuarios activos
Cada 5 minutos, productos comprados por categoría
Cada minuto, ranking de productos más visitados
Cada hora, total de pedidos
En tiempo real, seleccionar anuncios
En tiempo real, detectar comportamientos anómalos
You don’t know the volume of the data before you start
Data is never complete
Low-latency is expected
Events might be related, but data can come out of order
System should remain available during upgrades
Retos de trabajar con sistemas en streaming
Stateless processing
• Working on per-element streams is relatively easy (i.e. change format of each item, or filter our
records based on their own properties)
•
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
The real fun starts when you need to do transforms/ aggregations over groups of elements:
group by, count, max, average, joins, filtering based on properties from related records, or
complex pattern detection
Stateful processing: Processing-Time based windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
Stateful processing: Event-Time Based Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
Stateful processing: Session Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
Reto: mantener el estado entre eventos
• El sistema tiene que saber en qué etapa está cada elemento, y si está en un estado
intermedio o ya se ha procesado por completo
• Para operaciones que necesiten ”memoria”, el sistema tiene que mantener el
estado de los elementos y cálculos intermedios
• En un sistema suficientemente grande, el estado será distribuido
Apache Flink
• Stateful computations over data streams. Operaciones con estado
sobre flujos de datos
https://flink.apache.org
package com.javier_cloud.demos.streaming;
import com.javier_cloud.demos.streaming.util.AppProperties;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
public class KafkaStreaming {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
AppProperties.loadProperties(env);
Properties kafkaProperties = new Properties();
String kafka_servers = AppProperties.getBootstrapServers();
kafkaProperties.setProperty("bootstrap.servers", kafka_servers);
kafkaProperties.setProperty("group.id", AppProperties.getGroupId());
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer011<>(AppProperties.getInputStream(), new SimpleStringSchema(),
kafkaProperties));
FlinkKafkaProducer011<String> streamSink = new FlinkKafkaProducer011<>(kafka_servers, AppProperties.getOutputStream(),
new SimpleStringSchema());
streamSink.setWriteTimestampToKafka(true);
stream.addSink(streamSink);
env.execute("Basic Flink Kafka Streaming");
}
}
package com.javier_cloud.demos.streaming;
import com.javier_cloud.demos.streaming.util.AppProperties;
import com.javier_cloud.demos.streaming.util.ESSinkBuilder;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.util.Collector;
import java.util.Properties;
public class KafkaStreamingToES {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
AppProperties.loadProperties(env);
Properties kafkaProperties = new Properties();
String kafka_servers = AppProperties.getBootstrapServers();
kafkaProperties.setProperty("bootstrap.servers", kafka_servers);
kafkaProperties.setProperty("group.id", AppProperties.getGroupId());
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer011<>(AppProperties.getInputStream(), new SimpleStringSchema(),
kafkaProperties));
FlinkKafkaProducer011<String> streamSink = new FlinkKafkaProducer011<String>(kafka_servers,
AppProperties.getOutputStream(), new SimpleStringSchema());
streamSink.setWriteTimestampToKafka(true);
stream.addSink(streamSink);
// split up the lines in pairs (2-tuples) containing: (word,1), then sum
DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new Tokenizer()).keyBy(0).sum(1);
counts.addSink(ESSinkBuilder.buildElasticSearchSink(AppProperties.getESWordCountIndex()));
env.execute("Streaming from a Kafka topic, echoing the message to Kafka, and outputting aggregations to ElasticSearch");
}
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { ... }
}
}
from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment
from pyflink.table.descriptors import Schema, OldCsv, FileSystem
exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)
t_env.connect(FileSystem().path('/tmp/input')) 
.with_format(OldCsv()
.field('word', DataTypes.STRING())) 
.with_schema(Schema()
.field('word', DataTypes.STRING())) 
.create_temporary_table('mySource')
t_env.connect(FileSystem().path('/tmp/output')) 
.with_format(OldCsv()
.field_delimiter('t')
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) 
.with_schema(Schema()
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) 
.create_temporary_table('mySink')
t_env.from_path('mySource') 
.group_by('word') 
.select('word, count(1)') 
.insert_into('mySink')
t_env.execute("tutorial_job")
Algunos operadores en Apache Flink
Tipo Operadores/Funciones
A nivel de elemento Map, FlatMap, Filter, Select, Project
Agregados KeyBy, Reduce, Fold, sum, min, max
Trabajar con ventanas globales,
de tiempo de proceso o de
evento
Window (TumblingEventTime, TumblingProcessingTime, SlidingEventTime,
SlidingProcessingTime, EventTimeSession, ProcessingTimeSession, GlobalWindows),
WindowAll, Window Apply, trigger, evictor, allowedLateness, sideOutputLateData, getSideOutput
Combinar varios streams
Union, Join, OuterJoin, Cross, Distinct, IntervalJoin, CoGroup, Connect, CoMap, CoFlatMap,
Split, PartitionCustom, Rebalance, Rescale, Shuffle, First-n, SortPartition
Optimizaciones Iterate, StartNewChain,DisableChaining
Bucles y asincronía Iterate, AsyncFunctions
SQL
Funciones SQL para: Comparison, Logical, Arithmetic, String, Temporal, Conditional, Type,
Aggregate, Collection, Columnar
¿Por qué Flink lo peta?
• Manejo propio de la memoria
• Serialización a un formato binario propio
• Gestión optimizada de las comunicaciones entre nodos y tareas
• Opciones para almacenar el estado
• Checkpoints y savepoints
• Varios niveles de abstracción en sus APIs
Demo: Analizando clickstream de usuarios
Usando Apache Kafka, Apache Flink, y ElasticSearch
¿Qué pinta en todo esto Apache BEAM?
Ventajas de Apache BEAM
• API Unificada para Batch y Stream
• Portable a diferentes Runners (sin vendor lock-in): Flink, Spark,
Samza, DataFlow, Nemo, Twister2, Hazelcast Jet...
• Soporte nativo de Java, Python, y Go (con todas sus librerías)
• Posibilidad de mezclar lenguajes en una misma pipeline
from __future__ import absolute_import
import re
from past.builtins import unicode
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def run(argv=None, save_main_session=True):
"""Main entry point; defines and runs the wordcount pipeline."""
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word_count):
(word, count) = word_count
return '%s: %s' % (word, count)
output = counts | 'Format' >> beam.Map(format_result)
output | WriteToText(known_args.output)
if __name__ == '__main__':
run()
package com.amazonaws.samples.beam.taxi.count;
import org.apache.beam.runners.flink.FlinkRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.kinesis.KinesisIO;
import org.apache.beam.sdk.transforms.*;
(..)
import software.amazon.awssdk.services.cloudwatch.model.Dimension;
public class BeamTaxiCount {
public static void main(String[] args) {
String[] kinesisArgs = TaxiCountOptions.argsFromKinesisApplicationProperties(args,"BeamApplicationProperties");
TaxiCountOptions options = PipelineOptionsFactory.fromArgs(ArrayUtils.addAll(args, kinesisArgs)).as(TaxiCountOptions.class);
options.setRunner(FlinkRunner.class);
options.setAwsRegion(Regions.getCurrentRegion().getName());
PipelineOptionsValidator.validate(TaxiCountOptions.class, options);
Pipeline p = Pipeline.create(options);
PCollection<TripEvent> input = p
.apply("Kinesis source", KinesisIO
.read()
.withStreamName(options.getInputStreamName())
.withAWSClientsProvider(new DefaultCredentialsProviderClientsProvider(Regions.fromName(options.getAwsRegion())))
.withInitialPositionInStream(InitialPositionInStream.LATEST)
)
.apply("Parse Kinesis events", ParDo.of(new EventParser.KinesisParser()));
PCollection<Metric> metrics = input
.apply("Group into 5 second windows", Window
.<TripEvent>into(FixedWindows.of(Duration.standardSeconds(5)))
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(15)))
)
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes() )
.apply("Count globally", Combine
.globally(Count.<TripEvent>combineFn())
.withoutDefaults()
)
.apply("Map to Metric", ParDo.of(
new DoFn<Long, Metric>() {
@ProcessElement
public void process(ProcessContext c) {
c.output(new Metric(c.element().longValue(), c.timestamp()));
}
}
));
prepareMetricForCloudwatch(metrics)
.apply("CloudWatch sink", ParDo.of(new CloudWatchSink(options.getInputStreamName())));
p.run().waitUntilFinish();
}
Demo: Analizando trayectos de Taxi con Apache
BEAM
Usando tiempo real, y batch para el backfilling
Apache Flink en AWS
Modelo de responsabilidad compartida
Amazon Kinesis
Data Analytics for
Apache Flink
Amazon EMR
Hadoop/Yarn
gestionado
Más
gestionado
Menos
gestionado
AWS gestiona El cliente gestiona
• Almacenamiento y estado
• Métricas, monitorización, e interfaz dedicado
• Hardware, software, red
• Provisionado y autoescalado
• Código de la aplicación
• Configuración básica
• Escalado del cluster (basado en Yarn)
• Hardware, software, red
• Código de la aplicación
• Configuración de estado y almacenamiento
• Configuración de seguridad del interfaz
• Gestión/ejecución de las aplicaciones
• Plano de control de orquestación de contenedores
• Hardware, software del orquestador, red (física)
• Código y configuración completa de la aplicación
• Instalación y actualización del software
• Gestión de clusters, seguridad, y configuración de red
• Gestión/ejecución de las aplicaciones
• Escalado
• Hardware, software, red (física)
• Código y configuración completa de la aplicación
• Instalación y actualización del software
• Seguridad, y configuración de red
• Gestión/ejecución de las aplicaciones
• Escalado
• Provisionado, instalación y gestión de imágenes, parches de
seguridad
ECS/EKS
Gestión de
contenedores
EC2
Infraestructura
como servicio
¡Gracias!
Javier Ramírez - @supercoco9
Developer Advocate - Amazon Web Services
Noviembre 3-4-5, 2020

More Related Content

What's hot

Databases
DatabasesDatabases
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
Amazon Web Services
 
AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...
AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...
AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...
Swapnil Pawar
 
Architecting Cloud Apps
Architecting Cloud AppsArchitecting Cloud Apps
Architecting Cloud Apps
jineshvaria
 
Getting the most Bang for your Buck with #EC2 #Winning
Getting the most Bang for your Buck with #EC2 #WinningGetting the most Bang for your Buck with #EC2 #Winning
Getting the most Bang for your Buck with #EC2 #Winning
Amazon Web Services
 
Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...
Amazon Web Services
 
Airbnb - StreamAlert
Airbnb - StreamAlertAirbnb - StreamAlert
Airbnb - StreamAlert
Amazon Web Services
 
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
Amazon Web Services
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
Amazon Web Services
 
Securing Serverless Architecture
Securing Serverless ArchitectureSecuring Serverless Architecture
Securing Serverless Architecture
Amazon Web Services
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions:
Amazon Web Services
 
AWS re:Invent Recap 2016 Taiwan part 2
AWS re:Invent Recap 2016 Taiwan part 2AWS re:Invent Recap 2016 Taiwan part 2
AWS re:Invent Recap 2016 Taiwan part 2
Amazon Web Services
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
Amazon Web Services
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)
Amazon Web Services
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWS
Julien SIMON
 
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
 	  NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing 	  NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
Amazon Web Services
 
Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017
Amazon Web Services
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
Amazon Web Services
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
Amazon Web Services
 
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
Amazon Web Services
 

What's hot (20)

Databases
DatabasesDatabases
Databases
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...
AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...
AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sou...
 
Architecting Cloud Apps
Architecting Cloud AppsArchitecting Cloud Apps
Architecting Cloud Apps
 
Getting the most Bang for your Buck with #EC2 #Winning
Getting the most Bang for your Buck with #EC2 #WinningGetting the most Bang for your Buck with #EC2 #Winning
Getting the most Bang for your Buck with #EC2 #Winning
 
Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...
 
Airbnb - StreamAlert
Airbnb - StreamAlertAirbnb - StreamAlert
Airbnb - StreamAlert
 
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
 
Securing Serverless Architecture
Securing Serverless ArchitectureSecuring Serverless Architecture
Securing Serverless Architecture
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions:
 
AWS re:Invent Recap 2016 Taiwan part 2
AWS re:Invent Recap 2016 Taiwan part 2AWS re:Invent Recap 2016 Taiwan part 2
AWS re:Invent Recap 2016 Taiwan part 2
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWS
 
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
 	  NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing 	  NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
 
Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
 

Similar to Analitica de datos en tiempo real con Apache Flink y Apache BEAM

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
Monal Daxini
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
Robert Metzger
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
HostedbyConfluent
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKReal-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Chti jug - 2018-06-26
Chti jug - 2018-06-26Chti jug - 2018-06-26
Chti jug - 2018-06-26
Florent Ramiere
 
Jug - ecosystem
Jug -  ecosystemJug -  ecosystem
Jug - ecosystem
Florent Ramiere
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
Gyula Fóra
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Flink in action
Flink in actionFlink in action
Flink in action
Artem Semenenko
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 

Similar to Analitica de datos en tiempo real con Apache Flink y Apache BEAM (20)

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKReal-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNK
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Chti jug - 2018-06-26
Chti jug - 2018-06-26Chti jug - 2018-06-26
Chti jug - 2018-06-26
 
Jug - ecosystem
Jug -  ecosystemJug -  ecosystem
Jug - ecosystem
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Flink in action
Flink in actionFlink in action
Flink in action
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 

More from javier ramirez

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
javier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
javier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
javier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
javier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
javier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
javier ramirez
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
javier ramirez
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
javier ramirez
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
javier ramirez
 
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
javier ramirez
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
javier ramirez
 

More from javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
 

Recently uploaded

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 

Recently uploaded (20)

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 

Analitica de datos en tiempo real con Apache Flink y Apache BEAM

  • 1. Analítica de datos en tiempo real con Apache Flink y Apache BEAM Javier Ramírez - @supercoco9 Developer Advocate - Amazon Web Services Noviembre 3-4-5, 2020
  • 2. Un posible sistema de tiempo (casi) real AWS Cloud Transformaciones/Validaciones/ Filtrado/Agregados/Analítica
  • 3. Analítica de un posible clickstream AWS Cloud Parsear clicks Cada minuto, calcular número de usuarios activos Cada 5 minutos, productos comprados por categoría Cada minuto, ranking de productos más visitados Cada hora, total de pedidos En tiempo real, seleccionar anuncios En tiempo real, detectar comportamientos anómalos
  • 4. You don’t know the volume of the data before you start Data is never complete Low-latency is expected Events might be related, but data can come out of order System should remain available during upgrades Retos de trabajar con sistemas en streaming
  • 5. Stateless processing • Working on per-element streams is relatively easy (i.e. change format of each item, or filter our records based on their own properties) • 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/ The real fun starts when you need to do transforms/ aggregations over groups of elements: group by, count, max, average, joins, filtering based on properties from related records, or complex pattern detection
  • 6. Stateful processing: Processing-Time based windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 7. Stateful processing: Event-Time Based Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 8. Stateful processing: Session Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 9. Reto: mantener el estado entre eventos • El sistema tiene que saber en qué etapa está cada elemento, y si está en un estado intermedio o ya se ha procesado por completo • Para operaciones que necesiten ”memoria”, el sistema tiene que mantener el estado de los elementos y cálculos intermedios • En un sistema suficientemente grande, el estado será distribuido
  • 10. Apache Flink • Stateful computations over data streams. Operaciones con estado sobre flujos de datos https://flink.apache.org
  • 11. package com.javier_cloud.demos.streaming; import com.javier_cloud.demos.streaming.util.AppProperties; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011; public class KafkaStreaming { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); AppProperties.loadProperties(env); Properties kafkaProperties = new Properties(); String kafka_servers = AppProperties.getBootstrapServers(); kafkaProperties.setProperty("bootstrap.servers", kafka_servers); kafkaProperties.setProperty("group.id", AppProperties.getGroupId()); DataStream<String> stream = env .addSource(new FlinkKafkaConsumer011<>(AppProperties.getInputStream(), new SimpleStringSchema(), kafkaProperties)); FlinkKafkaProducer011<String> streamSink = new FlinkKafkaProducer011<>(kafka_servers, AppProperties.getOutputStream(), new SimpleStringSchema()); streamSink.setWriteTimestampToKafka(true); stream.addSink(streamSink); env.execute("Basic Flink Kafka Streaming"); } }
  • 12. package com.javier_cloud.demos.streaming; import com.javier_cloud.demos.streaming.util.AppProperties; import com.javier_cloud.demos.streaming.util.ESSinkBuilder; import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011; import org.apache.flink.util.Collector; import java.util.Properties; public class KafkaStreamingToES { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); AppProperties.loadProperties(env); Properties kafkaProperties = new Properties(); String kafka_servers = AppProperties.getBootstrapServers(); kafkaProperties.setProperty("bootstrap.servers", kafka_servers); kafkaProperties.setProperty("group.id", AppProperties.getGroupId()); DataStream<String> stream = env .addSource(new FlinkKafkaConsumer011<>(AppProperties.getInputStream(), new SimpleStringSchema(), kafkaProperties)); FlinkKafkaProducer011<String> streamSink = new FlinkKafkaProducer011<String>(kafka_servers, AppProperties.getOutputStream(), new SimpleStringSchema()); streamSink.setWriteTimestampToKafka(true); stream.addSink(streamSink); // split up the lines in pairs (2-tuples) containing: (word,1), then sum DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new Tokenizer()).keyBy(0).sum(1); counts.addSink(ESSinkBuilder.buildElasticSearchSink(AppProperties.getESWordCountIndex())); env.execute("Streaming from a Kafka topic, echoing the message to Kafka, and outputting aggregations to ElasticSearch"); } public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { ... } } }
  • 13. from pyflink.dataset import ExecutionEnvironment from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment from pyflink.table.descriptors import Schema, OldCsv, FileSystem exec_env = ExecutionEnvironment.get_execution_environment() exec_env.set_parallelism(1) t_config = TableConfig() t_env = BatchTableEnvironment.create(exec_env, t_config) t_env.connect(FileSystem().path('/tmp/input')) .with_format(OldCsv() .field('word', DataTypes.STRING())) .with_schema(Schema() .field('word', DataTypes.STRING())) .create_temporary_table('mySource') t_env.connect(FileSystem().path('/tmp/output')) .with_format(OldCsv() .field_delimiter('t') .field('word', DataTypes.STRING()) .field('count', DataTypes.BIGINT())) .with_schema(Schema() .field('word', DataTypes.STRING()) .field('count', DataTypes.BIGINT())) .create_temporary_table('mySink') t_env.from_path('mySource') .group_by('word') .select('word, count(1)') .insert_into('mySink') t_env.execute("tutorial_job")
  • 14. Algunos operadores en Apache Flink Tipo Operadores/Funciones A nivel de elemento Map, FlatMap, Filter, Select, Project Agregados KeyBy, Reduce, Fold, sum, min, max Trabajar con ventanas globales, de tiempo de proceso o de evento Window (TumblingEventTime, TumblingProcessingTime, SlidingEventTime, SlidingProcessingTime, EventTimeSession, ProcessingTimeSession, GlobalWindows), WindowAll, Window Apply, trigger, evictor, allowedLateness, sideOutputLateData, getSideOutput Combinar varios streams Union, Join, OuterJoin, Cross, Distinct, IntervalJoin, CoGroup, Connect, CoMap, CoFlatMap, Split, PartitionCustom, Rebalance, Rescale, Shuffle, First-n, SortPartition Optimizaciones Iterate, StartNewChain,DisableChaining Bucles y asincronía Iterate, AsyncFunctions SQL Funciones SQL para: Comparison, Logical, Arithmetic, String, Temporal, Conditional, Type, Aggregate, Collection, Columnar
  • 15.
  • 16. ¿Por qué Flink lo peta? • Manejo propio de la memoria • Serialización a un formato binario propio • Gestión optimizada de las comunicaciones entre nodos y tareas • Opciones para almacenar el estado • Checkpoints y savepoints • Varios niveles de abstracción en sus APIs
  • 17. Demo: Analizando clickstream de usuarios Usando Apache Kafka, Apache Flink, y ElasticSearch
  • 18. ¿Qué pinta en todo esto Apache BEAM?
  • 19. Ventajas de Apache BEAM • API Unificada para Batch y Stream • Portable a diferentes Runners (sin vendor lock-in): Flink, Spark, Samza, DataFlow, Nemo, Twister2, Hazelcast Jet... • Soporte nativo de Java, Python, y Go (con todas sus librerías) • Posibilidad de mezclar lenguajes en una misma pipeline
  • 20. from __future__ import absolute_import import re from past.builtins import unicode import apache_beam as beam from apache_beam.io import ReadFromText from apache_beam.io import WriteToText from apache_beam.options.pipeline_options import PipelineOptions from apache_beam.options.pipeline_options import SetupOptions def run(argv=None, save_main_session=True): """Main entry point; defines and runs the wordcount pipeline.""" pipeline_options = PipelineOptions(pipeline_args) pipeline_options.view_as(SetupOptions).save_main_session = save_main_session with beam.Pipeline(options=pipeline_options) as p: # Read the text file[pattern] into a PCollection. lines = p | ReadFromText(known_args.input) # Count the occurrences of each word. counts = ( lines | 'Split' >> ( beam.FlatMap(lambda x: re.findall(r'[A-Za-z']+', x)). with_output_types(unicode)) | 'PairWithOne' >> beam.Map(lambda x: (x, 1)) | 'GroupAndSum' >> beam.CombinePerKey(sum)) # Format the counts into a PCollection of strings. def format_result(word_count): (word, count) = word_count return '%s: %s' % (word, count) output = counts | 'Format' >> beam.Map(format_result) output | WriteToText(known_args.output) if __name__ == '__main__': run()
  • 21. package com.amazonaws.samples.beam.taxi.count; import org.apache.beam.runners.flink.FlinkRunner; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.io.kinesis.KinesisIO; import org.apache.beam.sdk.transforms.*; (..) import software.amazon.awssdk.services.cloudwatch.model.Dimension; public class BeamTaxiCount { public static void main(String[] args) { String[] kinesisArgs = TaxiCountOptions.argsFromKinesisApplicationProperties(args,"BeamApplicationProperties"); TaxiCountOptions options = PipelineOptionsFactory.fromArgs(ArrayUtils.addAll(args, kinesisArgs)).as(TaxiCountOptions.class); options.setRunner(FlinkRunner.class); options.setAwsRegion(Regions.getCurrentRegion().getName()); PipelineOptionsValidator.validate(TaxiCountOptions.class, options); Pipeline p = Pipeline.create(options); PCollection<TripEvent> input = p .apply("Kinesis source", KinesisIO .read() .withStreamName(options.getInputStreamName()) .withAWSClientsProvider(new DefaultCredentialsProviderClientsProvider(Regions.fromName(options.getAwsRegion()))) .withInitialPositionInStream(InitialPositionInStream.LATEST) ) .apply("Parse Kinesis events", ParDo.of(new EventParser.KinesisParser())); PCollection<Metric> metrics = input .apply("Group into 5 second windows", Window .<TripEvent>into(FixedWindows.of(Duration.standardSeconds(5))) .triggering(AfterWatermark .pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(15))) ) .withAllowedLateness(Duration.ZERO) .discardingFiredPanes() ) .apply("Count globally", Combine .globally(Count.<TripEvent>combineFn()) .withoutDefaults() ) .apply("Map to Metric", ParDo.of( new DoFn<Long, Metric>() { @ProcessElement public void process(ProcessContext c) { c.output(new Metric(c.element().longValue(), c.timestamp())); } } )); prepareMetricForCloudwatch(metrics) .apply("CloudWatch sink", ParDo.of(new CloudWatchSink(options.getInputStreamName()))); p.run().waitUntilFinish(); }
  • 22. Demo: Analizando trayectos de Taxi con Apache BEAM Usando tiempo real, y batch para el backfilling
  • 23. Apache Flink en AWS Modelo de responsabilidad compartida Amazon Kinesis Data Analytics for Apache Flink Amazon EMR Hadoop/Yarn gestionado Más gestionado Menos gestionado AWS gestiona El cliente gestiona • Almacenamiento y estado • Métricas, monitorización, e interfaz dedicado • Hardware, software, red • Provisionado y autoescalado • Código de la aplicación • Configuración básica • Escalado del cluster (basado en Yarn) • Hardware, software, red • Código de la aplicación • Configuración de estado y almacenamiento • Configuración de seguridad del interfaz • Gestión/ejecución de las aplicaciones • Plano de control de orquestación de contenedores • Hardware, software del orquestador, red (física) • Código y configuración completa de la aplicación • Instalación y actualización del software • Gestión de clusters, seguridad, y configuración de red • Gestión/ejecución de las aplicaciones • Escalado • Hardware, software, red (física) • Código y configuración completa de la aplicación • Instalación y actualización del software • Seguridad, y configuración de red • Gestión/ejecución de las aplicaciones • Escalado • Provisionado, instalación y gestión de imágenes, parches de seguridad ECS/EKS Gestión de contenedores EC2 Infraestructura como servicio
  • 24. ¡Gracias! Javier Ramírez - @supercoco9 Developer Advocate - Amazon Web Services Noviembre 3-4-5, 2020