Successfully reported this slideshow.
Your SlideShare is downloading. ×

Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 19 Ad

Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

The increasing number of available data sources in today's application stacks created a demand to continuously capture and process data from various sources to quickly turn high volume streams of raw data into actionable insights. Apache Flink addresses many of the challenges faced in this domain as it's specifically tailored to distributed computations over streams. While Flink provides all the necessary capabilities to process streaming data, provisioning and maintaining a Flink cluster still requires considerable effort and expertise. We will discuss how cloud services can remove most of the burden of running the clusters underlying your Flink jobs and explain how to build a real-time processing pipeline on top of AWS by integrating Flink with Amazon Kinesis and Amazon EMR. We will furthermore illustrate how to leverage the reliable, scalable, and elastic nature of the AWS cloud to effectively create and operate your real-time processing pipeline with little operational overhead.

The increasing number of available data sources in today's application stacks created a demand to continuously capture and process data from various sources to quickly turn high volume streams of raw data into actionable insights. Apache Flink addresses many of the challenges faced in this domain as it's specifically tailored to distributed computations over streams. While Flink provides all the necessary capabilities to process streaming data, provisioning and maintaining a Flink cluster still requires considerable effort and expertise. We will discuss how cloud services can remove most of the burden of running the clusters underlying your Flink jobs and explain how to build a real-time processing pipeline on top of AWS by integrating Flink with Amazon Kinesis and Amazon EMR. We will furthermore illustrate how to leverage the reliable, scalable, and elastic nature of the AWS cloud to effectively create and operate your real-time processing pipeline with little operational overhead.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Processing Pipeline with Apache Flink on AWS (20)

Advertisement

More from Flink Forward (20)

Recently uploaded (20)

Advertisement

Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

  1. 1. ©  2016,  Amazon  Web  Services,  Inc.  or  its  Affiliates.  All  rights  reserved. Dr.  Steffen  Hausmann,  Solutions  Architect,  AWS September  13,  2017 Build  a  Real-­time  Stream  Processing   Pipeline  with  Apache  Flink on  AWS
  2. 2. Stream  Processing  Challenges Consistency  and   high  availability Low  latency  and   high  throughput Rich  forms  of   queries Event  time  and  out   of  order  events
  3. 3. Apache  Flink “Apache  Flink® is  an  open  source  platform  for  distributed   stream  and  batch  data  processing.” https://flink.apache.org/ http://data-­artisans.com/why-­apache-­flink/
  4. 4. Analyzing  NYC  Taxi  Rides  in  Real-­time
  5. 5. Simple  Pattern  for  Streaming  Data Continuously  creates   data Continuously  writes   data  to  a  stream Can  be  almost   anything Data  Producer Durably  stores  data Provides  temporary   buffer Supports  very  high-­ throughput Streaming  Storage Continuously   processes  data Cleans,  prepares,  &   aggregates Transforms  data  to   information Data  Consumer Mobile  Client Amazon  Kinesis Apache  Flink
  6. 6. Amazon  Kinesis  Streams   Create  streams  to  capture  and  store   streaming  data Replicates  your  streaming  data  across  three   facilities Elastically  add  and  remove  shards  to  scale   throughput Secured  via  AWS  IAM  and  server-­side   encryption
  7. 7. Amazon  Elastic  Map  Reduce  (EMR) Easily  provision  and  manage  clusters  for   your  big  data  needs Hadoop,  Flink,  Spark,  Presto,  HBase,  Tez,   Hive,  Pig,  … Dynamically  scalable,  persistent,  or   transient  clusters   Tightly  integrated  with  other  AWS  services,   eg,  for storage,  encryption,  and  monitoring
  8. 8. Amazon  Elasticsearch  Service Setup  Elasticsearch cluster  in  minutes Integrated  with  Logstash and  Kibana Scale  Elasticsearch clusters  seamlessly Highly  available  and  reliable Tightly  integrated  with  other  AWS  services
  9. 9. Amazon  Kinesis   Streams Amazon  ESApache  Flink on   Amazon  EMR Architecture  for  Analyzing  Taxi  Rides
  10. 10. Let’s  dive  right  in!
  11. 11. Lessons  Learned  and   Best  Practices
  12. 12. Building  the  Flink  Kinesis  Connector The  Flink Kinesis  Connector  binary  is  not  available  from   Maven  Central Build  the  Connector  with  Maven  3.0.x,  3.1.x,  or  3.2.x  … • mvn clean  install  -­Pinclude-­kinesis  -­DskipTests -­Dhadoop-­two.version=2.7.3 …  or  use  CodeBuild to  let  it  be  build  for  you!
  13. 13. Important  Parameters  of  the  Kinesis  Connector AWS_CREDENTIALS_PROVIDER • determines  how  Flink obtains  IAM  credentials • set  to  AUTO  and  use  appropriate  roles  with  the  EMR  cluster SHARD_GETRECORDS_INTERVAL_MILLIS • determines  how  often  Flink polls  events  from  Kinesis • set  to  at  least  1000  to  facilitate  multiple  consumers
  14. 14. Connecting  to  the  Flink  Dashboard Use  dynamic  port  forwarding  to  the  Master  node • ssh -­D  8157  hadoop@... Use  FoxyProxy to  redirect  URLs  to  localhost • *ec2*.amazonaws.com* • *.compute.internal* Connect  through  the  EMR  console • navigate  to  the  YARN  Resource  Manager   • select  the  Flink ApplicationMaster
  15. 15. Starting  Flink and  Submitting  Jobs
  16. 16. Important  Kinesis  Streams  Metrics
  17. 17. Checkpointing and  High  Availability Zookeeper  can  be  bootstrapped  on  EMR Overprovision  the  EMR  cluster  for  fast  failovers Use  externalized  checkpoints  and  store  them  on  Amazon  S3 externalized   Checkpoint
  18. 18. Build  a  Stream  Processing  Pipeline  Yourself Many  examples  with  sample  code  are  on  the  AWS  Big   Data  Blog.  Follow  the  blog! Build  a  Real-­time  Stream  Processing  Pipeline  with  Apache   Flink  on  AWS https://github.com/awslabs/flink-­stream-­processing-­refarch/
  19. 19. Thank  you! shausma@amazon.de

×