Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improvements to Flink & it's Applications in Alibaba Search

1,900 views

Published on

Improvements to Flink & it's Applications in Alibaba Search

Published in: Technology

Improvements to Flink & it's Applications in Alibaba Search

  1. 1. Blink Improvements to Flink & Its Applications in Alibaba SearchXiaowei Jiang, Feng Wang {xiaowei.jxw, jason.wang} @alibaba-inc.com
  2. 2. Who Are We? n Xiaowei Jiang l 2014 −− now Alibaba l 2010 −− 2014 Facebook l 2002 −− 2010 Microsoft l 2000 −− 2002 Stratify n Feng Wang l 2006 −− now Alibaba
  3. 3. About Alibaba n  Alibaba Group l  Operating the world’s largest online marketplace l  Annual GMV $394 Billion in year 2015 n  Alibaba Search l  Personalized search and recommendation platform l  Major driver of online traffic
  4. 4. Agenda n Background n What is Blink? n Improvements in Blink n Challenges & Future
  5. 5. Logs Scenario – Realtime A/B Test Transacton Parser Filter Join Agg Parser Filter UDF Druid Click Impression Parser Filter
  6. 6. Scenario – Search Index Build & Update DataSource Filter Sync HBase IC Filter Sync UIC Join Search Engine Export HBase Result UIC IC1 IC2 UIC1 UIC2
  7. 7. Streaming Topologies Long Batch Pipelines Machine Learning at Scale Graph Analysis à low latency à resource utilization à iterative algorithms à mutable state Flink: Unified Compute Engine
  8. 8. Flink Stack
  9. 9. What is Blink? n Blink – Improvements to Flink from Alibaba l Comprehensive Improvements to Flink Table API l Improved Runtime Compatible with Flink API and Ecosystem n Status l Runs on Thousands of Nodes In Alibaba Production l Supports Mission Critical Products
  10. 10. Table API Improvements n Principle – Unified SQL layer for batch and streaming n Functionality l  UDF/UDTF/UDAGG l  Stream-Stream Join l  Aggregation(min, max, avg, sum, count, distinct_count) l  Windowing (time_window, count_window) l  Retraction
  11. 11. Runtime Improvements n New Runtime Architecture on YARN n Optimized State, Checkpoint & Failover n Reliable & Production Quality n Much More
  12. 12. Flink on YARN Client Node YARN Node YARN Node YARN ResourceManager YARN NodeManager Container Flink JobManager YARN AppMaster YARN Node YARN NodeManager Container Flink TaskManager YARN Node YARN NodeManager Container Flink TaskManager Flink YARN Client HDFS 4.allocate worker 3.allocate app master 1. store user jar and configuration 2. register resource and request app master always bootstrap containers with user jar and config
  13. 13. Blink on YARN Client Node YARN Node YARN Node YARN ResourceManager YARN NodeManager Container JobMaster YARN Node YARN NodeManager YARN Node YARN NodeManager Blink Client HDFS 4.allocate worker 3.allocate app master 1. store user jar and configuration 2. register resource and request app master always bootstrap containers with user jar and config Container TaskExecutor Container TaskExecutor Container TaskExecutor Container Container TaskExecutor JobMaster 4.allocate worker
  14. 14. Blink Job Architecture Yarn Node NodeManager Yarn Node NodeManager Shuffle Service Yarn Node NodeManager Shuffle Service HDFS ZooKeeper controlchannel controlchannel state backup/recover local data channel local data channel state backup/recover Container Job Master task scheduler checkpoint coordinator Container rocks db spilled file Task Executor taskin out Container rocks db spilled file Task Executor taskin out Container rocks db spilled file Task Executor taskin out Container rocks db spilled file Task Executor taskin out completed checkpoint schedule events Network data channel
  15. 15. Blink Checkpoint & State TaskExecutor Local CPn Local CPn-1Incremental Backup OnComplete i1 i2 i3 Bn in queue o1 o2 Bn- 1 o3 out queue 2. hard link snapshot Job Master 1. trigger 3.ack clean up 4. complete clean up Task operator state HDFS reference async CPn CPn-1 diff State Files 1.sst 2.sst n.sst
  16. 16. Blink Rescale
  17. 17. Blink Failover At Least Once Source Source Source Source fail restart restart failover Excactly Once Source Source Source Source fail restart failover Sink Sink Sink Sink
  18. 18. Blink Metrics Job Vertex Number: [CPU, Memory] * Parallelism In Queue TPS Out Queue Latency Delay CPU Memory Task Metrics Running Tasks
  19. 19. Challenges & Future n Continued Optimization in Streaming n Batch in Production n Machine Learning in Production n Larger Cluster Scale n Contribute back to Flink community
  20. 20. Q & A Thank You! Xiaowei Jiang: xiaowei.jxw@alibaba-inc.com Twitter: @xiaoweij Feng Wang: jason.wang@alibaba-inc.com Twitter: @ifengwang

×