SlideShare a Scribd company logo
1 of 37
DON'T FLINK. YOU'LL MISS
IT! - INTRODUCING
CLOUDERA STREAMING
ANALYTICS
Andrew Psaltis – Field CTO – Asia Pacific and Japan
WHY STREAMING WHY NOW?
© 2019 Cloudera, Inc. All rights reserved. 3
STREAM PROCESSING IS THE NEW
DATA PROCESSING PARADIGM
© 2019 Cloudera, Inc. All rights reserved. 4
INDUSTRIES SHIFTING FROM REACTIVE TO PROACTIVE
From mass branding …to 1x1 targeting
From educated investing …to automated algorithms
From static branding …to real-time personalization
From break then fix …to repair before break
© 2019 Cloudera, Inc. All rights reserved. 5
MOVING FROM BATCH TO STREAMING
• High-Latency Apps
• Static Files
• Process-After-Store
• Low-Latency Apps
• Event Streams
• Sense and Respond
CLOUDERA DATA FLOW AND
STREAMING ANALYTICS
CLOUDERA DATA PLATFORM
CLOUDERA DATA FLOW
Enterprise Services
Provisioning, Management
and Monitoring
Unified Security
Edge-to-Enterprise Governance
Single Sign-On
Edge Management
Edge data collection,
Routing and monitoring
MiNiFi
Edge Flow Manager
NiFi Registry
Flow Management
Enterprise data ingestion,
transformation and
enrichment
Apache NiFi
NiFi Registry
Stream Processing
Real-time stream
processing at IoT scale
Apache Kafka
Schema Registry
Streaming Analytics
Predictive analytics
and real-time insights
Kafka Streams
Apache Flink
Spark StreamingStreams Messaging Manager
Streams Replication Manager
Streams Management
THE COMPLETE AND CONNECTED DATA LIFECYCLE
Collect
Edge & Flow
Management
ActData-in-
Motion
Curate
Data
Engineering
Report
Data
Warehouse
Serve
Operational
Database
Predict
Machine
Learning and AI
Data-at-
Rest
A Connected Data Lifecycle is Critical to Meet the Needs of Real-time Use CasesPOWERED BY
DistributeBuffer Analyze
WHY FLINK?
© 2019 Cloudera, Inc. All rights reserved. 11
WHAT IS COMMON ACROSS THESE ENTERPRISES?
They all process billions of events every day with Flink!
Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html
and at the Flink Forward website at https://www.ververica.com/flink-forward-san-francisco-2019.
© 2019 Cloudera, Inc. All rights reserved. 12
KEY FEATURES
● Guaranteed correctness
 Exactly-once state consistency
 Event-time semantics
● Flexible deployments and large
ecosystem
 Kubernetes, YARN, Mesos, Docker,
S3, HDFS, Kafka, Kinesis
● In-memory processing at
massive scale
 Runs on 100000s of cores
 Manages 100s TBs of state
● Flexible and expressive APIs ● Serves most streaming & batch use
cases
○ Data Pipelines, Analytics, CEP,
Event-driven Applications
 Kubernetes, YARN, Mesos, Docker,
S3, HDFS, Kafka, Kinesis
© 2019 Cloudera, Inc. All rights reserved. 13
MAINTAINING AND CHECKPOINTING STATE
● Flink maintains state locally per task (in-mem / on-disk)
○ Fast access!
● State is periodically checkpointed to durable storage
○ A checkpoint is a consistent snapshot of the state of all tasks
© 2019 Cloudera, Inc. All rights reserved. 14
RECOVERY AND GUARANTEED CONSISTENCY
● Recovery is like loading a saved computer game
● Flink recovers state with exactly-once consistency
○ After a failure, the application is restarted
○ All tasks load their state from the latest checkpoint
○ The application continues as if the failure never happened...
Loading
Game...
Game
saved!
GAME
OVER!
© 2019 Cloudera, Inc. All rights reserved. 15
MUCH MORE THAN JUST EXACTLY-ONCE RECOVERY!
● Suspend and resume applications
● Query the state outside of the streaming application
● Fix and upgrade applications
● Migrate applications to a different/upgraded cluster
● Scale applications in and out
● A/B test applications
● ...
FLINK API’S
© 2019 Cloudera, Inc. All rights reserved. 17
LAYERED APISSUPPORTEES LAYERED APIS
© 2019 Cloudera, Inc. All rights reserved. 18
SUPPORTEES LAYERED APIS
• Basic DataStream API operations
• ProcessFunction API
• Windowing functionality: processing/event time and count based
keyed windows
• Basic ConnectedStream and KeyedStream API
• Stateful operators, basic checkpointing configuration
• State backends: In-memory and RocksDB + HDFS
© 2019 Cloudera, Inc. All rights reserved. 19
DATASTREAM API
● Programs are composed as data flows
● Logic is implemented as custom user functions
○ map, flatMap, reduce, window aggregation, window join, asynchronous
request function, ...
● Data is processed as arbitrary Java/Scala objects
○ (Avro) POJOs, Tuple, Row
© 2019 Cloudera, Inc. All rights reserved. 20
PROCESSFUNCTIONS
● Flink’s most expressive function interfaces
○ Expose access to State and Time
○ Are embedded in DataStream programs
● Enable powerful applications
○ Put events or intermediate results into state for future computations
○ Register timers to be called back once “time is up”
● A collection of multiple function interfaces
○ 1 input, 1 windowed input,
2 key-partitioned inputs, 2 broadcasted/forwarded inputs, ...
© 2019 Cloudera, Inc. All rights reserved. 21
MAINTAINING AND CHECKPOINTING STATE
● Flink maintains state locally per task (in-mem / on-disk)
○ Fast access!
● State is periodically checkpointed to durable storage
○ A checkpoint is a consistent snapshot of the state of all tasks
© 2019 Cloudera, Inc. All rights reserved. 22
RECOVERY AND GUARANTEED CONSISTENCY
● Recovery is like loading a saved computer game
● Flink recovers state with exactly-once consistency
○ After a failure, the application is restarted
○ All tasks load their state from the latest checkpoint
○ The application continues as if the failure never happened...
Loading
Game...
Game
saved!
GAME
OVER!
© 2019 Cloudera, Inc. All rights reserved. 23
EVENT-TIME AND PROCESSING-TIME
© 2019 Cloudera, Inc. All rights reserved. 24
WHAT IS PROCESSING-TIME?
 A record is processed based on the wall-clock time when it arrives.
 Results are inherently non-deterministic and depend on
○ Clocks, load, and processing speed of machines
○ Arrival / ingestion rate of data and possibly backpressure
○ ... Applications of processing-time
○ Does not work for recorded data.
○ Does not work for data that arrives out-of-order
○ Might be sufficient for approximate, low-latency results
© 2019 Cloudera, Inc. All rights reserved. 25
WHAT IS EVENT-TIME?
 A record is processed based on an embedded timestamp.
○ Timestamp typically denotes time when record was created.
 The “current” time is determined by watermarks
○ A watermark is a special record with a timestamp w
○ Denotes that no more records with a time t <= w will arrive
 Properties of event-time processing
o Results are deterministic
o Same semantics when processing recorded and live data
o Can trade result latency for result completeness
USE CASES
© 2019 Cloudera, Inc. All rights reserved. 27
GENERAL THEMES
© 2019 Cloudera, Inc. All rights reserved. 28
HEALTH CARE
• Smart hospitals - collect data
and readings from hospital
devices (vitals, IVs, MRI, etc.)
and analyze and alert in real
time.
• Biometrics - collect and analyze
data from patient devices that
collect vitals while outside of
care facilities.
© 2019 Cloudera, Inc. All rights reserved. 29
OIL AND GAS
• Downstream Use cases — Real
Time Drill monitoring, live tool
performance monitoring &
failure detection, sub-surface
temperature, pressure
monitoring & alerting etc.
• Upstream Use Cases — Monitor
production platforms, Monitor
temperature, pressure at various
joints etc.
© 2019 Cloudera, Inc. All rights reserved. 30
MANUFACTURING
• Detect anomalies
in manufacturing
machines based
on a stream of
measurements
• Detect anomalies
in part from video
inspection
© 2019 Cloudera, Inc. All rights reserved. 31
TELCO
• Automatic
classification of
outages
• Antenna
Optimization
• Real-time charging
on customer usage
• Optimized
advertising for
video/audio
SUMMARY
© 2019 Cloudera, Inc. All rights reserved. 33
CLOUDERA STREAMING ANALYTICS ON CDP-DC
© 2019 Cloudera, Inc. All rights reserved. 34
INSTALLATION CONSIDERATIONS AND ARTEFACTS
Parcel and CSD for CDP-DataCenter
HistoryServer and Gateway roles
© 2019 Cloudera, Inc. All rights reserved. 35
CLOUDERA STREAMING ANALYTICS 1.1.0
Core Flink Support
● Flink 1.9.1+ support
● Flink on Yarn
● Support for installing
Flink on CM Managed
Clusters
● Support for fully Secure
(Kerberized & TLS
enabled) Flink Cluster
Flink API Support
● All stable Flink APIs
supported: DataStream API,
windowing functionality,
ConnectedStream API,
KeyedStream API and Basic
checkpointing configuration
● Support subset of evolving
APIS including: Stream Join,
ProcessFunctions, Interval
Join, Stateful operators,
FsStateBackend with HDFS,
RocksDBStateBackend with
HDFS
Flink Source and Sink
Connectors
● Kafka: Exactly-once
consumer, At-least-once
producer
● HDFS: Exactly-once sink
● HBase: Idempotent key-
value sink
Platform Integration
● Cloudera Manager for
installing, management
and monitoring
● Schema Registry
Integration
● Centralized Log Search
for Flink Application Logs
● Monitoring Services with
with CM Metrics
Store/Service integration
and Flink Kafka Metrics
Reporter
CLOUDERA FOR FLINK
Our commitment to the community over the next year
Platform Integration
Config Management
Diagnostics Framework
Unified Security Model
Enterprise Security
Apache Atlas
Apache Knox
Community
Regular blog posts
Conference Presence
Reference Architecture
New Connectors
Apache HBase
Apache Kudu
Cldr Schema Registry
Apache Ranger
TH N Y U

More Related Content

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR compliance
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enough
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

DE - Don't Flink. You'll Miss it! - Introducing Cloudera Streaming Analytics

  • 1. DON'T FLINK. YOU'LL MISS IT! - INTRODUCING CLOUDERA STREAMING ANALYTICS Andrew Psaltis – Field CTO – Asia Pacific and Japan
  • 3. © 2019 Cloudera, Inc. All rights reserved. 3 STREAM PROCESSING IS THE NEW DATA PROCESSING PARADIGM
  • 4. © 2019 Cloudera, Inc. All rights reserved. 4 INDUSTRIES SHIFTING FROM REACTIVE TO PROACTIVE From mass branding …to 1x1 targeting From educated investing …to automated algorithms From static branding …to real-time personalization From break then fix …to repair before break
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 MOVING FROM BATCH TO STREAMING • High-Latency Apps • Static Files • Process-After-Store • Low-Latency Apps • Event Streams • Sense and Respond
  • 6. CLOUDERA DATA FLOW AND STREAMING ANALYTICS
  • 8. CLOUDERA DATA FLOW Enterprise Services Provisioning, Management and Monitoring Unified Security Edge-to-Enterprise Governance Single Sign-On Edge Management Edge data collection, Routing and monitoring MiNiFi Edge Flow Manager NiFi Registry Flow Management Enterprise data ingestion, transformation and enrichment Apache NiFi NiFi Registry Stream Processing Real-time stream processing at IoT scale Apache Kafka Schema Registry Streaming Analytics Predictive analytics and real-time insights Kafka Streams Apache Flink Spark StreamingStreams Messaging Manager Streams Replication Manager Streams Management
  • 9. THE COMPLETE AND CONNECTED DATA LIFECYCLE Collect Edge & Flow Management ActData-in- Motion Curate Data Engineering Report Data Warehouse Serve Operational Database Predict Machine Learning and AI Data-at- Rest A Connected Data Lifecycle is Critical to Meet the Needs of Real-time Use CasesPOWERED BY DistributeBuffer Analyze
  • 11. © 2019 Cloudera, Inc. All rights reserved. 11 WHAT IS COMMON ACROSS THESE ENTERPRISES? They all process billions of events every day with Flink! Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html and at the Flink Forward website at https://www.ververica.com/flink-forward-san-francisco-2019.
  • 12. © 2019 Cloudera, Inc. All rights reserved. 12 KEY FEATURES ● Guaranteed correctness  Exactly-once state consistency  Event-time semantics ● Flexible deployments and large ecosystem  Kubernetes, YARN, Mesos, Docker, S3, HDFS, Kafka, Kinesis ● In-memory processing at massive scale  Runs on 100000s of cores  Manages 100s TBs of state ● Flexible and expressive APIs ● Serves most streaming & batch use cases ○ Data Pipelines, Analytics, CEP, Event-driven Applications  Kubernetes, YARN, Mesos, Docker, S3, HDFS, Kafka, Kinesis
  • 13. © 2019 Cloudera, Inc. All rights reserved. 13 MAINTAINING AND CHECKPOINTING STATE ● Flink maintains state locally per task (in-mem / on-disk) ○ Fast access! ● State is periodically checkpointed to durable storage ○ A checkpoint is a consistent snapshot of the state of all tasks
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 RECOVERY AND GUARANTEED CONSISTENCY ● Recovery is like loading a saved computer game ● Flink recovers state with exactly-once consistency ○ After a failure, the application is restarted ○ All tasks load their state from the latest checkpoint ○ The application continues as if the failure never happened... Loading Game... Game saved! GAME OVER!
  • 15. © 2019 Cloudera, Inc. All rights reserved. 15 MUCH MORE THAN JUST EXACTLY-ONCE RECOVERY! ● Suspend and resume applications ● Query the state outside of the streaming application ● Fix and upgrade applications ● Migrate applications to a different/upgraded cluster ● Scale applications in and out ● A/B test applications ● ...
  • 17. © 2019 Cloudera, Inc. All rights reserved. 17 LAYERED APISSUPPORTEES LAYERED APIS
  • 18. © 2019 Cloudera, Inc. All rights reserved. 18 SUPPORTEES LAYERED APIS • Basic DataStream API operations • ProcessFunction API • Windowing functionality: processing/event time and count based keyed windows • Basic ConnectedStream and KeyedStream API • Stateful operators, basic checkpointing configuration • State backends: In-memory and RocksDB + HDFS
  • 19. © 2019 Cloudera, Inc. All rights reserved. 19 DATASTREAM API ● Programs are composed as data flows ● Logic is implemented as custom user functions ○ map, flatMap, reduce, window aggregation, window join, asynchronous request function, ... ● Data is processed as arbitrary Java/Scala objects ○ (Avro) POJOs, Tuple, Row
  • 20. © 2019 Cloudera, Inc. All rights reserved. 20 PROCESSFUNCTIONS ● Flink’s most expressive function interfaces ○ Expose access to State and Time ○ Are embedded in DataStream programs ● Enable powerful applications ○ Put events or intermediate results into state for future computations ○ Register timers to be called back once “time is up” ● A collection of multiple function interfaces ○ 1 input, 1 windowed input, 2 key-partitioned inputs, 2 broadcasted/forwarded inputs, ...
  • 21. © 2019 Cloudera, Inc. All rights reserved. 21 MAINTAINING AND CHECKPOINTING STATE ● Flink maintains state locally per task (in-mem / on-disk) ○ Fast access! ● State is periodically checkpointed to durable storage ○ A checkpoint is a consistent snapshot of the state of all tasks
  • 22. © 2019 Cloudera, Inc. All rights reserved. 22 RECOVERY AND GUARANTEED CONSISTENCY ● Recovery is like loading a saved computer game ● Flink recovers state with exactly-once consistency ○ After a failure, the application is restarted ○ All tasks load their state from the latest checkpoint ○ The application continues as if the failure never happened... Loading Game... Game saved! GAME OVER!
  • 23. © 2019 Cloudera, Inc. All rights reserved. 23 EVENT-TIME AND PROCESSING-TIME
  • 24. © 2019 Cloudera, Inc. All rights reserved. 24 WHAT IS PROCESSING-TIME?  A record is processed based on the wall-clock time when it arrives.  Results are inherently non-deterministic and depend on ○ Clocks, load, and processing speed of machines ○ Arrival / ingestion rate of data and possibly backpressure ○ ... Applications of processing-time ○ Does not work for recorded data. ○ Does not work for data that arrives out-of-order ○ Might be sufficient for approximate, low-latency results
  • 25. © 2019 Cloudera, Inc. All rights reserved. 25 WHAT IS EVENT-TIME?  A record is processed based on an embedded timestamp. ○ Timestamp typically denotes time when record was created.  The “current” time is determined by watermarks ○ A watermark is a special record with a timestamp w ○ Denotes that no more records with a time t <= w will arrive  Properties of event-time processing o Results are deterministic o Same semantics when processing recorded and live data o Can trade result latency for result completeness
  • 27. © 2019 Cloudera, Inc. All rights reserved. 27 GENERAL THEMES
  • 28. © 2019 Cloudera, Inc. All rights reserved. 28 HEALTH CARE • Smart hospitals - collect data and readings from hospital devices (vitals, IVs, MRI, etc.) and analyze and alert in real time. • Biometrics - collect and analyze data from patient devices that collect vitals while outside of care facilities.
  • 29. © 2019 Cloudera, Inc. All rights reserved. 29 OIL AND GAS • Downstream Use cases — Real Time Drill monitoring, live tool performance monitoring & failure detection, sub-surface temperature, pressure monitoring & alerting etc. • Upstream Use Cases — Monitor production platforms, Monitor temperature, pressure at various joints etc.
  • 30. © 2019 Cloudera, Inc. All rights reserved. 30 MANUFACTURING • Detect anomalies in manufacturing machines based on a stream of measurements • Detect anomalies in part from video inspection
  • 31. © 2019 Cloudera, Inc. All rights reserved. 31 TELCO • Automatic classification of outages • Antenna Optimization • Real-time charging on customer usage • Optimized advertising for video/audio
  • 33. © 2019 Cloudera, Inc. All rights reserved. 33 CLOUDERA STREAMING ANALYTICS ON CDP-DC
  • 34. © 2019 Cloudera, Inc. All rights reserved. 34 INSTALLATION CONSIDERATIONS AND ARTEFACTS Parcel and CSD for CDP-DataCenter HistoryServer and Gateway roles
  • 35. © 2019 Cloudera, Inc. All rights reserved. 35 CLOUDERA STREAMING ANALYTICS 1.1.0 Core Flink Support ● Flink 1.9.1+ support ● Flink on Yarn ● Support for installing Flink on CM Managed Clusters ● Support for fully Secure (Kerberized & TLS enabled) Flink Cluster Flink API Support ● All stable Flink APIs supported: DataStream API, windowing functionality, ConnectedStream API, KeyedStream API and Basic checkpointing configuration ● Support subset of evolving APIS including: Stream Join, ProcessFunctions, Interval Join, Stateful operators, FsStateBackend with HDFS, RocksDBStateBackend with HDFS Flink Source and Sink Connectors ● Kafka: Exactly-once consumer, At-least-once producer ● HDFS: Exactly-once sink ● HBase: Idempotent key- value sink Platform Integration ● Cloudera Manager for installing, management and monitoring ● Schema Registry Integration ● Centralized Log Search for Flink Application Logs ● Monitoring Services with with CM Metrics Store/Service integration and Flink Kafka Metrics Reporter
  • 36. CLOUDERA FOR FLINK Our commitment to the community over the next year Platform Integration Config Management Diagnostics Framework Unified Security Model Enterprise Security Apache Atlas Apache Knox Community Regular blog posts Conference Presence Reference Architecture New Connectors Apache HBase Apache Kudu Cldr Schema Registry Apache Ranger
  • 37. TH N Y U