More Related Content More from Cloudera, Inc. (20) DE - Don't Flink. You'll Miss it! - Introducing Cloudera Streaming Analytics1. DON'T FLINK. YOU'LL MISS
IT! - INTRODUCING
CLOUDERA STREAMING
ANALYTICS
Andrew Psaltis – Field CTO – Asia Pacific and Japan
3. © 2019 Cloudera, Inc. All rights reserved. 3
STREAM PROCESSING IS THE NEW
DATA PROCESSING PARADIGM
4. © 2019 Cloudera, Inc. All rights reserved. 4
INDUSTRIES SHIFTING FROM REACTIVE TO PROACTIVE
From mass branding …to 1x1 targeting
From educated investing …to automated algorithms
From static branding …to real-time personalization
From break then fix …to repair before break
5. © 2019 Cloudera, Inc. All rights reserved. 5
MOVING FROM BATCH TO STREAMING
• High-Latency Apps
• Static Files
• Process-After-Store
• Low-Latency Apps
• Event Streams
• Sense and Respond
8. CLOUDERA DATA FLOW
Enterprise Services
Provisioning, Management
and Monitoring
Unified Security
Edge-to-Enterprise Governance
Single Sign-On
Edge Management
Edge data collection,
Routing and monitoring
MiNiFi
Edge Flow Manager
NiFi Registry
Flow Management
Enterprise data ingestion,
transformation and
enrichment
Apache NiFi
NiFi Registry
Stream Processing
Real-time stream
processing at IoT scale
Apache Kafka
Schema Registry
Streaming Analytics
Predictive analytics
and real-time insights
Kafka Streams
Apache Flink
Spark StreamingStreams Messaging Manager
Streams Replication Manager
Streams Management
9. THE COMPLETE AND CONNECTED DATA LIFECYCLE
Collect
Edge & Flow
Management
ActData-in-
Motion
Curate
Data
Engineering
Report
Data
Warehouse
Serve
Operational
Database
Predict
Machine
Learning and AI
Data-at-
Rest
A Connected Data Lifecycle is Critical to Meet the Needs of Real-time Use CasesPOWERED BY
DistributeBuffer Analyze
11. © 2019 Cloudera, Inc. All rights reserved. 11
WHAT IS COMMON ACROSS THESE ENTERPRISES?
They all process billions of events every day with Flink!
Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html
and at the Flink Forward website at https://www.ververica.com/flink-forward-san-francisco-2019.
12. © 2019 Cloudera, Inc. All rights reserved. 12
KEY FEATURES
● Guaranteed correctness
Exactly-once state consistency
Event-time semantics
● Flexible deployments and large
ecosystem
Kubernetes, YARN, Mesos, Docker,
S3, HDFS, Kafka, Kinesis
● In-memory processing at
massive scale
Runs on 100000s of cores
Manages 100s TBs of state
● Flexible and expressive APIs ● Serves most streaming & batch use
cases
○ Data Pipelines, Analytics, CEP,
Event-driven Applications
Kubernetes, YARN, Mesos, Docker,
S3, HDFS, Kafka, Kinesis
13. © 2019 Cloudera, Inc. All rights reserved. 13
MAINTAINING AND CHECKPOINTING STATE
● Flink maintains state locally per task (in-mem / on-disk)
○ Fast access!
● State is periodically checkpointed to durable storage
○ A checkpoint is a consistent snapshot of the state of all tasks
14. © 2019 Cloudera, Inc. All rights reserved. 14
RECOVERY AND GUARANTEED CONSISTENCY
● Recovery is like loading a saved computer game
● Flink recovers state with exactly-once consistency
○ After a failure, the application is restarted
○ All tasks load their state from the latest checkpoint
○ The application continues as if the failure never happened...
Loading
Game...
Game
saved!
GAME
OVER!
15. © 2019 Cloudera, Inc. All rights reserved. 15
MUCH MORE THAN JUST EXACTLY-ONCE RECOVERY!
● Suspend and resume applications
● Query the state outside of the streaming application
● Fix and upgrade applications
● Migrate applications to a different/upgraded cluster
● Scale applications in and out
● A/B test applications
● ...
17. © 2019 Cloudera, Inc. All rights reserved. 17
LAYERED APISSUPPORTEES LAYERED APIS
18. © 2019 Cloudera, Inc. All rights reserved. 18
SUPPORTEES LAYERED APIS
• Basic DataStream API operations
• ProcessFunction API
• Windowing functionality: processing/event time and count based
keyed windows
• Basic ConnectedStream and KeyedStream API
• Stateful operators, basic checkpointing configuration
• State backends: In-memory and RocksDB + HDFS
19. © 2019 Cloudera, Inc. All rights reserved. 19
DATASTREAM API
● Programs are composed as data flows
● Logic is implemented as custom user functions
○ map, flatMap, reduce, window aggregation, window join, asynchronous
request function, ...
● Data is processed as arbitrary Java/Scala objects
○ (Avro) POJOs, Tuple, Row
20. © 2019 Cloudera, Inc. All rights reserved. 20
PROCESSFUNCTIONS
● Flink’s most expressive function interfaces
○ Expose access to State and Time
○ Are embedded in DataStream programs
● Enable powerful applications
○ Put events or intermediate results into state for future computations
○ Register timers to be called back once “time is up”
● A collection of multiple function interfaces
○ 1 input, 1 windowed input,
2 key-partitioned inputs, 2 broadcasted/forwarded inputs, ...
21. © 2019 Cloudera, Inc. All rights reserved. 21
MAINTAINING AND CHECKPOINTING STATE
● Flink maintains state locally per task (in-mem / on-disk)
○ Fast access!
● State is periodically checkpointed to durable storage
○ A checkpoint is a consistent snapshot of the state of all tasks
22. © 2019 Cloudera, Inc. All rights reserved. 22
RECOVERY AND GUARANTEED CONSISTENCY
● Recovery is like loading a saved computer game
● Flink recovers state with exactly-once consistency
○ After a failure, the application is restarted
○ All tasks load their state from the latest checkpoint
○ The application continues as if the failure never happened...
Loading
Game...
Game
saved!
GAME
OVER!
23. © 2019 Cloudera, Inc. All rights reserved. 23
EVENT-TIME AND PROCESSING-TIME
24. © 2019 Cloudera, Inc. All rights reserved. 24
WHAT IS PROCESSING-TIME?
A record is processed based on the wall-clock time when it arrives.
Results are inherently non-deterministic and depend on
○ Clocks, load, and processing speed of machines
○ Arrival / ingestion rate of data and possibly backpressure
○ ... Applications of processing-time
○ Does not work for recorded data.
○ Does not work for data that arrives out-of-order
○ Might be sufficient for approximate, low-latency results
25. © 2019 Cloudera, Inc. All rights reserved. 25
WHAT IS EVENT-TIME?
A record is processed based on an embedded timestamp.
○ Timestamp typically denotes time when record was created.
The “current” time is determined by watermarks
○ A watermark is a special record with a timestamp w
○ Denotes that no more records with a time t <= w will arrive
Properties of event-time processing
o Results are deterministic
o Same semantics when processing recorded and live data
o Can trade result latency for result completeness
28. © 2019 Cloudera, Inc. All rights reserved. 28
HEALTH CARE
• Smart hospitals - collect data
and readings from hospital
devices (vitals, IVs, MRI, etc.)
and analyze and alert in real
time.
• Biometrics - collect and analyze
data from patient devices that
collect vitals while outside of
care facilities.
29. © 2019 Cloudera, Inc. All rights reserved. 29
OIL AND GAS
• Downstream Use cases — Real
Time Drill monitoring, live tool
performance monitoring &
failure detection, sub-surface
temperature, pressure
monitoring & alerting etc.
• Upstream Use Cases — Monitor
production platforms, Monitor
temperature, pressure at various
joints etc.
30. © 2019 Cloudera, Inc. All rights reserved. 30
MANUFACTURING
• Detect anomalies
in manufacturing
machines based
on a stream of
measurements
• Detect anomalies
in part from video
inspection
31. © 2019 Cloudera, Inc. All rights reserved. 31
TELCO
• Automatic
classification of
outages
• Antenna
Optimization
• Real-time charging
on customer usage
• Optimized
advertising for
video/audio
33. © 2019 Cloudera, Inc. All rights reserved. 33
CLOUDERA STREAMING ANALYTICS ON CDP-DC
34. © 2019 Cloudera, Inc. All rights reserved. 34
INSTALLATION CONSIDERATIONS AND ARTEFACTS
Parcel and CSD for CDP-DataCenter
HistoryServer and Gateway roles
35. © 2019 Cloudera, Inc. All rights reserved. 35
CLOUDERA STREAMING ANALYTICS 1.1.0
Core Flink Support
● Flink 1.9.1+ support
● Flink on Yarn
● Support for installing
Flink on CM Managed
Clusters
● Support for fully Secure
(Kerberized & TLS
enabled) Flink Cluster
Flink API Support
● All stable Flink APIs
supported: DataStream API,
windowing functionality,
ConnectedStream API,
KeyedStream API and Basic
checkpointing configuration
● Support subset of evolving
APIS including: Stream Join,
ProcessFunctions, Interval
Join, Stateful operators,
FsStateBackend with HDFS,
RocksDBStateBackend with
HDFS
Flink Source and Sink
Connectors
● Kafka: Exactly-once
consumer, At-least-once
producer
● HDFS: Exactly-once sink
● HBase: Idempotent key-
value sink
Platform Integration
● Cloudera Manager for
installing, management
and monitoring
● Schema Registry
Integration
● Centralized Log Search
for Flink Application Logs
● Monitoring Services with
with CM Metrics
Store/Service integration
and Flink Kafka Metrics
Reporter
36. CLOUDERA FOR FLINK
Our commitment to the community over the next year
Platform Integration
Config Management
Diagnostics Framework
Unified Security Model
Enterprise Security
Apache Atlas
Apache Knox
Community
Regular blog posts
Conference Presence
Reference Architecture
New Connectors
Apache HBase
Apache Kudu
Cldr Schema Registry
Apache Ranger