Streaming and real-time data has high business value, but that value can rapidly decay if not processed quickly. If the value of the data is not realized in a certain window of time, its value is lost and the decision or action that was needed as a result never occurs. Streaming data – whether from sensors, devices, applications, or events – needs special attention because a sudden price change, a critical threshold met, a sensor reading changing rapidly, or a blip in a log file can all be of immense value, but only if the alert is in time.
2. The ETL Legacy
• An ad hoc manner of connecting sources and destinations
• ETL surfaced in the 1990s
– Far fewer data platforms and types
– Built for DW
– Bottleneck in DW population
– Time and Resource intensive
– Batch
• Can be chaotic and unmanageable
2
3. EAI
• Then came EAI
– Facilitate exchange of business transactions messages between
applications
– Used Enterprise Service classes underneath the covers
– Works for small scale data
– Not designed to handle the span of data that is required for
modern day, like sensors
3
4. Modern Realities of Data Integration
• Desire for consolidated methods for data integration
• New types of data sources
– Logs, sensors, etc.
• We have more than OLTP and OLAP
– Distributed data platforms
• Desire for real-time data
• High-velocity data increasingly needs integration
• Traditional approaches, without Stream Processing, turn
into ETL+custom scripts+middleware+MQ
4
5. Streaming: Real-Time and Scalable
• Streaming is Forward-
thinking
• Real-Time and Scale
Becoming the Rule Not
the Exception
5
SUN MON TUES WED THU FRI SAT
BATCHREAL-TIME
SCALABILITY
ETL
EAI
STREAMING
PLATFORM
6. Point to point
• Old way
• Add another database? Repeat process
6
S
t
a
g
i
n
g
T
a
b
l
e
s
ERP
CRM
Financials
HR
BI Tools
BI Tools/
OLAP Clients
Physical
OLAP
Cubes
Physical
Object
7. ETL is Insufficient for this combination
• Data platforms operating at an enterprise-wide scale
• A high variety of data sources
• Real-time/streaming data
• ETL forces either real-time loading without being scalable or
scalability with batch loading
– Data, produced from numerous sources, is a torrent of flowing
information, needing to be timestamped, dispatched, and even
duplicated (to protect against data loss)
– A postman is needed to distribute data from message senders to
receivers at the right place at the right time.
7
8. Real-Time Data
• A.k.a. messaging, live feeds, real-time, event-driven
• Comes in continuously and often quickly, so we also call it
streaming data
• Needs special attention and can be of immense value, but
only if we are alerted in time
• Foundation for Artificial Intelligence excellence
– Stream data forms the core of data for artificial intelligence
8
9. Message Brokers
• Message Brokers are a way of decoupling the sending and
receiving services through the concept of Publish &
Subscribe
• Another thing Message Brokers do is queue or retain the
message till the consumer picks it up
• Streaming allows us to have both Pub-Sub as well as
queuing features (historically, either one or the other was
supported by such brokers
9
11. All Data Can Be Represented as Streams
11
Streaming
Platform
DW
Hadoop
RDBMS
NOSQL
Apps
Real-time Analytics
Search
Monitoring
Web Services
API
Big Data
Analysis
Parallel
Tools
Multi-Threaded
Math Libraries
Cluster
support
12. Streaming Data
• Unbounded, continuous flow of real-time records
• Stream APIs transform and enrich data
• Millisecond latency
• Stateless or stateful
• Incorporate data into your applications; deploy anywhere,
including containers
12
13. Enter Message-Oriented Middleware aka Streaming and
message queuing technology
• Messages can be any kind of data wrapped in a neat package
with a very simple header as a bow on top.
• Messages are sent by “producers”—systems, sensors, or
devices that generate the messages—toward a “broker.”
• A broker does not process the messages, but instead routes
them into queues according to the information enclosed in the
message header or its own routing process.
• Then “consumers” retrieve the messages from the queues to
which they subscribe (although sometimes messages are
pushed to consumers rather than pulled).
• The consumers open the messages and perform some kind of
action on them.
13
14. Streaming solutions
Intelligent data platform for fast data:
Connect, process, and store data in real-time
…in a unified, flexible solution
…able to meet demanding SLAs even at scale
…without operational burdens and complexity
14
15. Performance and scalability in streaming
15
Storage
Ability to retain varying volumes of
messages for varying lengths of time
Throughput
High, sustainable rate of message
processing
Latency
Fast, consistent responsiveness for
publishing and consumption
Operations
Minimizing operational burden for scaling,
tuning, and monitoring
16. Comprehensive capabilities
16
Stream-Native Functions
Apply processing functions on data
Multi-tenancy
A single cluster can support
many tenants and use cases
Durability
Data replicated and synced to
disk
Geo-replication
Out of box support for
geographically distributed
applications
Unified messaging model
Support both Topic & Queue
semantic in a single model
Delivery Guarantees
At least once, at most once
and effectively once
Scalability
Supports millions of topics in a
single cluster
17. Apache Kafka
• Open source streaming platform developed at LinkedIn
• A distributed publish-subscribe messaging system that maintains feeds of
messages called topics
– Publishers write data to topics and subscribers read from topics
– Kafka topics are partitioned and replicated across multiple nodes in your Hadoop
cluster
• Enables “source to sink” data pipelines
• Kafka messages are simple, byte-long arrays that can store objects in
virtually any format with a key attached to each message; often in JSON
• E&L in ETL through Kafka Connect API
• T in ETL through Kafka Streams API
• Fault-tolerant
• DIY
17
19. Application programming interfaces
• A ubiquitous method and de facto standard of communication among modern
information technologies.
• APIs have begun to replace older, more cumbersome methods of information
sharing with lightweight endpoints.
• Due to the popularity and proliferation of APIs and microservices, the need has
arisen to manage the multitude of services a company relies on—both internal
and external.
• Organizations depend on these services to be properly managed, with high
performance and availability.
19
20. API & Microservices Ecosystem
Public Private - External Private - Internal
Over 20,000 public APIs*
*according to https://www.programmableweb.com/apis/directory
External Partners Connected Apps & Data
20
21. The Need for Management
HTTP Basic Auth
OAuth2.0
OpenID
API Keys
Test
Production
Rate
limiting
Analytics
Transformations
Quotas
Caching
CORS
21
23. API Requirements
• Performance: Good for high performance
workloads (>1,000TPS)
• Reliability: All workloads completed with 100%
message completion
• Complexity: Multiple plugins enabled
23
24. RabbitMQ
• Open source message broker platform
• Created in 2007 and is managed by Pivotal Software
• Uses an exchange to receive messages from brokers and pushes them to the
registered consumers
• The broker pushes messages—which are queued in random order—toward the
consumers.
• Brokers are persistently connected to consumer, and they know which ones are
subscribed to which queues
• Consumers cannot fetch specific messages, but can receive them unordered
– unaware of the queue state
• Messages, queues, and exchanges do not persist unless otherwise instructed.
– If a broker is restarted or fails, the messages are lost
– Has settings to make both queues and messages durable. Moreover, non-critical messages
can be tagged by the producer to not be sent to a durable queue
• Allows producers’ and consumers’ code to declare new queues and exchanges
• Several replication and load balancing alternatives
24
25. Amazon Kinesis
• Similar to Kafka
• In enterprise-ready package
• Amazon users pay for by the shard-hour and payload
25
26. Apache Pulsar
• Originally developed at Yahoo
• Began its incubation at Apache in late 2016
• Has been in production at Yahoo since 2013
• Utilized in popular services and applications like Yahoo! Mail, Finance, Sports,
Flickr, Gemini Ads, and Sherpa
• Follows the publisher-subscriber model (pub-sub), and has the same producers,
topics, and consumers as some of the aforementioned technologies
• Uses built-in multi-datacenter replication
• Architected for multi-tenancy and uses concepts of properties and namespaces
26
27. Streamlio
• Enterprise-ready deployment of Pulsar
• Unified solution for connecting, processing and storing fast-moving
data
• The unified messaging model has three components:
• Consumption
• Acknowledgement
• Retention
• Three modes of subscription: exclusive, failover, and shared.
• Supports both persistent and non-persistent states.
• Has a configurable time-to-live (TTL) feature than can be set to handle
messages that have not been consumed.
• A unified platform gives enterprises the best of both the streaming
and message queuing worlds.
27
28. Workloads are Distinguished by
• The number of topics
• The size of the messages being produced and consumed
• The number of subscriptions per topic
• The number of producers per topic
• The rate at which producers produce messages (per
second)
• The size of the consumer’s backlog (in gigabytes)
28
29. Creating a Streaming Application
• Configure the Application
• Serialize data
• Set up tables for change logs
29
30. Migrating ETL to Stream Processing
• Sessionization of event data
• Tools to acquire:
– Message bus
– Data storage (i.e., HDFS with S3)
– Operations support
30
31. Biggest Challenges in Streaming
• Getting data live at scale
• Accenting data with metadata
• Misordered events
• Job recovery
• High operational workload
31
32. Future of Data Integration
32
Source
Dest
ConnectAPI
ConnectAPI
Streaming
Solution
Streams API
App
Transformations
Streaming
PlatformDW
Hadoop
RDBMS
NOSQLApps
Real-time Analytics
Search
Monitoring
Web Services
API
Big Data
Analysis
Parallel
Tools
Multi-Threaded
Math Libraries
Cluster
support
33. In Conclusion
• Streaming and message queuing have lasting value to
organizations.
• They will be as prevalent as ETL was and is in the world of data
warehousing and integration.
• APIs have begun to replace older, more cumbersome methods
of information sharing with lightweight endpoints.
• Streaming and messaging will be able to meet the data
volume, variety, and timing requirements of the coming years.
• Data-driven organizations will benefit from these technologies
because it will allow them to ingest data and operate at a scale
that would have been practically impossible just a few years
ago.
33
34. Second Thursday of
Every Month, at 2:00 ET
Presented by: William McKnight
President, McKnight Consulting Group
www.mcknightcg.com (214) 514-1444
#AdvAnalytics