Data Stream
Processing
Agenda
 Overview
 What is Streaming Data?
 Streaming Data Pipeline
 Streaming Platform components
 What is Stishovite?
Overview
Monitoring Events
In RealTime
Monitoring Alerts
Sending alerts based on
detection of event patterns
in data streams.
Dashboards
RealTime Operational
Dashboards
Search
Full-text querying,
aggregations, Geo Data in
near real time
Analytics
Analyze big volumes of
data quickly and in near
real time
Streaming Data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
This data needs to be processed sequentially and incrementally on a record-by-record basis or
over sliding time windows, and used for a wide variety of analytics including correlations,
aggregations, filtering, and sampling.
Stream processing has become the defacto standard for building real-time ETL and Stream
Analytics applications. We see batch workloads move into Stream processing to act on the
data and derive insights faster. With the explosion of data such IoT and machine-generated
data, Stream Processing + Predictive Analytics is driving tremendous business value.
Streaming Data
Streaming Data examples include:
• Website, Network and Applications monitoring
• Fraud detection
• Advertising
• Internet of Things: sensors (trucks, transportation vehicles, industrial equipment)
• Machine-generated data
• Social analytics
• Private Searching
• Others
Streaming Data Examples
o Persistence
o Performance
o Scale
o Parallel & Partitioned
o Messaging
o Processing
o Storage
Key Requirements for Streaming Data
State of Stream Processing
Stateless
• Filter
• Map
Stateful
• Aggregate
• Join
Typical Streaming Workflow
Producer
Producer
Streaming
Platform
Streaming
Processing
Persistence
Consumer
We need to collect the data, process the data, store the data, and finally serve the data for
analysis, searching, machine learning and dashboards.
Streaming Data Pipeline
Data Sources Collect & Insgest
Data
Serve DataStore DataProcess Data
? ? ? ?
We need to collect the data from a wide array of inputs and write them into a wide array of
outputs in real time.
Collect Data
• Pull-based
• Push-based
Change Data Capture (CDC)
Database Changefeeds
CollectorsCustom Collectors
• Java
• Python
When data is ingested in real time, each data item is imported as it is emitted by the source. An
effective data ingestion process begins by prioritizing data sources, validating individual files
and routing data items to the correct destination.
Streaming Data Ingestion
Kafka Topics
Apache Kafka is a distributed system designed for streams. It is built to be fault-tolerant, high-
throughput, horizontally scalable, and allows geographically distributing data streams and
stream processing applications.
Apache Kafka
Kafka’s system design can be thought of as that of a distributed commit log, where incoming
data is written sequentially to disk. There are four main components involved in moving data in
and out of Kafka:
• Topics
• Producers
• Consumers
• Brokers
How Kafka Works
Kafka Streaming Platform
Collect & ingest
Data
We need to collect the data, process the data, store the data, and finally serve the data for
analysis, machine learning, and dashboards.
Data Sources Serve DataStore DataProcess Data
? ? ?
Streaming Data Pipeline
Data Stream Processing
There are a wide variety of technologies, frameworks, and libraries for building applications
that process streams of data. Frameworks such as Flink, Storm, Samza and Spark all can
process streams of data in real time writing code in Java, Python or Scala doing excellent job.
But if you was looking for something more simple to build data pipelines with a minimal data
processing you should test:
Apache NiFi is an integrated data platform that enables the automation of data flow between
systems. It provides real-time control that makes it easy to manage the movement of data
between any source and any destination. Apache NiFi helps move and track data.
Apache Nifi
Apache NiFi is used for:
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
• Conversion between formats
• Extraction/Parsing/Splitting/Aggregation
• Schema translation
• Routing decisions
Collect & ingest
Data
Data Stream Processing
Data Sources Serve DataStore DataProcess Data
? ?
Streaming Data Pipeline
For storing lots of streaming data, we need a data store that supports fast writes and scales.
Storing Streaming Data
Collect & ingest
Data
Storing Streaming Data
Data Sources Serve DataStore DataProcess Data
?
Streaming Data Pipeline
End applications like dashboards, business intelligence tools, and other applications that use
the processed event data.
Serving the Data
Collect & ingest
Data
Complete workflow of streaming data
Data Sources Serve DataStore DataProcess Data
Streaming Data Pipeline
Stishovite is a centralized console to manage the entire pipeline of the xGem Streaming
Platform.
xGem Stream Platform is the integration of differents Open Source Products.
https://gitlab.com/xgem/stishovite
What is Stishovite?
Thanks!
Jorge Hirtz
@jahtux

xGem Data Stream Processing

  • 2.
  • 3.
    Agenda  Overview  Whatis Streaming Data?  Streaming Data Pipeline  Streaming Platform components  What is Stishovite?
  • 4.
    Overview Monitoring Events In RealTime MonitoringAlerts Sending alerts based on detection of event patterns in data streams. Dashboards RealTime Operational Dashboards Search Full-text querying, aggregations, Geo Data in near real time Analytics Analyze big volumes of data quickly and in near real time
  • 5.
    Streaming Data isdata that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling. Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to act on the data and derive insights faster. With the explosion of data such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. Streaming Data
  • 6.
    Streaming Data examplesinclude: • Website, Network and Applications monitoring • Fraud detection • Advertising • Internet of Things: sensors (trucks, transportation vehicles, industrial equipment) • Machine-generated data • Social analytics • Private Searching • Others Streaming Data Examples
  • 7.
    o Persistence o Performance oScale o Parallel & Partitioned o Messaging o Processing o Storage Key Requirements for Streaming Data
  • 8.
    State of StreamProcessing Stateless • Filter • Map Stateful • Aggregate • Join
  • 9.
  • 10.
    We need tocollect the data, process the data, store the data, and finally serve the data for analysis, searching, machine learning and dashboards. Streaming Data Pipeline Data Sources Collect & Insgest Data Serve DataStore DataProcess Data ? ? ? ?
  • 11.
    We need tocollect the data from a wide array of inputs and write them into a wide array of outputs in real time. Collect Data • Pull-based • Push-based Change Data Capture (CDC) Database Changefeeds CollectorsCustom Collectors • Java • Python
  • 12.
    When data isingested in real time, each data item is imported as it is emitted by the source. An effective data ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination. Streaming Data Ingestion Kafka Topics
  • 13.
    Apache Kafka isa distributed system designed for streams. It is built to be fault-tolerant, high- throughput, horizontally scalable, and allows geographically distributing data streams and stream processing applications. Apache Kafka
  • 14.
    Kafka’s system designcan be thought of as that of a distributed commit log, where incoming data is written sequentially to disk. There are four main components involved in moving data in and out of Kafka: • Topics • Producers • Consumers • Brokers How Kafka Works
  • 15.
  • 16.
    Collect & ingest Data Weneed to collect the data, process the data, store the data, and finally serve the data for analysis, machine learning, and dashboards. Data Sources Serve DataStore DataProcess Data ? ? ? Streaming Data Pipeline
  • 17.
    Data Stream Processing Thereare a wide variety of technologies, frameworks, and libraries for building applications that process streams of data. Frameworks such as Flink, Storm, Samza and Spark all can process streams of data in real time writing code in Java, Python or Scala doing excellent job. But if you was looking for something more simple to build data pipelines with a minimal data processing you should test:
  • 18.
    Apache NiFi isan integrated data platform that enables the automation of data flow between systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination. Apache NiFi helps move and track data. Apache Nifi Apache NiFi is used for: • Reliable and secure transfer of data between systems • Delivery of data from sources to analytic platforms • Enrichment and preparation of data: • Conversion between formats • Extraction/Parsing/Splitting/Aggregation • Schema translation • Routing decisions
  • 19.
    Collect & ingest Data DataStream Processing Data Sources Serve DataStore DataProcess Data ? ? Streaming Data Pipeline
  • 20.
    For storing lotsof streaming data, we need a data store that supports fast writes and scales. Storing Streaming Data
  • 21.
    Collect & ingest Data StoringStreaming Data Data Sources Serve DataStore DataProcess Data ? Streaming Data Pipeline
  • 22.
    End applications likedashboards, business intelligence tools, and other applications that use the processed event data. Serving the Data
  • 23.
    Collect & ingest Data Completeworkflow of streaming data Data Sources Serve DataStore DataProcess Data Streaming Data Pipeline
  • 24.
    Stishovite is acentralized console to manage the entire pipeline of the xGem Streaming Platform. xGem Stream Platform is the integration of differents Open Source Products. https://gitlab.com/xgem/stishovite What is Stishovite?
  • 25.