Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

xGem Data Stream Processing


Published on

A data stream management system is a computer software system to manage continuous data streams.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

xGem Data Stream Processing

  1. 1. Data Stream Processing
  2. 2. Agenda  Overview  What is Streaming Data?  Streaming Data Pipeline  Streaming Platform components  What is Stishovite?
  3. 3. Overview Monitoring Events In RealTime Monitoring Alerts Sending alerts based on detection of event patterns in data streams. Dashboards RealTime Operational Dashboards Search Full-text querying, aggregations, Geo Data in near real time Analytics Analyze big volumes of data quickly and in near real time
  4. 4. Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling. Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to act on the data and derive insights faster. With the explosion of data such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. Streaming Data
  5. 5. Streaming Data examples include: • Website, Network and Applications monitoring • Fraud detection • Advertising • Internet of Things: sensors (trucks, transportation vehicles, industrial equipment) • Machine-generated data • Social analytics • Private Searching • Others Streaming Data Examples
  6. 6. o Persistence o Performance o Scale o Parallel & Partitioned o Messaging o Processing o Storage Key Requirements for Streaming Data
  7. 7. State of Stream Processing Stateless • Filter • Map Stateful • Aggregate • Join
  8. 8. Typical Streaming Workflow Producer Producer Streaming Platform Streaming Processing Persistence Consumer
  9. 9. We need to collect the data, process the data, store the data, and finally serve the data for analysis, searching, machine learning and dashboards. Streaming Data Pipeline Data Sources Collect & Insgest Data Serve DataStore DataProcess Data ? ? ? ?
  10. 10. We need to collect the data from a wide array of inputs and write them into a wide array of outputs in real time. Collect Data • Pull-based • Push-based Change Data Capture (CDC) Database Changefeeds CollectorsCustom Collectors • Java • Python
  11. 11. When data is ingested in real time, each data item is imported as it is emitted by the source. An effective data ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination. Streaming Data Ingestion Kafka Topics
  12. 12. Apache Kafka is a distributed system designed for streams. It is built to be fault-tolerant, high- throughput, horizontally scalable, and allows geographically distributing data streams and stream processing applications. Apache Kafka
  13. 13. Kafka’s system design can be thought of as that of a distributed commit log, where incoming data is written sequentially to disk. There are four main components involved in moving data in and out of Kafka: • Topics • Producers • Consumers • Brokers How Kafka Works
  14. 14. Kafka Streaming Platform
  15. 15. Collect & ingest Data We need to collect the data, process the data, store the data, and finally serve the data for analysis, machine learning, and dashboards. Data Sources Serve DataStore DataProcess Data ? ? ? Streaming Data Pipeline
  16. 16. Data Stream Processing There are a wide variety of technologies, frameworks, and libraries for building applications that process streams of data. Frameworks such as Flink, Storm, Samza and Spark all can process streams of data in real time writing code in Java, Python or Scala doing excellent job. But if you was looking for something more simple to build data pipelines with a minimal data processing you should test:
  17. 17. Apache NiFi is an integrated data platform that enables the automation of data flow between systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination. Apache NiFi helps move and track data. Apache Nifi Apache NiFi is used for: • Reliable and secure transfer of data between systems • Delivery of data from sources to analytic platforms • Enrichment and preparation of data: • Conversion between formats • Extraction/Parsing/Splitting/Aggregation • Schema translation • Routing decisions
  18. 18. Collect & ingest Data Data Stream Processing Data Sources Serve DataStore DataProcess Data ? ? Streaming Data Pipeline
  19. 19. For storing lots of streaming data, we need a data store that supports fast writes and scales. Storing Streaming Data
  20. 20. Collect & ingest Data Storing Streaming Data Data Sources Serve DataStore DataProcess Data ? Streaming Data Pipeline
  21. 21. End applications like dashboards, business intelligence tools, and other applications that use the processed event data. Serving the Data
  22. 22. Collect & ingest Data Complete workflow of streaming data Data Sources Serve DataStore DataProcess Data Streaming Data Pipeline
  23. 23. Stishovite is a centralized console to manage the entire pipeline of the xGem Streaming Platform. xGem Stream Platform is the integration of differents Open Source Products. What is Stishovite?
  24. 24. Thanks! Jorge Hirtz @jahtux