Successfully reported this slideshow.
Your SlideShare is downloading. ×

ITPC Building Modern Data Streaming Apps

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 64 Ad

ITPC Building Modern Data Streaming Apps

Download to read offline

ITPC Building Modern Data Streaming Apps

https://princetonacm.acm.org/tcfpro/

17th Annual IEEE IT Professional Conference (ITPC)
Armstrong Hall at The College of New Jersey
Friday, March 17th, 2023 at 8:30 AM to 5:00 PM
TCF Photo
In continuous operation since 1976, the Trenton Computer Festival (TCF) is the nation's longest running personal computer. For the seventeenth year, the TCF is extending its program to provide Information Technology and computer professionals with an additional day of conference. It is intended, in an economical way, to provide attendees with insight and information pertinent to their jobs, and to keep them informed of emerging technologies that could impact their work.

The IT Professional Conference is co-sponsored by the Institute of Electrical and Electronics Engineers (IEEE) Computer Society Chapter of Princeton / Central Jersey.

11:00am Building Modern Data Streaming Apps
presented by
Timothy Spann

Building Modern Data Streaming Apps
In this session, I will show you some best practices I have discovered over the last seven years in building data streaming applications including IoT, CDC, Logs, and more.

In my modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Pulsar. From there we build streaming ETL with Spark, enhance events with Pulsar Functions for ML and enrichment. We build continuous queries against our topics with Flink SQL.

Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.

Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark.

Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.

ITPC Building Modern Data Streaming Apps

https://princetonacm.acm.org/tcfpro/

17th Annual IEEE IT Professional Conference (ITPC)
Armstrong Hall at The College of New Jersey
Friday, March 17th, 2023 at 8:30 AM to 5:00 PM
TCF Photo
In continuous operation since 1976, the Trenton Computer Festival (TCF) is the nation's longest running personal computer. For the seventeenth year, the TCF is extending its program to provide Information Technology and computer professionals with an additional day of conference. It is intended, in an economical way, to provide attendees with insight and information pertinent to their jobs, and to keep them informed of emerging technologies that could impact their work.

The IT Professional Conference is co-sponsored by the Institute of Electrical and Electronics Engineers (IEEE) Computer Society Chapter of Princeton / Central Jersey.

11:00am Building Modern Data Streaming Apps
presented by
Timothy Spann

Building Modern Data Streaming Apps
In this session, I will show you some best practices I have discovered over the last seven years in building data streaming applications including IoT, CDC, Logs, and more.

In my modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Pulsar. From there we build streaming ETL with Spark, enhance events with Pulsar Functions for ML and enrichment. We build continuous queries against our topics with Flink SQL.

Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.

Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark.

Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.

Advertisement
Advertisement

More Related Content

Similar to ITPC Building Modern Data Streaming Apps (20)

More from Timothy Spann (20)

Advertisement

Recently uploaded (20)

ITPC Building Modern Data Streaming Apps

  1. 1. © 2023 Cloudera, Inc. All rights reserved. Building Modern Data Streaming Apps Tim Spann Principal Developer Advocate 17-March-2023
  2. 2. © 2023 Cloudera, Inc. All rights reserved.
  3. 3. © 2023 Cloudera, Inc. All rights reserved. 3 Building Modern Data Streaming Apps In this session, I will show you some best practices I have discovered over the last seven years in building data streaming applications including IoT, CDC, Logs, and more. In my modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Spark. We build continuous queries against our topics with Flink SQL.
  4. 4. © 2023 Cloudera, Inc. All rights reserved. FLiPN-FLaNK Stack Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://github.com/tspannhw/EverythingApacheNiFi https://medium.com/@tspann Apache NiFi x Apache Kafka x Apache Flink x Java
  5. 5. © 2023 Cloudera, Inc. All rights reserved. 5 FLiP Stack Weekly This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark, Java and Open Source friends. https://bit.ly/32dAJft
  6. 6. © 2023 Cloudera, Inc. All rights reserved. 6 Welcome to Future of Data - Princeton + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  7. 7. © 2023 Cloudera, Inc. All rights reserved. 7 Largest Java Conference in the US! 12 tracks on Java, Cloud, Frameworks, Streaming, etc… Devnexus.com Join me! Save with SEEMESPEAK https://devnexus.com/presentations/apache-pulsar-development-101-with-java
  8. 8. © 2023 Cloudera, Inc. All rights reserved. STREAMING
  9. 9. © 2023 Cloudera, Inc. All rights reserved. 9 WHAT IS REAL-TIME?
  10. 10. © 2023 Cloudera, Inc. All rights reserved. 10 ENABLING ANALYTICS AND INSIGHTS ANYWHERE Driving enterprise business value REAL-TIME STREAMING ENGINE ANALYTICS & DATA WAREHOUSE DATA SCIENCE/ MACHINE LEARNING CENTRALIZED DATA PLATFORM STORAGE & PROCESSING ANALYTICS & INSIGHTS Stream Ingest Ingest – Data at Rest Deploy Models BI Solutions SQL Predictive Analytics • Model Building • Model Training • Model Scoring Actions & Alerts [SQL] Real-Time Apps STREAMING DATA SOURCES Clickstream Market data Machine logs Social ENTERPRISE DATA SOURCES CRM Customer history Research Compliance Data Risk Data Lending
  11. 11. © 2023 Cloudera, Inc. All rights reserved. 11 STREAMING FROM … TO .. WHILE .. Data distribution as a first class citizen IOT Devices LOG DATA SOURCES ON-PREM DATA SOURCES BIG DATA CLOUD SERVICES CLOUD BUSINESS PROCESS SERVICES * CLOUD DATA* ANALYTICS /SERVICE (Cloudera DW) App Logs Laptops /Servers Mobile Apps Security Agents CLOUD WAREHOUSE UNIVERSAL DATA DISTRIBUTION (Ingest, Transform, Deliver) Ingest Processors Ingest Gateway Router, Filter & Transform Processors Destination Processors
  12. 12. © 2023 Cloudera, Inc. All rights reserved. © 2019 Cloudera, Inc. All rights reserved. 12 EVENT-DRIVEN ORGANIZATION Modernize your data and applications CDF Event Streaming Platform Integration - Processing - Management - Cloud Stream ETL Cloud Storage Application Data Lake Data Stores Make Payment µServices Streams Edge - IoT Dashboard
  13. 13. Building Real-Time Requires a Team
  14. 14. © 2023 Cloudera, Inc. All rights reserved. 14 End to End Streaming Pipeline Example Enterprise sources Weather Errors Aggregates Alerts Stocks ETL Analytics Clickstream Market data Machine logs Social SQL
  15. 15. © 2023 Cloudera, Inc. All rights reserved. Aggregate all data from sensors, drones, logs, geo-location devices, images from cameras, results from running predictions on pre-trained models. Collect: Bring Together Mediate point-to-point and bi-directional data flows, distribute, delivering data reliably to Apache Iceberg, S3, SnowFlake, Slack and Email. Conduct: Mediate the Data Flow Orchestrate, parse, merge, aggregate, filter, join, transform, fork, query, sort, dissect, store, enrich with weather, location, sentiment analysis, image analysis, object detection, image recognition and more with Apache Tika, Apache OpenNLP and Machine Learning. Curate: Gain Insights
  16. 16. © 2023 Cloudera, Inc. All rights reserved. APACHE PULSAR
  17. 17. © 2023 Cloudera, Inc. All rights reserved. Apache Pulsar ● Serverless computing framework. ● Unbounded storage, multi-tiered architecture, and tiered-storage. ● Streaming & Pub/Sub messaging semantics. ● Multi-protocol support. ● Open Source ● Cloud-Native
  18. 18. Streaming Consumer Consumer Consumer Subscription Shared Failover Consumer Consumer Subscription In case of failure in Consumer B-0 Consumer Consumer Subscription Exclusive X Consumer Consumer Key-Shared Subscription Pulsar Topic/Partition Messaging
  19. 19. © 2023 Cloudera, Inc. All rights reserved. 19 Messages - the Basic Unit of Apache Pulsar Component Description Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data can also conform to data schemas. Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like topic compaction. Properties An optional key/value map of user-defined properties. Producer name The name of the producer who produces the message. If you do not specify a producer name, the default name is used. Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the message is its order in that sequence.
  20. 20. © 2023 Cloudera, Inc. All rights reserved. 20 Integrated Schema Registry Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers
  21. 21. © 2023 Cloudera, Inc. All rights reserved. 21 The FLiPN Kitten crosses the stream 4 ways with Apache Pulsar
  22. 22. © 2023 Cloudera, Inc. All rights reserved. 22 Kafka on Pulsar (KoP)
  23. 23. © 2023 Cloudera, Inc. All rights reserved. 23 Data Offloaders (Tiered Storage) Client Libraries Apache Pulsar Ecosystem hub.streamnative.io Connectors (Sources & Sinks) Protocol Handlers Pulsar Functions (Lightweight Stream Processing) Processing Engines … and more! … and more!
  24. 24. © 2023 Cloudera, Inc. All rights reserved. 24 Pulsar Functions ● Consume messages from one or more Pulsar topics. ● Apply user-supplied processing logic to each message. ● Publish the results of the computation to another topic. ● Support multiple programming languages (Java, Python, Go) ● Can leverage 3rd-party libraries to support the execution of ML models on the edge.
  25. 25. © 2023 Cloudera, Inc. All rights reserved. APACHE KAFKA
  26. 26. © 2023 Cloudera, Inc. All rights reserved. 26 What is Apache Kafka? – Distributed: horizontally scalable – Partitioned: the data is split-up and distributed across the brokers – Replicated: allows for automatic failover – Unique: Kafka does not track the consumption of messages (the consumers do) – Fast: designed from the ground up with a focus on performance and throughput – Kafka was built at Linkedin in 2011 – Open sourced as an Apache project
  27. 27. © 2023 Cloudera, Inc. All rights reserved. 27 Yes, Franz, It’s Kafka Let’s do a metamorphosis on your data. Don’t fear changing data. You don’t need to be a brilliant writer to stream data. Franz Kafka was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. Wikipedia
  28. 28. © 2023 Cloudera, Inc. All rights reserved. 28 What is Can You Do With Apache Kafka? • Web site activity: track page views, searches, etc. in real time • Events & log aggregation: particularly in distributed systems where messages come from multiple sources • Monitoring and metrics: aggregate statistics from distributed applications and build a dashboard application • Stream processing: process raw data, clean it up, and forward it on to another topic or messaging system • Real-time data ingestion: fast processing of a very large volume of messages
  29. 29. © 2023 Cloudera, Inc. All rights reserved. 29 KAFKA TERMINOLOGY • Kafka is a publish/subscribe messaging system comprised of the following components: – Topic: a message feed – Producer: a process that publishes messages to a topic – Consumer: a process that subscribes to a topic and processes its messages – Broker: a server in a Kafka cluster
  30. 30. © 2021 Cloudera, Inc. All rights reserved. 30 Apache Kafka • Highly reliable distributed messaging system • Decouple applications, enables many-to-many patterns • Publish-Subscribe semantics • Horizontal scalability • Efficient implementation to operate at speed with big data volumes • Organized by topic to support several use cases Source System Source System Source System Kafka Fraud Detection Security Systems Real-Time Monitoring Many-To-Many Publish-Subscribe EVENTS
  31. 31. © 2023 Cloudera, Inc. All rights reserved. APACHE FLINK
  32. 32. © 2023 Cloudera, Inc. All rights reserved. 32 Flink SQL https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite
  33. 33. © 2023 Cloudera, Inc. All rights reserved. 33 Flink SQL -- specify Kafka partition key on output SELECT foo AS _eventKey FROM sensors -- use event time timestamp from kafka -- exactly once compatible SELECT eventTimestamp FROM sensors -- nested structures access SELECT foo.’bar’ FROM table; -- must quote nested column -- timestamps SELECT * FROM payments WHERE eventTimestamp > CURRENT_TIMESTAMP-interval '10' second; -- unnest SELECT b.*, u.* FROM bgp_avro b, UNNEST(b.path) AS u(pathitem) -- aggregations and windows SELECT card, MAX(amount) as theamount, TUMBLE_END(eventTimestamp, interval '5' minute) as ts FROM payments WHERE lat IS NOT NULL AND lon IS NOT NULL GROUP BY card, TUMBLE(eventTimestamp, interval '5' minute) HAVING COUNT(*) > 4 -- >4==fraud -- try to do this ksql! SELECT us_west.user_score+ap_south.user_score FROM kafka_in_zone_us_west us_west FULL OUTER JOIN kafka_in_zone_ap_south ap_south ON us_west.user_id = ap_south.user_id; Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
  34. 34. 34 © 2022 Cloudera, Inc. All rights reserved. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  35. 35. © 2023 Cloudera, Inc. All rights reserved. DATAFLOW APACHE NIFI
  36. 36. © 2023 Cloudera, Inc. All rights reserved. 36 CLOUDERA FLOW AND EDGE MANAGEMENT Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance Advanced tooling to industrialize flow development (Flow Development Life Cycle) ACQUIRE • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG PROCESS HASH MERGE EXTRACT DUPLICATE SPLIT ENCRYPT TALL EVALUATE EXECUTE GEOENRICH SCAN REPLACE TRANSLATE CONVERT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT ROUTE RATE DISTRIBUTE LOAD DELIVER • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG
  37. 37. © 2023 Cloudera, Inc. All rights reserved. 37 Cloudera DataFlow: Universal Data Distribution Service Process Route Filter Enrich Transform Distribute Connectors Any destination Deliver Ingest Active Passive Connectors Gateway Endpoint Connect & Pull Send Data born in the cloud Data born outside the cloud UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF) Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
  38. 38. 38 © 2023 Cloudera, Inc. All rights reserved. WHAT IS APACHE NIFI? Apache NiFi is a scalable, real-time streaming data platform that collects, curates, and analyzes data so customers gain key insights for immediate actionable intelligence.
  39. 39. © 2023 Cloudera, Inc. All rights reserved. 39 APACHE NIFI Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance ACQUIRE PROCESS DELIVER • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  40. 40. © 2023 Cloudera, Inc. All rights reserved. 40 FDLC: FLOW DEVELOPMENT LIFECYCLE NiFi Cluster Node 1 Node x Tests cluster Registry .. NiFi Cluster Node 1 Node x Production cluster Registry .. NiFi Cluster Node 1 Node x Dev cluster Registry .. Dev Ops chain NiFi CLI, NiPyAPI, Jenkins 1- push new flow version 2- get new version of the flow 3 & 6 - automated testing / promotion (naming convention, scheduling, variable replacement) 4- push new flow version 5- get new version of the flow 7- push new flow version
  41. 41. © 2023 Cloudera, Inc. All rights reserved. 41 PROVENANCE
  42. 42. © 2023 Cloudera, Inc. All rights reserved. 42 EXTENSIBILITY • Built from the ground up with extensions in mind • Service-loader pattern for… – Processors – Controller Services – Reporting Tasks – Prioritizers • Extensions packaged as NiFi Archives (NARs) – Deploy NiFi lib directory and restart – Same model as standard components
  43. 43. © 2020 Cloudera, Inc. All rights reserved. 43 STATELESS ENGINE • Granular containers per flow • Flows From NiFi Registry https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html bin/nifi.sh stateless RunFromRegistry Continuous --file kafka.json https://github.com/apache/nifi/blob/ea1becac4fc519c54b8b4d21773e68f8da364755/nifi-nar-bundles/nifi-framework-bundle/nifi- framework/nifi-stateless/README.md
  44. 44. © 2020 Cloudera, Inc. All rights reserved. 44 STATELESS ENGINE • See also Parameters • Docker • YARN • Kubernetes (K8) • Stateful NiFi clusters • Apache OpenWhisk (FaaS) https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html {"registryUrl": "http://tspann-mbp15-hw14277:18080", "bucketId": "140b30f0-5a47-4747-9021-19d4fde7f993", "flowId": "0540e1fd-c7ca-46fb-9296-e37632021945", "ssl": { "keystoreFile": "","keystorePass": "","keyPass": "","keystoreType": "", "truststoreFile": "/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/lib/sec urity/cacerts", "truststorePass": "changeit", "truststoreType": "JKS" }, "parameters": { "broker" : "4.317.852.100:9092", "topic" : "iot", "group_id" : "nifi-stateless-kafka-consumer", "DestinationDirectory" : "/tmp/nifistateless/output2/", "output_dir": "/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/output" } } https://github.com/tspannhw/stateless-examples
  45. 45. © 2020 Cloudera, Inc. All rights reserved. 45 PARQUETREADER / PARQUETWRITER RECORDS • Native Record Processors for Apache Parquet Files! • CVS <-> Parquet • XML <-> Parquet • AVRO <-> Parquet • JSON <-> Parquet • More... https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apac he_7.html
  46. 46. © 2020 Cloudera, Inc. All rights reserved. 46 POSTSLACK • Post Images to Slack https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html https://www.datainmotion.dev/2019/11/nifi-110-postslack-easy-image-upload.html
  47. 47. © 2020 Cloudera, Inc. All rights reserved. 47 REMOTE INPUT PORT IN A PROCESS GROUP • Put Remote Connections for Site-To-Site (S2S) Anywhere! • Not only top level • Drop down simplicity https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
  48. 48. 48 © 2023 Cloudera, Inc. All rights reserved. READYFLOW GALLERY • Cloudera provided flow definitions • Cover most common data flow use cases • Optimized to work with CDP sources/destinations • Can be deployed and adjusted as needed
  49. 49. 49 © 2023 Cloudera, Inc. All rights reserved. FLOW CATALOG • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  50. 50. Apache NiFi with Python Custom Processors Python as a 1st class citizen
  51. 51. © 2023 Cloudera, Inc. All rights reserved. 51 Processing one billion events per second with NiFi https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/
  52. 52. © 2023 Cloudera, Inc. All rights reserved. FREE LEARNING ENVIRONMENT
  53. 53. © 2023 Cloudera, Inc. All rights reserved. 53 CSP Community Edition • Kafka, KConnect, SMM, SR, Flink, and SSB in Docker • Runs in Docker • Try new features quickly • Develop applications locally ● Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry ○ $>docker compose up ● Licensed under the Cloudera Community License ● Unsupported ● Community Group Hub for CSP ● Find it on docs.cloudera.com under Applications
  54. 54. © 2023 Cloudera, Inc. All rights reserved. DEMO
  55. 55. © 2023 Cloudera, Inc. All rights reserved. 55 End to End Streaming Demo Pipeline Enterprise sources Weather Errors Aggregates Alerts Stocks ETL Analytics Streaming SQL Clickstream Market data Machine logs Social https://github.com/tspannhw/CloudDemo2021
  56. 56. AI BASED ENHANCEMENTS SERVE SOURCES Data Warehouse Report Sensorid Sensor conditions Machine Learning Predict, Automate Control System REPORT Visualize CLOUD Collect COLLECT Message Broker Data Flow Distribute Sensor id Temperature COLLECT and DISTRIBUTE DATA 1. Data is collected from sensors that use mqtt protocol via CEM and sent to CDP Public Cloud 2. Two CDF flows run in the cloud to accomplish our two goals: streaming analytics and batch analytics ENRICH, REPORT Report, Automate Real time alerting SQL Stream Builder Data Visualization Edge Management Humidity Timestamp Visualize Data Visualization USE DATA 3. Streaming Use Case: some conditions of our greenhouse must be avoided and have to be controlled in real time. Some warnings have been defined to alert us in case alerting conditions are met, control system is automatically activated to adjust environmental variables. 4. Batch analytics: to ensure the optimal growth of our plants the ideal conditions have to be met for each plant. Each 6 hours the plant conditions are monitored and in case some control adjustment is required a ML model gives a suggestion about getting the optimal point minimizing the cost. EDGE Control System
  57. 57. © 2023 Cloudera, Inc. All rights reserved. RESOURCES AND WRAP-UP
  58. 58. © 2023 Cloudera, Inc. All rights reserved. 58 Streaming Solutions When to use what? • Routing vs Analytics – Listeners – Joins – In-Memory • Operational Load • Current Skills • Use NiFi – Doing more than just Syndication – Not just small Kafka sized events – Edge Management is needed – Listener Type use cases that bind to ports – Single Product for light ETL, Lineage, Provenance, Message Replay • Use Flink – Joining Streams – Windowing – Late Data Handling – Streaming Analytics • Use KConnect – Kafka Centric • Multiple-Tool Objections – In-Memory Stateless
  59. 59. © 2023 Cloudera, Inc. All rights reserved. 59 STREAMING RESOURCES • https://dzone.com/articles/real-time-stream-processing-with-hazelcast-an d-streamnative • https://flipstackweekly.com/ • https://www.datainmotion.dev/ • https://www.flankstack.dev/ • https://github.com/tspannhw • https://medium.com/@tspann • https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d739 5d714 • https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Str eaming_Engineer.pdf
  60. 60. © 2023 Cloudera, Inc. All rights reserved. 60 https://github.com/tspannhw https://www.datainmotion.dev/
  61. 61. © 2023 Cloudera, Inc. All rights reserved. 61 April 4 - Atlanta April 24 - San Francisco May 10 - Virtual Upcoming Events
  62. 62. © 2023 Cloudera, Inc. All rights reserved. 62 Resources
  63. 63. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 63
  64. 64. 64 TH N Y U

×