This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
Apache Flume is a highly scalable, distributed, fault tolerant data collection framework for Apache Hadoop and Apache HBase. Flume is designed to transfer massive volumes of event data in a highly scalable way into HDFS or HBase. Flume is declarative and easy to configure and can easily be deployed to a large number of machines using configuration management systems like Puppet or Cloudera Manager. In this talk, we will cover the basic components of Flume, configuring and deploying flume. We will also briefly talk about the metrics Flume exposes, and the various ways in which these can be collected. Apache
Flume is a Top Level Project (TLP) at the Apache Software Foundation, and has made several releases since entering incubation in June, 2011. Flume graduated to become a TLP in July, 2012. The current release of Flume is Flume 1.3.1.
Presenter: Hari Shreedharan, PMC Member and Committer, Apache Flume, Software Engineer, Cloudera
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
Apache Flume is a highly scalable, distributed, fault tolerant data collection framework for Apache Hadoop and Apache HBase. Flume is designed to transfer massive volumes of event data in a highly scalable way into HDFS or HBase. Flume is declarative and easy to configure and can easily be deployed to a large number of machines using configuration management systems like Puppet or Cloudera Manager. In this talk, we will cover the basic components of Flume, configuring and deploying flume. We will also briefly talk about the metrics Flume exposes, and the various ways in which these can be collected. Apache
Flume is a Top Level Project (TLP) at the Apache Software Foundation, and has made several releases since entering incubation in June, 2011. Flume graduated to become a TLP in July, 2012. The current release of Flume is Flume 1.3.1.
Presenter: Hari Shreedharan, PMC Member and Committer, Apache Flume, Software Engineer, Cloudera
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
In this talk we will show how Hadoop Ecosystem tools like Apache Kafka, Spark, and MLLib can be used in various real-time architectures and how they can be used to perform real-time detection of a DDOS attack. We will explain some of the challenges in building real-time architectures, followed by walking through the DDOS detection example and a live demo. This talk is appropriate for anyone interested in Security, IoT, Apache Kafka, Spark, or Hadoop.
Presenter Ryan Bosshart is a Systems Engineer at Cloudera and is the first 3 time presenter at BigDataMadison!
Design data pipeline to gather log events and transform it to queryable data with HIVE ddl.
This covers Java applications with log4j and non-java unix applications using rsyslog.
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
This presentation describes Flume, a distributed log collection system for shipping data to frameworks such as Hadoop and HBase. It provides an overview and describes updates and emerging stories from the community since its open source release. These are the slides from the 2/18/11 Austin, TX HUG.
Apache Phoenix: Use Cases and New FeaturesHBaseCon
James Taylor (Salesforce) and Maryann Xue (Intel)
This talk with be broken into two parts: Phoenix use cases and new Phoenix features. Three use cases will be presented as lightning talks by individuals from 1) Sony about its social media NewsSuite app, 2) eHarmony on its matching service, and 3) Salesforce.com on its time-series metrics engine. Two new features will be discussed in detail by the engineers who developed them: ACID transactions in Phoenix through Apache Tephra. and cost-based query optimization through Apache Calcite. The focus will be on helping end users more easily develop scalable applications on top of Phoenix.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
In this talk we will show how Hadoop Ecosystem tools like Apache Kafka, Spark, and MLLib can be used in various real-time architectures and how they can be used to perform real-time detection of a DDOS attack. We will explain some of the challenges in building real-time architectures, followed by walking through the DDOS detection example and a live demo. This talk is appropriate for anyone interested in Security, IoT, Apache Kafka, Spark, or Hadoop.
Presenter Ryan Bosshart is a Systems Engineer at Cloudera and is the first 3 time presenter at BigDataMadison!
Design data pipeline to gather log events and transform it to queryable data with HIVE ddl.
This covers Java applications with log4j and non-java unix applications using rsyslog.
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
This presentation describes Flume, a distributed log collection system for shipping data to frameworks such as Hadoop and HBase. It provides an overview and describes updates and emerging stories from the community since its open source release. These are the slides from the 2/18/11 Austin, TX HUG.
Apache Phoenix: Use Cases and New FeaturesHBaseCon
James Taylor (Salesforce) and Maryann Xue (Intel)
This talk with be broken into two parts: Phoenix use cases and new Phoenix features. Three use cases will be presented as lightning talks by individuals from 1) Sony about its social media NewsSuite app, 2) eHarmony on its matching service, and 3) Salesforce.com on its time-series metrics engine. Two new features will be discussed in detail by the engineers who developed them: ACID transactions in Phoenix through Apache Tephra. and cost-based query optimization through Apache Calcite. The focus will be on helping end users more easily develop scalable applications on top of Phoenix.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
OSSNA Building Modern Data Streaming AppsTimothy Spann
OSSNA
Building Modern Data Streaming Apps
https://ossna2023.sched.com/event/1Jt05/virtual-building-modern-data-streaming-apps-with-open-source-timothy-spann-streamnative
Timothy Spann
Cloudera
Principal Developer Advocate
Data in Motion
In my session, I will show you some best practices I have discovered over the last seven years in building data streaming applications, including IoT, CDC, Logs, and more. In my modern approach, we utilize several open-source frameworks to maximize all the best features. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Pulsar. From there, we build streaming ETL with Apache Spark and enhance events with Pulsar Functions for ML and enrichment. We make continuous queries against our topics with Flink SQL. We will stream data into various open-source data stores, including Apache Iceberg, Apache Pinot, and others. We use the best streaming tools for the current applications with the open source stack - FLiPN. https://www.flipn.app/ Updates: This will be in-person with live coding based on feedback from the crowd. This will also include new data stores, new sources, and data relevant to and from the Vancouver area. This will also include updates to the platforms and inclusion of Apache Iceberg, Apache Pinot and some other new tech.
https://github.com/tspannhw/SpeakerProfile Tim Spann is a Principal Developer Advocate for Cloudera. He works with Apache Kafka, Apache Flink, Flink SQL, Apache NiFi, MiniFi, Apache MXNet, TensorFlow, Apache Spark, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Timothy J Spann
Cloudera
Principal Developer Advocate
Hightstown, NJ
Websitehttps://datainmotion.dev/
Registry is a central metadata repository that allows users to collaboratively use Schema definitions for stream processing.
Stream Analytics Manager, provides a framework to build Streaming applications faster, easier.
Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to act on the data and derive insights faster. With the explosion of data with "Perishable Insights" such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. This is evidenced by the explosion of Stream Processing frameworks like proven and evolving Apache Storm and newer frameworks such as Apache Flink, Apache Apex, and Spark Streaming.
Today, users have to choose and try to understand the benefits of each of these frameworks and not only that they have to learn the new APIs and also operationalize their applications. To create value faster, we are introducing new open source tool - Streamline. It is a self-service framework that will ease building streaming application and deploy the streaming application across multiple frameworks/engines that users prefer in a snap. It simplifies integration with Machine Learning models for scoring and classification of data for Predictive Analytics. It provides an elegant way to build Analytics dashboards to derive business insights out of the streaming data and to allow the business users to consume it easily.
In this talk, we will outline the fundamentals of real-time stream processing and demonstrate Streamline capabilities to show how it simplifies building real-time streaming analytics applications.
Speaker:
Priyank Shah, Staff Software Engineer, Hortonworks
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...Red Hat Developers
Apache Kafka is taking the world by storm and is rapidly becoming the de-facto event bus for event-driven and streaming applications that respond to events and data in real time. OpenShift Streams for Apache Kafka is Red Hat's fully hosted and managed Apache Kafka service targeting development teams that want to incorporate streaming data and scalable messaging in their applications, without the burden of setting up and maintaining a Kafka cluster infrastructure.
In this session you will discover how Apache Kafka can be used in an IoT scenario to ingest data from devices and make them available in real-time to other applications.
More specifically you will learn how to:
Simulate devices that send MQTT messages to a MQTT broker
Use Apache Camel and Camel-K to bridge MQTT with Apache Kafka
Use Kafka Streams in a Quarkus application to process the device messages
Query the state of the devices using GraphQ
Develop and deploy Streaming Analytics applications visually with bindings for streaming engine and multiple source/sinks, rich set of streaming operators and operational lifecycle management. Streaming Analytics Manager makes it easy to develop, monitor streaming applications and also provides analytics of data thats being processed by streaming application.
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Apache NiFi, Apache Flink, Apache Kafka
Timothy Spann
Principal Developer Advocate
Cloudera
Data in Motion
https://budapestdata.hu/2023/en/speakers/timothy-spann/
Timothy Spann
Principal Developer Advocate
Cloudera (US)
LinkedIn · GitHub · datainmotion.dev
June 8 · Online · English talk
Building Modern Data Streaming Apps with NiFi, Flink and Kafka
In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink SQL. We will stream data into Apache Iceberg.
We use the best streaming tools for the current applications with FLaNK. flankstack.dev
BIO
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsTimothy Spann
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
https://portotechhub.com/conference-2021/
Timothy Spann
Developer Advocate
StreamNative
A cloud data lake that is empty is not useful to anyone.
How can you quickly, scalably and reliably fill your cloud data lake with diverse sources of data you already have and new ones you never imagined you needed. Utilizing open source tools from Apache, the FLiP stack enables any data engineer, programmer or analyst to build reusable modules with low or no code. FLiP utilizes Apache NiFi, Apache Pulsar, Apache Flink and MiNiFi agents to load CDC, Logs, REST, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before.
I will teach you how to fish in the deep end of the lake and return a data engineering hero. Let's hope everyone is ready to go from 0 to Petabyte hero.
TRACK RIBEIRA Fri 07:00 — 50 min
19-Nov-2021
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
ITPC Building Modern Data Streaming AppsTimothy Spann
ITPC Building Modern Data Streaming Apps
https://princetonacm.acm.org/tcfpro/
17th Annual IEEE IT Professional Conference (ITPC)
Armstrong Hall at The College of New Jersey
Friday, March 17th, 2023 at 8:30 AM to 5:00 PM
TCF Photo
In continuous operation since 1976, the Trenton Computer Festival (TCF) is the nation's longest running personal computer. For the seventeenth year, the TCF is extending its program to provide Information Technology and computer professionals with an additional day of conference. It is intended, in an economical way, to provide attendees with insight and information pertinent to their jobs, and to keep them informed of emerging technologies that could impact their work.
The IT Professional Conference is co-sponsored by the Institute of Electrical and Electronics Engineers (IEEE) Computer Society Chapter of Princeton / Central Jersey.
11:00am Building Modern Data Streaming Apps
presented by
Timothy Spann
Building Modern Data Streaming Apps
In this session, I will show you some best practices I have discovered over the last seven years in building data streaming applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Pulsar. From there we build streaming ETL with Spark, enhance events with Pulsar Functions for ML and enrichment. We build continuous queries against our topics with Flink SQL.
Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark.
Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
In this session, we will explore the powerful combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines. We will present a case study using the FLaNK-MTA project, which leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). By integrating Flink, NiFi, and Kafka, FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Takeaways:
Understanding the integration of Apache Flink, Apache NiFi, and Apache Kafka for real-time data processing
Insights into building scalable and fault-tolerant data processing pipelines
Best practices for data collection, transformation, and analytics with FLaNK-MTA as a reference
Knowledge of use cases and potential business impact of real-time data processing pipelines
https://github.com/tspannhw/FLaNK-MTA/tree/main
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
apache nifi
apache kafka
apache flink
apache iceberg
apache parquet
real-time streaming
tim spann
principal developer advocate
cloudera
datainmotion.dev
Building Real-time Travel Alerts
In this session, we will walk through how to build a complete streaming application to send alerts based on travel advisories from public data. We will also join in other data sources of relevance and push out alerts.
We will show you how to build this streaming application with Apache NiFi, Apache Kafka, and Apache Flink and show you when/why/how, and what to build to maximize performance, productivity, and ease of development.
Let's get streaming.
Apache Flink
Apache Kafka
Apache NiFi
FLaNK Stack
Tim Spann
Big Data Conference Europe 2023
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...VMware Tanzu
The primary goals of this session are to:
Do a deep dive into the CF architecture via animated slides illustrating push, stage, deploy, scale, and health management.
Also do a brief dive into BOSH, including why BOSH, what it is, and animations of how it works. It’s not an operations focused workshop, so we keep the treatment light.
Discuss the value adds to CF BOSH OSS that Pivotal brings through the Pivotal Ops Manager product and our associated ecosystem of data and mobile services.
Quickly prove that I can push an app to a Pivotal CF environment running on vCHS in the same exact way I can push an app to PWS.
Pivotal Cloud Platform Roadshow is coming to a city near you!
Join Pivotal technologists and learn how to build and deploy great software on a modern cloud platform. Find your city and register now http://bit.ly/1poA6PG
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationTimothy Spann
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
@PaasDev www.datainmotion.dev github.com/tspannhw medium.com/@tspann
Principal Developer Advocate
Princeton Future of Data Meetup
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC, ex-EY, ex-HPE.
Apache NiFi x Apache Kafka x Apache Flink
There are a lot of factors involved in determining how you can find our way around and avoid delays, bad weather,dangers and expenses. In this talk I will focus on public transport in the largest transit system in the United States, the MTA,
which is focused around New York City. Utilizing public and semi-public data feeds, this can be extended to most city and metropolitan areas around the world. As a personal example, I live in New Jersey and this is an extremely useful use of open source and public
data.
Once I am notified that I need to travel to Manhattan, I need to start my data streams flowing. Most of the data sources are REST feeds that are ingested by Apache NiFi to transform, convert, enrich and finalize it for usage in streaming tables with Flink SQL, but also keep that same contract with Kafka consumers, Iceberg tables and other users of this data. I do not need to many user interfaces to interopt with the system as I want my final decision sent in a Slack message to me and then I’ll get moving. Along the way data will be visible in NiFi lineage, Kafka topic views, Flink SQL output, REST output and Iceberg tables.
Apache NiFi, Apache Kafka, Apache OpenNLP, Apache Tika, Apache Flink, Apache Avro, Apache Parquet, Apache Iceberg.
https://github.com/tspannhw/FLaNK-MTA/tree/main
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
https://medium.com/@tspann/open-source-streaming-talks-in-progress-3e75af8848b0
https://medium.com/@tspann/watching-airport-traffic-in-real-time-32c522a6e386
Data Integration with Apache Kafka: What, Why, HowPat Patterson
Presented at Orange County Advanced Analytics and Big Data Meetup, June 21 2019.
Apache Kafka has fast become the dominant messaging technology for the enterprise; if you're a data scientist or data engineer and you have not yet worked with Kafka, that situation will likely change soon! In this session, Pat Patterson, director of evangelism at StreamSets, explains what Kafka is, why it has disrupted the previous generation of messaging products, and how you can use open source products to build dataflow pipelines with Kafka, without writing code.
Are you interested in harnessing and analyzing the data that drives the Spark Web UI?
Are you keen to use that data to tune your applications or understand fluctuations in
runtime of your production applications? Do you want to understand the efficiency of
your Spark executors and system resources?
This presentation will help you do that and more, by walking through the wealth of data in
Spark application events. The event data can be used as a foundation for a Spark profiler and
advisor that analyzes application events in batch or real-time.
At the very least, you will be able to use the data to generate a summary page of your application execution, similar to the Hadoop job summary page, allowing you to compare executions.
A common theme in the IoT space is the need for large volume data streaming, ingestion and storage, and post-ingestion processing and analytics all of which depend on an efficient, scalable and well-performing data model. With intelligent transportation picking up traction as an IoT showcase, this presentation will take the usecase of vehicle-to-infrastructure (V2I) data exchange for intelligent vehicle systems, and walk through a high level data schema and datastore design approach to support billions of vehicles and hundreds of billions of daily data events. It should be noted that at these volumes, effective and efficient schema-level indexing is not practical. The proposed design borrows a page from the venerable Unix Filesystem inode structure and can be implemented on datastores like Apache Cassandra and Apache HBase.
Spark's capabilities as a better and faster Hadoop, as a distributed Scala platform, and as an interactive, batch and streaming environment are quite well known. But its prowess to be all that as a multilingual platform have not received sufficient spotlight.
Traditionally RDBMS environments needed to glue together set oriented SQL with row-level specialized procedural languages (e.g. Pl/SQL), or use APIs in non-SQL languages e.g. JDBC. In spark however, the confluence of Scala and SQL is that of two equals as both are set or collection oriented, but have their own unique strengths.
This presentation will illustrate with background and examples on how to exploit this fusion of Scala and SQL in a way that takes advantage of both their strengths as well as boosts productivity.