Storm and Cassandra

•

13 likes•10,518 views

Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra. There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler. http://github.com/tjake/stormscraper

Technology Design

Storm and Cassandra
Cassandra NYC Meetup 11/5/2013
Jake Luciani (@tjake)

What is Storm?
•

Distributed event processor

•

Provides constructs to reliably process all events

•

Simple conceptual model

•

New to Apache Incubator:
http://wiki.apache.org/incubator/StormProposal

Storm Concepts
Spout - Collects work and submits it to be processed.
Tracks success or failure of each tuple.

…

Tuple - A collection of data that is passed within storm.

Bolt - Processes tuples and optionally emits more tuples.
Stream - Identiﬁes outputs from a Spout/Bolt.
Forces tuples have some declared structure.

Storm Topologies
A directed graph of spouts and bolts connected via streams

A-F
G-P

Firehose

Zookeeper

Q-Z

Host A

Host B

Host C

Cassandra
(optional)

Example Topologies

•

Track the top 10 most popular links being shared in the
last N minutes.

Where does data end up?
•

Storm supports built in RPC so client requests can
effectively become a spout.
!

•

Put the data into a database…

•

Why Cassandra though?

Why Cassandra?

•

Cassandra’s Data model allows incremental
modiﬁcations to rows.

•

Different bolts can update different parts of a
Cassandra row asynchronously.

StormScraper!
A web crawling system built on
Storm + Cassandra
!
http://github.com/tjake/stormscraper

StormScraper C* DataModel
!

CREATE TABLE scrape_list (
url text PRIMARY KEY,
last_update timestamp,
depth int
);

CREATE TABLE pages (
url text,
scrape_date timestamp,
title text,
html text,
text text,
inbound_links set<text>,
outbound_links set<text>,
PRIMARY KEY (url, scrape_date)
);

StormScraper Topology

Url
Spout

Cassandra

StormScraper Topology

Url
Spout

Scraper
Bolt

Cassandra

StormScraper Topology
Html Writer

Url
Spout

Scraper
Bolt

Cassandra

StormScraper Topology
Html Writer

Url
Spout

Scraper
Bolt

Link Writer

Cassandra

StormScraper Topology
Html Writer

Url
Spout

Scraper
Bolt

Link Writer

Text
Extraction
Bolt
Cassandra

StormScraper Topology
Html Writer

Url
Spout

Scraper
Bolt

Link Writer

Text
Extraction
Bolt
Cassandra

Text Writer

StormScraper Topology
Fail
Html Writer

Url
Spout

Scraper
Bolt

Link Writer

Text
Extraction
Bolt
Cassandra

Text Writer

Code Walkthrough
http://github.com/tjake/
stormscraper

Storm Summary

•

Powerful

•

But easy to make mistakes
•

Wrong tuple expectation, names, types

•

Bad topology wiring

adoop plays a central role for Yahoo! to provide personalized experiences for our users and create value for our advertisers. In this talk, we will discuss the convergence of low-latency processing and Hadoop platform. To enable the convergence, we have developed Storm-on-YARN to enable Storm streaming/microbatch applications and Hadoop batch applications hosted in a single cluster. Storm applications could leverage YARN for resource management, and apply Hadoop style security to Hadoop datasets on HDFS and HBase. In Storm-on-YARN, YARN is used to launch Storm application master (Nimbus), and enable Nimbus to request resources for Storm workers (Supervisors). YARN resource manager and Storm scheduler work together to support multi-tenancy and high availability. HDFS enables Storm to achieve higher availability of Nimbus itself. We are introducing Hadoop style security into Storm through JAAS authentication (Kerberos and Digest). Storm servers (Nimbus and DRPC) will be configured with authorization plugins for access control and audit. The security context enables Storm applications to access authorized datasets only (including those created by Hadoop applications). Yahoo! is making our contribution on Storm and YARN available as open source. We will work with industry partners to foster the convergence of low-latency processing and big-data.

Apache Storm Internals

Humoyun Ahmedov

Apache Storm 0.9 basic training - Verisign

Michael Noll

Apache Storm 0.9 basic training (130 slides) covering: 1. Introducing Storm: history, Storm adoption in the industry, why Storm 2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism 3. Operating Storm: architecture, hardware specs, deploying, monitoring 4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning 5. Playing with Storm using Wirbelsturm Audience: developers, operations, architects Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/ Verisign is a global leader in domain names and internet security. Tools mentioned: - Wirbelsturm (https://github.com/miguno/wirbelsturm) - kafka-storm-starter (https://github.com/miguno/kafka-storm-starter) Blog post at: http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/ Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!

Introduction to StormEugene Dvorkin

Hadoop Summit Europe 2014: Apache Storm Architecture

P. Taylor Goetz

Real-Time Big Data at In-Memory Speed, Using Storm

Nati Shalom

Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner. This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way. - See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf

Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc. In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well. Following topics will be covered: • Why use Apache Storm? • Common use cases • Storm Architecture - components, concepts, topology • Building simple Storm topology with Java and Groovy • Trident and micro-batch processing • Fault tolerance and guaranteed message delivery • Running and monitoring Storm in production • Kafka • Storm at WebMD • Resources

Slide #1:Introduction to Apache Storm

Md. Shamsur Rahim

Scaling Apache Storm (Hadoop Summit 2015)

Robert Evans

PHP Backends for Real-Time User Interaction using Apache Storm.

DECK36

Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).

Introduction to Twitter Storm

Uwe Printz

Storm: The Real-Time Layer - GlueCon 2012

Dan Lynn

Real-time Big Data Processing with Storm

viirya

Yahoo compares Storm and Spark

Chicago Hadoop Users Group

Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other. Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work). Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.

Storm Real Time Computation

Sonal Raj

Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Sonal Raj

Resource Aware Scheduling in Apache Storm

DataWorks Summit/Hadoop Summit

Introduction to Apache Storm

Tiziano De Matteis

Introduction to Apache Storm - Concept & Example

Dung Ngua

Cassandra and Storm at Health Market SceinceP. Taylor Goetz

Storm

Pouyan Rezazadeh

Realtime Analytics with Storm and HadoopDataWorks Summit

Storm presentation

Shyam Raj

Spark vs storm

Trong Ton

Scaling Apache Storm - Strata + Hadoop World 2014

P. Taylor Goetz

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...

Alexey Kharlamov

At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection. Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other. Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent. We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time. This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution. Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations: • Recovery time • Time relativity and continuity • Geographical distribution of data sources • Limit on data loss • Maintainability The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec. This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.

Real-Time Analytics with Kafka, Cassandra and Storm

John Georgiadis

Kafka and Storm - event processing in realtime

Guido Schmutz

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. This session presents the main concepts of Kafka and Storm and then shows how a simple stream processing application is implemented using these two technologies.

What's hot

Learning Stream Processing with Apache Storm

Eugene Dvorkin

Slide #1:Introduction to Apache Storm

Md. Shamsur Rahim

Scaling Apache Storm (Hadoop Summit 2015)

Robert Evans

PHP Backends for Real-Time User Interaction using Apache Storm.

DECK36

Introduction to Twitter Storm

Uwe Printz

Storm: The Real-Time Layer - GlueCon 2012

Dan Lynn

Real-time Big Data Processing with Storm

viirya

Yahoo compares Storm and Spark

Chicago Hadoop Users Group

Storm Real Time Computation

Sonal Raj

Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Sonal Raj

Resource Aware Scheduling in Apache Storm

DataWorks Summit/Hadoop Summit

Introduction to Apache Storm

Tiziano De Matteis

Introduction to Apache Storm - Concept & Example

Dung Ngua

Cassandra and Storm at Health Market SceinceP. Taylor Goetz

Storm

Pouyan Rezazadeh

Realtime Analytics with Storm and HadoopDataWorks Summit

Storm presentation

Shyam Raj

Spark vs storm

Trong Ton

Scaling Apache Storm - Strata + Hadoop World 2014

P. Taylor Goetz

What's hot (19)

Learning Stream Processing with Apache Storm

Slide #1:Introduction to Apache Storm

Scaling Apache Storm (Hadoop Summit 2015)

PHP Backends for Real-Time User Interaction using Apache Storm.

Introduction to Twitter Storm

Storm: The Real-Time Layer - GlueCon 2012

Real-time Big Data Processing with Storm

Yahoo compares Storm and Spark

Storm Real Time Computation

Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Resource Aware Scheduling in Apache Storm

Introduction to Apache Storm

Introduction to Apache Storm - Concept & Example

Cassandra and Storm at Health Market Sceince

Storm

Realtime Analytics with Storm and Hadoop

Storm presentation

Spark vs storm

Scaling Apache Storm - Strata + Hadoop World 2014

Viewers also liked

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...

Alexey Kharlamov

Real-Time Analytics with Kafka, Cassandra and Storm

John Georgiadis

Kafka and Storm - event processing in realtime

Guido Schmutz

Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

Caserta

Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example. Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics. Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics. For more information, visit www.casertaconcepts.com

Real Time Data Streaming using Kafka & Storm

Ran Silberman

Aids and the_duty_to_warn-1

lisawhistler

The Modern Web Part 4: Cloud Computing

David Pallmann

KDB database (EPAM tech talks, Sofia, April, 2015)

Martin Toshev

Actors and Threads

mperham

Asynchronous stream processing with Akka Streams

Johan Andrén

Kafka replication apachecon_2013Jun Rao

Cassandra + Spark + Elk

Vasil Remeniuk

Streaming Data Analytics with Kinesis Firehouse and Redshift

Amazon Web Services

AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...

Amazon Web Services

“Attribution" is the marketing term of art for allocating full or partial credit to individual advertisements that eventually lead to a purchase, sign up, download, or other desired consumer interaction. We'll share how we use DynamoDB at the core of our attribution system to store terabytes of advertising history data. The system is cost effective and dynamically scales from 0 to 300K requests per second on demand with predictable performance and low operational overhead.

Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...

DataStax

Many companies use both elasticsearch and cassandra, typically in the form of logs or time series, but managing many softwares at a large scale can be quite challenging. Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana. We will present the core concepts of elassandra and explain how it draws benefit from internal cassandra features to make elasticsearch masterless, scalable with automatic resharding, more reliable and more efficient than deploying both softwares. We will also explore the bidirectional mapping : the way elasticsearch automatically creates the corresponding cassandra schema and the way elasticsearch indexes an existing cassandra table. Furthermore, we will share some use cases and benchmark results demonstrating practical use of elassandra to scale-out, re-index with zero-downtime, search and visualize data with various tools. About the Speakers Remi Trouville Consultant, Independant Remi is an IT engineer who has worked for the last 8 years in the financial industry as a team manager responsible for all the call-center softwares managing the customer experience. At the end of this period, his team was dealing with 10,000+ agents with 100+ sites and some highly critical business processes such as storage of oral proof sales for transactions. He holds a Master's Degree in Telecommunication engineering and is now following an executive-MBA, in a French business school.

A Well Structured Essay

wsymes

AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...

Amazon Web Services

Elasticsearch is a fully featured search engine used for real-time analytics, and Amazon Elasticsearch Service makes it easy to deploy Elasticsearch clusters on AWS. With Amazon ES, you can ingest and process billions of events per day, and explore the data using Kibana to discover patterns. In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data into it using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.

Deep dive and best practices on real time streaming applications nyc-loft_oct...

Amazon Web Services

AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)

Amazon Web Services

In this session, learn how to easily and seamlessly transition or extend Hadoop and Spark into the cloud without disruption. Learn how customers are taking advantage of AWS services without major architectural changes or downtime by using AWS Big Data Technology Partner solutions. In this session, we focus on patterns for data migration from Hadoop clusters to Amazon S3 and automated deployment of partner solutions for big data workloads.

AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...

Amazon Web Services

As more and more organizations strive to gain real-time insights into their business, streaming data has become ubiquitous. Typical streaming data analytics solutions require specific skills and complex infrastructure. However, with Amazon Kinesis Analytics, you can analyze streaming data in real-time with standard SQL—there is no need to learn new programming languages or processing frameworks. In this session, we dive deep into the capabilities of Amazon Kinesis Analytics using real-world examples. We’ll present an end-to-end streaming data solution using Amazon Kinesis Streams for data ingestion, Amazon Kinesis Analytics for real-time processing, and Amazon Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Amazon Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system.

Viewers also liked (20)

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...

Real-Time Analytics with Kafka, Cassandra and Storm

Kafka and Storm - event processing in realtime

Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

Real Time Data Streaming using Kafka & Storm

Aids and the_duty_to_warn-1

The Modern Web Part 4: Cloud Computing

KDB database (EPAM tech talks, Sofia, April, 2015)

Actors and Threads

Asynchronous stream processing with Akka Streams

Kafka replication apachecon_2013

Cassandra + Spark + Elk

Streaming Data Analytics with Kinesis Firehouse and Redshift

AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...

Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...

A Well Structured Essay

AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...

Deep dive and best practices on real time streaming applications nyc-loft_oct...

AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)

AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...

Similar to Storm and Cassandra

All Day DevOps - FLiP Stack for Cloud Data Lakes

Timothy Spann

https://www.alldaydevops.com/addo-speakers/timothy-spann Timothy Spann StreamNative MODERN INFRASTRUCTURE SHARE THIS SESSION Session Name: FLiP Stack for Cloud Data Lakes Utilizing an all Apache stack for Rapid Data Lake Population and querying utilizing Apache Flink, Apache Pulsar, and Apache NiFi. We can quickly stream data to and from any datalake, data lake house, lakehouse, database or any datamart regardless of cloud or size. FLiP allows for Java and Python developers to build scalable solutions that span messaging and streaming in cloud native fashion with full monitoring. Speaker Bio: Tim Spann is a Developer Advocate @ StreamNative where he works with Apache Pulsar, Apache Flink, Apache NiFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData, and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, Pulsar, NiFi, the blockchain, and Spark.

DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

Dustin Vannoy

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Guido Schmutz

Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.

Apache Storm

Rajind Ruparathna

Apache storm vs. Spark Streaming

P. Taylor Goetz

Cleveland HUG - Storm

justinjleet

Apache storm

Kapil Kumar

SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai

Codemotion Dubai

A talk covering the best-of-breed platform consisting of Spark, Mesos, Akka, Cassandra and Kafka. SMACK is more of a toolbox of technologies to allow the building of resilient ingestion pipelines, offering a high degree of freedom in the selection of analysis and query possibilities and baked in support for flow-control. More and more customers are using this stack, which is rapidly becoming the new industry standard for Big Data solutions. Session can be seen here - in German - https://speakerdeck.com/stefan79/fast-data-smack-down

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

Helena Edelson

Connecting kafka message systems with scylla

Maheedhar Gunturu

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Helena Edelson

Scala Days, Amsterdam, 2015: Lambda Architecture - Batch and Streaming with Spark, Cassandra, Kafka, Akka and Scala; Fault Tolerance, Data Pipelines, Data Flows, Data Locality, Akka Actors, Spark, Spark Cassandra Connector, Big Data, Asynchronous data flows. Time series data, KillrWeather, Scalable Infrastructure, Partition For Scale, Replicate For Resiliency, Parallelism Isolation, Data Locality, Location Transparency

Introduction to Apache NiFi And Storm

Jungtaek Lim

Spark Summit EU talk by Sameer Agarwal

Spark Summit

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Helena Edelson

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...

DataStax Academy

Typesafe did a survey of Spark usage last year and found that a large percentage of Spark users combine it with Cassandra and Kafka. This talk focuses on streaming data scenarios that demonstrate how these three tools complement each other for building robust, scalable, and flexible data applications. Cassandra provides resilient and scalable storage, with flexible data format and query options. Kafka provides durable, scalable collection of streaming data with message-queue semantics. Spark provides very flexible analytics, everything from classic SQL queries to machine learning and graph algorithms, running in a streaming model based on "mini-batches", offline batch jobs, or interactive queries. We'll consider best practices and areas where improvements are needed.

Anomaly Detection at Scale

Jeff Henrikson

Development and Applications of Distributed IoT Sensors for Intermittent Conn...

InfluxData

What do electric power sensing IoT devices, large area electric field surveys and an array with hundreds of data channels have in common? They’re all built using an IoT stack fueled by InfluxDB and designed to run in environments of intermittent network connectivity. In the operational environments where U.S. Soldiers operate, network connectivity is not ensured due to jamming, intermittent 4G signals, or paperwork. To address these issues, the United States Army Research Laboratory runs InfluxDB in both the cloud and on the IoT device. When connectivity is available, the most recent data are replicated to the cloud with historical data replicated as possible. This allows them to design products that can leverage the cloud, but aren’t tied to it. As a result, they have been able to develop electric power monitors for installations and microgrids, strap sensors to vehicles for large area surveys, and combine sensors into arrays.

Streaming ETL with Apache Kafka and KSQL

Nick Dearden

Companies new and old are all recognizing the importance of a low-latency, scalable, fault-tolerant data backbone - in the form of the Apache Kafka streaming platform. With Kafka developers can integrate multiple systems and data sources to enable low-latency analytics, event-driven architectures, and the population of downstream systems. What's more, these data pipelines can be built using configuration alone. In this talk, we'll see how easy it is to capture a stream of data changes in real-time from a database such as MySQL into Kafka using the Kafka Connect framework and then use KSQL to filter, aggregate and join it to other data, and finally stream the results from Kafka out into multiple targets such as Elasticsearch and MySQL. All of this can be accomplished without a single line of Java code!

Netflix Keystone—Cloud scale event processing pipeline

Monal Daxini

Huawei Advanced Data Science With Spark Streaming

Jen Aman

Similar to Storm and Cassandra (20)

All Day DevOps - FLiP Stack for Cloud Data Lakes

DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Apache Storm

Apache storm vs. Spark Streaming

Cleveland HUG - Storm

Apache storm

SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

Connecting kafka message systems with scylla

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Introduction to Apache NiFi And Storm

Spark Summit EU talk by Sameer Agarwal

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...

Anomaly Detection at Scale

Development and Applications of Distributed IoT Sensors for Intermittent Conn...

Streaming ETL with Apache Kafka and KSQL

Netflix Keystone—Cloud scale event processing pipeline

Huawei Advanced Data Science With Spark Streaming

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

The Future of Platform Engineering

Jemma Hussein Allen

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Recently uploaded (20)