Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Â
Apache Spark is an excellent tool to accelerate your analytics, whether youâre doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices Iâve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
Building a real time, solr-powered recommendation engineTrey Grainger
Â
Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also likedâŚ), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? Itâs not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Its effective stream processing capabilities are trusted by Twitter and Yahoo for quickly extracting insights from their Big Data.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Â
Apache Spark is an excellent tool to accelerate your analytics, whether youâre doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices Iâve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
Building a real time, solr-powered recommendation engineTrey Grainger
Â
Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also likedâŚ), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? Itâs not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Its effective stream processing capabilities are trusted by Twitter and Yahoo for quickly extracting insights from their Big Data.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
Â
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Common Patterns of Multi Data-Center Architectures with Apache Kafkaconfluent
Â
Whether you know you want to run Apache Kafka in multiple data centers and need practical advice or you are wondering why some organizations even need more than one cluster, this online talk is for you.
In this short session, weâll discuss the basic patterns of multi-datacenter Kafka architectures, explore some of the use-cases enabled by each architecture and show how Confluent Enterprise products make these patterns easy to implement.
Visit www.confluent.io for more information.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
Â
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Â
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Deep Dive: Memory Management in Apache SparkDatabricks
Â
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
Â
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Â
Slides from Spark Summit East 2017 â February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
Â
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Common Patterns of Multi Data-Center Architectures with Apache Kafkaconfluent
Â
Whether you know you want to run Apache Kafka in multiple data centers and need practical advice or you are wondering why some organizations even need more than one cluster, this online talk is for you.
In this short session, weâll discuss the basic patterns of multi-datacenter Kafka architectures, explore some of the use-cases enabled by each architecture and show how Confluent Enterprise products make these patterns easy to implement.
Visit www.confluent.io for more information.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
Â
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Â
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Deep Dive: Memory Management in Apache SparkDatabricks
Â
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
Â
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Â
Slides from Spark Summit East 2017 â February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Developers building healthcare applications for mobile devices, wearables and the desktop need to understand HIPAA requirements in order to build apps that are in compliance. This deck gives application developers an overview of the HIPAA rules and what it means for their software development.
This tutorial covers advanced consumer topics like custom deserializers, ConsumerRebalanceListener to rewind to a certain offset, manual assignment of partitions to implement a "priority queue", âat least onceâ message delivery semantics Consumer Java example, âat most onceâ message delivery semantics Consumer Java example, âexactly onceâ message delivery semantics Consumer Java example, and a lot more.
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
Â
My talk at Google DevFest Switzerland, Fribourg, Oct 2017.
https://devfest.ch/schedule/day1?sessionId=118
Abstract:
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka.
Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, todayâs common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data architectures. We cover how you can now build normal applications to serve your real-time processing needs â rather than building clusters or similar special-purpose infrastructure â and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. We discuss common use cases to realize that stream processing in practice often requires database-like functionality, and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (inventory management for large retailers, patient monitoring in healthcare, fleet tracking in logistics, etc), for example in the form of event-driven, containerized microservices. We will also give a brief shout-out to the recently launched KSQL, a streaming SQL engine for Apache Kafka.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
Â
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. Weâll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance.
Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Â
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, weâve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Learning Stream Processing with Apache StormEugene Dvorkin
Â
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
⢠Why use Apache Storm?
⢠Common use cases
⢠Storm Architecture - components, concepts, topology
⢠Building simple Storm topology with Java and Groovy
⢠Trident and micro-batch processing
⢠Fault tolerance and guaranteed message delivery
⢠Running and monitoring Storm in production
⢠Kafka
⢠Storm at WebMD
⢠Resources
A presentation at Twitter's official developer conference, Chirp, about why we use the Scala programming language and how we build services in it. Provides a tour of a number of libraries and tools, both developed at Twitter and otherwise.
Integrate Solr with real-time stream processing applicationsthelabdude
Â
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solrâs real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. Heâll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Keynote at COMMitMDE'18 showing the basic concepts behind Hawk, our past case studies, and some of our experience in designing the Hawk Thrift APIs for remote model querying.
OWASP WTE, or OWASP Web Testing Environment, is a collection of application security tools and documentation available in multiple formats such as VMs, Linux distribution packages, Cloud-based installations and ISO images.
This presentation provides an overview and history of OWASP WTE. Additionally, it shows new OWASP WTE developments including the the ability to use WTE remotely by installing it on a cloud-based server.
Building and deploying LLM applications with Apache AirflowKaxil Naik
Â
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. Weâll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
Nelson: Rigorous Deployment for a Functional WorldTimothy Perrett
Â
Functional programming finds its roots in mathematics - the pursuit of purity and completeness. We functional programmers look to formalize system behaviors in an algebraic and total manner. Despite this, when it comes time to deploy ones beautiful monadic ivory towers to production, most organizations cast caution to the wind and use a myriad of bash scripts and sticky tape to get the job done. In this talk, the speaker will introduce you to Nelson, an open-source project from Verizon that looks to provide rigor to your large distributed system, whilst offering best-in-class security, runtime traffic shifting and a fully immutable approach to application lifecycle. Nelson itself is entirely composed of free algebras and coproducts, and the speaker will show not only how this has enabled development, but also how it provided a frame with which to reason about solutions to fundamental operational problems.
Practical Chaos Engineering will show how to start running chaos experiments in your infrastructure and will try to guide your through the principles of chaos.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Â
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Solving k8s persistent workloads using k8s DevOps styleMayaData
Â
Solving k8s persistent workloads
using k8s DevOps style. Presented at Container_stack-Zurich-2019
-How Hardware trends enforce a change in the way we do things
-Storage limitations bubble up
-Infrastructure as code
David Kale and Ruben Fizsel from Skymind talk about deep learning for the JVM and enterprise using deeplearning4j (DL4J). Deep learning (nouveau neural nets) have sparked a renaissance in empirical machine learning with breakthroughs in computer vision, speech recognition, and natural language processing. However, many popular deep learning frameworks are targeted to researchers and poorly suited to enterprise settings that use Java-centric big data ecosystems. DL4J bridges the gap, bringing high performance numerical linear algebra libraries and state-of-the-art deep learning functionality to the JVM.
Similar to Apache Storm 0.9 basic training - Verisign (20)
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Â
Abstract â Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Â
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Â
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
Â
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
1. Apache Storm 0.9 basic training
Michael G. Noll, Verisign
mnoll@verisign.com / @miguno
July 2014
2. Storm?
⢠Part 1: Introducing Storm
⢠âWhy should I stay awake for the full duration of this workshop?â
⢠Part 2: Storm core concepts
⢠Topologies, tuples, spouts, bolts, groupings, parallelism
⢠Part 3: Operating Storm
⢠Architecture, hardware specs, deploying, monitoring
⢠Part 4: Developing Storm apps
⢠Bolts and topologies, Kafka integration, testing, serialization, example apps, P&S tuning
⢠Part 5: Playing with Storm using Wirbelsturm
⢠Wrapping up
Verisign Public
2
3. NOT covered in this workshop (too little time)
⢠Storm Trident
⢠High-level abstraction on top of Storm, which intermixes high throughput and
stateful stream processing with low latency distributed querying.
Verisign Public
⢠Joins, aggregations, grouping, functions, filters.
⢠Adds primitives for doing stateful, incremental processing on top of any database or
persistence store.
⢠Has consistent, exactly-once semantics.
⢠Processes a stream as small batches of messages
⢠(cf. Spark Streaming)
⢠Storm DRPC
⢠Parallelizes the computation of really intense functions on the fly.
⢠Input is a stream of function arguments, and output is a stream of the results
for each of those function calls.
3
5. Overview of Part 1: Introducing Storm
⢠Storm?
⢠Storm adoption and use cases in the wild
⢠Storm in a nutshell
⢠Motivation behind Storm
Verisign Public
5
6. Storm?
⢠âDistributed and fault-tolerant real-time computationâ
⢠http://storm.incubator.apache.org/
⢠Originated at BackType/Twitter, open sourced in late 2011
⢠Implemented in Clojure, some Java
⢠12 core committers, plus ~ 70 contributors
Verisign Public
6
https://github.com/apache/incubator-storm/#committers
https://github.com/apache/incubator-storm/graphs/contributors
7. Storm adoption and use cases
⢠Twitter: personalization, search, revenue optimization, âŚ
Verisign Public
⢠200 nodes, 30 topos, 50B msg/day, avg latency <50ms, Jun 2013
⢠Yahoo: user events, content feeds, and application logs
⢠320 nodes (YARN), 130k msg/s, June 2013
⢠Spotify: recommendation, ads, monitoring, âŚ
⢠v0.8.0, 22 nodes, 15+ topos, 200k msg/s, Mar 2014
⢠Alibaba, Cisco, Flickr, PARC, WeatherChannel, âŚ
⢠Netflix is looking at Storm and Samza, too.
7
https://github.com/nathanmarz/storm/wiki/Powered-By
14. âShow me your code and conceal your data structures,
and I shall continue to be mystified. Show me your
data structures, and I won't usually need your code;
it'll be obvious.â
-- Eric S. Raymond, The Cathedral and the Bazaar
Verisign Public
21. Clojure
⢠Is a dialect of Lisp that targets the JVM (and JavaScript)
Verisign Public
⢠clojure-1.5.1.jar
22. Verisign Public
Wait a minute â LISP??
(me? (kidding (you (are))))
Yeah, those parentheses are annoying. At first.
Think: Like Pythonâs significant whitespace.
23. Clojure
⢠Is a dialect of Lisp that targets the JVM (and JavaScript)
Verisign Public
⢠clojure-1.5.1.jar
⢠"Dynamic, compiled programming language"
⢠Predominantly functional programming
⢠Many interesting characteristics and value propositions for software
development, notably for concurrent applications
⢠Stormâs core is implemented in Clojure
⢠And you will see why they match so well.
24. Verisign Public
Previous WordCount example in Clojure
h g f
(sort-by val > (frequencies (map second queries)))
Alternative, left-to-right syntax with ->>:
(->> queries (map second) frequencies (sort-by val >))
$ cat input.txt | awk | sort # kinda
25. Verisign Public
Clojure REPL
user> queries
(("1.1.1.1" "foo.com") ("2.2.2.2" "bar.net")
("3.3.3.3" "foo.com") ("4.4.4.4" "foo.com")
("5.5.5.5" "bar.net"))
user> (map second queries)
("foo.com" "bar.net" "foo.com" "foo.com" "bar.net")
user> (frequencies (map second queries))
{"bar.net" 2, "foo.com" 3}
user> (sort-by val > (frequencies (map second queries)))
(["foo.com" 3] ["bar.net" 2])
26. Verisign Public
Scaling up
Clojure, Java, <your-pick> can turn the
previous code into a multi-threaded app
that utilizes all cores on your server.
27. But what if even a very big machine is not
Verisign Public
enough for your Internet-scale app?
31. Overview of Part 2: Storm core concepts
⢠A first look
⢠Topology
⢠Data model
⢠Spouts and bolts
⢠Groupings
⢠Parallelism
Verisign Public
31
32. A first look
Verisign Public
32
Storm is distributed FP-like
processing of data streams.
Same idea, many machines.
(but thereâs more of course)
33. Overview of Part 2: Storm core concepts
⢠A first look
⢠Topology
⢠Data model
⢠Spouts and bolts
⢠Groupings
⢠Parallelism
Verisign Public
33
34. Verisign Public
A topology in Storm wires
data and functions via a DAG.
Executes on many machines
like a MR job in Hadoop.
37. Verisign Public
Bolt 2
Spout 2 Bolt 3
Bolt 4
Spout 1
Bolt 1
data
functions
Topology
38. Verisign Public
Bolt 2
Spout 2 Bolt 3
Bolt 4
Spout 1
Bolt 1
data
functions
DAG
Topology
39. Relation of topologies to FP
Verisign Public
Bolt 2
Bolt 4
Spout 1
Bolt 1
data
f
g
h
Spout 2 Bolt 3
40. Relation of topologies to FP
Verisign Public
Bolt 2
Bolt 4
Spout 1
Bolt 1
data
f
g
h
DAG: h( f(data), g ( d a t a ) )
41. Previous WordCount example in Storm (high-level)
Verisign Public
(->> queries (map second) frequencies (sort-by val >) )
queries f g h
Spout Bolt 1
Bolt 2 Bolt 3
Remember?
42. Overview of Part 2: Storm core concepts
⢠A first look
⢠Topology
⢠Data model
⢠Spouts and bolts
⢠Groupings
⢠Parallelism
Verisign Public
42
43. Verisign Public
Tuple = datum containing 1+ fields
(1.1.1.1, âfoo.comâ)
Values can be of any type such as Java primitive types, String, byte[].
Custom objects should provide their own Kryo serializer though.
Stream = unbounded sequence of tuples
...
(1.1.1.1, âfoo.comâ)
(2.2.2.2, âbar.netâ)
(3.3.3.3, âfoo.comâ)
...
Data model
http://storm.incubator.apache.org/documentation/Concepts.html
44. Overview of Part 2: Storm core concepts
⢠A first look
⢠Topology
⢠Data model
⢠Spouts and bolts
⢠Groupings
⢠Parallelism
Verisign Public
44
45. Spouts and bolts
Spout = source of data streams
Bolt = consumes 1+ streams and potentially produces new streams
Verisign Public
Spout 1 Bolt 1
Can be âunreliableâ (fire-and-forget) or âreliableâ (can replay failed tuples).
Example: Connect to the Twitter API and emit a stream of decoded URLs.
Spout 1 Bolt 1 Bolt 2
Can do anything from running functions, filter tuples, joins, talk to DB, etc.
Complex stream transformations often require multiple steps and thus multiple bolts.
http://storm.incubator.apache.org/documentation/Concepts.html
46. Overview of Part 2: Storm core concepts
⢠A first look
⢠Topology
⢠Data model
⢠Spouts and bolts
⢠Groupings
⢠Parallelism
Verisign Public
46
47. Stream groupings control the data flow in the DAG
Bolt C
ď§ Shuffle grouping = random; typically used to distribute load evenly to downstream bolts
ď§ Fields grouping = GROUP BY field(s)
ď§ All grouping = replicates stream across all the boltâs tasks; use with care
ď§ Global grouping = stream goes to a single one of the boltâs tasks; donât overwhelm the target bolt!
ď§ Direct grouping = producer of the tuple decides which task of the consumer will receive the tuple
ď§ LocalOrShuffle = If the target bolt has one or more tasks in the same worker process, tuples will
Verisign Public
be shuffled to just those in-process tasks. Otherwise, same as normal shuffle.
ď§ Custom groupings are possible, too.
Bolt B
Spout
Bolt A
48. Overview of Part 2: Storm core concepts
⢠A first look
⢠Topology
⢠Data model
⢠Spouts and bolts
⢠Groupings
⢠Parallelism â worker, executors, tasks
Verisign Public
48
49. Worker processes vs. Executors vs. Tasks
Verisign Public
Invariant: #threads ⤠#tasks
A worker process is either idle or being used by a single topology, and it is never
shared across topologies. The same applies to its child executors and tasks.
http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
54. Verisign Public
Supervisor
Supervisor
Storm architecture
54
Hadoop v1 Storm
JobTracker Nimbus
(only 1)
ď§ distributes code around cluster
ď§ assigns tasks to machines/supervisors
ď§ failure monitoring
ď§ is fail-fast and stateless (you can âkill -9â it)
TaskTracker Supervisor
(many)
ď§ listens for work assigned to its machine
ď§ starts and stops worker processes as necessary based on Nimbus
ď§ is fail-fast and stateless (you can âkill -9â it)
ď§ shuts down worker processes with âkill -9â, too
MR job Topology ď§ processes messages forever (or until you kill it)
ď§ a running topology consists of many worker processes spread across
many machines
Nimbus
ZooKeeper
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
56. Storm architecture: ZooKeeper
⢠Storm requires ZooKeeper
Verisign Public
⢠0.9.2+ uses ZK 3.4.5
⢠Storm typically puts less load on ZK than Kafka (but ZK is still a bottleneck),
but caution: often you have many more Storm nodes than Kafka nodes
⢠ZooKeeper
⢠NOT used for message passing, which is done via Netty in 0.9
⢠Used for coordination purposes, and to store state and statistics
⢠Register + discover Supervisors, detect failed nodes, âŚ
⢠Example: To add a new Supervisor node, just start it.
⢠This allows Stormâs components to be stateless. âkill -9â away!
⢠Example: Supervisors/Nimbus can be restarted without affecting running topologies.
⢠Used for heartbeats
⢠Workers heartbeat the status of child executor threads to Nimbus via ZK.
⢠Supervisor processes heartbeat their own status to Nimbus via ZK.
⢠Stores recent task errors (deleted on topology shutdown)
56
57. Storm architecture: fault tolerance
⢠What happens when Nimbus dies (master node)?
Verisign Public
⢠If Nimbus is run under process supervision as recommended (e.g. via
supervisord), it will restart like nothing happened.
⢠While Nimbus is down:
⢠Existing topologies will continue to run, but you cannot submit new topologies.
⢠Running worker processes will not be affected. Also, Supervisors will restart
their (local) workers if needed. However, failed tasks will not be reassigned to
other machines, as this is the responsibility of Nimbus.
⢠What happens when a Supervisor dies (slave node)?
⢠If Supervisor run under process supervision as recommended (e.g. via
supervisord), will restart like nothing happened.
⢠Running worker processes will not be affected.
⢠What happens when a worker process dies?
⢠It's parent Supervisor will restart it. If it continuously fails on startup and is
unable to heartbeat to Nimbus, Nimbus will reassign the worker to another
machine.
57
58. Storm hardware specs
⢠ZooKeeper
Verisign Public
⢠Preferably use dedicated machines because ZK is a bottleneck for Storm
⢠1 ZK instance per machine
⢠Using VMs may work in some situations. Keep in mind other VMs or processes
running on the shared host machine may impact ZK performance, particularly if
they cause I/O load. (source)
⢠I/O is a bottleneck for ZooKeeper
⢠Put ZK storage on its own disk device
⢠SSDâs dramatically improve performance
⢠Normally, ZK will sync to disk on every write, and that causes two seeks (1x for
the data, 1x for the data log). This may add up significantly when all the
workers are heartbeating to ZK. (source)
⢠Monitor I/O load on the ZK nodes
⢠Preferably run ZK ensembles with nodes >= 3 in production environments
so that you can tolerate the failure of 1 ZK server (incl. e.g. maintenance)
58
59. Storm hardware specs
⢠Nimbus aka master node
Verisign Public
⢠Comparatively little load on Nimbus, so a medium-sized machine suffices
⢠EC2 example: m1.xlarge @ $0.27/hour
⢠Check monitoring stats to see if the machine can keep up
59
60. Storm hardware specs
⢠Storm Supervisor aka slave nodes
Verisign Public
⢠Exact specs depend on anticipated usage â e.g. CPU heavy, I/O heavy, âŚ
⢠CPU heavy: e.g. machine learning
⢠CPU light: e.g. rolling windows, pre-aggregation (here: get more RAM)
⢠CPU cores
⢠More is usually better â the more you have the more threads you can support
(i.e. parallelism). And Storm potentially uses a lot of threads.
⢠Memory
⢠Highly specific to actual use case
⢠Considerations: #workers (= JVMs) per node? Are you caching and/or holding
in-memory state?
⢠Network: 1GigE
⢠Use bonded NICs or 10GigE if needed
⢠EC2 examples: c1.xlarge @ $0.36/hour, c3.2xlarges @ $0.42/hour
60
61. Deploying Storm
⢠Puppet module
Verisign Public
⢠https://github.com/miguno/puppet-storm
⢠Hiera-compatible, rspec tests, Travis CI setup (e.g. to test against multiple
versions of Puppet and Ruby, Puppet style checker/lint, etc.)
⢠RPM packaging script for RHEL 6
⢠https://github.com/miguno/wirbelsturm-rpm-storm
⢠Digitally signed by yum@michael-noll.com
⢠RPM is built on a Wirbelsturm-managed build server
⢠See later slides on Wirbelsturm for 1-click off-the-shelf cluster setups.
61
63. Operating Storm
⢠Typical operations tasks include:
Verisign Public
⢠Monitoring topologies for P&S (âDonât let our pipes blow up!â)
⢠Tackling P&S in Storm is a joint Ops-Dev effort.
⢠Adding or removing slave nodes, i.e. nodes that run Supervisors
⢠Apps management: new topologies, swapping topologies, âŚ
⢠See Ops-related references at the end of this part
63
64. Storm security
⢠Original design was not created with security in mind.
⢠Security features are now being added, e.g. from Yahoo!âs fork.
⢠State of security in Storm 0.9.x:
Verisign Public
⢠No authentication, no authorization.
⢠No encryption of data in transit, i.e. between workers.
⢠No access restrictions on data stored in ZooKeeper.
⢠Arbitrary user code can be run on nodes if Nimbusâ Thrift port is not locked down.
⢠This list goes on.
⢠Further details plus recommendations on hardening Storm:
⢠https://github.com/apache/incubator-storm/blob/master/SECURITY.md
64
66. Monitoring Storm
⢠Storm UI
⢠Use standard monitoring tools such as Graphite & friends
Verisign Public
⢠Graphite
⢠https://github.com/miguno/puppet-graphite
⢠Java API, also used by Kafka: http://metrics.codahale.com/
⢠Consider Storm's built-in metrics feature
⢠Collect logging files into a central place
⢠Logstash/Kibana and friends
⢠Helps with troubleshooting, debugging, etc. â notably if you can correlate
logging data with numeric metrics
66
67. Monitoring Storm
⢠Built-in Storm UI, listens on 8080/tcp by default
⢠Storm REST API (new since in 0.9.2)
Verisign Public
⢠https://github.com/apache/incubator-storm/blob/master/STORM-UI-REST-API.md
⢠Third-party tools
⢠https://github.com/otoolep/stormkafkamon
67
68. Monitoring Storm topologies
⢠Wait â why does the Storm UI report seemingly incorrect numbers?
Verisign Public
⢠Storm samples incoming tuples when computing statistics in order to
increase performance.
⢠Sample rate is configured via topology.stats.sample.rate.
⢠0.05 is the default value
⢠Here, Storm will pick a random event of the next 20 events in which to increase
the metric count by 20. So if you have 20 tasks for that bolt, your stats could be
off by +/- 380.
⢠1.00 forces Storm to count everything exactly
⢠This gives you accurate numbers at the cost of a big performance hit. For testing
purposes however this is acceptable and often quite helpful.
68
69. Monitoring ZooKeeper
⢠Ensemble (= cluster) availability
Verisign Public
⢠LinkedIn run 5-node ensembles = tolerates 2 dead
⢠Twitter run 13-node ensembles = tolerates 6 dead
⢠Latency of requests
⢠Metric target is 0 ms when using SSDâs in ZooKeeper machines.
⢠Why? Because SSDâs are so fast they typically bring down latency below ZKâs
metric granularity (which is per-ms).
⢠Outstanding requests
⢠Metric target is 0.
⢠Why? Because ZK processes all incoming requests serially. Non-zero
values mean that requests are backing up.
69
76. Code
Verisign Public
Topology config â for running on a production Storm cluster
77. Creating a spout
⢠Wonât cover implementing a spout in this workshop.
⢠This is because you typically use an existing spout (Kafka spout, Redis
spout, etc). But you will definitely implement your own bolts.
Verisign Public
77
78. Creating a bolt
⢠Storm is polyglot â but in this workshop we focus on JVM languages.
⢠Two main options for JVM users:
Verisign Public
⢠Implement the IRichBolt or IBasicBolt interfaces
⢠Extend the BaseRichBolt or BaseBasicBolt abstract classes
⢠BaseRichBolt
⢠You must â and are able to â manually ack() an incoming tuple.
⢠Can be used to delay acking a tuple, e.g. for algorithms that need to work
across multiple incoming tuples.
⢠BaseBasicBolt
⢠Auto-acks the incoming tuple at the end of its execute() method.
⢠These bolts are typically simple functions or filters.
78
80. Extending BaseRichBolt
⢠execute() is the heart of the bolt.
⢠This is where you will focus most of your attention when implementing
your bolt or when trying to understand somebody elseâs bolt.
Verisign Public
80
81. Extending BaseRichBolt
⢠prepare() acts as a âsecond constructorâ for the boltâs class.
⢠Because of Stormâs distributed execution model and serialization,
prepare() is often needed to fully initialize the bolt on the target JVM.
Verisign Public
81
82. Extending BaseRichBolt
⢠declareOutputFields() tells downstream bolts about this boltâs
output. What you declare must match what you actually emit().
Verisign Public
⢠You will use this information in downstream bolts to âextractâ the data
from the emitted tuples.
⢠If your bolt only performs side effects (e.g. talk to a DB) but does not emit
an actual tuple, override this method with an empty {} method.
82
83. Common spout/bolt gotchas
⢠NotSerializableException at run-time of your topology
Verisign Public
⢠Typically you will run into this because your bolt has fields (instance or class
members) that are not serializable. This recursively applies to each field.
⢠The root cause is Stormâs distributed execution model and serialization:
Storm code will be shipped â first serialized and then deserialized â to a
different machine/JVM, and then executed. (see docs for details)
⢠How to fix?
⢠Solution 1: Make the culprit class serializable, if possible.
⢠Solution 2: Register a custom Kryo serializer for the class.
⢠Solution 3a (Java): Make the field transient. If needed, initialize it in prepare().
⢠Solution 3b (Scala): Make the field @transient lazy val (Scala). If needed,
turn it into a var and initialize it in in prepare().
⢠For example, the var/prepare() approach may be needed if you use the factory pattern
to create a specific type of a collaborator within a bolt. Factories come in handy to make
the code testable. See AvroKafkaSinkBolt in kafka-storm-starter for such a case.
83
84. Common spout/bolt gotchas
⢠Tick tuples are configured per-component, i.e. per bolt
Verisign Public
⢠Idiomatic approach to trigger periodic activities in your bolts: âEvery 10s do XYZ.â
⢠Don't configure them per-topology as this will throw a RuntimeException.
⢠Tick tuples are not 100% guaranteed to arrive in time
⢠They are sent to a bolt just like any other tuples, and will enter the same queues and
buffers. Congestion, for example, may cause tick tuples to arrive too late.
⢠Across different bolts, tick tuples are not guaranteed to arrive at the same time, even
if the bolts are configured to use the same tick tuple frequency.
⢠Currently, tick tuples for the same bolt will arrive at the same time at the bolt's various
task instances. However, this property is not guaranteed for the future.
84
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
85. Common spout/bolt gotchas
⢠When using tick tuples, forgetting to handle them "in a special way"
Verisign Public
⢠Trying to run your normal business logic on tick tuples â e.g. extracting a certain data
field â will usually only work for normal tuples but fail for a tick tuple.
⢠When using tick tuples, forgetting to ack() them
⢠Tick tuples must be acked like any other tuple.
85
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
86. Common spout/bolt gotchas
⢠Outputcollector#emit() can only be called from the "original" thread
that runs a bolt
Verisign Public
⢠You can start additional threads in your bolt, but only the bolt's own thread may call
emit() on the collector to write output tuples. If you try to emit tuples from any of the
other threads, Storm will throw a NullPointerException.
⢠If you need the additional-threads pattern, use e.g. a thread-safe queue to communicate
between the threads and to collect [pun intended] the output tuples across threads.
⢠This limitation is only relevant for output tuples, i.e. output that you want to send
within the Storm framework to downstream consumer bolts.
⢠If you want to write data to (say) Kafka instead â think of this as a side effect of your
bolt â then you don't need the emit() anyways and can thus write the side-effect
output in any way you want, and from any thread.
86
87. Creating a topology
⢠When creating a topology youâre essentially defining the DAG â that is,
which spouts and bolts to use, and how they interconnect.
Verisign Public
⢠TopologyBuilder#setSpout() and TopologyBuilder#setBolt()
⢠Groupings between spouts and bolts, e.g. shuffleGrouping()
87
88. Creating a topology
⢠You must specify the initial parallelism of the topology.
Verisign Public
⢠Crucial for P&S but no rule of thumb. We talk about tuning later.
⢠You must understand concepts such as workers/executors/tasks.
⢠Only some aspects of parallelism can be changed later, i.e. at run-time.
⢠You can change the #executors (threads).
⢠You cannot change #tasks, which remains static during the topologyâs lifetime.
88
89. Creating a topology
⢠You submit a topology either to a âlocalâ cluster or to a real cluster.
Verisign Public
⢠LocalCluster#submitTopology
⢠StormSubmitter#submitTopology() and #submitTopologyWithProgressBar()
⢠In your code you may want to use both approaches, e.g. to facilitate local testing.
⢠Notes
⢠A StormTopology is a static, serializable Thrift data structure. It contains
instructions that tell Storm how to deploy and run the topology in a cluster.
⢠The StormTopology object will be serialized, including all the components in
the topology's DAG. See later slides on serialization.
⢠Only when the topology is deployed (and serialized in the process) and
initialized (i.e. prepare() and other life cycle methods are called on
components such as bolts) does it perform any actual message processing.
89
90. Running a topology
⢠To run a topology you must first package your code into a âfat jarâ.
Verisign Public
⢠You must includes all your codeâs dependencies but:
⢠Exclude the Storm dependency itself, as the Storm cluster will provide this.
⢠Sbt: "org.apache.storm" % "storm-core" % "0.9.2-incubating" % "provided"
⢠Maven: <scope>provided</scope>
⢠Gradle with gradle-fatjar-plugin: compile '...', { ext { fatJarExclude = true } }
⢠Note: You may need to tweak your build script so that your local tests do include the
Storm dependency. See e.g. assembly.sbt in kafka-storm-starter for an example.
⢠A topology is run via the storm jar command.
⢠Will connects to Nimbus, upload your jar, and run the topology.
⢠Use any machine that can run "storm jar" and talk to Nimbus' Thrift port.
⢠You can pass additional JVM options via $STORM_JAR_JVM_OPTS.
90
$ storm jar all-my-code.jar com.miguno.MyTopology arg1 arg2
91. Alright, my topology runs â now what?
⢠The topology will run forever or until you kill it.
⢠Check the status of your topology
Verisign Public
⢠Storm UI (default: 8080/tcp)
⢠Storm CLI, e.g. storm [list | kill | rebalance | deactivate | ...]
⢠Storm REST API
⢠FYI:
⢠Storm will guarantee that no data is lost, even if machines go down and
messages are dropped (as long as you donât disable this feature).
⢠Storm will automatically restart failed tasks, and even re-assign tasks to
different machines if e.g. a machine dies.
⢠See Storm docs for further details.
91
93. Reading from Kafka
⢠Use the official Kafka spout that ships in Storm 0.9.2
Verisign Public
⢠https://github.com/apache/incubator-storm/tree/master/external/storm-kafka
⢠Compatible with Kafka 0.8, available on Maven Central
⢠Based on wurstmeister's spout, now part of Storm
https://github.com/wurstmeister/storm-kafka-0.8-plus
⢠Alternatives to official Kafka spout
⢠NFI: https://github.com/HolmesNL/kafka-spout
⢠A detailed comparison is beyond the scope of this workshop, but:
⢠Official Kafka spout uses Kafkaâs Simple Consumer API, NFI uses High-level API.
⢠Official spout can read from multiple topics, NFI canât.
⢠Official spout's replay-failed-tuples functionality is better than NFIâs.
93
"org.apache.storm" % "storm-kafka" % "0.9.2-incubating"
94. Reading from Kafka
⢠Spout configuration via KafkaConfig
⢠In the following example:
Verisign Public
⢠Connect to the target Kafka cluster via the ZK ensemble at zookeeper1:2181.
⢠We want to read from the Kafka topic âmy-kafka-input-topicâ, which has 10 partitions.
⢠By default, the spout stores its own state incl. Kafka offsets in the Storm cluster's ZK.
⢠Can be changed by setting the field SpoutConfig.zkServers. See source, no docs yet.
⢠Full example at KafkaStormSpec in kafka-storm-starter
94
95. Writing to Kafka
⢠Use a normal Kafka producer in your bolt, no special magic needed
⢠Base setup:
Verisign Public
⢠Serialize the desired output data in the way you need, e.g. via Avro.
⢠Write to Kafka, typically in your boltâs emit() method.
⢠If you are not emitting any Storm tuples, i.e. if you write to Kafka only, make
sure you override declareOutputFields() with an empty {} method
⢠Full example at AvroKafkaSinkBolt in kafka-storm-starter
95
97. Testing Storm topologies
⢠Wonât have the time to cover testing in this workshop.
⢠Some hints:
Verisign Public
⢠Unit-test your individual classes like usual, e.g. bolts
⢠When integration testing, use in-memory instances of Storm and ZK
⢠Try Stormâs built-in testing API (cf. kafka-storm-starter below)
⢠Test-drive topologies in virtual Storm clusters via Wirbelsturm
⢠Starting points:
⢠storm-core test suite
⢠https://github.com/apache/incubator-storm/tree/master/storm-core/test/
⢠storm-kafka test suite
⢠https://github.com/apache/incubator-storm/tree/master/external/storm-kafka/src/test
⢠kafka-storm-starter tests related to Storm
⢠https://github.com/miguno/kafka-storm-starter/
97
99. Serialization in Storm
⢠Required because Storm processes data across JVMs and machines
⢠When/where/how serialization happens is often critical for P&S tuning
⢠Storm uses Kryo for serialization, falls back on Java serialization
Verisign Public
⢠By default, Storm can serialize primitive types, strings, byte arrays, ArrayList, HashMap, HashSet,
and the Clojure collection types.
⢠Anything else needs a custom Kryo serializer, which must be âregisteredâ with Storm.
⢠Storm falls back on Java serialization if needed. But this serialization is slow.
⢠Tip: Disable topology.fall.back.on.java.serialization to spot missing serializers.
⢠Examples in kafka-storm-starter, all of which make use of Twitter Bijection/Chill
⢠AvroScheme[T] â enable automatic Avro-decoding in Kafka spout
⢠AvroDecoderBolt[T] â decode Avro data in a bolt
⢠AvroKafkaSinkBolt[T] â encode Avro data in a bolt
⢠TweetAvroKryoDecorator â a custom Kryo serializer
⢠KafkaStormSpec â shows how to register a custom Kryo serializer
⢠More details at Storm serialization
99
101. storm-starter
⢠storm-starter is part of core Storm project since 0.9.2
⢠https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
⢠Since 0.9.2 also published to Maven Central = you can re-use its spouts/bolts
Verisign Public
101
$ git clone https://github.com/apache/incubator-storm.git
$ cd incubator-storm/
$ mvn clean install -DskipTests=true # build Storm locally
$ cd examples/storm-starter # go to storm-starter
(Must have Maven 3.x and JDK installed.)
102. storm-starter: RollingTopWords
Verisign Public
102
$ mvn compile exec:java -Dstorm.topology=storm.starter.RollingTopWords
⢠Will run a topology that implements trending topics.
⢠http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
103. Behind the scenes of RollingTopWords
Verisign Public
103
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
104. kafka-storm-starter
⢠Written by yours truly
⢠https://github.com/miguno/kafka-storm-starter
Verisign Public
104
$ git clone https://github.com/miguno/kafka-storm-starter
$ cd kafka-storm-starter
# Now ready for mayhem!
(Must have JDK 7 installed.)
105. kafka-storm-starter: run the test suite
Verisign Public
105
$ ./sbt test
⢠Will run unit tests plus end-to-end tests of Kafka, Storm, and Kafka-
Storm integration.
106. kafka-storm-starter: run the KafkaStormDemo app
Verisign Public
106
$ ./sbt run
⢠Starts in-memory instances of ZooKeeper, Kafka, and Storm. Then
runs a Storm topology that reads from Kafka.
109. Storm performance tuning
⢠Unfortunately, no silver bullet and no free lunch. Witchcraft?
⢠And what is âthe bestâ performance in the first place?
Verisign Public
⢠Some users require a low latency, and are willing to let most of the cluster
sit idle as long they can process a new event quickly once it happens.
⢠Other users are willing to sacrifice latency for minimizing the hardware
footprint to save $$$. And so on.
⢠P&S tuning depends very much on the actual use cases
⢠Hardware specs, data volume/velocity/âŚ, etc.
⢠Which means in practice:
⢠What works with sampled data may not work with production-scale data.
⢠What works for topology A may not work for topology B.
⢠What works for team A may not work for team B.
⢠Tip: Be careful when adopting other peopleâs recommendations if you donât
fully understand whatâs being tuned, why, and in which context.
109
110. General considerations
⢠Test + measure: use Storm UI, Graphite & friends
⢠Understand your topologyâs DAG on a macro level
Verisign Public
⢠Where and how data flows, its volume, joins/splits, etc.
⢠Trivial example: Shoveling 1Gbps into a âsingletonâ bolt = WHOOPS
⢠Understand ⌠on a micro level
⢠How your data flows between machines, workers, executors, tasks.
⢠Where and when serialization happens.
⢠Which queues and buffers your data will hit.
⢠We talk about this in detail in the next slides!
⢠Best performance optimization is often to stop doing something.
⢠Example: If you can cut out (de-)serialization and sending tuples to another
process, even over the loopback device, then that is potentially a big win.
110
http://www.slideshare.net/ptgoetz/scaling-storm-hadoop-summit-2014
http://www.slideshare.net/JamesSirota/cisco-opensoc
111. How to approach P&S tuning
⢠Optimize locally before trying to optimize globally
Verisign Public
⢠Tune individual spouts/bolts before tuning entire topology.
⢠Write simple data generator spouts and no-op bolts to facilitate this.
⢠Even small things count at scale
⢠A simple string operation can slowdown throughput when processing 1M tuples/s
⢠Turn knobs slowly, one at a time
⢠A common advice when fiddling with a complex system.
⢠Add your own knobs
⢠It helps to make as many things configurable as possible.
⢠Error handling is critical
⢠Poorly handled errors can lead to topology failure, data loss or data duplication.
⢠Particularly important when interfacing Storm with other systems such as Kafka.
111
http://www.slideshare.net/ptgoetz/scaling-storm-hadoop-summit-2014
http://www.slideshare.net/JamesSirota/cisco-opensoc
112. Some rules of thumb, for guidance
⢠CPU-bound topology?
Verisign Public
⢠Try to spread and parallelize the load across cores (think: workers).
⢠Local cores: may incur serialization/deserialization costs, see later slides.
⢠Remote cores: will incur serialization/deserialization costs, plus network I/O and
additional Storm coordination work.
⢠Network I/O bound topology?
⢠Collocate your cores, e.g. try to perform more logical operations per bolt.
⢠Breaks single responsibility principle (SRP) in favor of performance.
⢠But what if topology is CPU-bound and I/O-bound and �
⢠It becomes very tricky when parts of your topology are CPU-bound, other parts are
I/O bound, and other parts are constrained by memory (which has it's own
limitations).
⢠Grab a lot of coffee, and good luck!
112
113. Internal message buffers of Storm (as of 0.9.1)
Update August 2014: This setup may have changed due to recent P&S work in STORM-297.
http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/
Verisign Public
114. Communication within a Storm cluster
⢠Intra-worker communication: LMAX Disruptor <<< awesome library!
⢠Between local threads within the same worker process (JVM), e.g. between local tasks/executors of same topology
⢠Flow is: emit() -> executor Aâs send buffer -> executor Bâs receive buffer.
⢠Does not hit the parent workerâs transfer buffer. Does not incur serialization because itâs in the same JVM.
⢠Inter-worker communication: Netty in Storm 0.9+, ZeroMQ in 0.8
⢠Different JVMs/workers on same machine
Verisign Public
⢠emit() -> exec send buffer -> worker Aâs transfer queue -> local socket -> worker Bâs recv queue -> exec recv buffer
⢠Different machines.
⢠Same as above, but uses a network socket and thus also hits the NIC. Incurs additional latency because of network.
⢠Inter-worker communication incurs serialization overhead (passes JVM boundaries), cf. Storm serialization with Kryo
⢠Inter-topology communication:
⢠Nothing built into Storm â up to you! Common choices are a messaging system such as Kafka or Redis, an RDBMS
or NOSQL database, etc.
⢠Inter-topology communication incurs serialization overhead, details depend on your setup
114
115. Tuning internal message buffers
⢠Start with the following settings if you think the defaults arenât adequate
⢠Helpful references
Verisign Public
⢠Storm default configuration (defaults.yaml)
⢠Tuning and Productionization of Storm, by Nathan Marz
⢠Notes on Storm+Trident Tuning, by Philip Kromer
⢠Understanding the Internal Message Buffers of Storm, by /me
115
Config Default Tuning
guess
Notes
topology.receiver.buffer.size 8 8
topology.transfer.buffer.size 1,024 32 Batches of
messages
topology.executor.receive.buffer.size 1,024 16,384 Batches of
messages
topology.executor.send.buffer.size 1,024 16,384 Individual
messages
116. JVM garbage collection and RAM
⢠Garbage collection woes
Verisign Public
⢠If you are GCâing too much and failing a lot of tuples (which may be in part
due to GCs) it is possible that you are out of memory.
⢠Try to increase the JVM heap size (-Xmx) that is allocated for each worker.
⢠Try the G1 garbage collector in JDK7u4 and later.
⢠But: A larger JVM heap size is not always better.
⢠When the JVM will eventually garbage-collect, the GC pause may take
much longer for larger heaps.
⢠Example: A GC pause will also temporarily stop those threads in Storm that
perform the heartbeating. So GC pauses can potentially make Storm think
that workers have died, which will trigger ârecoveryâ actions etc. This can
cause cascading effects.
116
117. Rate-limiting topologies
⢠topology.max.spout.pending
Verisign Public
⢠Max number of tuples that can be pending on a single spout task at once. âPendingâ
means the tuple has either failed or has not been acked yet.
⢠Typically, increasing max pending tuples will increase the throughput of your topology.
But in some cases decreasing the value may be required to increase throughput.
⢠Caveats:
⢠This setting has no effect for unreliable spouts, which don't tag their tuples with a message id.
⢠For Trident, maxSpoutPending refers to the number of pipelined batches of tuples.
⢠Recommended to not setting this parameter very high for Trident topologies (start testing with ~ 10).
⢠Primarily used a) to throttle your spouts and b) to make sure your spouts don't emit
more than your topology can handle.
⢠If the complete latency of your topology is increasing then your tuples are getting backed up
(bottlenecked) somewhere downstream in the topology.
⢠If some tasks run into âOOM: GC overhead limit exceededâ exception, then typically your
upstream spouts/bolts are outpacing your downstream bolts.
⢠Apart from throttling your spouts with this setting you can of course also try to increase the
topologyâs parallelism (maybe you actually need to combine the two).
117
118. Acking strategies
⢠topology.acker.executors
Verisign Public
⢠Determines the number of executor threads (or tasks?) that will track tuple
trees and detect when a tuple has been fully processed.
⢠Disabling acking trades reliability for performance.
⢠If you want to enable acking and thus guaranteed message processing
⢠Rule of thumb: 1 acker/worker (which is also the default in Storm 0.9)
⢠If you want to disable acking and thus guaranteed message processing
⢠Set value to 0. Here, Storm will immediately ack tuples as soon as they come off
the spout, effectively disabling acking and thus reliability.
⢠Note that there are two additional ways to fine-tune acking behavior, and notably
to disable acking:
1. Turn off acking for an individual spout by omitting a message id in the
SpoutOutputCollector.emit() method.
2. If you don't care if a particular subset of tuples is failed to be processed downstream in
the topology, you can emit them as unanchored tuples. Since they're not anchored to
any spout tuples, they won't cause any spout tuples to fail if they aren't acked.
118
119. Miscellaneous
⢠A worker process is never shared across topologies.
Verisign Public
⢠If you have Storm configured to run only a single worker on a machine, then
you canât run multiple topologies on that machine.
⢠Spare worker capacity canât be used by other topos. All the workerâs child
executors and tasks will only ever be used to run code for a single topology.
⢠All executors/tasks on a worker run in the same JVM.
⢠In some cases â e.g. a localOrShuffleGrouping() â this improves
performance.
⢠In other cases this can cause issues.
⢠If a task crashes the JVM/worker or causes the JVM to run out of memory, then
all other tasks/executors of the worker die, too.
⢠Some applications may malfunction if they co-exist as multiple instances in the
same JVM, e.g. when relying on static variables.
119
120. Miscellaneous
⢠Consider the use of Trident to increase throughput
Verisign Public
⢠Trident inherently operates on batches of tuples.
⢠Drawback is typically a higher latency.
⢠Trident is not covered in this workshop. ď
⢠Experiment with batching messages/tuples manually
⢠Keep in mind that here a failed tuple actually corresponds to multiple data records.
⢠For instance, if a batch âtupleâ fails and gets replayed, all the batched data records
will be replayed, which may lead to data duplication.
⢠If you donât like the idea of manual batching, try Trident!
120
121. When using Storm with Kafka
⢠Stormâs parallelism is controlled by Kafkaâs âparallelismâ
Verisign Public
⢠Set Kafka spoutâs parallelism to #partitions of source topic.
⢠Other key parameters that determine performance
⢠KafkaConfig.fetchSizeBytes (default: 1 MB)
⢠KafkaConfig.bufferSizeBytes (default: 1 MB)
121
122. TL;DR: Start with this, then measure/improve/repeat
⢠1 worker / machine / topology
Verisign Public
⢠Minimize unnecessary network transfer
⢠1 acker / worker
⢠This is also the default in Storm 0.9
⢠CPU-bound use cases:
⢠1 executor thread / CPU core, to optimize thread and CPU usage
⢠I/O-bound use cases:
⢠10-100 executor threads / CPU core
122
http://www.slideshare.net/ptgoetz/scaling-storm-hadoop-summit-2014
123. Part 5: Playing with Storm using Wirbelsturm
1-click Storm deployments
Verisign Public
123
124. Deploying Storm via Wirbelsturm
⢠Written by yours truly
⢠https://github.com/miguno/wirbelsturm
Verisign Public
124
$ git clone https://github.com/miguno/wirbelsturm.git
$ cd wirbelsturm
$ ./bootstrap
$ vagrant up zookeeper1 nimbus1 supervisor1 supervisor2
(Must have Vagrant 1.6.1+ and VirtualBox 4.3+ installed.)
125. Deploying Storm via Wirbelsturm
⢠By default, the Storm UI runs on nimbus1 at:
Verisign Public
⢠http://localhost:28080/
⢠You can build and run a topology:
⢠Beyond the scope of this workshop.
⢠Use e.g. an Ansible playbook to submit topologies to make this task
simple, easy, and fun.
125
126. What can I do with Wirbelsturm?
⢠Get a first impression of Storm
⢠Test-drive your topologies
⢠Test failure handling
Verisign Public
⢠Stop/kill Nimbus, check what happens to Supervisors.
⢠Stop/kill ZooKeeper instances, check what happens to topology.
⢠Use as sandbox environment to test/validate deployments
⢠âWhat will actually happen when I deactivate this topology?â
⢠âWill my Hiera changes actually work?â
⢠Reproduce production issues, share results with Dev
⢠Also helpful when reporting back to Storm project and mailing lists.
⢠Any further cool ideas? ď
126
128. Where to go from here
⢠A few Storm books are already available.
⢠Storm documentation
Verisign Public
⢠http://storm.incubator.apache.org/documentation/Home.html
⢠storm-kafka
⢠https://github.com/apache/incubator-storm/tree/master/external/storm-kafka
⢠Mailing lists
⢠http://storm.incubator.apache.org/community.html
⢠Code examples
⢠https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
⢠https://github.com/miguno/kafka-storm-starter/
⢠Related work aka tools that are similar to Storm â try them, too!
⢠Spark Streaming
⢠See comparison Apache Storm vs. Apache Spark Streaming, by P. Taylor Goetz (Storm committer)
128
129. Š 2014 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of
VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.