Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn
This document provides an overview of Storm, an open source distributed real-time computation system. It describes key Storm concepts like tuples, streams, spouts, bolts and topologies. It provides an example of a streaming word count topology and discusses tips for using Storm like using log aggregation, rolling deployments, tuning parallelism and anchoring tuples. It also briefly introduces the Trident abstraction for Storm.
Storm is a fast, scalable, fault-tolerant, and easy to operate distributed realtime computation system. It guarantees that messages will be processed and allows processing big data streams reliably in real time. Storm was originally developed by Nathan Marz at BackType (acquired by Twitter) and is written in Java and Clojure. It uses a simple programming model and can scale to large clusters, making it suitable for processing millions of events per second.
The document compares the performance of different data serialization formats (JSON, Apache Avro, Protocol Buffers) for real-time applications. It describes building a pipeline to ingest, process, and cache serialized data. Benchmark results show JSON has the highest throughput but also the highest latency, while Protocol Buffers has the lowest throughput but lowest latency. The document recommends JSON for latency-critical, small data and Protocol Buffers for data-heavy, real-time applications relying on Google services. It also provides information about monitoring throughput patterns and the presenter's background and skills.
invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.
streamparse and pystorm: simple reliable parallel processing with stormDaniel Blanchard
Storm is a distributed real-time computation system that dramatically simplifies processing streaming data. streamparse allows Python code to integrate with Storm by providing a Pythonic API. It handles running, debugging, and deploying Storm topologies to clusters through commands like "sparse run" and "sparse submit".
Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets.
A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams.
In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. As part of this presentation, we also released streamparse, a new Python that makes it easy to debug and run large Storm clusters.
Links:
* http://parse.ly/code
* https://github.com/Parsely/streamparse
* https://github.com/getsamsa/samsa
TokyoCabinet and TokyoTyrant are open source databases written by Akira Koyasu. TokyoCabinet provides embedded databases using hash, B+ tree, fixed-length, and table formats. TokyoTyrant builds on TokyoCabinet and provides a client-server database with a server and client command line interface. Both have Java APIs and support features like putting, getting, listing, and removing key-value pairs.
Concurrency and parallelism in Python are always hot topics. This talk will look the variety of forms of concurrency and parallelism. In particular this talk will give an overview of various forms of message-passing concurrency which have become popular in languages like Scala and Go. A Python library called python-csp which implements similar ideas in a Pythonic way will be introduced and we will look at how this style of programming can be used to avoid deadlocks, race hazards and "callback hell".
Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn
This document provides an overview of Storm, an open source distributed real-time computation system. It describes key Storm concepts like tuples, streams, spouts, bolts and topologies. It provides an example of a streaming word count topology and discusses tips for using Storm like using log aggregation, rolling deployments, tuning parallelism and anchoring tuples. It also briefly introduces the Trident abstraction for Storm.
Storm is a fast, scalable, fault-tolerant, and easy to operate distributed realtime computation system. It guarantees that messages will be processed and allows processing big data streams reliably in real time. Storm was originally developed by Nathan Marz at BackType (acquired by Twitter) and is written in Java and Clojure. It uses a simple programming model and can scale to large clusters, making it suitable for processing millions of events per second.
The document compares the performance of different data serialization formats (JSON, Apache Avro, Protocol Buffers) for real-time applications. It describes building a pipeline to ingest, process, and cache serialized data. Benchmark results show JSON has the highest throughput but also the highest latency, while Protocol Buffers has the lowest throughput but lowest latency. The document recommends JSON for latency-critical, small data and Protocol Buffers for data-heavy, real-time applications relying on Google services. It also provides information about monitoring throughput patterns and the presenter's background and skills.
invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.
streamparse and pystorm: simple reliable parallel processing with stormDaniel Blanchard
Storm is a distributed real-time computation system that dramatically simplifies processing streaming data. streamparse allows Python code to integrate with Storm by providing a Pythonic API. It handles running, debugging, and deploying Storm topologies to clusters through commands like "sparse run" and "sparse submit".
Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets.
A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams.
In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. As part of this presentation, we also released streamparse, a new Python that makes it easy to debug and run large Storm clusters.
Links:
* http://parse.ly/code
* https://github.com/Parsely/streamparse
* https://github.com/getsamsa/samsa
TokyoCabinet and TokyoTyrant are open source databases written by Akira Koyasu. TokyoCabinet provides embedded databases using hash, B+ tree, fixed-length, and table formats. TokyoTyrant builds on TokyoCabinet and provides a client-server database with a server and client command line interface. Both have Java APIs and support features like putting, getting, listing, and removing key-value pairs.
Concurrency and parallelism in Python are always hot topics. This talk will look the variety of forms of concurrency and parallelism. In particular this talk will give an overview of various forms of message-passing concurrency which have become popular in languages like Scala and Go. A Python library called python-csp which implements similar ideas in a Pythonic way will be introduced and we will look at how this style of programming can be used to avoid deadlocks, race hazards and "callback hell".
Storm is an open source distributed real-time computation system. It provides guarantees of processing data reliably in real-time. Storm allows for building real-time streaming data pipelines that process unbounded streams of data reliably. Key features include being distributed, fault-tolerant, guaranteeing message processing, and providing a high level abstraction over message passing.
The document discusses the Collected Works of Mahatma Gandhi (CWMG) project. It outlines issues with scanning and digitizing original print books, including problems like fuzzy text, dirt, light and dark patches, dust, and missing or out of order pages. It then describes the standardization process used to clean, format and check the digital files. This includes assigning styles, filtering for errors, problem solving using software, and final checking. Estimates are provided for costs to develop an e-book and web portal, including one-time costs for hardware, software, and furniture, as well as recurring hosting, technical support and salary costs. The target dates for completion of the e-book and web portal are
Java garbage collection has evolved significantly since its inception in 1959. The modern Hotspot JVM uses generational garbage collection with a young and old generation. It employs concurrent and parallel techniques like CMS to minimize pauses. OutOfMemoryErrors require increasing heap sizes or fixing leaks. Finalizers are generally avoided due to performance impacts. GC tuning must be tested under realistic loads rather than one-size-fits-all settings. Analysis tools help correlate GC logs with application behavior.
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
The "n" in the PrintCompilation output indicates that the method was not compiled, but is on deck to be compiled. So in this case, java.lang.Object::hashCode was not yet compiled, but is a candidate for compilation if/when it is called more.
Kyoto Products includes Kyoto Cabinet and Kyoto Tycoon. Kyoto Cabinet is a lightweight database library that provides a straightforward implementation of DBM with high performance and scalability. Kyoto Tycoon is a lightweight database server that provides a persistent cache based on Kyoto Cabinet with features like expiration, high concurrency, and replication. Both support various database types and languages.
DEFCON 23 - Mike Sconzo - i am packer and so can youFelipe Prado
The document discusses packers and signature-based detection of packers in Portable Executable (PE) files. It proposes moving beyond the current standard PEiD signatures by taking a clustering approach to detect packers based on assembly mnemonics, linker versions, and number of sections in files. Samples from malware families like APT1 and ZeuS are analyzed to identify clusters corresponding to different packers and compile environments. Random files are also clustered to show most do not match. The approach aims to generate new signatures that are easier for non-experts to understand and extend compared to existing solutions.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Storm is a distributed real-time computation system originally created by Twitter to analyze tweets. It provides distributed, reliable, and fault-tolerant processing of streaming data. Storm topologies are composed of spouts that emit streams of tuples and bolts that process those tuples, with streams connecting the components. Storm can be used for applications like continuous computation, log analysis, and distributed data processing.
This document discusses Reactive Programming and Reactive Streams. It introduces Reactor, a reactive programming framework, and how it addresses issues like latency in microservices architectures. Reactive Streams provide an interoperable way to work with asynchronous data streams in a non-blocking manner. Streams represent sequences of data that can be processed reactively through operators like map and filter.
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
The document discusses using CNTK (Microsoft Cognitive Toolkit) for natural language processing and deep learning within Spark pipelines. It provides information on mmlspark, which allows embedding CNTK models into Spark. It also discusses using CNTK to analyze data from GitHub commits and relate code changes to natural language comments through sequence-to-sequence models.
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Integrate Solr with real-time stream processing applicationsthelabdude
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
Storm is an open-source distributed real-time computation system. It uses a distributed messaging system to reliably process streams of data. The core abstractions in Storm are spouts, which are sources of streams, and bolts, which are basic processing elements. Spouts and bolts are organized into topologies which represent the flow of data. Storm provides fault tolerance through message acknowledgments and guarantees exactly-once processing semantics. Trident is a high-level abstraction built on Storm that supports operations like aggregations, joins, and state management through its micro-batch oriented and stream-based API.
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
1. GPUs have many more cores than CPUs and are very good at processing large blocks of data in parallel.
2. GPUs can provide a significant speedup over CPUs for applications that map well to a data-parallel programming model by harnessing the power of many cores.
3. The throughput-oriented nature of GPUs makes them well-suited for algorithms where the same operation can be performed on many data elements independently.
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. We’ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.
Storm is an open source distributed real-time computation system. It provides guarantees of processing data reliably in real-time. Storm allows for building real-time streaming data pipelines that process unbounded streams of data reliably. Key features include being distributed, fault-tolerant, guaranteeing message processing, and providing a high level abstraction over message passing.
The document discusses the Collected Works of Mahatma Gandhi (CWMG) project. It outlines issues with scanning and digitizing original print books, including problems like fuzzy text, dirt, light and dark patches, dust, and missing or out of order pages. It then describes the standardization process used to clean, format and check the digital files. This includes assigning styles, filtering for errors, problem solving using software, and final checking. Estimates are provided for costs to develop an e-book and web portal, including one-time costs for hardware, software, and furniture, as well as recurring hosting, technical support and salary costs. The target dates for completion of the e-book and web portal are
Java garbage collection has evolved significantly since its inception in 1959. The modern Hotspot JVM uses generational garbage collection with a young and old generation. It employs concurrent and parallel techniques like CMS to minimize pauses. OutOfMemoryErrors require increasing heap sizes or fixing leaks. Finalizers are generally avoided due to performance impacts. GC tuning must be tested under realistic loads rather than one-size-fits-all settings. Analysis tools help correlate GC logs with application behavior.
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
The "n" in the PrintCompilation output indicates that the method was not compiled, but is on deck to be compiled. So in this case, java.lang.Object::hashCode was not yet compiled, but is a candidate for compilation if/when it is called more.
Kyoto Products includes Kyoto Cabinet and Kyoto Tycoon. Kyoto Cabinet is a lightweight database library that provides a straightforward implementation of DBM with high performance and scalability. Kyoto Tycoon is a lightweight database server that provides a persistent cache based on Kyoto Cabinet with features like expiration, high concurrency, and replication. Both support various database types and languages.
DEFCON 23 - Mike Sconzo - i am packer and so can youFelipe Prado
The document discusses packers and signature-based detection of packers in Portable Executable (PE) files. It proposes moving beyond the current standard PEiD signatures by taking a clustering approach to detect packers based on assembly mnemonics, linker versions, and number of sections in files. Samples from malware families like APT1 and ZeuS are analyzed to identify clusters corresponding to different packers and compile environments. Random files are also clustered to show most do not match. The approach aims to generate new signatures that are easier for non-experts to understand and extend compared to existing solutions.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Storm is a distributed real-time computation system originally created by Twitter to analyze tweets. It provides distributed, reliable, and fault-tolerant processing of streaming data. Storm topologies are composed of spouts that emit streams of tuples and bolts that process those tuples, with streams connecting the components. Storm can be used for applications like continuous computation, log analysis, and distributed data processing.
This document discusses Reactive Programming and Reactive Streams. It introduces Reactor, a reactive programming framework, and how it addresses issues like latency in microservices architectures. Reactive Streams provide an interoperable way to work with asynchronous data streams in a non-blocking manner. Streams represent sequences of data that can be processed reactively through operators like map and filter.
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
The document discusses using CNTK (Microsoft Cognitive Toolkit) for natural language processing and deep learning within Spark pipelines. It provides information on mmlspark, which allows embedding CNTK models into Spark. It also discusses using CNTK to analyze data from GitHub commits and relate code changes to natural language comments through sequence-to-sequence models.
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Integrate Solr with real-time stream processing applicationsthelabdude
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
Storm is an open-source distributed real-time computation system. It uses a distributed messaging system to reliably process streams of data. The core abstractions in Storm are spouts, which are sources of streams, and bolts, which are basic processing elements. Spouts and bolts are organized into topologies which represent the flow of data. Storm provides fault tolerance through message acknowledgments and guarantees exactly-once processing semantics. Trident is a high-level abstraction built on Storm that supports operations like aggregations, joins, and state management through its micro-batch oriented and stream-based API.
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
1. GPUs have many more cores than CPUs and are very good at processing large blocks of data in parallel.
2. GPUs can provide a significant speedup over CPUs for applications that map well to a data-parallel programming model by harnessing the power of many cores.
3. The throughput-oriented nature of GPUs makes them well-suited for algorithms where the same operation can be performed on many data elements independently.
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. We’ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
This document describes designing a real-time heat map service using Apache Storm. It involves collecting check-in data from various locations, geocoding the addresses, building heat maps for time intervals, and persisting the results. The key components are a check-ins spout to generate sample data, geocode lookup bolt to geocode addresses, heat map builder bolt to accumulate locations into intervals and emit maps, and persistor bolt to store results. Stream groupings and parallelism across workers allow the topology to horizontally scale for high throughput processing of location data.
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
Data-intensive computing has positioned itself as a valuable programming paradigm to efficiently approach problems requiring processing very large volumes of data. This paper presents a pilot study about how to apply the data-intensive computing paradigm to evolutionary computation algorithms. Two representative cases (selectorecombinative genetic algorithms and estimation of distribution algorithms) are presented, analyzed, and discussed. This study shows that equivalent data-intensive computing evolutionary computation algorithms can be easily developed, providing robust and scalable algorithms for the multicore-computing era. Experimental results show how such algorithms scale with the number of available cores without further modification.
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
This document discusses building a real-time analytics system like Google Analytics using Storm, Kafka, and GigaSpaces. It describes the key components needed: a spout to read page view data from Kafka, Trident bolts to calculate metrics like top URLs, active users, and geographic information, and a time series bolt to track page views over time. The architecture allows for highly scalable, low-latency analysis of streaming page view data in real-time.
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
Twitter's operations team manages software performance, availability, capacity planning, and configuration management for Twitter. They use metrics, logs, and analysis to find weak points and take corrective action. Some techniques include caching everything possible, moving operations to asynchronous daemons, and optimizing databases to reduce replication delay and locks. The team also created several open source projects like CacheMoney for caching and Kestrel for asynchronous messaging.
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
Fixing Twitter and Finding your own Fail Whale document discusses Twitter operations. The operations team manages software performance, availability, capacity planning, and configuration management using metrics, logs, and data-driven analysis to find weak points and take corrective action. They use managed services for infrastructure to focus on computer science problems. The document outlines Twitter's rapid growth and challenges in maintaining performance as traffic increases. It provides recommendations around caching, databases, asynchronous processing, and other techniques Twitter uses to optimize performance under heavy load.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
3. “Next click” problem
Raymie Strata (CTO,Yahoo):
“With the paths that go through Hadoop [at Yahoo!], the
latency is about fifteen minutes. … [I]t will never be true
real-time. It will never be what we call “next click,” where
I click and by the time the page loads, the semantic
implication of my decision is reflected in the page.”
4. “Next click” problem
(next)
HTTP HTTP HTTP HTTP
Request Response Request Response
max latency max latency
80 ms 80 ms
web server
realtime near realtime
response response
real time layer
collect data process data
time
5. Example problems
• Realtime statistics - counting, trends, moving average
• Read Twitter stream and output images that are
trending in the last 10 minutes
• CTR calculation - read ad clicks/ad impressions and
calculate new click through rate
• ETL - transform format, filter duplicates / bot traffic,
enrich from static data, persist
• Search advertising
6. Pick your framework...
• S4 - Yahoo, “real time map reduce”, actor model
• Storm - Twitter
• MapReduce Online - Yahoo
• Cloud Map Reduce - Accenture
• HStreaming - Startup, based on Hadoop
• Brisk - DataStax, Cassandra
7. System requirements
• Fault tolerance - system keeps running when a node
fails
• Horizontal scalability - should be easy, just add a
node
• Low latency
• Reliable - does not loose data
• High availability - well, if it’s down for an hour its not
realtime
8. Storm in a nutshell
• Written by Backtype (aquired by Twitter)
• Open Source, Github
• Runs on JVM
• Clojure, Python, Zookeeper, ZeroMQ
• Currently used by Twitter for real time statistics
9. Programming model
• Tuple - name/value list
• Stream - unbounded sequence of Tuples
• Spout - source of Streams
• Bolt - consumer / producer of Streams
• Topology - network of Streams, Spouts and Bolts
14. Task
Parallel processor inside Spouts and Bolts.
Each Spout / Bolt has a fixed number of Tasks.
Spout Bolt
Task Task
Task Task
Task
15. Stream grouping
Which Task does a Tuple go to?
• shuffle grouping - distribute randomly
• field grouping - partition by field value
• all grouping - send to all Tasks
• custom grouping - implement your own logic
16. Word count example
Sentence Word (“a”, 2)
Splitter Count (“b”, 2)
Spout
Bolt Bolt (“c”, 1)
(“a”) (“d”, 1)
(“b”)
(“a b c a b d”) (“c”)
(“a”)
(“b”)
(“d”)
17. Guaranteed processing
(“a”)
(“b”)
(“a”, 2)
(“c”)
(“b”, 2)
Spout (“a b c a b d”)
(“c”, 1)
(“a”)
(“d”, 1)
(“b”)
(“d”)
Topology has a timeout for processing of the tuple tree
19. Reliability
• Nimbus / Supervisor are SPOF
• both are stateless, easy to restart without data loss
• Failure of master node (?)
• Running Topologies should not be affected!
• Failed Workers are restarted
• Guaranteed message processing
20. Administration
• Nimbus / Supervisor / Zookeeper need monitoring
and supervisor (e.g. Monit)
• Cluster nodes can be added at runtime
• But: existing Topologies are not rebalanced (there is a
ticket)
• Administration web GUI
21. Community
• Source is on Github - https://github.com/
nathanmarz/storm.git
• Wiki - https://github.com/nathanmarz/storm/wiki
• Nice documentation
• Google Group
• People start to build add-ons: JRuby integration,
adapters for JMS, AMQP
22. Storm summary
• Nice programming model
• Easy to deploy new topologies
• Horizontal scalability
• Low latency
• Fault tolerance
• Easy to setup on EC2