Workshop on "Big Data, Real-time Processing and Storm" at The Fifth Elephant, 2013 Bangalore, India on 11th July, 2013.
http://fifthelephant.in/2013/workshops
Session proposal: https://funnel.hasgeek.com/fifthel2013/652-big-data-real-time-processing-and-storm
Code for the workshop can be found at: https://github.com/P7h
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganHazelcast
In this webinar you’ll learn the importance of Advanced Data Locality and Data IPC Transports with respect to Java distributed cache data grids. This is information that is super-crucial to the HPC Linux supercomputing community. The presenter will show how by using native /dev/shm as an IPC transport we can achieve latencies 1,000x faster than TCP/IP.
We’ll cover the following topics:
-Why going off-heap is fundamental to meeting real-time SLAs
-Why traditional grid transports (TCP/IP) are not good enough for the HPC Linux supercomputing community
-Why we need something more than TCP/IP
-Live Q&A Session
Presenter:
Ben Cotton, Consultant at JPMorgan
Ben is has been an IT Consultant to Financial Services Industry for nearly 20 years. His specializations lie in open source, transactions, caching, datagrids, fixed income & derivatives trading systems.
Ben is active in the following communities:
Java Community Process Member
JSR-156 expert group: Java XML Transactions API
JSR-107 expert group: Java Caching API
JSR-347 expert group: Java Data Grids API
RedHat Community Member (Fedora, Infinispan, JBoss XTS) Code Contributor
invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.
Meteor Deep Dive – Reactive Programming With Tracker
As a Meteor developer, I believe that everyone who has worked with Meteor all had experience with Tracker, or at least used it through some kinds of interfaces like getMeteordata or createContainer which come with the react-meteor-data package.
However, not all people know how Tracker works its magic. In this post, I’m going to dig into every facet of this case as well as bring to you a little bit of background knowledge.
This is the final report for my project as a Technical Student at CERN.
The Intel Xeon/Phi platform is a powerful x86 multi-core engine with a very high-speed memory interface. In its next version it will be able to operate as a stand-alone system with a very high-speed interconnect. This makes it a very interesting candidate for (near) real-time applications such as event-building, event-sorting and event preparation for subsequent processing by high level trigger software algorithms.
Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganHazelcast
In this webinar you’ll learn the importance of Advanced Data Locality and Data IPC Transports with respect to Java distributed cache data grids. This is information that is super-crucial to the HPC Linux supercomputing community. The presenter will show how by using native /dev/shm as an IPC transport we can achieve latencies 1,000x faster than TCP/IP.
We’ll cover the following topics:
-Why going off-heap is fundamental to meeting real-time SLAs
-Why traditional grid transports (TCP/IP) are not good enough for the HPC Linux supercomputing community
-Why we need something more than TCP/IP
-Live Q&A Session
Presenter:
Ben Cotton, Consultant at JPMorgan
Ben is has been an IT Consultant to Financial Services Industry for nearly 20 years. His specializations lie in open source, transactions, caching, datagrids, fixed income & derivatives trading systems.
Ben is active in the following communities:
Java Community Process Member
JSR-156 expert group: Java XML Transactions API
JSR-107 expert group: Java Caching API
JSR-347 expert group: Java Data Grids API
RedHat Community Member (Fedora, Infinispan, JBoss XTS) Code Contributor
invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.
Meteor Deep Dive – Reactive Programming With Tracker
As a Meteor developer, I believe that everyone who has worked with Meteor all had experience with Tracker, or at least used it through some kinds of interfaces like getMeteordata or createContainer which come with the react-meteor-data package.
However, not all people know how Tracker works its magic. In this post, I’m going to dig into every facet of this case as well as bring to you a little bit of background knowledge.
This is the final report for my project as a Technical Student at CERN.
The Intel Xeon/Phi platform is a powerful x86 multi-core engine with a very high-speed memory interface. In its next version it will be able to operate as a stand-alone system with a very high-speed interconnect. This makes it a very interesting candidate for (near) real-time applications such as event-building, event-sorting and event preparation for subsequent processing by high level trigger software algorithms.
Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
Data analytics in the cloud with Jupyter notebooks.Graham Dumpleton
Jupyter Notebooks provide an interactive computational environment, in which you can combine Python code, rich text, mathematics, plots and rich media. It provides a convenient way for data analysts to explore, capture and share their research.
Numerous options exist for working with Jupyter Notebooks, including running a Jupyter Notebook instance locally or by using a Jupyter Notebook hosting service.
This talk will provide a quick tour of some of the more well known options available for running Jupyter Notebooks. It will then look at custom options for hosting Jupyter Notebooks yourself using public or private cloud infrastructure.
An in-depth look at how you can run Jupyter Notebooks in OpenShift will be presented. This will cover how you can directly deploy a Jupyter Notebook server image, as well as how you can use Source-to-Image (S2I) to create a custom application for your requirements by combining an existing Jupyter Notebook server image with your own notebooks, additional code and research data.
Specific use cases around Jupyter Notebooks which will be explored will include individual use, team use within an organisation, and class room environments for teaching. Other issues which will be covered include importing of notebooks and data into an environment, storing data using persistent volumes and other forms of centralised storage.
As an example of the possibilities of using Jupyter Notebooks with a cloud, it will be shown how you can easily use OpenShift to set up a distributed parallel computing cluster using ‘ipyparallel’ and use it in conjunction with a Jupyter Notebook.
As hackers we love to understand how stuff works and how to optimize it. A very good tool to do both is software tracing. During the talk we'll see how tracing tools work and we'll zoom on one particular project called pyflame.
Talk on sneaky computation to be given at Reykjavik University. Sneaky computation is the using spare CPU cycles with little or no intervention from the user.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.
DTrace and SystemTap are dynamic tracing frameworks available for Solaris and Linux respectively. This session will give an overview of the static DTrace probes available in both Drizzle and MySQL and show numerous examples of scripts that utilize these probes. Mixing dynamic and static probes will also be discussed.
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...DynamicInfraDays
Slides from Barak Michener's talk "CoreOS: Building the Layers of the Scalable Cluster for Containers" at ContainerDays Boston 2015: http://dynamicinfradays.org/events/2015-boston/programme.html#layers
W świecie mikrousługowym architektura Lambda zadomowiła się już na dobre. Tak przetwarzania streamingowe, jak i batchowe buduje wiele firm. Na rynku (o ile o rynku można mówić w kontekście open source) istnieje wiele frameworków, każdy jednak ma pewne cechy, które — zwłaszcza przy dużych projektach — utrudniają pracę. Jedne służą do przetwarzania real-time, drugie lepiej spisują się w workloadach batchowych. Niektóre z nich zaś można uznać za „rock-solid” tylko jeśli uruchamiamy je na Hadoopie. Nie brak tych problemów jest jednak główną zaletą Beama. A co nią jest? Dowiecie się na prezentacji! Poruszymy takie kwestie jak model przetwarzania, use-case’y, w których Beam się sprawdza, a także środowiska uruchomieniowe. Zobaczycie też, jak uruchamiać joby Apache Beam na Google Cloud Platform.
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...Codemotion
More than a technology, Sinfonier is the logic evolution of real time processing systems. It's the combination of a visual programming language (Yahoo Pipes), a collaborative philosophy (community) and a framework for real-time processing (Apache Storm) made available to users. During years solutions that exploit information using batch processing techs. have been growing. Now it is time to bring this philosophy to the world of real-time processing. Sinfonier was born with a clear focus on solving problems related to cybersecurity but technology supports other aspects we'll show this.
Data analytics in the cloud with Jupyter notebooks.Graham Dumpleton
Jupyter Notebooks provide an interactive computational environment, in which you can combine Python code, rich text, mathematics, plots and rich media. It provides a convenient way for data analysts to explore, capture and share their research.
Numerous options exist for working with Jupyter Notebooks, including running a Jupyter Notebook instance locally or by using a Jupyter Notebook hosting service.
This talk will provide a quick tour of some of the more well known options available for running Jupyter Notebooks. It will then look at custom options for hosting Jupyter Notebooks yourself using public or private cloud infrastructure.
An in-depth look at how you can run Jupyter Notebooks in OpenShift will be presented. This will cover how you can directly deploy a Jupyter Notebook server image, as well as how you can use Source-to-Image (S2I) to create a custom application for your requirements by combining an existing Jupyter Notebook server image with your own notebooks, additional code and research data.
Specific use cases around Jupyter Notebooks which will be explored will include individual use, team use within an organisation, and class room environments for teaching. Other issues which will be covered include importing of notebooks and data into an environment, storing data using persistent volumes and other forms of centralised storage.
As an example of the possibilities of using Jupyter Notebooks with a cloud, it will be shown how you can easily use OpenShift to set up a distributed parallel computing cluster using ‘ipyparallel’ and use it in conjunction with a Jupyter Notebook.
As hackers we love to understand how stuff works and how to optimize it. A very good tool to do both is software tracing. During the talk we'll see how tracing tools work and we'll zoom on one particular project called pyflame.
Talk on sneaky computation to be given at Reykjavik University. Sneaky computation is the using spare CPU cycles with little or no intervention from the user.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.
DTrace and SystemTap are dynamic tracing frameworks available for Solaris and Linux respectively. This session will give an overview of the static DTrace probes available in both Drizzle and MySQL and show numerous examples of scripts that utilize these probes. Mixing dynamic and static probes will also be discussed.
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...DynamicInfraDays
Slides from Barak Michener's talk "CoreOS: Building the Layers of the Scalable Cluster for Containers" at ContainerDays Boston 2015: http://dynamicinfradays.org/events/2015-boston/programme.html#layers
W świecie mikrousługowym architektura Lambda zadomowiła się już na dobre. Tak przetwarzania streamingowe, jak i batchowe buduje wiele firm. Na rynku (o ile o rynku można mówić w kontekście open source) istnieje wiele frameworków, każdy jednak ma pewne cechy, które — zwłaszcza przy dużych projektach — utrudniają pracę. Jedne służą do przetwarzania real-time, drugie lepiej spisują się w workloadach batchowych. Niektóre z nich zaś można uznać za „rock-solid” tylko jeśli uruchamiamy je na Hadoopie. Nie brak tych problemów jest jednak główną zaletą Beama. A co nią jest? Dowiecie się na prezentacji! Poruszymy takie kwestie jak model przetwarzania, use-case’y, w których Beam się sprawdza, a także środowiska uruchomieniowe. Zobaczycie też, jak uruchamiać joby Apache Beam na Google Cloud Platform.
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...Codemotion
More than a technology, Sinfonier is the logic evolution of real time processing systems. It's the combination of a visual programming language (Yahoo Pipes), a collaborative philosophy (community) and a framework for real-time processing (Apache Storm) made available to users. During years solutions that exploit information using batch processing techs. have been growing. Now it is time to bring this philosophy to the world of real-time processing. Sinfonier was born with a clear focus on solving problems related to cybersecurity but technology supports other aspects we'll show this.
Kubernetes is exploding in popularity right now and has all the buzz and cargo-culting that Docker enjoyed just a few years ago. But what even is Kubernetes? How do I run my PHP apps in it? Should I run my PHP apps in it ?
Porting a Streaming Pipeline from Scala to RustEvan Chan
How we at Conviva ported a streaming data pipeline in months from Scala to Rust. What are the important human and technical factors in our port, and what did we learn?
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Jennie Wang, Software Engineer (Intel)
Tsai Louie, Software Engineer (Intel)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
4. Laptop with anyOS
JDK v7.x installed
Mavenv3.0.5+ installed
IDE[either Eclipse with m2eclipse plugin or IntelliJ IDEA]
CreatedTwitter app for retrieving tweets
Cloned ordownloaded Storm Projectsfrom my GitHub Account:
https://github.com/P7h/StormWordCount
https://github.com/P7h/StormTweetsWordCount
Prerequisites for Workshop
5. Big Data
Batchvs. Real-time processing
Intro to Storm
CompaniesusingStorm
Storm Dependencies
Storm Concepts
Anatomyof Storm Cluster
Live codinga use caseusingStormTopology
Storm vs. Hadoop
Ag e n d a
10. Batch vs. Real-time Processing
Batch processing
Gatheringof dataand processingas a group at one time.
Real-time processing
Processingof datathat takes place as theinformation is being entered.
11. Event Processing
Simple Event Processing
Actingon a single event,filter in theESP
Event Stream Processing
Looking across multiple events
Complex Event Processing
Looking across multiple eventsfrom multipleevent streams
12.
13. Storm
Createdby Nathan Marz @BackType
Analyzetweets, links,users on Twitter
Open sourced on 19th September, 2011
EclipsePublic License 1.0
Stormv0.5.2
16k Java and 7k Clojure LOC
Latest Updates
Currentstable release v0.8.2released on 11th January,2013
Major core improvementsplanned for v0.9.0
Stormwill be anApacheProject [soon..]
14. Storm
Open source distributedreal-time computation system
Hadoop of real-time
Fast
Scalable
Fault-tolerant
Guarantees datawill be processed
Programminglanguage agnostic
Easyto set up and operate
Excellent documentation
23. enables the
convergenceof Big Data and
low-latency processing.
Empowers stream / micro-batch
processing of user events,
content feeds and application
logs.
27. Clojure
a dialect of the Lisp programminglanguagerunson the JVM, CLR, andJavaScript engines
ApacheThrift
Cross languagebridge,RPC; Frameworktobuild services
ØMQ
Asynchronous messagetransportlayer
Jetty
Embeddedweb server
Stormunderthe hood
28. Stormunderthe hood
ApacheZooKeeper
Distributed system, used to store metadata
LMAX Disruptor
High performancequeuesharedbythreads
Kryo
Serializationframework
Misc.
SLF4J,Python, Java 5+,JZMQ, JODA,Guava
29.
30. Tuples
Main data structure in Storm.
An ordered list of objects.
(“user”, “Prashanth”, “Babu”, “Engineer”, “Bangalore“)
Key-valuepairs – keys are strings, values canbe of anytype.
Tuple
34. Bolts
Processinput streams and [might] produce new streams.
Can do anything i.e. filtering, streaming joins, aggregations,readfrom
/ write to databases, APIs, run arbitraryfunctions, etc.
All sinks in the topology are bolts but not all bolts are sinks.
Tuple Tuple Tuple
36. Topology
Networkof spouts and bolts.
Can be visualized like a graph.
Container for application logic.
Analogous to a MapReduce job. But runs forever.
38. Stream Groupings
Each Spout or Bolt might be running n instances in parallel[tasks].
Groupings are used to decide which task in the subscribing bolt, the
tuple is sent to.
Grouping Feature
Shuffle Randomgrouping
Fields Groupedby valuesuch that equal valueresults in sametask
All Replicates to all tasks
Global Makesall tuples goto one task
None MakesBolt run in the samethread as the Bolt / Spout it subscribesto
Direct Producer(task that emits) controls which Consumerwill receive
Local or Shuffle If the targetbolt has one or moretasks in the sameworkerprocess,
tuples will be shuffled to just those in-process tasks
41. Storm Cluster
Nimbus daemon is the masterof this cluster.
Managestopologies.
Comparable to HadoopJobTracker.
Supervisor daemon spawns workers.
Comparable to HadoopTaskTracker.
Workersare spawned bysupervisors.
One per port defined in storm.yaml configuration.
42. Storm Cluster [contd..]
Task is run as a thread in workers.
Zookeeperis a distributed system, used to store metadata.
UI is a webapp which gives diagnostics on the cluster and topologies.
Nimbus and Supervisor daemons are fail-fast and stateless.
State is storedin Zookeeper.
43. Storm – Modes of operation
Localmode
Develop,test and debug topologies onyour localmachine.
Mavenis used to include Storm asa dev dependency for the project.
mvn clean compile package && java -jar target/storm-wordcount-1.0-SNAPSHOT-jar-
with-dependencies.jar
44. Remote [or Production]mode
Topologies are submitted for executionon a clusterof machines.
Cluster information is added in storm.yaml file.
More details on storm.yaml file canbe found here:
https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-cluster#fill-in-mandatory-configurations-
into-stormyaml
storm jar target/storm-wordcount-1.0-SNAPSHOT.jar
org.p7h.storm.offline.wordcount.topology.WordCountTopology WordCount
Storm – Modes of operation [contd..]
54. Problem#1 – WordCount [if there are internetissues]
Create a Spout which feeds randomsentences [you can define your own
set of randomsentences].
Create a Bolt which receivessentencesfrom the Spout and then splits
them into words and forwardsthem to next bolt.
Create another Bolt to count the words.
https://github.com/P7h/StormWordCount
55. Create a Spout which gets data from Twitter [please use Twitter4Jand
OAUTH Credentials to get tweets using Streaming API].
Forsimplicity consideronly tweets which arein English.
Emit only the stuff which we areinterested,i.e.A tweet’s getRetweetedStatus().
Create another Bolt to count the count the retweetsof a particular
tweet.
Makean in-memoryMapwith retweetscreen name andthe counterof the retweet asthe
value.
Logthe counterevery few seconds / minutes [should be configurable].
https://github.com/P7h/StormTopRetweets
Problem#2 – Top5 retweetedtweets[if internetworks fine]
56.
57. Storm vs. Hadoop
Batch processing
Jobs runto completion
[Pre-YARN]NameNode is SPOF
Statefulnodes
Scalable
Guarantees nodata loss
Open source
Real-time processing
Topologies run forever
No SPOF
Stateless nodes
Scalable
Gurantees nodataloss
Open source
58. t
now
Hadoop works great back here Stormworks here
Hadoop AND Storm
Blended
view
Blended
view
Blended View
Advantages of batch processing:>> Once the data process is started, the computer can be left running without supervision.>> Large amount of transactions can be combined into a batch rather than processing them each individually.>> Is cost efficient for an organization. Batch processing is only carried out when needed and uses less computer processing time.Disadvantages of batch processing:>> With batch processing there is a time delay before the work is processed and returned.>> Jobs must be run through to completion, and any changes in the data require reprocessing of the batch job itself.>> In batch processing, the master file is not always kept up to date.>> Batch processing can be slow.>> Complete or no insight into the outcomeAdvantages of real-time processing:>> Any errors in the processing of data are caught at the time the data is entered and can be fixed when real-time processing is used.>> Real-time processing provides better control over inventory and inventory turnover.>> Real-time processing reduces the delay in analyzing.Disadvantages of real-time processing:>> Real-time processing is more complex and expensive than batch systems. >> A real-time processing system is harder to audit.>> The backup of real time processing is necessary to retain the data.
Open sourced on 19th September, 2011 [post Twitter acquisition of BackType]Current stable release [as of 11th July, 2013] v0.8.2 released on 11th January, 2013