A tutorial presentation based on storm.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
Real-Time Integration Between MongoDB and SQL DatabasesEugene Dvorkin
Many companies have huge investment in Data Warehouse and BI tools and want to leverage those investments to process data collected by applications in MongoDB. For example, a company may need to blend clickstream data collected by distributed MongoDB data storage with personal data from Oracle into the Data Warehouse system or Analytics platform to provide timely marketing reports. Most of the time the job requires converting a MongoDB JSON document structure into a traditional relational model. Traditional ETL (Extract Transform Load) process still needed to be developed for loading and conversion of unstructured data into traditional analytical tools or Hadoop. In this talk we discuss how to develop a real-time, scalable, fault-tolerant ETL process to integrate MongoDB with traditional RDBMS storage using the open-sourced Twitter Storm project. We will be capturing data streamed by MongoDB oplog or capped collections, transforming it into tables, rows and columns and loading it into a SQL database. We will discuss mongoDB oplog and Storm architecture. The principles discussed in the talk can be used for many other applications - like advanced analytics, continuous computations and so on. We will be using Java as our language of choice but you can use the same software stack with any language.
A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH
OSDC 2016, Berlin: Talk by Florian Lautenschlager (@flolaut, Senior Software Engineer at QAware)
Abstract: How to store billions of time series points and access them within a few milliseconds? Chronix! Chronix is a young but mature open source project that allows one for example to store about 15 GB (csv) of time series in 238 MB with average query times of 21 ms. Chronix is built on top of Apache Solr a bulletproof distributed NoSQL database with impressive search capabilities. In this code-intense session we show how Chronix achieves its efficiency in both respects by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with (pre-computed) attributes, and by specialized query functions.
A tutorial presentation based on storm.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
Real-Time Integration Between MongoDB and SQL DatabasesEugene Dvorkin
Many companies have huge investment in Data Warehouse and BI tools and want to leverage those investments to process data collected by applications in MongoDB. For example, a company may need to blend clickstream data collected by distributed MongoDB data storage with personal data from Oracle into the Data Warehouse system or Analytics platform to provide timely marketing reports. Most of the time the job requires converting a MongoDB JSON document structure into a traditional relational model. Traditional ETL (Extract Transform Load) process still needed to be developed for loading and conversion of unstructured data into traditional analytical tools or Hadoop. In this talk we discuss how to develop a real-time, scalable, fault-tolerant ETL process to integrate MongoDB with traditional RDBMS storage using the open-sourced Twitter Storm project. We will be capturing data streamed by MongoDB oplog or capped collections, transforming it into tables, rows and columns and loading it into a SQL database. We will discuss mongoDB oplog and Storm architecture. The principles discussed in the talk can be used for many other applications - like advanced analytics, continuous computations and so on. We will be using Java as our language of choice but you can use the same software stack with any language.
A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH
OSDC 2016, Berlin: Talk by Florian Lautenschlager (@flolaut, Senior Software Engineer at QAware)
Abstract: How to store billions of time series points and access them within a few milliseconds? Chronix! Chronix is a young but mature open source project that allows one for example to store about 15 GB (csv) of time series in 238 MB with average query times of 21 ms. Chronix is built on top of Apache Solr a bulletproof distributed NoSQL database with impressive search capabilities. In this code-intense session we show how Chronix achieves its efficiency in both respects by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with (pre-computed) attributes, and by specialized query functions.
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
There's an increasing demand for real-time data ingestion and processing. Systems like Apache Kafka, Samza, and Storm have become popular for this reason. This type of high-volume, online data processing presents an interesting set of new challenges, namely, how do we drink from the firehose without getting drenched? Explore some of the fundamental primitives used in stream processing and, specifically, how we can use probabilistic methods to solve the problem.
Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.
We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Developing Java Streaming Applications with Apache StormLester Martin
Apache Storm, http://storm.apache.org, is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. During this presentation, a simple Java-based streaming application will be built from scratch!
Code examples can be found at https://github.com/lestermartin/streaming-exploration.
This paper describes the use of Storm at Twitter. Storm is a realtime fault-tolerant and distributed stream data processing system.Storm is currently being used to run various critical computations in Twitter at scale, and in real-time. This paper describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This paper also describes how queries (aka.topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results
from an empirical evaluation demonstrating the resilience of
Storm in dealing with machine failures. Storm is under active
development at Twitter and we also present some potential
directions for future work.
Speakers: Chris Larsen (Limelight Networks) and Benoit Sigoure (Arista Networks)
The OpenTSDB community continues to grow and with users looking to store massive amounts of time-series data in a scalable manner. In this talk, we will discuss a number of use cases and best practices around naming schemas and HBase configuration. We will also review OpenTSDB 2.0's new features, including the HTTP API, plugins, annotations, millisecond support, and metadata, as well as what's next in the roadmap.
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
There's an increasing demand for real-time data ingestion and processing. Systems like Apache Kafka, Samza, and Storm have become popular for this reason. This type of high-volume, online data processing presents an interesting set of new challenges, namely, how do we drink from the firehose without getting drenched? Explore some of the fundamental primitives used in stream processing and, specifically, how we can use probabilistic methods to solve the problem.
Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.
We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Developing Java Streaming Applications with Apache StormLester Martin
Apache Storm, http://storm.apache.org, is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. During this presentation, a simple Java-based streaming application will be built from scratch!
Code examples can be found at https://github.com/lestermartin/streaming-exploration.
This paper describes the use of Storm at Twitter. Storm is a realtime fault-tolerant and distributed stream data processing system.Storm is currently being used to run various critical computations in Twitter at scale, and in real-time. This paper describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This paper also describes how queries (aka.topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results
from an empirical evaluation demonstrating the resilience of
Storm in dealing with machine failures. Storm is under active
development at Twitter and we also present some potential
directions for future work.
Speakers: Chris Larsen (Limelight Networks) and Benoit Sigoure (Arista Networks)
The OpenTSDB community continues to grow and with users looking to store massive amounts of time-series data in a scalable manner. In this talk, we will discuss a number of use cases and best practices around naming schemas and HBase configuration. We will also review OpenTSDB 2.0's new features, including the HTTP API, plugins, annotations, millisecond support, and metadata, as well as what's next in the roadmap.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
This session will discuss how Cassandra/Solr can be used to create real-time analytics platform – jKool.
jKool provides an in-memory analysis of time-series data, automatically performing sequencing, correlation, grouping, enriching, synchronizing, computing, querying and displaying data streams. The session will discuss architecture, challenges and approaches taken to create a real-time analytics platform on top of open source big data analytics platforms: Cassandra, Solr, Kafka & Spark.
This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.
Building a real time big data analytics platform with solrTrey Grainger
Having “big data” is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.
The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
ZendCon 2010 - Building Intelligent Search Applications with Apache Solr and PHP5. This is a presentation on how to create intelligent web-based search applications using PHP 5 and the out-of-the-box features available in Solr 1.4.1 After we finish we finish the illustration of adding, updating and removing data from the Solr index, we will discuss how to add features such as auto-completion, hit highlighting, faceted navigation, spelling suggestions etc
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.
The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring.
In this session, you’ll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You’ll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We’ll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We’ll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.
Apache Hadoop has emerged as the storage and processing platform of choice for Big Data. In this tutorial, I will give an overview of Apache Hadoop and its ecosystem, with specific use cases. I will explain the MapReduce programming framework in detail, and outline how it interacts with Hadoop Distributed File System (HDFS). While Hadoop is written in Java, MapReduce applications can be written using a variety of languages using a framework called Hadoop Streaming. I will give several examples of MapReduce applications using Hadoop Streaming.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Similar to Real-Time Big Data with Storm, Kafka and GigaSpaces (20)
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
Real-Time Big Data with Storm, Kafka and GigaSpaces
1. Real-Time Big Data with
Storm, Kafka and GigaSpaces.
Building own Real-Time
Google Analytics
Oleksiy Dyagilev
Lead Software Engineer
Epam Systems
2. Real-time
• must guarantee response
within strict time constraints
• deadline must be
met regardless of system load
ABS (anti-lock brakes )
railway switching system
chess playing program
Near Real-time
• time delay introduced by
data processing or
network transmission
• "no significant delays"
video streaming
analytical applications
3. Big Data
• data sets so large and complex
that it becomes difficult to
process using traditional data
processing applications
• volume (terabytes, petabytes)
• velocity (speed of data in and
out)
• variety (various data sources,
structured and unstructured)
IoT(Internet of Things)
mobile devices
sensor data (meteo, genomics, geo, bio, etc)
social media
Internet search
user activity tracking
software logs
6. Kafka, not Franz
A high-throughput distributed messaging
system
• fast, O(1) persistence
• scalable
• durable
• distributed by design
• originally developed by LinkedIn
• written in Scala
7. Kafka. Commit log.
• ordered, immutable sequence of
messages that is continually
appended to
• retention
• read offset controlled by consumer
• partitions distributed over cluster
• replication (leader, followers)
• messages load balanced by
message key or in round-robin
8. Kafka. Internals
• disks are slow! .. oh wait ...
• random access vs sequential matters a lot
• 6*7200rpm SATA RAID-5 array. Random writes - 100k/sec. Sequential -
600MB/sec [1]
• according to ACM Queue article in some cases seq. write of eight 15000
rpm SAS disks in RAID-5 can be faster than memory random access (mind
artice date!)
9. Kafka. Internals
• OS pagecache, read-ahead, write-behind
• flush after K seconds or N messages
• no need to delete data(comparing to in-memory solutions)
• no need to keep state what has been consumed (controlled by consumer,
one integer per partition in ZK)
• no GC penalties
• no overhead for JVM objects
• batching
• end-to-end batch compression
• linux sendfile() system call: eliminate context switches and memory copy
(nio.FileChannel.transferTo())
10. Zero copy
each time data traverses the user-kernel boundary, it must be copied, which
consumes CPU cycles and memory bandwidth. Benchmark shows that time
reduced in more than 2x with zero copy.
java.nio.channels.FileChannel.transferTo() available on linux
traditional approach Zero copy
11. Kafka. Ordering guarantees.
• Messages sent by a producer to a particular topic partition will be
appended in the order they are sent.
• Messages delivered asynchronously to consumers, so may arrive out of
order on different consumers
• Kafka assigns partition to consumer to guarantee ordering within
partition, so that partition consumed by 1 consumer in the group.
• No global ordering across partitions, the only way is to have single
partition and single consumer which doesn't scale.
12. Kafka. Delivery semantics
Producer:
• synchronous
• asynchronous
• wait for replication or not
To guarantee exactly-once:
• include PK and deduplicate on consumer
• or single writer per partition and check last message in case of network error
Consumer:
• at least once: read, process, commit position.
• at most once: read, commit position, process.
• exactly-once: need application level logic, keep offset together with data
13. LinkedIn benchmarksEnvironment:
• 6 machines: Kafka on 3, other 3 for Zookeeper and client
• Intel Xeon 2.5 GHz processor with six cores
• Six 7200 RPM SATA drives, JBOD - no RAID
• 32GB of RAM
• 1Gb Ethernet
• 6 partitions, record 100 bytes
Result:
• 1 producer, no replica - 821,557 records/sec (78.3 MB/sec)
• 1 producer, 3x async replica - 786,980 records/sec (75.1 MB/sec)
• 1 producer, 3x sync replica - 421,823 records/sec (40.2 MB/sec)
• 3 producers, 3x async replication - 2,024,032 records/sec (193.0 MB/sec)
• 1 consumer - 940,521 records/sec (89.7 MB/sec)
• 3 consumers - 2,615,968 records/sec (249.5 MB/sec)
• 1 producer and 1 consumer - 795,064 records/sec (75.8 MB/sec)
15. Apache Storm
Free and open source distributed realtime computation system. Storm
makes it easy to reliably process unbounded streams of data, doing for
realtime processing what Hadoop did for batch processing
• scalable
• fault-tolerant
• guarantees data will be processed
• originally written by Nathan Marz in Java/Clojure, then adopted by Twitter
• now in Apache incubator
• used by dozens of companies
• active community
16. Storm in a nutshell
• Topology - computation graph
• Tuple - unit of data, sequence of fields
• Stream - unbounded sequence of tuples
• Spout - input data source
• Bolt - processing node
• Stream grouping (field, shuffle, all, global)
• DRPC - distributed RPC, ad-hoc queries
• Trident - framework on top of Storm for stateful, incremental processing on
top of persistence store
17. Storm cluster.
• Nimbus distributes code around cluster. SPOF.
Stateless, fail-fast
• Supervisor starts/stops workers. Stateless,
fail-fast
• Worker executes subset of topology
• coordination between Nimbus and Supervisor
is done through Zookeeper
19. Word Counter. Topology
class SplitSentence extends BaseBasicBolt {
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String string = tuple.getString(0);
for (String word : string.split(" ")) {
collector.emit(new Values(word));
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
20. Storm. Word counter.
class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
21. Storm. Word counter.
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar("word-counter", conf, builder.createTopology());
}
22. Message processing guarantees
• every tuple will be processed at least once. Use Trident for exactly-once
• when tuple created it's given random 64 bit id
• every tuple knows the ids of all the spout tuples for which it exists in their tuple trees(information copied from anchors when new tuple emitted in bolt)
• when a tuple is acked, it sends a message to the appropriate acker tasks with information about how the tuple tree changed
• mod hashing used to map a spout tuple id to an acker task
• acker task stores a map from a spout tuple id to a pair of values (spout id, ack val). Ack val is 64 bit = XOR of all tuple ids, represents state of the tree
• ack val = 0 means that tree is fully processed
• at 10K acks per second, it will take 50,000,000 years until a mistake is made. Will cause data loss only if tuple fails
23. Trident
Trident is a high-level abstraction for doing stateful,
incremental processing on top of persistence store
• tuples are processed as small batches
• exactly-once processing
• “transactional” datastore persistence
• functional API: joins, aggregations, grouping, functions, and filters
• compiles into as efficient of a Storm topology as possible
24. Word counter with Trident
TridentState wordCounts = topology
.newStream("spout1", spout).parallelismHint(16)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(16);
topology.newDRPCStream("words", drpc)
.each(new Fields("args"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
.each(new Fields("count"), new FilterNull())
.aggregate(new Fields("count"), new Sum(), new Fields("sum"));
27. How Trident guarantees exactly-once
semantics?
• Each batch of tuples is given a unique id called the “transaction id” (txid). If the batch is replayed, it
is given the exact same txid.
• State updates are ordered among batches. That is, the state updates for batch 3 won’t be applied
until the state updates for batch 2 have succeeded. Note: pipelining
Consider 'transactional' spout:
1. Batches for a given txid are always the same. Replays of batches for a txid will exact same set of
tuples as the first time that batch was emitted for that txid.
2. There’s no overlap between batches of tuples (tuples are in one batch or another, never
multiple).
3. Every tuple is in a batch (no tuples are skipped)
30. Realtime Google Analytics
highly scalable equivalent of Realtime Google Analytics on top of Storm and GigaSpaces.
Application can be deployed to cloud with one click using Cloudify
Code available on github https://github.com/fe2s/xap-storm
40. Thanks!
Presentation and detailed blog post available at http://dyagilev.org
Resources:
• Kafka benchmarking
• The Linux Page Cache and pdflush: Theory of Operation and Tuning for Write-Heavy Loads
• Kafka documentation
• The Lambda architecture: principles for architecting realtime Big Data systems
• Efficient data transfer through zero copy
• RabbitMQ Performance Measurements
• GigaSpaces and Storm Integration