Module-1
Introduction to Big Data and STORM
www.edureka.in/apache-storm
LIVE On-line Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz and Assignment
Project Work on Large Da...
Course Topics
Slide 3 www.edureka.in/apache-storm
 Module 1
» Introduction to Big Data and Storm
 Module 2
» Storm Techn...
Objectives
Slide 4 www.edureka.in/apache-storm
At the end of this module, you will be able to:
Recall Big Data and Hadoop...
Big Data
Slide 5 www.edureka.in/apache-storm
Storm is a open source computing system used for processing Real-time Big Dat...
 Lots of Data - Terabytes or Petabytes
 Big data is the term for a collection of data sets so
large and complex that it ...
 Systems / Enterprises generate huge amount of data from Terabytes and even Petabytes of information.
Stock market genera...
 2,500 exabytes of new information in 2012 with Internet as primary driver.
 Digital universe grew by 62% last year to 8...
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition
Web
logs
Images
...
Annie’s Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
a...
Annie’s Question
Map the following to correspolnodinTghdeatraet!y!pe:
Slide 11 www.edureka.in/apache-storm
My name is Anni...
Annie’s Answer
XML files -> Semi-structureldodTathaere!!
Slide 12 www.edureka.in/apache-storm
Word docs, PDF filesM, Tyext...
Hadoop and its primary programming model, Map-Reduce, are great for batch-oriented processing of huge amount
of data.
Wi...
What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across cluster...
Hadoop Eco-System
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
HIVE
DW System
Pig Latin
Data Analysis Oth...
This evolution has forced the addition of support for
Higher Level Languages (Pig & Hive) New Real-time Storage Engines (...
Due to batch processing, Hadoop should be deployed in situations such as
Index Building
Pattern Recognitions
Creating Rec...
Real-time Big Data Analytics
Social Networking:
» Pick your own Big Data database (RDBMS or NoSQL)
» Measure the immediat...
Real-time Big Data Analytics
SaaS:
» Measuring user behaviour and acting upon it is crucial
for improving customer satisf...
Real-time Big Data Analytics
Financial Services:
» Determining in real time whether your portfolio is losing
money, or if...
Real Time Big Data Analytics - Options
Apache StormAmazon Kinesis
Slide 21 www.edureka.in/apache-storm
Problem Statement:
To find the total number of page views of Edureka’s blog over a
range of time.
Google Analytics can pro...
petabyte – scale
All Data
Slide 23 www.edureka.in/apache-storm
Need for Real-time Analytics
Challenge:
Querying huge amoun...
Precomputed
View
All Data Query
Slide 24 www.edureka.in/apache-storm
Need for Real-time Analytics
Solution:
Precompute his...
Need for Real-time Analytics
Google Analytics might have to keep the historical data for each hour as precompiled view
Pag...
Need for Real-time Analytics
Precomputed
View
All Data Query
Slide 26 www.edureka.in/apache-storm
using Hadoop
But, what about the
data generated after
last precompiled view?
Slide 27 www.edureka.in/apache-storm
Need for Real-time An...
Compensating for last few hours of data
Need for Real-time Analytics
spout
bolt
bolt
bolt Real-time
View
Storm
Real-time
D...
All Data
Precomputed
Batch View
Precomputed
Real-time View
Query
New Data Stream Storm
Slide 29 www.edureka.in/apache-stor...
Lambda Architecture
All data entering the system is dispatched to both the batch layer and the speed layer for processing...
Lambda Architecture
Batch View
Batch View
Master
Dataset
The batch layer has two functions:
» managing the master dataset...
Lambda Architecture
The speed layer compensates for the high latency of updates to the serving layer and deals with recen...
Lambda Architecture
Any incoming query can be answered by merging results from batch views and real-time views.
Batch Vie...
Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
Fault-tolerant
STORM
processing
St...
The work is delegated to different types of components that are each responsible for a simple specific processing task.
...
Annie’s Question
Storm can be used in:
- Real-time Processing
- Batch Processing
- Both
Slide 36 www.edureka.in/apache-sto...
Annie’s Answer
Real-time Processing
Slide 37 www.edureka.in/apache-storm
Annie’s Question
Which of them can be a source of Stream?
- Spout
- Bolt
- Both
Slide 38 www.edureka.in/apache-storm
Annie’s Answer
Both
Slide 39 www.edureka.in/apache-storm
Annie’s Question
It is not possible to run Storm process along with MapReduce jobs inside a
Hadoop Cluster.
- True
- False...
Annie’s Answer
False. With Hadoop 2.0, it is possible.
Slide 41 www.edureka.in/apache-storm
ZooKeeper
Nimbus ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus node (master node, sim...
Annie’s Question
A Nimbus Node is similar to TaskTracker Node in Hadoop Cluster.
- True
- False
Slide 43 www.edureka.in/ap...
Annie’s Answer
No. A Nimbus Node is more like a JobTracker Node in Hadoop
Slide 44 www.edureka.in/apache-storm
Five key abstractions help to understand how Storm
processes data:
Tuples – an ordered list of elements. For example, a
“...
Annie’s Question
A Storm topology is defined in terms of
- Nimbus, Zookeeper, Supervisor nodes
- Spout, Bolt
- Spout, Bolt...
Annie’s Answer
Spout and Bolt
Slide 47 www.edureka.in/apache-storm
Use Cases of Storm
Processing Streams
Distributed Remote
Procedure Call
Unlike other stream
processing systems,
with Storm...
Use Cases of Storm
Slide 49 www.edureka.in/apache-storm
Financial Services
» Securities Fraud
» Compliance Violations
» O...
Key Differentiators
Simple to Program Fault-tolerant
It’s painful to do real-
time processing from
scratch.
With storm,
co...
Assignment
Slide 51 www.edureka.in/apache-storm
Try setting up single-node Storm cluster on your system as shown in LMS A...
Pre-work
Slide 52 www.edureka.in/apache-storm
Install Ubuntu Vmware Player on your System.
Install single-node Storm clu...
What’s within the LMS
This section will
give you an
insight of
Apache Storm
course
Slide 53 www.edureka.in/apache-storm
What’s within the LMS
Click here to
expand and view
all the elements
of this Module
Slide 54 www.edureka.in/apache-storm
What’s within the LMS
Assignment
Pre-work
Slide 55 www.edureka.in/apache-storm
Quiz
edureka !
/•
Upcoming SlideShare
Loading in...5
×

Apache Storm

1,598

Published on

Published in: Education, Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,598
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Storm"

  1. 1. Module-1 Introduction to Big Data and STORM www.edureka.in/apache-storm
  2. 2. LIVE On-line Class Class Recording in LMS 24/7 Post Class Support Module Wise Quiz and Assignment Project Work on Large Data Set Verifiable Certificate How it Works? Slide 2 www.edureka.in/apache-storm
  3. 3. Course Topics Slide 3 www.edureka.in/apache-storm  Module 1 » Introduction to Big Data and Storm  Module 2 » Storm Technology Stack and Groupings  Module 3 » Spouts and Bolts  Module 4 » Trident Topologies  Module 5 » Real Life Storm Project -1  Module 6 » Real Life Storm Project -2
  4. 4. Objectives Slide 4 www.edureka.in/apache-storm At the end of this module, you will be able to: Recall Big Data and Hadoop Understand Batch and Real-time Analytics of Big Data Investigate Shortcoming of Hadoop Understand Lambda Architecture Develop a basic knowledge of Apache Storm and its components Explain the Use Cases and Key Differentiators of Storm
  5. 5. Big Data Slide 5 www.edureka.in/apache-storm Storm is a open source computing system used for processing Real-time Big Data Analytics. Lets understand Big Data first to learn STORM.
  6. 6.  Lots of Data - Terabytes or Petabytes  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. What is Big Data? Slide 6 www.edureka.in/apache-storm
  7. 7.  Systems / Enterprises generate huge amount of data from Terabytes and even Petabytes of information. Stock market generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades. What is Big Data? Slide 7 www.edureka.in/apache-storm
  8. 8.  2,500 exabytes of new information in 2012 with Internet as primary driver.  Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year. Slide 8 www.edureka.in/apache-storm Un-structured Data is Exploding
  9. 9. IBM’s Definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ IBM’s Definition Web logs Images Videos Sensor Data Audios VOLUME VELOCITY VARIETY Slide 9 www.edureka.in/apache-storm
  10. 10. Annie’s Introduction Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 10 www.edureka.in/apache-storm
  11. 11. Annie’s Question Map the following to correspolnodinTghdeatraet!y!pe: Slide 11 www.edureka.in/apache-storm My name is Annie. I lo quizzes and Data from EpnutezrpzrliseessyastnemdsI(EaRmP, CRhMereetc.)to make you guys think and answer my questions. - XML files - Word docs, PDF files, Text files - - E-Mail body
  12. 12. Annie’s Answer XML files -> Semi-structureldodTathaere!! Slide 12 www.edureka.in/apache-storm Word docs, PDF filesM, Tyextnfailems -e> UisnsAtrnunctiuer.ed DataE-Mail body -> Unstructured Data Data from EnterpriseIsylostems q(EuRiPz, zCReMs eatcn.)d-> Structured Data puzzles and I am here to make you guys think and answer my questions.
  13. 13. Hadoop and its primary programming model, Map-Reduce, are great for batch-oriented processing of huge amount of data. With growing data, Hadoop enables you to horizontally scale your cluster by adding commodity nodes and thus keep up with query workloads. is primary programming model great for batch-oriented processing of huge amount of data Big Data Batch Analytics Slide 13 www.edureka.in/apache-storm
  14. 14. What is Hadoop? Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an Open-source Data Management with scale-out storage and distributed processing. Slide 14 www.edureka.in/apache-storm
  15. 15. Hadoop Eco-System Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) HIVE DW System Pig Latin Data Analysis Other YARN Frameworks (MPI,GIRAPH)MapReduce Framework HBase YARN Cluster Resource Management Slide 15 www.edureka.in/apache-storm
  16. 16. This evolution has forced the addition of support for Higher Level Languages (Pig & Hive) New Real-time Storage Engines (HBase) Big Data Batch Analytics Extensions for Streaming Data (Hadoop Streaming) Slide 16 www.edureka.in/apache-storm
  17. 17. Due to batch processing, Hadoop should be deployed in situations such as Index Building Pattern Recognitions Creating Recommendation Engine Sentiment Analysis Situations generate huge amount of data stored queried Hadoop for Batch Analytics Slide 17 www.edureka.in/apache-storm
  18. 18. Real-time Big Data Analytics Social Networking: » Pick your own Big Data database (RDBMS or NoSQL) » Measure the immediate impact to your site traffic from social media, whether a new blog post, a tweet, a “Like”, or even a comment. » Knowing this information translates to better conversion and more effective online campaigns. Slide 18 www.edureka.in/apache-storm
  19. 19. Real-time Big Data Analytics SaaS: » Measuring user behaviour and acting upon it is crucial for improving customer satisfaction and conversion rates – which represent immediate increases in revenue. Slide 19 www.edureka.in/apache-storm
  20. 20. Real-time Big Data Analytics Financial Services: » Determining in real time whether your portfolio is losing money, or if there is fraud in your system means that you can prevent disasters as they occur, not after the damage is done. » Correlating multiple sources from the market in real-time results in a more accurate view of the market and enables more accurate actions to maximize your profit. Slide 20 www.edureka.in/apache-storm
  21. 21. Real Time Big Data Analytics - Options Apache StormAmazon Kinesis Slide 21 www.edureka.in/apache-storm
  22. 22. Problem Statement: To find the total number of page views of Edureka’s blog over a range of time. Google Analytics can provide you this information. Example: For a particular day, the data can be: Need for Real-time Analytics Slide 22 www.edureka.in/apache-storm
  23. 23. petabyte – scale All Data Slide 23 www.edureka.in/apache-storm Need for Real-time Analytics Challenge: Querying huge amount of Historical Data is slow
  24. 24. Precomputed View All Data Query Slide 24 www.edureka.in/apache-storm Need for Real-time Analytics Solution: Precompute historical data
  25. 25. Need for Real-time Analytics Google Analytics might have to keep the historical data for each hour as precompiled view Page view Page view Page view Page view Page view All Data Query Slide 25 www.edureka.in/apache-storm URL Hr of the day No. of pageviews edureka.in/blog/aboutapachestorm 1 250 edureka.in/blog/aboutapachestorm 2 300 edureka.in/blog/aboutapachestorm 3 455 edureka.in/blog/aboutapachestorm 4 460 edureka.in/blog/aboutapachestorm 5 320 edureka.in/blog/aboutapachestorm 6 111 edureka.in/blog/aboutapachestorm 7 129 Precomputed View
  26. 26. Need for Real-time Analytics Precomputed View All Data Query Slide 26 www.edureka.in/apache-storm using Hadoop
  27. 27. But, what about the data generated after last precompiled view? Slide 27 www.edureka.in/apache-storm Need for Real-time Analytics
  28. 28. Compensating for last few hours of data Need for Real-time Analytics spout bolt bolt bolt Real-time View Storm Real-time Data Stored Or Slide 28 www.edureka.in/apache-storm Or
  29. 29. All Data Precomputed Batch View Precomputed Real-time View Query New Data Stream Storm Slide 29 www.edureka.in/apache-storm Hadoop Need for Real-time Analytics
  30. 30. Lambda Architecture All data entering the system is dispatched to both the batch layer and the speed layer for processing. New Data Speed Layer Slide 30 www.edureka.in/apache-storm Batch Layer 1 Serving Layer
  31. 31. Lambda Architecture Batch View Batch View Master Dataset The batch layer has two functions: » managing the master dataset (an immutable, append-only set of raw data), and » to pre-compute the batch views. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. Batch Layer Serving Layer New Data Speed Layer 1 2 3 Slide 31 www.edureka.in/apache-storm
  32. 32. Lambda Architecture The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. Batch View Batch View Real-time View Master Dataset New Data Speed Layer Real-time View Batch Layer Serving Layer 1 2 3 Slide 32 www.edureka.in/apache-storm 4
  33. 33. Lambda Architecture Any incoming query can be answered by merging results from batch views and real-time views. Batch View Batch View Real-time View Master Dataset New Data Query Speed Layer Query Real-time View Batch Layer Serving Layer 1 2 3 Slide 33 www.edureka.in/apache-storm 4 5
  34. 34. Storm is a distributed, reliable, fault-tolerant system for processing streams of data. Fault-tolerant STORM processing Streams of Data What is Storm? Slide 34 www.edureka.in/apache-storm
  35. 35. The work is delegated to different types of components that are each responsible for a simple specific processing task. The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a component called a bolt, which transforms it in some way. A bolt either persists the data in some sort of storage, or passes it to some other bolt. transforms data bolt bolt spout spout bolt bolt passes data passes data data storage Input Data Source What is Storm? Slide 35 www.edureka.in/apache-storm
  36. 36. Annie’s Question Storm can be used in: - Real-time Processing - Batch Processing - Both Slide 36 www.edureka.in/apache-storm
  37. 37. Annie’s Answer Real-time Processing Slide 37 www.edureka.in/apache-storm
  38. 38. Annie’s Question Which of them can be a source of Stream? - Spout - Bolt - Both Slide 38 www.edureka.in/apache-storm
  39. 39. Annie’s Answer Both Slide 39 www.edureka.in/apache-storm
  40. 40. Annie’s Question It is not possible to run Storm process along with MapReduce jobs inside a Hadoop Cluster. - True - False Slide 40 www.edureka.in/apache-storm
  41. 41. Annie’s Answer False. With Hadoop 2.0, it is possible. Slide 41 www.edureka.in/apache-storm
  42. 42. ZooKeeper Nimbus ZooKeeper ZooKeeper Supervisor Supervisor Supervisor Supervisor Supervisor Nimbus node (master node, similar to the Hadoop JobTracker): » Uploads computations for execution » Distributes code across the cluster » Launches workers across the cluster » Monitors computation and reallocates workers as needed ZooKeeper nodes: » Coordinates the Storm cluster Supervisor nodes : » Communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus Storm Components A Storm cluster has 3 sets of nodes 1. Nimbus node 2. Zookeeper nodes 3. Supervisor nodes Slide 42 www.edureka.in/apache-storm
  43. 43. Annie’s Question A Nimbus Node is similar to TaskTracker Node in Hadoop Cluster. - True - False Slide 43 www.edureka.in/apache-storm
  44. 44. Annie’s Answer No. A Nimbus Node is more like a JobTracker Node in Hadoop Slide 44 www.edureka.in/apache-storm
  45. 45. Five key abstractions help to understand how Storm processes data: Tuples – an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7) Streams – an unbounded sequence of tuples Spouts – sources of streams in a computation (e.g. a Twitter API) Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases Topologies – the overall calculation, represented visually as a network of spouts and bolts spout spout bolt bolt bolt bolt Storm users define topologies for how to process the data when it comes streaming in from the spout. Slide 45 www.edureka.in/apache-storm Storm Components
  46. 46. Annie’s Question A Storm topology is defined in terms of - Nimbus, Zookeeper, Supervisor nodes - Spout, Bolt - Spout, Bolt, Nimbus, Zookeeper, Supervisor nodes - Spout, Bolt, Zookeeper node Slide 46 www.edureka.in/apache-storm
  47. 47. Annie’s Answer Spout and Bolt Slide 47 www.edureka.in/apache-storm
  48. 48. Use Cases of Storm Processing Streams Distributed Remote Procedure Call Unlike other stream processing systems, with Storm there’s no need for intermediate queues. Send data to clients continuously so they can update and show results in real time, such as site metrics. Easily parallelize CPU- intensive operations. Continuous Computation Use Cases of Storm Slide 48 www.edureka.in/apache-storm
  49. 49. Use Cases of Storm Slide 49 www.edureka.in/apache-storm Financial Services » Securities Fraud » Compliance Violations » Order Routing » Pricing  Telecom » Security Breaches » Network Outages » Bandwidth Allocation » Customer Service  Retail » Shrinkage » Stock outs » Offers » Pricing Web » Application Failure » Operational Issues » Personalized Content Use Storm to prevent certain outcomes or to optimize their objectives.
  50. 50. Key Differentiators Simple to Program Fault-tolerant It’s painful to do real- time processing from scratch. With storm, complexity is reduced drastically. It’s easier to develop in a JVM-based language, but Storm supports any language as long as you use or implement a small intermediary library. The Storm cluster takes care of workers going down, reassigning tasks when necessary. Support for Multiple Programming Languages Key Differentiators Slide 50 www.edureka.in/apache-storm
  51. 51. Assignment Slide 51 www.edureka.in/apache-storm Try setting up single-node Storm cluster on your system as shown in LMS Apache Storm single-node cluster installation document.
  52. 52. Pre-work Slide 52 www.edureka.in/apache-storm Install Ubuntu Vmware Player on your System. Install single-node Storm cluster on your system.
  53. 53. What’s within the LMS This section will give you an insight of Apache Storm course Slide 53 www.edureka.in/apache-storm
  54. 54. What’s within the LMS Click here to expand and view all the elements of this Module Slide 54 www.edureka.in/apache-storm
  55. 55. What’s within the LMS Assignment Pre-work Slide 55 www.edureka.in/apache-storm Quiz
  56. 56. edureka ! /•

×