SlideShare a Scribd company logo
1 of 34
BIG DATA CONCEPTS

Serkan ÖZAL
Middle East Technical University
Ankara/TURKEY
October 2013
Contents
 Big Data and Scalability
 NoSQL





Column Stores
Key-Value Stores
Document Stores
Graph Database Systems

 Batch Data Processing
 MapReduce
 Hadoop
 Running Analytical Queries over Offline Big Data
 Hive
 Pig

 RealTime Data Processing
 Storm

2
Big Data
 Collect.
 Store.
 Organize.
 Analyze.
 Share.

Data growth outruns the ability to manage it so we
need scalable solutions.

3
Scalability - Vertical
 Increasing server capacity.

 Adding more CPU, RAM.
 Managing is hard.
 Possible down times

4
Scalability - Horizontal
 Adding servers to existing system with little effort, aka

Elastically scalable.
 Bugs, hardware errors, things fail all the time.
 It should become cheaper. Cost efficiency.







Shared nothing.
Use of commodity/cheap hardware.
Heterogeneous systems.
Controlled Concurrency (avoid locks).
Service Oriented Architecture. Local states.
 Decentralized to reduce bottlenecks.
 Avoid Single point of failures.

 Asynchrony.
 Symmetry, you don’t have to know what is happening. All

nodes should be symmetric.
5
NoSQL
 Stands for Not Only SQL
 Class of non-relational data storage systems

 Usually do not require a fixed table schema nor

do they use the concept of joins
 Offers BASE properties instead of ACID
(Atomicity, Consistency, Isolation, Durability)
properties
 BAsically available: Nodes in the a distributed

6

environment can go down, but the whole system
shouldn’t be affected.
 Soft State (scalable): The state of the system and data
changes over time.
 Eventual Consistency: Given enough time, data will be
consistent across the distributed system.
What is Wrong With RDBMS?
 One size fits all? Not really.
 Rigid schema design.
 Harder to scale.
 Replication.

 Joins across multiple nodes? Hard.
 How does RDMS handle data growth?

Hard.

7
Advantages & Disadvantages
 Advantages
 Flexible schema
 Quicker/cheaper to set up
 Massive scalability
 Evantual consistency

higher performance &

availability

 Disadvantages
 No declarative query language

programming
 Evantual consistency
8

more

fewer guarantees
NoSQL Systems
 Several incarnations
 Column Stores
 Key-Value Stores
 Document Stores
 Graph Databases

9
Column Stores
 Each storage block contains data from only one

column
 More efficient than row (or document) store if:
 Multiple row/record/documents are inserted at the

same time so updates of column blocks can be
aggregated
 Retrievals access only some of the columns in a
row/record/document
 Example Systems
 HBase

10
Key-Value Stores
Extremely simple interface




Data model: (key, value) pairs
Operations:




Insert(key,value), Fetch(key),Update(key), Delete(key)

Implementation: efficiency, scalability, faulttolerance




Records distributed to nodes based on key



Replication



Single-record transactions, “eventual consistency”

Example systems



11

Google BigTable, Amazon Dynamo, Cassandra,
Voldemort, …
Document Stores
 Like Key-Value Stores except value is

document
Data model: (key, document) pairs
Document: JSON, XML, other semistructured
formats
Basic operations:








Insert(key,document), Fetch(key),Update(key),
Delete(key)

Also Fetch based on document contents

 Example systems
 CouchDB, MongoDB, SimpleDB, …
12
Graph Database Systems
 A graph is a collection nodes (things) and edges

(relationships) that connect pairs of nodes.
 Attach properties (key-value pairs) on nodes and

relationships
 Relationships connect two nodes and both nodes

and relationships can hold an arbitrary amount of keyvalue pairs.
 A graph database can be thought of as a key-value

store, with full support for relationships.
 Example systems
 Neo4j, FlockDB, Pregel, …

13
Batch Data Processing
 Batch data processing is an efficient way of

processing high volumes of data is where a group of
transactions is collected over a period of time.
 Data is collected, entered, processed and then the
batch results are produced
 Data is processed by worker nodes and managed by
master node.
 Analytical queries can be executed as distributed on
offline big data. User queries are converted to MapReduce jobs on cluster to process big data.
 Hive: Supports SQL like query language
 Pig: More likely a scripting language
14
Map-Reduce
 MapReduce

is a programming model and
software framework first developed by Google
(Google’s MapReduce paper was submitted in
2004)
 A method for distributing a task across multiple
nodes
 Each node processes data assigned to that node
 Consists of two developer-created phases
 Map
 Reduce

 Nice features of MapReduce systems include:
15

 reliably processing a job even when machines die
 paralellization on thousands of machines
Map
 "Map" step:

The master node takes the input,
divides it into smaller sub-problems, and
distributes them to worker nodes. A worker node
may do this again in turn, leading to a multi-level
tree structure. The worker node processes the
smaller problem, and passes the answer back to
its master node.
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, 1) // emit(key, value)

16
Reduce
 "Reduce" step: After map step, the master node

collects the answers of all the sub-problems and
classifies all values with their keys. Each unique
key and its values are sent to a distinct reducer.
This means that each reducer processes a
distinct key and ita values independently from
function reduce(String word, Iterator
each other.

17

partialCounts):
// word: a word
// partialCounts: a list of aggregated partial
counts
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
Map-Reduce Overview

18
Hadoop
 Hadoop is a scalable fault-tolerant distributed

system for data storage and processing.
 Core Hadoop has two main components:
(1) HDFS: Hadoop Distributed File System is a
reliable, self-healing, redundant, high-bandwidth,
distributed file system optimized for large files.
(2) MapReduce: Fault-tolerant distributed processing
 Operates on unsuctured and structured data
 Open source under the friendly Apache License

19
Hadoop Architecture

20
HDFS
 Inspired by Google File System
 Scalable, distributed, portable file system written in

Java for Hadoop framework
 Primary distributed storage used by Hadoop applications

 HDFS can be part of a Hadoop cluster or can be a

stand-alone general purpose distributed file system
 An HDFS cluster primarily consists of
 NameNode that manages file system metadata
 DataNode that stores actual data

 Stores very large files in blocks across machines in a

large cluster
 Reliability and fault tolerance ensured by replicating data

across multiple hosts
21
HDFS Architecture

22
Running Analytical Queries
over Offline Big Data
 Hadoop is great for large-data processing!
 But writing Java programs for everything is verbose

and slow
 Not everyone wants to (or can) write Java code
 Solution: develop higher-level data processing

languages
 Hive: HQL is like SQL
 Pig: Pig Latin is a bit like Perl

23
Hive
A system for querying and managing structured data
built on top of Hadoop
 Uses Map-Reduce for execution
 HDFS for storage – but any system that
implements Hadoop FS API
Key Building Principles:
 Structured data with rich data types (structs, lists
and maps)
 Directly query data from different formats
(text/binary) and file formats (Flat/Sequence)
 SQL as a familiar programming tool and for
standard analytics
24
 Allow embedded scripts for extensibility and for
Hive - Example
SELECT foo FROM sample WHERE ds='2012-02-24';

Create HDFS dir for the output

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample
WHERE ds='2012-02-24';
Create local dir for the output
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT *
FROM sample;
SELECT MAX(foo) FROM sample;
SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;
Hive allows the From clause to come first !!!

Store the results into a table

FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*)
WHERE s.foo > 0 GROUP BY s.bar;
25

25
Pig









A platform for analyzing large data sets that
consists of a high-level language for expressing
data analysis programs.
Compiles down to MapReduce jobs
Developed by Yahoo!
Open-source language
Common design patterns as key words (joins,
distinct, counts)
Data flow analysis




Can be interactive mode


26

A script can map to multiple map-reduce jobs
Issue commands and get results
Pig - Example
Read file from HDFS

The input format (text, tab delimited)

Define run-time schema

raw = LOAD 'excite.log' USING PigStorage('t') AS (user, id, time, query);
clean1 = FILTER raw BY id > 20 AND id < 100;
clean2 = FOREACH clean1 GENERATE
Filter the rows on predicates
user, time,
org.apache.pig.tutorial.sanitze(query)each row, do some transformation
For as query;
user_groups = GROUP clean2 BY (user, query);
Grouping of records
user_query_counts = FOREACH user_groups
GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);

Compute aggregation for each group

STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');

Text, Comma delimited

Store the output in a file

27

27
RealTime Data Processing
 In contrast, real time data processing involves a





28

continual input, process and output of data.
Size of data is not certain at the beginning of the
processing.
Data must be processed in a small time period (or
near real time).
Storm is a most popular open source
implementation of realtime data processing.
Most importantly used by Twitter.
Storm
 Is a highly distributed realtime

computation system.
 Provides general primitives to do
realtime computation.
 Can be used with any programming
language.
 It's scalable and fault-tolerant.
29
Architecture – General
Overview

30
Architecture - Components
 Nimbus
 Master node (similar to Hadoop JobTracker)
 Responsible for distributing code around the cluster,

assigning tasks to machines, and monitoring for failures.
 ZooKeeper
 Used for cluster coordination between Nimbus and

Supervisors.
 Nimbus and Supervisors are fail-fast and stateless; all
state is kept in Zookeeper or on local disk.
 Supervisor
 Run worker processes.
 The supervisor listens for work assigned to its machine

and starts and stops worker processes as necessary
based on what Nimbus has assigned to it.
31
Main Concepts – General
Overview

32
Main Concepts - Components
 Streams
 Unbounded sequence of tuples/datas (input or

output)

 Spouts
 Source of streams
 Such as reading data from Twitter streaming

API

 Bolts
 Processes input streams, does some

processing and possibly emits new streams

 Topologies
33

 Network of spouts and bolts
Thanks for Listening

34

More Related Content

What's hot

Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
IOT DATA MANAGEMENT AND COMPUTE STACK.pptx
IOT DATA MANAGEMENT AND COMPUTE STACK.pptxIOT DATA MANAGEMENT AND COMPUTE STACK.pptx
IOT DATA MANAGEMENT AND COMPUTE STACK.pptxMeghaShree665225
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsRohithND
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issuesEsar Qasmi
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Architecture of Mobile Computing
Architecture of Mobile ComputingArchitecture of Mobile Computing
Architecture of Mobile ComputingJAINIK PATEL
 

What's hot (20)

Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
IOT DATA MANAGEMENT AND COMPUTE STACK.pptx
IOT DATA MANAGEMENT AND COMPUTE STACK.pptxIOT DATA MANAGEMENT AND COMPUTE STACK.pptx
IOT DATA MANAGEMENT AND COMPUTE STACK.pptx
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Big data Ppt
Big data PptBig data Ppt
Big data Ppt
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issues
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Architecture of Mobile Computing
Architecture of Mobile ComputingArchitecture of Mobile Computing
Architecture of Mobile Computing
 

Viewers also liked

Big data
Big dataBig data
Big datahsn99
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsSherinMariamReji05
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Sergey Lukjanov
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)Jose Luis Lopez Pino
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKhuguk
 
Docker & Apcera Better Together
Docker & Apcera Better TogetherDocker & Apcera Better Together
Docker & Apcera Better TogetherSimone Morellato
 
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsCase study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsHappiest Minds Technologies
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a securityTyrone Systems
 
7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data PlatformHarshal Deo (HD)
 

Viewers also liked (20)

Big data
Big dataBig data
Big data
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data
Big DataBig Data
Big Data
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)
 
Sgml and xml
Sgml and xmlSgml and xml
Sgml and xml
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
 
Docker & Apcera Better Together
Docker & Apcera Better TogetherDocker & Apcera Better Together
Docker & Apcera Better Together
 
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsCase study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
 
Big Data
Big DataBig Data
Big Data
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a security
 
7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform
 

Similar to Big data concepts

How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones WebSantiago Coffey
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

Similar to Big data concepts (20)

Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
HADOOP
HADOOPHADOOP
HADOOP
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

More from Serkan Özal

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaSerkan Özal
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the HoodSerkan Özal
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data PresentationSerkan Özal
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceSerkan Özal
 

More from Serkan Özal (7)

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS Lambda
 
MySafe
MySafeMySafe
MySafe
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the Hood
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data Presentation
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map Reduce
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Big data concepts

  • 1. BIG DATA CONCEPTS Serkan ÖZAL Middle East Technical University Ankara/TURKEY October 2013
  • 2. Contents  Big Data and Scalability  NoSQL     Column Stores Key-Value Stores Document Stores Graph Database Systems  Batch Data Processing  MapReduce  Hadoop  Running Analytical Queries over Offline Big Data  Hive  Pig  RealTime Data Processing  Storm 2
  • 3. Big Data  Collect.  Store.  Organize.  Analyze.  Share. Data growth outruns the ability to manage it so we need scalable solutions. 3
  • 4. Scalability - Vertical  Increasing server capacity.  Adding more CPU, RAM.  Managing is hard.  Possible down times 4
  • 5. Scalability - Horizontal  Adding servers to existing system with little effort, aka Elastically scalable.  Bugs, hardware errors, things fail all the time.  It should become cheaper. Cost efficiency.      Shared nothing. Use of commodity/cheap hardware. Heterogeneous systems. Controlled Concurrency (avoid locks). Service Oriented Architecture. Local states.  Decentralized to reduce bottlenecks.  Avoid Single point of failures.  Asynchrony.  Symmetry, you don’t have to know what is happening. All nodes should be symmetric. 5
  • 6. NoSQL  Stands for Not Only SQL  Class of non-relational data storage systems  Usually do not require a fixed table schema nor do they use the concept of joins  Offers BASE properties instead of ACID (Atomicity, Consistency, Isolation, Durability) properties  BAsically available: Nodes in the a distributed 6 environment can go down, but the whole system shouldn’t be affected.  Soft State (scalable): The state of the system and data changes over time.  Eventual Consistency: Given enough time, data will be consistent across the distributed system.
  • 7. What is Wrong With RDBMS?  One size fits all? Not really.  Rigid schema design.  Harder to scale.  Replication.  Joins across multiple nodes? Hard.  How does RDMS handle data growth? Hard. 7
  • 8. Advantages & Disadvantages  Advantages  Flexible schema  Quicker/cheaper to set up  Massive scalability  Evantual consistency higher performance & availability  Disadvantages  No declarative query language programming  Evantual consistency 8 more fewer guarantees
  • 9. NoSQL Systems  Several incarnations  Column Stores  Key-Value Stores  Document Stores  Graph Databases 9
  • 10. Column Stores  Each storage block contains data from only one column  More efficient than row (or document) store if:  Multiple row/record/documents are inserted at the same time so updates of column blocks can be aggregated  Retrievals access only some of the columns in a row/record/document  Example Systems  HBase 10
  • 11. Key-Value Stores Extremely simple interface   Data model: (key, value) pairs Operations:   Insert(key,value), Fetch(key),Update(key), Delete(key) Implementation: efficiency, scalability, faulttolerance   Records distributed to nodes based on key  Replication  Single-record transactions, “eventual consistency” Example systems   11 Google BigTable, Amazon Dynamo, Cassandra, Voldemort, …
  • 12. Document Stores  Like Key-Value Stores except value is document Data model: (key, document) pairs Document: JSON, XML, other semistructured formats Basic operations:      Insert(key,document), Fetch(key),Update(key), Delete(key) Also Fetch based on document contents  Example systems  CouchDB, MongoDB, SimpleDB, … 12
  • 13. Graph Database Systems  A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes.  Attach properties (key-value pairs) on nodes and relationships  Relationships connect two nodes and both nodes and relationships can hold an arbitrary amount of keyvalue pairs.  A graph database can be thought of as a key-value store, with full support for relationships.  Example systems  Neo4j, FlockDB, Pregel, … 13
  • 14. Batch Data Processing  Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time.  Data is collected, entered, processed and then the batch results are produced  Data is processed by worker nodes and managed by master node.  Analytical queries can be executed as distributed on offline big data. User queries are converted to MapReduce jobs on cluster to process big data.  Hive: Supports SQL like query language  Pig: More likely a scripting language 14
  • 15. Map-Reduce  MapReduce is a programming model and software framework first developed by Google (Google’s MapReduce paper was submitted in 2004)  A method for distributing a task across multiple nodes  Each node processes data assigned to that node  Consists of two developer-created phases  Map  Reduce  Nice features of MapReduce systems include: 15  reliably processing a job even when machines die  paralellization on thousands of machines
  • 16. Map  "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) // emit(key, value) 16
  • 17. Reduce  "Reduce" step: After map step, the master node collects the answers of all the sub-problems and classifies all values with their keys. Each unique key and its values are sent to a distinct reducer. This means that each reducer processes a distinct key and ita values independently from function reduce(String word, Iterator each other. 17 partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc)
  • 19. Hadoop  Hadoop is a scalable fault-tolerant distributed system for data storage and processing.  Core Hadoop has two main components: (1) HDFS: Hadoop Distributed File System is a reliable, self-healing, redundant, high-bandwidth, distributed file system optimized for large files. (2) MapReduce: Fault-tolerant distributed processing  Operates on unsuctured and structured data  Open source under the friendly Apache License 19
  • 21. HDFS  Inspired by Google File System  Scalable, distributed, portable file system written in Java for Hadoop framework  Primary distributed storage used by Hadoop applications  HDFS can be part of a Hadoop cluster or can be a stand-alone general purpose distributed file system  An HDFS cluster primarily consists of  NameNode that manages file system metadata  DataNode that stores actual data  Stores very large files in blocks across machines in a large cluster  Reliability and fault tolerance ensured by replicating data across multiple hosts 21
  • 23. Running Analytical Queries over Offline Big Data  Hadoop is great for large-data processing!  But writing Java programs for everything is verbose and slow  Not everyone wants to (or can) write Java code  Solution: develop higher-level data processing languages  Hive: HQL is like SQL  Pig: Pig Latin is a bit like Perl 23
  • 24. Hive A system for querying and managing structured data built on top of Hadoop  Uses Map-Reduce for execution  HDFS for storage – but any system that implements Hadoop FS API Key Building Principles:  Structured data with rich data types (structs, lists and maps)  Directly query data from different formats (text/binary) and file formats (Flat/Sequence)  SQL as a familiar programming tool and for standard analytics 24  Allow embedded scripts for extensibility and for
  • 25. Hive - Example SELECT foo FROM sample WHERE ds='2012-02-24'; Create HDFS dir for the output INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24'; Create local dir for the output INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample; SELECT MAX(foo) FROM sample; SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; Hive allows the From clause to come first !!! Store the results into a table FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; 25 25
  • 26. Pig       A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. Compiles down to MapReduce jobs Developed by Yahoo! Open-source language Common design patterns as key words (joins, distinct, counts) Data flow analysis   Can be interactive mode  26 A script can map to multiple map-reduce jobs Issue commands and get results
  • 27. Pig - Example Read file from HDFS The input format (text, tab delimited) Define run-time schema raw = LOAD 'excite.log' USING PigStorage('t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE Filter the rows on predicates user, time, org.apache.pig.tutorial.sanitze(query)each row, do some transformation For as query; user_groups = GROUP clean2 BY (user, query); Grouping of records user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); Compute aggregation for each group STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(','); Text, Comma delimited Store the output in a file 27 27
  • 28. RealTime Data Processing  In contrast, real time data processing involves a     28 continual input, process and output of data. Size of data is not certain at the beginning of the processing. Data must be processed in a small time period (or near real time). Storm is a most popular open source implementation of realtime data processing. Most importantly used by Twitter.
  • 29. Storm  Is a highly distributed realtime computation system.  Provides general primitives to do realtime computation.  Can be used with any programming language.  It's scalable and fault-tolerant. 29
  • 31. Architecture - Components  Nimbus  Master node (similar to Hadoop JobTracker)  Responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.  ZooKeeper  Used for cluster coordination between Nimbus and Supervisors.  Nimbus and Supervisors are fail-fast and stateless; all state is kept in Zookeeper or on local disk.  Supervisor  Run worker processes.  The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. 31
  • 32. Main Concepts – General Overview 32
  • 33. Main Concepts - Components  Streams  Unbounded sequence of tuples/datas (input or output)  Spouts  Source of streams  Such as reading data from Twitter streaming API  Bolts  Processes input streams, does some processing and possibly emits new streams  Topologies 33  Network of spouts and bolts