SlideShare a Scribd company logo
Introduction to Big Data
Vipin Batra
 Webopedia[1]
Big data is used to describe a massive volume of both structured and
unstructured data that is so large that it's difficult to process using
traditional database and software techniques.
Gartner [2]
Big data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization.
What is BIG Data?
 National Institute of Standards and Technology (USA) [3]
 Big Data consists of extensive datasets, primarily in the
characteristics of volume, velocity, and/or variety that require a
scalable architecture for efficient storage, manipulation, and
Data set characteristics that force a new architecture are:
1. the dataset at rest characteristics of : Volume and Variety (i.e., data
from multiple repositories, domains, or types), and
2. from the data in motion characteristics of Velocity (i.e., rate of
flow) and Variability (i.e., the change in velocity.)
These characteristics, known as the ‘V’s’ of Big Data
What is Big Data
Big Data – 3 Vs
Big Data - 4Vs
 IBM’s 4Vs of Big Data:
Volume Variety Velocity Veracity
Data at Scale
Terabytes to
petabytes of data
Data in Many Forms
Structured, unstructured,
text, multimedia
Data in Motion
Analysis of streaming data
to enable decisions within
fractions of a second.
Data Uncertainty
Managing the reliability and
predictability of inherently
imprecise data types.
 Validity: Refers to Quality of data and Accuracy for intended
purpose for which it is collected.
 Volatility: Tendency for data structures to change over time. In
this world of real time data you need to determine at what point
is data no longer relevant to the current analysis
 Value: The value through insight provided Big Data analytics
More Vs of Big Data [10,3]
1024 Bytes = 210 Bytes
1024 KB = 220 Bytes
1024 MB = 230 Bytes
1024 GB = 240 Bytes
1024 TB = 250 Bytes
1024 PB = 260 Bytes
1024 EB = 270 Bytes
1YB=1024 ZB = 280 Bytes = 1,208,925,819,614,629,174,706,176 Bytes
• 290 Bytes= Brontobytes, Hellabytes, or Ninabytes?
• 2100 Bytes= Geopbytes, Gegobytes, or Tenabytes?
Big Volume [4] – 1/
1 GeopByte=1024 BB = 2100 Bytes = 1,267,650,600,228,229,401,496,703,205,376 Bytes
Big Volume – 2/
Big Volume – 3/
Big Velocity [8]– 1/
Big Velocity
 Variety..
Big Variety
Big Variety[13,14]
 ~90% of all data is unstructured and it is growing faster than
structured data.
Big Variety
BIG Challenges
 Existing Storage and Processing systems have following limitations with respect to
handling Big Data:
 Way too much data to process within acceptable time limits: Network bottlenecks,
Compute bottlenecks
 Data needs to be structured before storing: Months needed to design/ implement
new schemas, everytime new business need arises
 Hard to retrieve archived data: Not trivial to find archive tapes and find relevant
Limitations of Existing Systems[15]
 Distributed Computing: Horizontal Scaling instead of Vertical Scaling
 Computations are done closer to where data is stored
 Instead of centrally located parallel computing architecture with super-computing
capabilities (Giga/Teraflops), low capacity distributed storage/computing solution is
 Use of Low Cost Commodity hardware
 Big Data solutions use large number of low cost, commodity hardware, organized in
clusters to carry out storage/computing tasks
 Reliability, Fault Tolerance and Recovery
 Individual nodes can fail anytime, so to ensure reliability, data is replicated across
multiple nodes
 Scaling with Demand
 The solutions are scalable and allow cluster sizes to grow as per requirement
 Storage of unstructured Data
 Traditional RDBMS systems require well defined schema to be created, before data
can be stored (schema on write)
 New data storage paradigm – ‘NoSQL’ has evolved to cater to need to store any type
of data. This provides for schema on read i.e. schema is applied when data is read.
 No Archiving
 Data is always online, so no archiving. The big data solutions do not assume what
data queries will be using, so rule is to store all data in raw form.
Characteristics of Big Data Systems
Big Data Storage
 Key Points to Note:
Comparison Traditional vs Big
Data Storage [15] – 1/2
# Parameter Traditional Systems Big Data Storage
1 Schema Schema on Write: Schema
must be created before data
can be loaded
Schema on Read: Data is simply
stored, no transformation
2 Transformation Explicit load operation has to
take place which transform
data to DB internal structure
A SerDes (Serializer/De-serializer)
is applied during read time to
extract the required columns
3 Storage
Single Seamless Store of
Data mostly single
Distributed Storage across
multiple nodes/locations
4 Distillation
data for read)
Already distilled data as in
structured format
Done on demand based on
business needs, allowing for
identifying new patterns and
relationships in existing data.
 Key Points to Note Contd..:
Comparison Traditional vs Big
Data Storage – 2/2
# Parameter Traditional Systems Big Data Storage
5 Store
Data is stored after
preparation (for example
after the extract-transform-
load and cleansing processes)
1. In a high velocity use case, the data
is prepared and analyzed for
alerting, and only then is it stored
2. In a volume use case, the data is
often stored in the raw state in
which it was produced.
6 Insights Analysis needs to be defined
upfront and hence is rigid to
the business need
Ability to analyze data as required.
Allows for data exploration and so
enables the discovery of new insights
that were not directly visible
7 Action Technically feasible, but not
effective due to data latency
Ability to integrate with Business
Decisioning systems for the next best
 NoSQL database refers to class of database that do not use relational
model for data storage (relational model uses tables and rows)
 There are many NoSQL solutions, these are widely classified as:
1. Key-Value
2. Column-Family
3. Document
4. Graph
NoSQL (Not Only SQL) Databases
• First three models
are aggregate
• Aggregate is a
collection of related
objects, treated as
a single unit
 Google BigTable is a compressed, high performance and proprietary data
storage system used in Google projects. It is Column Family Database.
 BigTable maps two arbitrary string values (row key and column key) and
timestamp (hence three-dimensional mapping) into an associated arbitrary
byte array. It is not a relational database and can be better defined as a
sparse, distributed multi-dimensional sorted map
Google BigTable[16]
 Apache HBase is an open-source, distributed, versioned, non-relational
database modeled after Google's BigTable. It is a column oriented DBMS that
runs on top of HDFS
Apache Hbase[18]
 Apache Cassandra is column-family database system. It designed as a
distributed storage system for managing very large amounts of structured
data spread out across many commodity servers, while providing highly
available service with no single point of failure.
Column-Family Database[18]
 MongoDB stores data in the form of documents, which are JSON[21]-like field
and value pairs. Documents are analogous to structures in programming
languages that associate keys with values (e.g. dictionaries, hashes, maps,
and associative arrays). Formally, MongoDB documents
are BSON[20] documents. BSON is a binary representation of JSON with
additional type information.
Document Database [19]
 MongoDB supports search by
field, range queries, regular
expression searches. Queries
can return specific fields of
documents and also include
user-defined JavaScript
 Any field in a MongoDB
document can be indexed.
Secondary indices are also
 Neo4j is an open-source NoSQL graph database implemented in Java and
Graph Database [30]
 The property graph contains connected entities (the nodes) which can hold any number of
attributes (key-value-pairs).
 Nodes can be tagged with labels, which in addition to contextualizing node and relationship
properties may also serve to attach metadata—index or constraint information—to certain nodes.
 Relationships provide directed, named semantically relevant connections between two node-
Big Data Analytics
 Big data analytics refers to the process of collecting, organizing
and analysing large sets of data to discover patterns and other
useful information[23].
 Conceptual Framework for Big Data analytics[24]:
Big Data Analytics – 1/
 The data analytics project life cycle stages[27]:
Big Data Analytics – 2/
 Following are types of Big Data Analytics[27]:
Big Data Analytics – 3/
Big Data Solutions and
Apache Hadoop [6]
 Apache Hadoop is widely used, open-source software for reliable, scalable, distributed
computing. Hadoop is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage
 Microsoft Dryad is a R&D project, which provides an infrastructure to allow a
programmer to use the resources of a computer cluster or a data center for
running data-parallel programs
 A Dryad programmer can use thousands of machines, each of them with multiple
processors or cores
 A Dryad job is a graph generator which can synthesize any directed acyclic graph
 These graphs can even change during execution, in response to important events in the
Big Data Solutions: Dryad[12]
LexisNexis – HPCC[22]
 HPCC (High-Performance Computing Cluster), also known as DAS (Data
Analytics Supercomputer), is an open source, data-intensive computing
system platform developed by LexisNexis Risk Solutions. The HPCC platform
incorporates a software architecture implemented on commodity computing
clusters to provide high-performance, data-parallel processing for applications
utilizing big data.
Thor: Batch Processing Engine Roxie: High Perf. Query Engine
 Apache Spark is an open-source cluster computing framework originally
developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage
disk-based MapReduce paradigm, Spark's in-memory primitives provide
performance up to 100 times faster for certain applications
 Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object(aka driver program).
Apache Spark [28]
Lightning-fast cluster computing
 The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager or Mesos or YARN), which allocate
resources across applications
 Once connected, first it acquires executors (processes that run computations
and store data) on nodes. Next, it sends application code (defined by JAR or
Python files passed to SparkContext) to the executors.
 Finally, SparkContext sends tasks for the executors to run.
Apache Spark [28]
Lightning-fast cluster computing
Quiz (Match The Following)
 Big Data is generally defined by:
 HPCC is example of:
 Cassandra is a:
 MongoDB is a:
 One of key Characteristic of Big Data
 Characteristic of Traditional storage
 NoSQL DB characteristic:
 Schema design before Storage
(Schema on write)
 Column-family Database
 Key-Value Database
 Graph Database
 Document Database
 3Vs - Volume, Veracity, Variety
 4Vs – Volume, Velocity, Variety,
 Big Data Solution
 Processing is closer to data
 Schema on Read
Brief History..
 Hadoop Cluster consists of set of cheap commodity hardware
networked together as set of servers in racks
Hadoop – 1/
 Hadoop framework allows for the distributed processing of large data sets across
clusters of computers. Can scale up from single servers to thousands of machines each
offering local computation and storage.
 Designed to detect and handle failures so as to deliver a highly-available service on top
of a cluster of computers, each of which may be prone to failures.
 The project includes these modules:
 Hadoop Common: The common utilities that support the other Hadoop modules.
 Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
 Hadoop YARN: A framework for job scheduling and cluster resource management.
 Hadoop MapReduce: A YARN-based system for parallel processing of large data
Hadoop – 2/2 [6, 31]
 HDFS is a Fault Tolerant Distributed File System
 HDFS provides for all POSIX File System features:
 File, Directory and Sub-Directory Structure
 Permission (rwx)
 Access (owner, Group, Others) and Super User
 Optimized for storing large files, with streaming data access (not
random access)
 File System keeps checksum (CRC32 per 512 byte) of data for
corruption detection and recovery
 Files are stored on across multiple commodity hardware machines in a
 Files are divided into uniform sized blocks (64MB, 128MB, 256 MB)
 Blocks are replicated across multiple machines to handle failures
 Provides access to block locations (servers/racks), so computations
can be done on same locations (same servers/racks on which data
Hadoop Distributed File System -1/9
HDFS – 2/9
 HDFS is implemented as a master-slave architecture
 NameNode is master, it has a secondary NameNode as backup
 DataNodes are slaves
• Read/Write
• Data
HDFS – 3/
 NameNode manages the file system :
 File System Names (e.g. /home/foo/data/ .. ) and meta data
 Maps a file name to set of blocks
 Maps a block to DataNodes where it resides
 Blocks within the file system and their replicas
 Manages Cluster Configuration
 Managing data nodes
 In case of NameNode Failure, SecondaryName node takes over
HDFS – 4/9
 Meta Data
 HDFS namespace is hierarchy of files and directories
 Entire Meta-data is in Memory
 No Demand Paging
 Consists of list of blocks for each file, file attributes e.g. access time,
replication factor etc.,
 Changes to HDFS are recorded in log called ‘Transaction Log’
 Block Placement, default 3 replicas, configurable
 One replica on local node, Second on remote rack, Third on same remote
 Additional copies randomly placed
 Clients Read from nearest replica
HDFS: Data Nodes – 5/9
 DataNode
 Slave Daemon process that reads/writes HDFS blocks from/to files in their local
files system
 During startup performs handshake with NameNode to verify namespace,
software version of data node (if version mismatch, datanode shuts down)
 Periodically sends heartbeat, block reports to NameNode
 Heartbeat carries total storage capacity, fraction used, ongoing data transfers etc.
 These stats are used by NameNode for block placement and load balancing
 Block Report has Block ID, Timestamp, block length for each replica
 Has no awareness of HDFS file system
 Does block creation, deletion, replication, shutdown etc. when NameNode
 Namenode commands are sent as replies to heartbeat messages received
 Store each HDFS block in separate file as underlying OS’s files
 Maintains optimal number of files per directory, creates new directories as
 Interacts directly with client to read/write blocks
 Java/C++ APIs are available to access Files on HDFS
 Sample code illustrates, writing to HDFS as a 3 step process:
HDFS Write – Sample Code: 6/9
 Figure below illustrates how write takes place (how blocks and
their replicas are updated):
HDFS Write Operations: 7/9
 Sample code illustrates, reading HDFS as a 4 step process:
HDFS Read– Sample Code: 8/9
 Figure below illustrates how read takes place
HDFS Read Operations: 9/9
YARN- Yet Another Resource
 Manages Compute resources across the clusters
 Consists of Following Nodes:
 Resource Manager(RM)
 Manages and Allocates Cluster Compute Resources
 Node Manager on each Node (NM)
 Manages and enforces node resources
 Application Master
 Per application
 Manages app lifecycle and
tasks scheduling
 Container
 Basic Unit of allocation
 Allows fine grained resource
YARN: Resource Manager
 Resource Manager
 Manages Nodes – Tracks heartbeats from NodeManagers
 Managers Containers
 Handles AM request for resources
 De-allocates containers when they expire or application
 Manages AM (ApplicationMasters)
 Creates a container for AMs and tracks heartbeats
 Manages Security
 Support Kerberos
YARN: NodeManager
 Node Manager resides on each Node
 Registers with ResourceManager (RM) and provides info on
node resources
 Sends periodic heartbeats and container status
 Managers processes in container
 Launches AMs on request from RM
 Launches application processes on request from AM
 Monitors resource usage by containers; kills rogue processes
 Provides logging services to applications
 Aggregates logs for an application and saves to HDFS
 Maintains node level security via ACLs
 Container
 Created by Resource Manager upon request
 Allocate a certain amount of resources (CPU, Memory)
 Applications run in one or more containers
 Application Master (AM)
 One per application
 Framework/application specific
 Runs in a container
 Requests more containers to run application tasks
YARN: Containers and AMs
 Client Requests RM an Application to be launched:
 RM launched Application Master on one NodeManager
YARN: Starting an App : 1/
 Application Master (AM) requests resources from RM; RM allocates
resources on Node Managers
 RM confirms resources allocations to AM with details, AM launched App
YARN: Starting an App: 2/
Resource Request
Resource Name (Hostname, Rack#)
Priority (within this app)
Resource Required:
• Memory (MB) , CPU (# of cores) etc.
Number of Containers
Container ID, Node
• C1@NM1
• C2@NM2
Container Launch
Container ID
Local Resources
Container Launch
• Container ID
• Commands (to
start MyApp)
• Environment
• Local Resources
(e.g. MyApp
binary, HDFS
 MapReduce, originally proprietary Google technology, is a programming
model for processing large amounts of data in a parallel and distributed
fashion. It is useful for large, long-running jobs that cannot be handled within
the scope of a single request, tasks like:
 Analyzing application logs
 Aggregating related data from external sources
 Transforming data from one format to another
 Exporting data for external analysis
Input from
MapReduce Operation
Map Shuffle Reduce
Output to
 Zookeeper exposes primitives that distributed applications can build upon to implement
higher level services for synchronization, configuration maintenance, and groups and
 Clients connect to servers to access name space which is much like that of a standard file
system to store/retrieve co-ordination data - status information, configuration, location
information, etc., data is usually small, in the byte to kilobyte range.
 Guarantees:
 Sequential Consistency - Updates will be applied in the order that they were sent.
 Atomicity - Updates either succeed or fail. No partial results.
 Single System Image – Same view of the service regardless of the server used
 Reliability - Once an update has been applied, it persists until updated.
 Timeliness - View of the system is guaranteed to be up-to-date within a time bound.
Zookeeper: A Distributed Coordination
Service for Distributed Applications[29]
Big Data – Business
BD – Landscape
Impact of Big Data on Economy
 Top 10 Big Data Challenges
Big Data Challenges
Big Data Trends
Big Data Market Forecast
Big Data: Revenues
545 518 491 480
418 415
312 305 300 295 283 280 275 260
188 175 175
2013 Big Data Revenue ($ millions)
 Government Operation: National Archives and Records Administration, Census
 Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web
Search, Digital Materials, Cargo shipping (as in UPS)
 Defense: Sensors, Image surveillance, Situation Assessment
 Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models,
 Deep Learning and Social Media: Driving Car, Geolocate images/cameras,
Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets
 The Ecosystem for Research: Metadata, Collaboration, Language Translation,
Light source experiments
 Astronomy and Physics: Sky Surveys compared to simulation, Large Hadron
Collider at CERN, Belle Accelerator II in Japan
 Earth, Environmental and Polar Science: Radar Scattering in Atmosphere,
Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar
mapping, Climate simulation datasets, Atmospheric turbulence identification,
Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET
gas sensors
 Energy: Smart grid
Big Data Applications
 Real Time Analytics: Banking and Finance, Disaster detection and recovery,
even monitoring etc. applications need vast data, coming at very fast pace to
be processed within strict time limits
 Artificial Intelligence/Business Intelligence:
 Intelligent Maintenance Systems: is a system that utilizes the collected data from
the machinery in order to predict and prevent the potential failures in them
 IoT/M2M: These applications are generating data at a very fast rate (high
velocity, from huge number of sources (high volume) and require big data
solutions to process and derive meaningful information.
 Transreality gaming, sometimes written as trans-reality gaming, describes a
type or a mode of gameplay that combines playing a game in a virtual
environment with game-related, physical experiences in the real world and
vice versa.
Emerging Trends in Big Data
 Cloud computing advances have helped Big Data emerge as a
mass scale solution
 Leased/Rented data storage, computing clusters, enable even
startups to have global scale Big Data capability, without major
capital investment
Emerging Trends in Cloud Computing
– Complementary Technologies
 Massively parallel processing refers to a multitude of individual processors
working in parallel to execute a particular program
 The Big Data paradigm consists of the distribution of data systems across
horizontally coupled, independent resources to achieve the scalability needed for
the efficient processing of extensive datasets.
 Big Data Engineering: Advanced techniques that harness independent resources
for building scalable data systems when the characteristics of the datasets
require new architectures for efficient storage, manipulation, and analysis.
 NoSQL: Non-relational models, also known as NoSQL, refer to logical data models
that do not follow relational algebra for the storage and manipulation of data.
 Federated database system is a type of meta2-database management system
(DBMS), which transparently maps multiple autonomous database systems into a
single federated database.
Terms[3] – 1/
 The data science paradigm is extraction of actionable knowledge directly from
data through a process of discovery, hypothesis, and hypothesis testing.
 The data lifecycle is the set of processes that transform raw data into actionable
 Analytics is the extraction of knowledge from information.
 Data science is the construction of actionable knowledge from raw data through
the complete data lifecycle process.
 A data scientist is a practitioner who has sufficient knowledge in the overlapping
regimes of business needs, domain knowledge, analytical skills, and software and
systems engineering to manage the end-to-end data processes through each
stage in the data lifecycle.
 Schema-on-read is the application of a data schema through preparation steps
such as transformations, cleansing, and integration at the time the data is read
from the database.
 Computational portability is the movement of the computation to the location of
the data.
Terms[3] – 2/
 Transaction processing is a style of computing that divides work into individual,
indivisible operations, called transactions.
 Relational databases have traditionally supported the ACID transaction model.
ACID transactions are:
 Atomic Either all of the actions in a transaction are completed (i.e., transaction is
committed) or none of them are completed (i.e., transaction is rolled back).
 Consistent The transaction must begin and end with the database in a consistent state
and must comply with all protocols (i.e., rules) of the database.
 Isolated The transaction will behave as if it is the only operation being performed upon
the database.
 Durable The results of a committed transaction can survive system malfunctions.
 The BASE acronym is often used to describe the types of transactions typically
supported by nonrelational databases. A BASE System is described in contrast to
an ACID-compliant systems as:
 Basically Available, Soft state, and Eventually Consistent
 BASE transactions allow a database to be in a temporarily inconsistent state that will
eventually be resolved.
Terms[3] – 3/
 CAP Theorem states that a distributed system can support only two of the
following three characteristics:
 Consistency The client perceives that a set of operations has occurred all at once.
 Availability Every operation must terminate in an intended response.
 Partition tolerance Operations will complete, even if individual components are
Terms[3] – 4/
1. Webopedia:
2. Gartner Big Data Article: Laney, Douglas. "The Importance of 'Big Data': A Definition". Gartner.
Retrieved21 June 2012
3. NIST definitions:
4. Extreme Big Data:
5. Presto Project:
6. Hadoop Project:
7. Xoriant Big Data Report:
8. Big Data Article:
9. Big Data Article at Data science central:
10. Big Data Article by IBM:
11. Big Data Article:
12. Dyrad Project:
13. Data Variety:
14. Data Growth Article:
15. Coudera Modern Data Operating System:
16. Google BigTable:
17. Google Spanner:
18. Apache Hbase:
19. Mongo DB:
20. BSON Specs:
21. JSON Specs:
22. LexisNexis HPCC:
23. Definition of Big Data Analytics:
24. Big Data, Mining, and Analytics: Components of Strategic Decision Making, Mar 2014, Stephan
Kudyba, CRC Press.
25. Big Data Use Cases:
26. Big Data Analytics with R and Hadoop, Vignesh Prajapati, PACKT publishing.
27. IBM Article: Transforming Energy and Utilities through Big Data & Analytics:
28. Apache Spark:
29. Zookeeper:
30. Neo4j Database:
31. Apache Hadoop:
32. HDFS:
33. YARN:
34. MapReduce:
35. MapReduce@Wiki:
36. Investments in Big Data:
37. Big Data Challenges:

More Related Content

What's hot

Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Rohit Dubey

What's hot (20)

Big data
Big dataBig data
Big data
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
IoT - Attacks and Solutions
IoT - Attacks and SolutionsIoT - Attacks and Solutions
IoT - Attacks and Solutions
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
BIG DATA-Seminar Report
BIG DATA-Seminar ReportBIG DATA-Seminar Report
BIG DATA-Seminar Report
Big data
Big dataBig data
Big data
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data
Big DataBig Data
Big Data
Big data ppt
Big data pptBig data ppt
Big data ppt
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Big data and analytics
Big data and analyticsBig data and analytics
Big data and analytics
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
Big data visualization
Big data visualizationBig data visualization
Big data visualization
Dm from databases perspective u 1
Dm from databases perspective u 1Dm from databases perspective u 1
Dm from databases perspective u 1
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?

Similar to Introduction to Big Data

Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx

Similar to Introduction to Big Data (20)

Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho bench
Introduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLIntroduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQL
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Traditional data word
Traditional data wordTraditional data word
Traditional data word
Big Data
Big DataBig Data
Big Data
Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho bench
Big Data SE vs. SE for Big Data
Big Data SE vs. SE for Big DataBig Data SE vs. SE for Big Data
Big Data SE vs. SE for Big Data
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data

Recently uploaded

Recently uploaded (20)

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...

Introduction to Big Data

  • 1. Introduction to Big Data Vipin Batra
  • 2. Definitions:  Webopedia[1] Big data is used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. Gartner [2] Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. What is BIG Data?
  • 3. Definitions:  National Institute of Standards and Technology (USA) [3]  Big Data consists of extensive datasets, primarily in the characteristics of volume, velocity, and/or variety that require a scalable architecture for efficient storage, manipulation, and analysis. Data set characteristics that force a new architecture are: 1. the dataset at rest characteristics of : Volume and Variety (i.e., data from multiple repositories, domains, or types), and 2. from the data in motion characteristics of Velocity (i.e., rate of flow) and Variability (i.e., the change in velocity.) These characteristics, known as the ‘V’s’ of Big Data What is Big Data
  • 5. Big Data - 4Vs  IBM’s 4Vs of Big Data: Volume Variety Velocity Veracity Data at Scale Terabytes to petabytes of data Data in Many Forms Structured, unstructured, text, multimedia Data in Motion Analysis of streaming data to enable decisions within fractions of a second. Data Uncertainty Managing the reliability and predictability of inherently imprecise data types.
  • 6.  Validity: Refers to Quality of data and Accuracy for intended purpose for which it is collected.  Volatility: Tendency for data structures to change over time. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis  Value: The value through insight provided Big Data analytics More Vs of Big Data [10,3]
  • 7. 1024 Bytes = 210 Bytes 1024 KB = 220 Bytes 1024 MB = 230 Bytes 1024 GB = 240 Bytes 1024 TB = 250 Bytes 1024 PB = 260 Bytes 1024 EB = 270 Bytes 1YB=1024 ZB = 280 Bytes = 1,208,925,819,614,629,174,706,176 Bytes • 290 Bytes= Brontobytes, Hellabytes, or Ninabytes? • 2100 Bytes= Geopbytes, Gegobytes, or Tenabytes? Big Volume [4] – 1/ 1 GeopByte=1024 BB = 2100 Bytes = 1,267,650,600,228,229,401,496,703,205,376 Bytes
  • 13. Big Variety[13,14]  ~90% of all data is unstructured and it is growing faster than structured data.
  • 16.  Existing Storage and Processing systems have following limitations with respect to handling Big Data:  Way too much data to process within acceptable time limits: Network bottlenecks, Compute bottlenecks  Data needs to be structured before storing: Months needed to design/ implement new schemas, everytime new business need arises  Hard to retrieve archived data: Not trivial to find archive tapes and find relevant data Limitations of Existing Systems[15]
  • 17.  Distributed Computing: Horizontal Scaling instead of Vertical Scaling  Computations are done closer to where data is stored  Instead of centrally located parallel computing architecture with super-computing capabilities (Giga/Teraflops), low capacity distributed storage/computing solution is used  Use of Low Cost Commodity hardware  Big Data solutions use large number of low cost, commodity hardware, organized in clusters to carry out storage/computing tasks  Reliability, Fault Tolerance and Recovery  Individual nodes can fail anytime, so to ensure reliability, data is replicated across multiple nodes  Scaling with Demand  The solutions are scalable and allow cluster sizes to grow as per requirement  Storage of unstructured Data  Traditional RDBMS systems require well defined schema to be created, before data can be stored (schema on write)  New data storage paradigm – ‘NoSQL’ has evolved to cater to need to store any type of data. This provides for schema on read i.e. schema is applied when data is read.  No Archiving  Data is always online, so no archiving. The big data solutions do not assume what data queries will be using, so rule is to store all data in raw form. Characteristics of Big Data Systems
  • 19.  Key Points to Note: Comparison Traditional vs Big Data Storage [15] – 1/2 # Parameter Traditional Systems Big Data Storage 1 Schema Schema on Write: Schema must be created before data can be loaded Schema on Read: Data is simply stored, no transformation 2 Transformation Explicit load operation has to take place which transform data to DB internal structure A SerDes (Serializer/De-serializer) is applied during read time to extract the required columns 3 Storage Mechanism Single Seamless Store of Data mostly single machine/location Distributed Storage across multiple nodes/locations 4 Distillation (Organizing data for read) Already distilled data as in structured format Done on demand based on business needs, allowing for identifying new patterns and relationships in existing data.
  • 20.  Key Points to Note Contd..: Comparison Traditional vs Big Data Storage – 2/2 # Parameter Traditional Systems Big Data Storage 5 Store Process Data is stored after preparation (for example after the extract-transform- load and cleansing processes) 1. In a high velocity use case, the data is prepared and analyzed for alerting, and only then is it stored 2. In a volume use case, the data is often stored in the raw state in which it was produced. 6 Insights Analysis needs to be defined upfront and hence is rigid to the business need Ability to analyze data as required. Allows for data exploration and so enables the discovery of new insights that were not directly visible 7 Action Technically feasible, but not effective due to data latency Ability to integrate with Business Decisioning systems for the next best action
  • 21.  NoSQL database refers to class of database that do not use relational model for data storage (relational model uses tables and rows)  There are many NoSQL solutions, these are widely classified as: 1. Key-Value 2. Column-Family 3. Document 4. Graph NoSQL (Not Only SQL) Databases • First three models are aggregate oriented • Aggregate is a collection of related objects, treated as a single unit
  • 22.  Google BigTable is a compressed, high performance and proprietary data storage system used in Google projects. It is Column Family Database.  BigTable maps two arbitrary string values (row key and column key) and timestamp (hence three-dimensional mapping) into an associated arbitrary byte array. It is not a relational database and can be better defined as a sparse, distributed multi-dimensional sorted map Google BigTable[16]
  • 23.  Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's BigTable. It is a column oriented DBMS that runs on top of HDFS Apache Hbase[18]
  • 24.  Apache Cassandra is column-family database system. It designed as a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Column-Family Database[18]
  • 25.  MongoDB stores data in the form of documents, which are JSON[21]-like field and value pairs. Documents are analogous to structures in programming languages that associate keys with values (e.g. dictionaries, hashes, maps, and associative arrays). Formally, MongoDB documents are BSON[20] documents. BSON is a binary representation of JSON with additional type information. Document Database [19]  MongoDB supports search by field, range queries, regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions.  Any field in a MongoDB document can be indexed. Secondary indices are also available.
  • 26.  Neo4j is an open-source NoSQL graph database implemented in Java and Scala Graph Database [30]  The property graph contains connected entities (the nodes) which can hold any number of attributes (key-value-pairs).  Nodes can be tagged with labels, which in addition to contextualizing node and relationship properties may also serve to attach metadata—index or constraint information—to certain nodes.  Relationships provide directed, named semantically relevant connections between two node- entities.
  • 28.  Big data analytics refers to the process of collecting, organizing and analysing large sets of data to discover patterns and other useful information[23].  Conceptual Framework for Big Data analytics[24]: Big Data Analytics – 1/
  • 29.  The data analytics project life cycle stages[27]: Big Data Analytics – 2/
  • 30.  Following are types of Big Data Analytics[27]: Big Data Analytics – 3/ Diagnostic Analytics Descriptive Analytics Predictive Analytics Prescriptive Analytics
  • 31. Big Data Solutions and Frameworks
  • 32. Apache Hadoop [6]  Apache Hadoop is widely used, open-source software for reliable, scalable, distributed computing. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage
  • 33.  Microsoft Dryad is a R&D project, which provides an infrastructure to allow a programmer to use the resources of a computer cluster or a data center for running data-parallel programs  A Dryad programmer can use thousands of machines, each of them with multiple processors or cores  A Dryad job is a graph generator which can synthesize any directed acyclic graph  These graphs can even change during execution, in response to important events in the computation. Big Data Solutions: Dryad[12]
  • 34. LexisNexis – HPCC[22]  HPCC (High-Performance Computing Cluster), also known as DAS (Data Analytics Supercomputer), is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. Thor: Batch Processing Engine Roxie: High Perf. Query Engine
  • 35.  Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications  Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object(aka driver program). Apache Spark [28] Lightning-fast cluster computing
  • 36.  The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos or YARN), which allocate resources across applications  Once connected, first it acquires executors (processes that run computations and store data) on nodes. Next, it sends application code (defined by JAR or Python files passed to SparkContext) to the executors.  Finally, SparkContext sends tasks for the executors to run. Apache Spark [28] Lightning-fast cluster computing
  • 37. Quiz (Match The Following)  Big Data is generally defined by:  HPCC is example of:  Cassandra is a:  MongoDB is a:  One of key Characteristic of Big Data Solution:  Characteristic of Traditional storage system:  NoSQL DB characteristic:  Schema design before Storage (Schema on write)  RDBMS  Column-family Database  Key-Value Database  Graph Database  Document Database  3Vs - Volume, Veracity, Variety  4Vs – Volume, Velocity, Variety, Value  Big Data Solution  Processing is closer to data location  Schema on Read
  • 38.
  • 40.  Hadoop Cluster consists of set of cheap commodity hardware networked together as set of servers in racks Hadoop – 1/
  • 41.  Hadoop framework allows for the distributed processing of large data sets across clusters of computers. Can scale up from single servers to thousands of machines each offering local computation and storage.  Designed to detect and handle failures so as to deliver a highly-available service on top of a cluster of computers, each of which may be prone to failures.  The project includes these modules:  Hadoop Common: The common utilities that support the other Hadoop modules.  Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.  Hadoop YARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Hadoop – 2/2 [6, 31]
  • 42.  HDFS is a Fault Tolerant Distributed File System  HDFS provides for all POSIX File System features:  File, Directory and Sub-Directory Structure  Permission (rwx)  Access (owner, Group, Others) and Super User  Optimized for storing large files, with streaming data access (not random access)  File System keeps checksum (CRC32 per 512 byte) of data for corruption detection and recovery  Files are stored on across multiple commodity hardware machines in a cluster  Files are divided into uniform sized blocks (64MB, 128MB, 256 MB)  Blocks are replicated across multiple machines to handle failures  Provides access to block locations (servers/racks), so computations can be done on same locations (same servers/racks on which data resides) Hadoop Distributed File System -1/9
  • 43. HDFS – 2/9  HDFS is implemented as a master-slave architecture  NameNode is master, it has a secondary NameNode as backup  DataNodes are slaves Name Node Secondary Name Node Checkpoints Data Node Data Node Data Node Metadata • Read/Write Commands • Data
  • 44. HDFS – 3/  NameNode manages the file system :  File System Names (e.g. /home/foo/data/ .. ) and meta data  Maps a file name to set of blocks  Maps a block to DataNodes where it resides  Blocks within the file system and their replicas  Manages Cluster Configuration  Managing data nodes  In case of NameNode Failure, SecondaryName node takes over DN1 DN2 DN3 DN4 DN5 ..DNN Files Meta-Data
  • 45. HDFS – 4/9  Meta Data  HDFS namespace is hierarchy of files and directories  Entire Meta-data is in Memory  No Demand Paging  Consists of list of blocks for each file, file attributes e.g. access time, replication factor etc.,  Changes to HDFS are recorded in log called ‘Transaction Log’  Block Placement, default 3 replicas, configurable  One replica on local node, Second on remote rack, Third on same remote rack  Additional copies randomly placed  Clients Read from nearest replica
  • 46. HDFS: Data Nodes – 5/9  DataNode  Slave Daemon process that reads/writes HDFS blocks from/to files in their local files system  During startup performs handshake with NameNode to verify namespace, software version of data node (if version mismatch, datanode shuts down)  Periodically sends heartbeat, block reports to NameNode  Heartbeat carries total storage capacity, fraction used, ongoing data transfers etc.  These stats are used by NameNode for block placement and load balancing  Block Report has Block ID, Timestamp, block length for each replica  Has no awareness of HDFS file system  Does block creation, deletion, replication, shutdown etc. when NameNode commands  Namenode commands are sent as replies to heartbeat messages received  Store each HDFS block in separate file as underlying OS’s files  Maintains optimal number of files per directory, creates new directories as needed  Interacts directly with client to read/write blocks
  • 47.  Java/C++ APIs are available to access Files on HDFS  Sample code illustrates, writing to HDFS as a 3 step process: HDFS Write – Sample Code: 6/9
  • 48.  Figure below illustrates how write takes place (how blocks and their replicas are updated): HDFS Write Operations: 7/9
  • 49.  Sample code illustrates, reading HDFS as a 4 step process: HDFS Read– Sample Code: 8/9
  • 50.  Figure below illustrates how read takes place HDFS Read Operations: 9/9
  • 51. YARN- Yet Another Resource Negotiator[33]  Manages Compute resources across the clusters  Consists of Following Nodes:  Resource Manager(RM)  Manages and Allocates Cluster Compute Resources  Node Manager on each Node (NM)  Manages and enforces node resources allocations  Application Master  Per application  Manages app lifecycle and tasks scheduling  Container  Basic Unit of allocation  Allows fine grained resource allocations
  • 52. YARN: Resource Manager  Resource Manager  Manages Nodes – Tracks heartbeats from NodeManagers  Managers Containers  Handles AM request for resources  De-allocates containers when they expire or application completes  Manages AM (ApplicationMasters)  Creates a container for AMs and tracks heartbeats  Manages Security  Support Kerberos
  • 53. YARN: NodeManager  Node Manager resides on each Node  Registers with ResourceManager (RM) and provides info on node resources  Sends periodic heartbeats and container status  Managers processes in container  Launches AMs on request from RM  Launches application processes on request from AM  Monitors resource usage by containers; kills rogue processes  Provides logging services to applications  Aggregates logs for an application and saves to HDFS  Maintains node level security via ACLs
  • 54.  Container  Created by Resource Manager upon request  Allocate a certain amount of resources (CPU, Memory)  Applications run in one or more containers  Application Master (AM)  One per application  Framework/application specific  Runs in a container  Requests more containers to run application tasks YARN: Containers and AMs
  • 55.  Client Requests RM an Application to be launched:  RM launched Application Master on one NodeManager YARN: Starting an App : 1/
  • 56.  Application Master (AM) requests resources from RM; RM allocates resources on Node Managers  RM confirms resources allocations to AM with details, AM launched App YARN: Starting an App: 2/ Resource Request Resource Name (Hostname, Rack#) Priority (within this app) Resource Required: • Memory (MB) , CPU (# of cores) etc. Number of Containers Allocates Resources Allocates Resources C1 C2 Container ID, Node • C1@NM1 • C2@NM2 MyApp MyApp Container Launch Context Container ID Commands Environment Local Resources Container Launch Context • Container ID • Commands (to start MyApp) • Environment (configuration) • Local Resources (e.g. MyApp binary, HDFS files) NM2 NM3 NM4 NM1
  • 57.  MapReduce, originally proprietary Google technology, is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running jobs that cannot be handled within the scope of a single request, tasks like:  Analyzing application logs  Aggregating related data from external sources  Transforming data from one format to another  Exporting data for external analysis Map-Reduce[34]
  • 58. Input from DB/HDFS MapReduce MapReduce Operation Map Shuffle Reduce Output to DB/HDFS
  • 59.  Zookeeper exposes primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.  Clients connect to servers to access name space which is much like that of a standard file system to store/retrieve co-ordination data - status information, configuration, location information, etc., data is usually small, in the byte to kilobyte range.  Guarantees:  Sequential Consistency - Updates will be applied in the order that they were sent.  Atomicity - Updates either succeed or fail. No partial results.  Single System Image – Same view of the service regardless of the server used  Reliability - Once an update has been applied, it persists until updated.  Timeliness - View of the system is guaranteed to be up-to-date within a time bound. Zookeeper: A Distributed Coordination Service for Distributed Applications[29]
  • 60. Big Data – Business Trends
  • 62. Impact of Big Data on Economy
  • 63.  Top 10 Big Data Challenges Big Data Challenges
  • 65. Big Data Market Forecast
  • 66. Big Data: Revenues 1368 869 652 545 518 491 480 418 415 312 305 300 295 283 280 275 260 188 175 175 0 200 400 600 800 1000 1200 1400 1600 2013 Big Data Revenue ($ millions)
  • 67.  Government Operation: National Archives and Records Administration, Census Bureau  Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS)  Defense: Sensors, Image surveillance, Situation Assessment  Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity  Deep Learning and Social Media: Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets  The Ecosystem for Research: Metadata, Collaboration, Language Translation, Light source experiments  Astronomy and Physics: Sky Surveys compared to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan  Earth, Environmental and Polar Science: Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors  Energy: Smart grid Big Data Applications
  • 68.  Real Time Analytics: Banking and Finance, Disaster detection and recovery, even monitoring etc. applications need vast data, coming at very fast pace to be processed within strict time limits  Artificial Intelligence/Business Intelligence:  Intelligent Maintenance Systems: is a system that utilizes the collected data from the machinery in order to predict and prevent the potential failures in them  IoT/M2M: These applications are generating data at a very fast rate (high velocity, from huge number of sources (high volume) and require big data solutions to process and derive meaningful information.  Transreality gaming, sometimes written as trans-reality gaming, describes a type or a mode of gameplay that combines playing a game in a virtual environment with game-related, physical experiences in the real world and vice versa. Emerging Trends in Big Data
  • 69.  Cloud computing advances have helped Big Data emerge as a mass scale solution  Leased/Rented data storage, computing clusters, enable even startups to have global scale Big Data capability, without major capital investment Emerging Trends in Cloud Computing – Complementary Technologies
  • 70.  Massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program  The Big Data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets.  Big Data Engineering: Advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis.  NoSQL: Non-relational models, also known as NoSQL, refer to logical data models that do not follow relational algebra for the storage and manipulation of data.  Federated database system is a type of meta2-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. Terms[3] – 1/
  • 71.  The data science paradigm is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.  The data lifecycle is the set of processes that transform raw data into actionable knowledge.  Analytics is the extraction of knowledge from information.  Data science is the construction of actionable knowledge from raw data through the complete data lifecycle process.  A data scientist is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data lifecycle.  Schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database.  Computational portability is the movement of the computation to the location of the data. Terms[3] – 2/
  • 72.  Transaction processing is a style of computing that divides work into individual, indivisible operations, called transactions.  Relational databases have traditionally supported the ACID transaction model. ACID transactions are:  Atomic Either all of the actions in a transaction are completed (i.e., transaction is committed) or none of them are completed (i.e., transaction is rolled back).  Consistent The transaction must begin and end with the database in a consistent state and must comply with all protocols (i.e., rules) of the database.  Isolated The transaction will behave as if it is the only operation being performed upon the database.  Durable The results of a committed transaction can survive system malfunctions.  The BASE acronym is often used to describe the types of transactions typically supported by nonrelational databases. A BASE System is described in contrast to an ACID-compliant systems as:  Basically Available, Soft state, and Eventually Consistent  BASE transactions allow a database to be in a temporarily inconsistent state that will eventually be resolved. Terms[3] – 3/
  • 73.  CAP Theorem states that a distributed system can support only two of the following three characteristics:  Consistency The client perceives that a set of operations has occurred all at once.  Availability Every operation must terminate in an intended response.  Partition tolerance Operations will complete, even if individual components are unavailable. Terms[3] – 4/
  • 74. 1. Webopedia: 2. Gartner Big Data Article: Laney, Douglas. "The Importance of 'Big Data': A Definition". Gartner. Retrieved21 June 2012 3. NIST definitions: release.pdf 4. Extreme Big Data: zettabytes-and-yottabytes/ 5. Presto Project: 6. Hadoop Project: 7. Xoriant Big Data Report: 8. Big Data Article: 9. Big Data Article at Data science central: 10. Big Data Article by IBM: 11. Big Data Article: data-veracity/ 12. Dyrad Project: 13. Data Variety: References
  • 75. 14. Data Growth Article: Going-After-The-Massive-Amount-Of-Unstructured-Data-Theyre- Collecting/articleshow/31055495.cms 15. Coudera Modern Data Operating System: apache-hadoop-the-modern-data-operating-system-stanford-ee380 16. Google BigTable: 17. Google Spanner: 18. Apache Hbase: 19. Mongo DB: 20. BSON Specs: 21. JSON Specs: 22. LexisNexis HPCC: 23. Definition of Big Data Analytics: 24. Big Data, Mining, and Analytics: Components of Strategic Decision Making, Mar 2014, Stephan Kudyba, CRC Press. 25. Big Data Use Cases: 26. Big Data Analytics with R and Hadoop, Vignesh Prajapati, PACKT publishing. References
  • 76. 27. IBM Article: Transforming Energy and Utilities through Big Data & Analytics: 28. Apache Spark: 29. Zookeeper: 30. Neo4j Database: 31. Apache Hadoop: 32. HDFS: 33. YARN: 34. MapReduce: 35. MapReduce@Wiki: 36. Investments in Big Data: 37. Big Data Challenges: References