Big data explanation with real time use case

Big Data-Introduction
N.Jagadish Kumar
Assistant professor
Velammal Institute of technology

What is Big data?
• Big data means extremely huge volume of
data.
• For the above answer we may have a question
1.What is Huge?100GB or 100TB?
Ans: There is no straight hard number that
defines Huge or Big data.

…..
• Because of the two reasons.
1.What is considered Big in terms of Volume is
need not to be big a year from now, because it
is very much a moving target.
2.What volume we considered as big may not be
the case for companies like Google and Face
book.
Hence for these reasons it is very hard top put a
number to define big data volumes.

Lets define Big data in Volume alone
• In our opinion let’s start with.
• 100 GB- we have hard disk greater than 100
GB,So clearly not a Big data.
• 1TB- Still not because a well defined
traditional database can handle 1TB or even
more without any issues.
• 100TB-May be some will claim it’s a big data
problem others won’t agree

…
• 1000TB- It scales in petabytes now it’s
definitely a big data.
Note: you have to understand Volume is not
the only factor to classify your data big data or
not. There are other factors too.

What are the characteristics of
Big data along with a use case?
• Let say we started a e-mail service company
our e-mail service is so good ,even better than
Gmail.
• In 3 months we have 100thousnad users login
and using it actively.
• But we are using traditional database to store
e-mail messages and it’s attachment.
• Also the size of the database is 1Tb

…..
• Is there any Big data problem? In the above
scenario.
• Ans is No. Because 1 TB is not that huge to
classify as Big data.
• Another question at this growth rate will the
company have big data problem in the near
future?
• Ans is we have to consider three factors

…..
1. Volume: It is an obvious factor .In 3 months
we have 100,000 active users and the volume
will be 1TB, if this continuous with positive
growth rate at the end of the year 1 the
company will have 400,000 active users and
the volume will be 4TB,end of the year 2 with
the same growth rate we will have 8 TB.
2. So what it will be if it is doubled or tripled
our users base for every 3 months.

…
• So the bottom line is not only the volume alone.
we have to think about the rate in which data
grows. In other words it is called as Velocity or
Speed.
• Velocity tells how fast your data is growing. If our
data volume stays 1TB every year, all you need is
good database to process. If the data volume
grows 1TB every week then you have to think
about a scalable big data solution.
• Most of the time Volume and velocity is used to
decide whether you have big data problem or
not.

Variety is the another important factor
• Variety- our data in a traditional database
system is highly structure i.e. rows and
columns.
• But in our email-service company we will
receive data in different formats.
• Text data in the form of text messages. Images
and videos as attachments.

…..
• When you want to analyze or process these
data's coming to our system in different
formats traditional database systems are short
to fit.
• And when it’s combined with High volume
and velocity for sure we will have a big data
problem.

Volume, Velocity & Variety-3v’s
Above 3 V’s
Will guide you
in understanding
the problem
your facing now
is a big data
problem or not.

Big data problems
• Big data problems absolutely exist in all
domains.
1.Science
2.Government
3.Private

Science
• Large Hardon Collider - is the world's largest
and highest-energy particle collider and the
largest machine in the world. It was built by
the European Organization for Nuclear
Research (CERN) between 1998 and 2008 in
collaboration with over 10,000 scientists and
hundreds of universities and laboratories, as
well as more than 100 countries.

…..
• Large Hardon Collider: It produce 1 Peta bytes
of data every seconds. The volume is so huge
they wont even retain or store all the data
they produce.
• NASA: It produce about 1.73GB of data every
hour , about weather ,Geo location data from
satellites etc.

Government
• NSA(National security Agency)-Known for its
controversial data collection.
• It collects his data in Utah Data center is a
data storage facility for the United States
Intelligence Community The volume it can
store is yottabyte(1 trillion Tera bytes).

…
• In march 2012 Obama Government has
announced about 200 million dollars In big
data initiatives.
• Obama’s 2nd term election used big data
analytics which gave them a competitive edge
to win the election.

Private
• Due to the Advent of Social media’s like etc
has no scarcity of data. Facebook,Twitter,
Linked-In
• E-bay is known to have 40Pb Hadoop cluster
for searching, Consumer recommendations
and merchandising.
• Face book is known to have 30Pb Hadoop
clusters .50 billion photos 130 Tb of Logs every
day.

….
• Big data not only produced and analyzed in
social media company and also in Retail space.
• The most common in several retail website is
to capture Click-stream data.
• Example…When you shop in Amzon.com
• Amazon is not only capturing data when you
click checkout. Your every click on the website
is tracked to bring the personalized shopping
experience.
• When Amazon shows you recommendation
the work behind this is big data analytics

Big Data Challenges
----Big data comes with big problems.
• Storage
• Computational efficiency
----Storage must be efficient, should be suitable for
computation.
• Data Loss
----It may happen due to data corruption or Hardware
failure ,you need to have proper Recovery strategies.
---Big data set computational with reasonable execution
time is a challenge.
• Cost
---The solution we plan to use to analyze and process the
big data in a reasonable time should be cost effective.

What is the need for New solution like
Hadoop?
• Why not
traditional
Database
solution like
My-SQL,Oracle is
not a good
solution.

…..
• In traditional RDBMS you will see scalability
issues when you start moving up data volume
in terms of Terabytes.
• As the data get’s bigger you will be forced to
make changes in the process like changes in
indexing ,optimizing the queries,Denormalize
and Aggregate the data for faster execution.
• Though you have enough hardware for your
database and still your facing performance
issue means you have no other choice other
than optimizing the query.

…
• Traditional database system are not
Horizontally scalable. It means You cannot
add more resources or more computation
nodes and distribute the problem to improve
execution time or the performance.
• Next big problem is Traditional database
system are designed to process only
structured data. It is not a good choice when
your data is in different formats like
Text,Videos,images etc.

…
• Distributed Computing: Distributed computing
solutions like Grid computing are running many
nodes that operating data in parallel and hence it
has faster computation.
But there are Two Challenges:
1. Grid is a high performance computing it is good
to compute intensive task with relatively low
volume of data but does not perform well when
the data volume is huge.
2. Grid computing requires Good experience with
Low level programming to Implement.

So Hadoop is the Good solution

…
1. It can handle huge volume of data it can store
data efficiently in terms of broad storage and
computation.
2. It has a good recovery solution for data loss
3. It can horizontally scale and that is very
important. So as your data gets bigger you can
add more nodes and everything will work
seamlessly.
4.It is cost effective meaning we don't need any
specialized hardware to run hard up and hence
it's great even for startups.

…
5. Finally it's easy to learn and implement.
It is not that RDBMS is Not good
There are things Hadoop is good and there
are things that database is good at.

UNDERSTANDING BIG DATA PROBLEM
• We will take a
sample big data
problem analyze
it and see how to
arrive the
solution

1. Imagine Your Working For A Major Stock Exchange Company. And The Data Here
Represents The Details Of Each Stock In That Date. And Size Is 1TB
2. Now You Have Given A Problem And Asked For The Solution

Now How you will figure out
• The First thing is Storage and the Second thing
is Computation.
• Let’s start with Storage
• Your workstation has only 20GB free space but
the size of the stock exchange data is 1TB.
• So you will call the storage team of your
organization to copy the data to the NAS
(Network Attach Storage) server or SAN
(Storage Area Network) server.

….
• NAS and SAN are connected to the network.
So any computer which has access to the
network can Access the data stored in it
,provided if it has the permission to access
• Now Data is stored and we have the Access
next is Computation.
• You’re a java programmer ,so you wrote and
optimized the java program to parse the
dataset and perform the computation.
• Now everything ready for execution.

….
• For the program to work on the data set ,First it
must be copied from storage to working memory.
• So how long does it take to copy a one terabyte
data set from storage.
• Let’s take our traditional Hard disk drive(HDD).
• When you request to read data ,the Head in the
Hard disk drive position itself on the platter and
start transferring data from the platter to the
head.
• The speed in which data transfer from platter to
head is called Data access rate.

….
• Average data
access rate in
HDD is usually
122MB/Sec.
• To read a 1TB
file from a HDD
is 2 hours and
22 mins.That is
for HDD
connected in
our Laptop’s or
workstations,

So what is the data access rate in HDD
of NAS or SAN servers?
• Let us assume the same Data access rate.
• So it takes 2hrs22mins to read a 1TB Data
• What will be the Network bandwidth in
between the server and workstation?
• So what about the computation time, since
we do not execute the program at least once
we cannot predict the execution time
• So the expected time to arrive(ETA) the
solution is approximately 3hrs.

So the Business user who gives you the
problem is Shocked to hear ETA is
More than 3 hrs.
• He will ask you a next question ,can you
execute in 30 Minutes?
• You know there is no way you can execute the
results in 30 minutes.
• Of course the business cannot wait for three
hours especially in finance.
• Time is money right.

I have a Idea
Why not we try SSD instead of HDD?

SSD(Solid state drives)
• It won’t have magnetic platters or Hits, it
won’t have any moving components and it is
based on Flash memory.
• So it is extremely fast ,it will significantly
reduces the Time it takes to read the data
from the storage.
• Sounds great.

But the problem with SSD is
• SDD comes with a price that usually 5 to 6
times in price then your regular HDD.
• Though we can expect the price to go down,
But the data volume we talking about with
respect to big data is not a viable option right
now.
• So now we stuck with HDD

So let’s think about how to reduce the
computation time of the program

…
• We already optimized the program for
execution.
• Approximately it will take 60 minutes to
execute.
•So what can be done next?

Another idea
• How about dividing 1TB dataset into 100 equal
size blocks and have 100 computers to do the
computation parallely.
• This means you cut the data access rate by a
factor of 100.And the computation time also by
the factor of 100.
• So with this idea you can bring the data access
time to < 2 minutes and the computation time to
< 1 minute.
• So it is a promising idea.

But the problem here is
• 100’s of nodes trying to read data from the storage
over network.
• Example:
• What will happen if each one of your family members
at home starts to stream there favorite TV show or
Movies in Netflix using single internet connection.
• It will result in very poor streaming experience with lot
of buffering. No one can enjoy the show. Because we
exceeds the network bandwidth allocated by the ISP.
• This will exactly happen in our case when 100’s of
nodes trying to access the storage over the network.

So bring the Storage closer to
Computation
It Means Storing Data Locally in each nodes hard disk.Block1 of data in
Node1,Block2 of data in Node2 etc. Now we achieve true parallel read on
all 100 nodes and also we eliminated the network bandwidth issue.

So another problem is Data Loss due
to Hard disk failure. To overcome this
Replicate the blocks in multiple nodes
Here Node 1,2,3 has the copy of Block7,Similarly Node 1,3 has the copy of Block 10.
Even if Node1 corrupted or hard disk crash Block1 copy is available in Node3 also.

But there are some challenges in
implementing this architecture .
• How does Node1 knows Node3 is having the copy of
Block7.
• And who decides Block7 should be stored in
Node1,Node2 and Node3.
• Who will break the 1TB into 100 blocks. This are the
challenges in Storage part.
• When comes to computation part.Node1 can compute
only a part of the result similarly Node2 and Node3
also. All the individual computation of nodes should
be consolidated to obtain the final output.
• So who is going to coordinate all that.

Consolidated computation from
multiple nodes.

But We have seen several complexities
involved both in storage layer and
computation layer in implementing this
architecture.
To overcome all this complexities we have
obtained the solution called.

Hadoop has two core components
HDFS & Map reduce

HDFS-Hadoop Distributed File system
• It takes care of all the storage related
complexities like splitting data sets into blocks.
• Replicating each block to more than one node.
And also keep track of which block is stored in
which node etc.

Map reduce
• Map reduce is a programming model .
• It takes care of all the computational
complexities.
• So Hadoop framework takes care of bringing
all the intermediate results from every single
node to offer a consolidated computational
result.

Hadoop
Is a framework for distributed processing of
large datasets across clusters of commodity
computers

What is meant by commodity
computers here?
• Commodity computers means inexpensive
hardware.
• Doesn’t meant it is a cheap hardware

Now go back to the business
problem?
• You can propose Hadoop framework to the
user to reduce the execution time to solve the
problem.
• You don’t even need 100 node cluster to solve
the problem a 10 node cluster is enough.
• If you want to reduce the execution time
further ,all you have to do is add more no of
nodes to the cluster.
• Hadoop is also Horizontally scalable.

Big data explanation with real time use case

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data explanation with real time use case

Similar to Big data explanation with real time use case (20)

More from N.Jagadish Kumar

More from N.Jagadish Kumar (15)

Recently uploaded

Recently uploaded (20)

Big data explanation with real time use case

Editor's Notes