SlideShare a Scribd company logo
1 of 56
Big Data-Introduction
N.Jagadish Kumar
Assistant professor
Velammal Institute of technology
What is Big data?
• Big data means extremely huge volume of
data.
• For the above answer we may have a question
1.What is Huge?100GB or 100TB?
Ans: There is no straight hard number that
defines Huge or Big data.
…..
• Because of the two reasons.
1.What is considered Big in terms of Volume is
need not to be big a year from now, because it
is very much a moving target.
2.What volume we considered as big may not be
the case for companies like Google and Face
book.
Hence for these reasons it is very hard top put a
number to define big data volumes.
Lets define Big data in Volume alone
• In our opinion let’s start with.
• 100 GB- we have hard disk greater than 100
GB,So clearly not a Big data.
• 1TB- Still not because a well defined
traditional database can handle 1TB or even
more without any issues.
• 100TB-May be some will claim it’s a big data
problem others won’t agree
…
• 1000TB- It scales in petabytes now it’s
definitely a big data.
Note: you have to understand Volume is not
the only factor to classify your data big data or
not. There are other factors too.
What are the characteristics of
Big data along with a use case?
• Let say we started a e-mail service company
our e-mail service is so good ,even better than
Gmail.
• In 3 months we have 100thousnad users login
and using it actively.
• But we are using traditional database to store
e-mail messages and it’s attachment.
• Also the size of the database is 1Tb
…..
• Is there any Big data problem? In the above
scenario.
• Ans is No. Because 1 TB is not that huge to
classify as Big data.
• Another question at this growth rate will the
company have big data problem in the near
future?
• Ans is we have to consider three factors
…..
1. Volume: It is an obvious factor .In 3 months
we have 100,000 active users and the volume
will be 1TB, if this continuous with positive
growth rate at the end of the year 1 the
company will have 400,000 active users and
the volume will be 4TB,end of the year 2 with
the same growth rate we will have 8 TB.
2. So what it will be if it is doubled or tripled
our users base for every 3 months.
…
• So the bottom line is not only the volume alone.
we have to think about the rate in which data
grows. In other words it is called as Velocity or
Speed.
• Velocity tells how fast your data is growing. If our
data volume stays 1TB every year, all you need is
good database to process. If the data volume
grows 1TB every week then you have to think
about a scalable big data solution.
• Most of the time Volume and velocity is used to
decide whether you have big data problem or
not.
Variety is the another important factor
• Variety- our data in a traditional database
system is highly structure i.e. rows and
columns.
• But in our email-service company we will
receive data in different formats.
• Text data in the form of text messages. Images
and videos as attachments.
…..
• When you want to analyze or process these
data's coming to our system in different
formats traditional database systems are short
to fit.
• And when it’s combined with High volume
and velocity for sure we will have a big data
problem.
Volume, Velocity & Variety-3v’s
Above 3 V’s
Will guide you
in understanding
the problem
your facing now
is a big data
problem or not.
Big data problems
• Big data problems absolutely exist in all
domains.
1.Science
2.Government
3.Private
Science
• Large Hardon Collider - is the world's largest
and highest-energy particle collider and the
largest machine in the world. It was built by
the European Organization for Nuclear
Research (CERN) between 1998 and 2008 in
collaboration with over 10,000 scientists and
hundreds of universities and laboratories, as
well as more than 100 countries.
…..
• Large Hardon Collider: It produce 1 Peta bytes
of data every seconds. The volume is so huge
they wont even retain or store all the data
they produce.
• NASA: It produce about 1.73GB of data every
hour , about weather ,Geo location data from
satellites etc.
Government
• NSA(National security Agency)-Known for its
controversial data collection.
• It collects his data in Utah Data center is a
data storage facility for the United States
Intelligence Community The volume it can
store is yottabyte(1 trillion Tera bytes).
…
• In march 2012 Obama Government has
announced about 200 million dollars In big
data initiatives.
• Obama’s 2nd term election used big data
analytics which gave them a competitive edge
to win the election.
Private
• Due to the Advent of Social media’s like etc
has no scarcity of data. Facebook,Twitter,
Linked-In
• E-bay is known to have 40Pb Hadoop cluster
for searching, Consumer recommendations
and merchandising.
• Face book is known to have 30Pb Hadoop
clusters .50 billion photos 130 Tb of Logs every
day.
….
• Big data not only produced and analyzed in
social media company and also in Retail space.
• The most common in several retail website is
to capture Click-stream data.
• Example…When you shop in Amzon.com
• Amazon is not only capturing data when you
click checkout. Your every click on the website
is tracked to bring the personalized shopping
experience.
• When Amazon shows you recommendation
the work behind this is big data analytics
Big Data Challenges
----Big data comes with big problems.
• Storage
• Computational efficiency
----Storage must be efficient, should be suitable for
computation.
• Data Loss
----It may happen due to data corruption or Hardware
failure ,you need to have proper Recovery strategies.
---Big data set computational with reasonable execution
time is a challenge.
• Cost
---The solution we plan to use to analyze and process the
big data in a reasonable time should be cost effective.
What is the need for New solution like
Hadoop?
• Why not
traditional
Database
solution like
My-SQL,Oracle is
not a good
solution.
…..
• In traditional RDBMS you will see scalability
issues when you start moving up data volume
in terms of Terabytes.
• As the data get’s bigger you will be forced to
make changes in the process like changes in
indexing ,optimizing the queries,Denormalize
and Aggregate the data for faster execution.
• Though you have enough hardware for your
database and still your facing performance
issue means you have no other choice other
than optimizing the query.
…
• Traditional database system are not
Horizontally scalable. It means You cannot
add more resources or more computation
nodes and distribute the problem to improve
execution time or the performance.
• Next big problem is Traditional database
system are designed to process only
structured data. It is not a good choice when
your data is in different formats like
Text,Videos,images etc.
…
• Distributed Computing: Distributed computing
solutions like Grid computing are running many
nodes that operating data in parallel and hence it
has faster computation.
But there are Two Challenges:
1. Grid is a high performance computing it is good
to compute intensive task with relatively low
volume of data but does not perform well when
the data volume is huge.
2. Grid computing requires Good experience with
Low level programming to Implement.
So Hadoop is the Good solution
…
1. It can handle huge volume of data it can store
data efficiently in terms of broad storage and
computation.
2. It has a good recovery solution for data loss
3. It can horizontally scale and that is very
important. So as your data gets bigger you can
add more nodes and everything will work
seamlessly.
4.It is cost effective meaning we don't need any
specialized hardware to run hard up and hence
it's great even for startups.
…
5. Finally it's easy to learn and implement.
It is not that RDBMS is Not good
There are things Hadoop is good and there
are things that database is good at.
….
UNDERSTANDING BIG DATA PROBLEM
• We will take a
sample big data
problem analyze
it and see how to
arrive the
solution
1. Imagine Your Working For A Major Stock Exchange Company. And The Data Here
Represents The Details Of Each Stock In That Date. And Size Is 1TB
2. Now You Have Given A Problem And Asked For The Solution
Now How you will figure out
• The First thing is Storage and the Second thing
is Computation.
• Let’s start with Storage
• Your workstation has only 20GB free space but
the size of the stock exchange data is 1TB.
• So you will call the storage team of your
organization to copy the data to the NAS
(Network Attach Storage) server or SAN
(Storage Area Network) server.
….
• NAS and SAN are connected to the network.
So any computer which has access to the
network can Access the data stored in it
,provided if it has the permission to access
• Now Data is stored and we have the Access
next is Computation.
• You’re a java programmer ,so you wrote and
optimized the java program to parse the
dataset and perform the computation.
• Now everything ready for execution.
….
• For the program to work on the data set ,First it
must be copied from storage to working memory.
• So how long does it take to copy a one terabyte
data set from storage.
• Let’s take our traditional Hard disk drive(HDD).
• When you request to read data ,the Head in the
Hard disk drive position itself on the platter and
start transferring data from the platter to the
head.
• The speed in which data transfer from platter to
head is called Data access rate.
….
• Average data
access rate in
HDD is usually
122MB/Sec.
• To read a 1TB
file from a HDD
is 2 hours and
22 mins.That is
for HDD
connected in
our Laptop’s or
workstations,
So what is the data access rate in HDD
of NAS or SAN servers?
• Let us assume the same Data access rate.
• So it takes 2hrs22mins to read a 1TB Data
• What will be the Network bandwidth in
between the server and workstation?
• So what about the computation time, since
we do not execute the program at least once
we cannot predict the execution time
• So the expected time to arrive(ETA) the
solution is approximately 3hrs.
So the Business user who gives you the
problem is Shocked to hear ETA is
More than 3 hrs.
• He will ask you a next question ,can you
execute in 30 Minutes?
• You know there is no way you can execute the
results in 30 minutes.
• Of course the business cannot wait for three
hours especially in finance.
• Time is money right.
I have a Idea
Why not we try SSD instead of HDD?
SSD(Solid state drives)
• It won’t have magnetic platters or Hits, it
won’t have any moving components and it is
based on Flash memory.
• So it is extremely fast ,it will significantly
reduces the Time it takes to read the data
from the storage.
• Sounds great.
But the problem with SSD is
• SDD comes with a price that usually 5 to 6
times in price then your regular HDD.
• Though we can expect the price to go down,
But the data volume we talking about with
respect to big data is not a viable option right
now.
• So now we stuck with HDD
So let’s think about how to reduce the
computation time of the program
…
• We already optimized the program for
execution.
• Approximately it will take 60 minutes to
execute.
•So what can be done next?
Another idea
• How about dividing 1TB dataset into 100 equal
size blocks and have 100 computers to do the
computation parallely.
• This means you cut the data access rate by a
factor of 100.And the computation time also by
the factor of 100.
• So with this idea you can bring the data access
time to < 2 minutes and the computation time to
< 1 minute.
• So it is a promising idea.
But the problem here is
• 100’s of nodes trying to read data from the storage
over network.
• Example:
• What will happen if each one of your family members
at home starts to stream there favorite TV show or
Movies in Netflix using single internet connection.
• It will result in very poor streaming experience with lot
of buffering. No one can enjoy the show. Because we
exceeds the network bandwidth allocated by the ISP.
• This will exactly happen in our case when 100’s of
nodes trying to access the storage over the network.
So bring the Storage closer to
Computation
It Means Storing Data Locally in each nodes hard disk.Block1 of data in
Node1,Block2 of data in Node2 etc. Now we achieve true parallel read on
all 100 nodes and also we eliminated the network bandwidth issue.
So another problem is Data Loss due
to Hard disk failure. To overcome this
Replicate the blocks in multiple nodes
Here Node 1,2,3 has the copy of Block7,Similarly Node 1,3 has the copy of Block 10.
Even if Node1 corrupted or hard disk crash Block1 copy is available in Node3 also.
But there are some challenges in
implementing this architecture .
• How does Node1 knows Node3 is having the copy of
Block7.
• And who decides Block7 should be stored in
Node1,Node2 and Node3.
• Who will break the 1TB into 100 blocks. This are the
challenges in Storage part.
• When comes to computation part.Node1 can compute
only a part of the result similarly Node2 and Node3
also. All the individual computation of nodes should
be consolidated to obtain the final output.
• So who is going to coordinate all that.
Consolidated computation from
multiple nodes.
But We have seen several complexities
involved both in storage layer and
computation layer in implementing this
architecture.
To overcome all this complexities we have
obtained the solution called.
Hadoop has two core components
HDFS & Map reduce
HDFS-Hadoop Distributed File system
• It takes care of all the storage related
complexities like splitting data sets into blocks.
• Replicating each block to more than one node.
And also keep track of which block is stored in
which node etc.
Map reduce
• Map reduce is a programming model .
• It takes care of all the computational
complexities.
• So Hadoop framework takes care of bringing
all the intermediate results from every single
node to offer a consolidated computational
result.
Hadoop
Is a framework for distributed processing of
large datasets across clusters of commodity
computers
What is meant by commodity
computers here?
• Commodity computers means inexpensive
hardware.
• Doesn’t meant it is a cheap hardware
Now go back to the business
problem?
• You can propose Hadoop framework to the
user to reduce the execution time to solve the
problem.
• You don’t even need 100 node cluster to solve
the problem a 10 node cluster is enough.
• If you want to reduce the execution time
further ,all you have to do is add more no of
nodes to the cluster.
• Hadoop is also Horizontally scalable.

More Related Content

What's hot

What's hot (20)

Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
NoSQL Introduction
NoSQL IntroductionNoSQL Introduction
NoSQL Introduction
 
Bigdata
Bigdata Bigdata
Bigdata
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 

Similar to Big data explanation with real time use case

Similar to Big data explanation with real time use case (20)

Video 1 big data
Video 1 big dataVideo 1 big data
Video 1 big data
 
Big data & hadoop Introduction
Big data & hadoop IntroductionBig data & hadoop Introduction
Big data & hadoop Introduction
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
Big data technology
Big data technology Big data technology
Big data technology
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsWhere Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
What is big data
What is big dataWhat is big data
What is big data
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
 

More from N.Jagadish Kumar

More from N.Jagadish Kumar (15)

Human computer interaction-web interface design and mobile eco system
Human computer interaction-web interface design and mobile eco systemHuman computer interaction-web interface design and mobile eco system
Human computer interaction-web interface design and mobile eco system
 
Human computer interaction -Design and software process
Human computer interaction -Design and software processHuman computer interaction -Design and software process
Human computer interaction -Design and software process
 
Human computer interaction-Memory, Reasoning and Problem solving
Human computer interaction-Memory, Reasoning and Problem solvingHuman computer interaction-Memory, Reasoning and Problem solving
Human computer interaction-Memory, Reasoning and Problem solving
 
Human computer interaction -Input output channel with Scenario
Human computer interaction -Input output channel with ScenarioHuman computer interaction -Input output channel with Scenario
Human computer interaction -Input output channel with Scenario
 
Human computer interaction -Input output channel
Human computer interaction -Input output channelHuman computer interaction -Input output channel
Human computer interaction -Input output channel
 
Human Computer Interaction Introduction
Human Computer Interaction IntroductionHuman Computer Interaction Introduction
Human Computer Interaction Introduction
 
HDFS
HDFSHDFS
HDFS
 
computer forensic tools-Hardware & Software tools
computer forensic tools-Hardware & Software toolscomputer forensic tools-Hardware & Software tools
computer forensic tools-Hardware & Software tools
 
Application layer protocols
Application layer protocolsApplication layer protocols
Application layer protocols
 
Routing protocols
Routing protocolsRouting protocols
Routing protocols
 
Media Access and Internetworking
Media Access and InternetworkingMedia Access and Internetworking
Media Access and Internetworking
 
Computer Network Fundamentals
Computer Network FundamentalsComputer Network Fundamentals
Computer Network Fundamentals
 
Cyber forensic-Evedidence collection tools
Cyber forensic-Evedidence collection toolsCyber forensic-Evedidence collection tools
Cyber forensic-Evedidence collection tools
 
Beginers guide for oracle sql
Beginers guide for oracle sqlBeginers guide for oracle sql
Beginers guide for oracle sql
 
Transport layer protocol
Transport layer protocolTransport layer protocol
Transport layer protocol
 

Recently uploaded

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
mikehavy0
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 

Big data explanation with real time use case

  • 1. Big Data-Introduction N.Jagadish Kumar Assistant professor Velammal Institute of technology
  • 2. What is Big data? • Big data means extremely huge volume of data. • For the above answer we may have a question 1.What is Huge?100GB or 100TB? Ans: There is no straight hard number that defines Huge or Big data.
  • 3. ….. • Because of the two reasons. 1.What is considered Big in terms of Volume is need not to be big a year from now, because it is very much a moving target. 2.What volume we considered as big may not be the case for companies like Google and Face book. Hence for these reasons it is very hard top put a number to define big data volumes.
  • 4. Lets define Big data in Volume alone • In our opinion let’s start with. • 100 GB- we have hard disk greater than 100 GB,So clearly not a Big data. • 1TB- Still not because a well defined traditional database can handle 1TB or even more without any issues. • 100TB-May be some will claim it’s a big data problem others won’t agree
  • 5. … • 1000TB- It scales in petabytes now it’s definitely a big data. Note: you have to understand Volume is not the only factor to classify your data big data or not. There are other factors too.
  • 6. What are the characteristics of Big data along with a use case? • Let say we started a e-mail service company our e-mail service is so good ,even better than Gmail. • In 3 months we have 100thousnad users login and using it actively. • But we are using traditional database to store e-mail messages and it’s attachment. • Also the size of the database is 1Tb
  • 7. ….. • Is there any Big data problem? In the above scenario. • Ans is No. Because 1 TB is not that huge to classify as Big data. • Another question at this growth rate will the company have big data problem in the near future? • Ans is we have to consider three factors
  • 8. ….. 1. Volume: It is an obvious factor .In 3 months we have 100,000 active users and the volume will be 1TB, if this continuous with positive growth rate at the end of the year 1 the company will have 400,000 active users and the volume will be 4TB,end of the year 2 with the same growth rate we will have 8 TB. 2. So what it will be if it is doubled or tripled our users base for every 3 months.
  • 9. … • So the bottom line is not only the volume alone. we have to think about the rate in which data grows. In other words it is called as Velocity or Speed. • Velocity tells how fast your data is growing. If our data volume stays 1TB every year, all you need is good database to process. If the data volume grows 1TB every week then you have to think about a scalable big data solution. • Most of the time Volume and velocity is used to decide whether you have big data problem or not.
  • 10. Variety is the another important factor • Variety- our data in a traditional database system is highly structure i.e. rows and columns. • But in our email-service company we will receive data in different formats. • Text data in the form of text messages. Images and videos as attachments.
  • 11. ….. • When you want to analyze or process these data's coming to our system in different formats traditional database systems are short to fit. • And when it’s combined with High volume and velocity for sure we will have a big data problem.
  • 12. Volume, Velocity & Variety-3v’s Above 3 V’s Will guide you in understanding the problem your facing now is a big data problem or not.
  • 13. Big data problems • Big data problems absolutely exist in all domains. 1.Science 2.Government 3.Private
  • 14. Science • Large Hardon Collider - is the world's largest and highest-energy particle collider and the largest machine in the world. It was built by the European Organization for Nuclear Research (CERN) between 1998 and 2008 in collaboration with over 10,000 scientists and hundreds of universities and laboratories, as well as more than 100 countries.
  • 15. ….. • Large Hardon Collider: It produce 1 Peta bytes of data every seconds. The volume is so huge they wont even retain or store all the data they produce. • NASA: It produce about 1.73GB of data every hour , about weather ,Geo location data from satellites etc.
  • 16. Government • NSA(National security Agency)-Known for its controversial data collection. • It collects his data in Utah Data center is a data storage facility for the United States Intelligence Community The volume it can store is yottabyte(1 trillion Tera bytes).
  • 17. … • In march 2012 Obama Government has announced about 200 million dollars In big data initiatives. • Obama’s 2nd term election used big data analytics which gave them a competitive edge to win the election.
  • 18. Private • Due to the Advent of Social media’s like etc has no scarcity of data. Facebook,Twitter, Linked-In • E-bay is known to have 40Pb Hadoop cluster for searching, Consumer recommendations and merchandising. • Face book is known to have 30Pb Hadoop clusters .50 billion photos 130 Tb of Logs every day.
  • 19. …. • Big data not only produced and analyzed in social media company and also in Retail space. • The most common in several retail website is to capture Click-stream data. • Example…When you shop in Amzon.com • Amazon is not only capturing data when you click checkout. Your every click on the website is tracked to bring the personalized shopping experience. • When Amazon shows you recommendation the work behind this is big data analytics
  • 20. Big Data Challenges ----Big data comes with big problems. • Storage • Computational efficiency ----Storage must be efficient, should be suitable for computation. • Data Loss ----It may happen due to data corruption or Hardware failure ,you need to have proper Recovery strategies. ---Big data set computational with reasonable execution time is a challenge. • Cost ---The solution we plan to use to analyze and process the big data in a reasonable time should be cost effective.
  • 21. What is the need for New solution like Hadoop? • Why not traditional Database solution like My-SQL,Oracle is not a good solution.
  • 22. ….. • In traditional RDBMS you will see scalability issues when you start moving up data volume in terms of Terabytes. • As the data get’s bigger you will be forced to make changes in the process like changes in indexing ,optimizing the queries,Denormalize and Aggregate the data for faster execution. • Though you have enough hardware for your database and still your facing performance issue means you have no other choice other than optimizing the query.
  • 23. … • Traditional database system are not Horizontally scalable. It means You cannot add more resources or more computation nodes and distribute the problem to improve execution time or the performance. • Next big problem is Traditional database system are designed to process only structured data. It is not a good choice when your data is in different formats like Text,Videos,images etc.
  • 24. … • Distributed Computing: Distributed computing solutions like Grid computing are running many nodes that operating data in parallel and hence it has faster computation. But there are Two Challenges: 1. Grid is a high performance computing it is good to compute intensive task with relatively low volume of data but does not perform well when the data volume is huge. 2. Grid computing requires Good experience with Low level programming to Implement.
  • 25. So Hadoop is the Good solution
  • 26. … 1. It can handle huge volume of data it can store data efficiently in terms of broad storage and computation. 2. It has a good recovery solution for data loss 3. It can horizontally scale and that is very important. So as your data gets bigger you can add more nodes and everything will work seamlessly. 4.It is cost effective meaning we don't need any specialized hardware to run hard up and hence it's great even for startups.
  • 27. … 5. Finally it's easy to learn and implement. It is not that RDBMS is Not good There are things Hadoop is good and there are things that database is good at.
  • 28. ….
  • 29. UNDERSTANDING BIG DATA PROBLEM • We will take a sample big data problem analyze it and see how to arrive the solution
  • 30. 1. Imagine Your Working For A Major Stock Exchange Company. And The Data Here Represents The Details Of Each Stock In That Date. And Size Is 1TB 2. Now You Have Given A Problem And Asked For The Solution
  • 31. Now How you will figure out • The First thing is Storage and the Second thing is Computation. • Let’s start with Storage • Your workstation has only 20GB free space but the size of the stock exchange data is 1TB. • So you will call the storage team of your organization to copy the data to the NAS (Network Attach Storage) server or SAN (Storage Area Network) server.
  • 32. …. • NAS and SAN are connected to the network. So any computer which has access to the network can Access the data stored in it ,provided if it has the permission to access • Now Data is stored and we have the Access next is Computation. • You’re a java programmer ,so you wrote and optimized the java program to parse the dataset and perform the computation. • Now everything ready for execution.
  • 33. …. • For the program to work on the data set ,First it must be copied from storage to working memory. • So how long does it take to copy a one terabyte data set from storage. • Let’s take our traditional Hard disk drive(HDD). • When you request to read data ,the Head in the Hard disk drive position itself on the platter and start transferring data from the platter to the head. • The speed in which data transfer from platter to head is called Data access rate.
  • 34. …. • Average data access rate in HDD is usually 122MB/Sec. • To read a 1TB file from a HDD is 2 hours and 22 mins.That is for HDD connected in our Laptop’s or workstations,
  • 35. So what is the data access rate in HDD of NAS or SAN servers? • Let us assume the same Data access rate. • So it takes 2hrs22mins to read a 1TB Data • What will be the Network bandwidth in between the server and workstation? • So what about the computation time, since we do not execute the program at least once we cannot predict the execution time • So the expected time to arrive(ETA) the solution is approximately 3hrs.
  • 36.
  • 37. So the Business user who gives you the problem is Shocked to hear ETA is More than 3 hrs. • He will ask you a next question ,can you execute in 30 Minutes? • You know there is no way you can execute the results in 30 minutes. • Of course the business cannot wait for three hours especially in finance. • Time is money right.
  • 38. I have a Idea Why not we try SSD instead of HDD?
  • 39. SSD(Solid state drives) • It won’t have magnetic platters or Hits, it won’t have any moving components and it is based on Flash memory. • So it is extremely fast ,it will significantly reduces the Time it takes to read the data from the storage. • Sounds great.
  • 40. But the problem with SSD is • SDD comes with a price that usually 5 to 6 times in price then your regular HDD. • Though we can expect the price to go down, But the data volume we talking about with respect to big data is not a viable option right now. • So now we stuck with HDD
  • 41. So let’s think about how to reduce the computation time of the program
  • 42. … • We already optimized the program for execution. • Approximately it will take 60 minutes to execute. •So what can be done next?
  • 43. Another idea • How about dividing 1TB dataset into 100 equal size blocks and have 100 computers to do the computation parallely. • This means you cut the data access rate by a factor of 100.And the computation time also by the factor of 100. • So with this idea you can bring the data access time to < 2 minutes and the computation time to < 1 minute. • So it is a promising idea.
  • 44.
  • 45. But the problem here is • 100’s of nodes trying to read data from the storage over network. • Example: • What will happen if each one of your family members at home starts to stream there favorite TV show or Movies in Netflix using single internet connection. • It will result in very poor streaming experience with lot of buffering. No one can enjoy the show. Because we exceeds the network bandwidth allocated by the ISP. • This will exactly happen in our case when 100’s of nodes trying to access the storage over the network.
  • 46. So bring the Storage closer to Computation It Means Storing Data Locally in each nodes hard disk.Block1 of data in Node1,Block2 of data in Node2 etc. Now we achieve true parallel read on all 100 nodes and also we eliminated the network bandwidth issue.
  • 47. So another problem is Data Loss due to Hard disk failure. To overcome this Replicate the blocks in multiple nodes Here Node 1,2,3 has the copy of Block7,Similarly Node 1,3 has the copy of Block 10. Even if Node1 corrupted or hard disk crash Block1 copy is available in Node3 also.
  • 48. But there are some challenges in implementing this architecture . • How does Node1 knows Node3 is having the copy of Block7. • And who decides Block7 should be stored in Node1,Node2 and Node3. • Who will break the 1TB into 100 blocks. This are the challenges in Storage part. • When comes to computation part.Node1 can compute only a part of the result similarly Node2 and Node3 also. All the individual computation of nodes should be consolidated to obtain the final output. • So who is going to coordinate all that.
  • 50. But We have seen several complexities involved both in storage layer and computation layer in implementing this architecture. To overcome all this complexities we have obtained the solution called.
  • 51. Hadoop has two core components HDFS & Map reduce
  • 52. HDFS-Hadoop Distributed File system • It takes care of all the storage related complexities like splitting data sets into blocks. • Replicating each block to more than one node. And also keep track of which block is stored in which node etc.
  • 53. Map reduce • Map reduce is a programming model . • It takes care of all the computational complexities. • So Hadoop framework takes care of bringing all the intermediate results from every single node to offer a consolidated computational result.
  • 54. Hadoop Is a framework for distributed processing of large datasets across clusters of commodity computers
  • 55. What is meant by commodity computers here? • Commodity computers means inexpensive hardware. • Doesn’t meant it is a cheap hardware
  • 56. Now go back to the business problem? • You can propose Hadoop framework to the user to reduce the execution time to solve the problem. • You don’t even need 100 node cluster to solve the problem a 10 node cluster is enough. • If you want to reduce the execution time further ,all you have to do is add more no of nodes to the cluster. • Hadoop is also Horizontally scalable.

Editor's Notes

  1. Large