BIG DATA Case StudyBIG DATA Case Study
What is BIG DATA
Problem associated with Big Data
Solutions for Big Data Problem
Comparison of traditional vs non-traditional solutions
Introduction of various Big Data solutions
Case study of Big Data For Bioinfo
What is my Approach
BIG DATABIG DATA
Big Data are a collection of data sets so large and
complex that it becomes difficult to process using
on-hand database management tools or traditional
data processing applications.
Why BIG DATA?Why BIG DATA?
Every day, we create 2.5 quintillion bytes of data
so much that 90% of the data in the world today
has been created in the last two years alone.
Scientists regularly encounter limitations due to
large data sets in many areas, including
meteorology, genomics, complex physics
simulations and biological and environmental
BIG DATA Survey
BIGDATA@WORK Survey conducted by IBM in
mid-2012 with 1144 professionals from 95
countries across 26 industry. Respodents
represent a mix of disciplines, including both
business professionals and IT professionals.
MapReduce Vs MPIMapReduce Vs MPI
Good for large Data
All the burden except
business logic is takes
care by map reduce
library & other language
like PIG & HIVE .
Good for large Computation
Lack of Scalability
Lack of Fault-tolerance
Tough to write MPI code
debugging, check pointing
need to consider.
Hadoop Cluster Vs HPCHadoop Cluster Vs HPC
(PetaBytes Vs PetaFlop)(PetaBytes Vs PetaFlop)
For large Data
Run on Commodity cluster
Private storage on node
Move computation to data
Good for data scalability
Total exe time= exe time +
disk seek time
Resource scheduler is
inbuilt with jobtracker
For large Computation
For SIMD & MIMD both
Run on highly dedicated server
Use a common storage SAN
Move data to compute node
Good for load balancing
Total exe time=exe time+disk
seek time+n/w transfer time
Need to install resource
MapReduce Frameworks forMapReduce Frameworks for
BIG DATABIG DATA
BIG DATA SolutionsBIG DATA Solutions
(MapReduce Framework)(MapReduce Framework)
Solves the challenge of getting your hands on
the right data in an ocean of structured or
Apache Hadoop is an open source software
framework that supports data-intensive
Works with Big Data using the concept of
Hadoop requires Java Runtime Environment
(JRE) 1.6 or higher.
How is it better?How is it better?
Performing computation on large volumes of data
has been done before, usually in a distributed
setting. What makes Hadoop unique is
its simplified programming modelsimplified programming model which allows the
user to quickly write and test distributed systems,
and its efficient, automatic distribution of data andefficient, automatic distribution of data and
work across machineswork across machines and in turn utilizing the
underlying parallelism of the CPU cores.
HADOOP FeaturesHADOOP Features
Hadoop is 100% open source scalable and fault tolerant.
For both structured and unstructured data.
Runs on commodity of cluster.
Good for Data intensive application.
Move computation, not data.
Good for batch processing and streaming data access.
Pig & Hive made MapReduce program easy.
Developed in java, so platform independent.
Hadoop Distributed File System(HDFS)Hadoop Distributed File System(HDFS) - HDFS stores
large data on various machines in cluster and split these
large file into a fixed size of block (64 MB or 128 MB) and
store in HDFS with replication factor 3. It works as a
NameNodeNameNode-It works as a master for HDFS. Which stores
metadata for HDFS.
DataNode-DataNode- It works as a slave node for HDFS, which stores
the actual data in the form of block.
SecondaryNamenode-SecondaryNamenode- It does some housekeeping kind of
things for NameNode.
Hadoop Architecture Contd.
MapReduceMapReduce--MapReduce is a software framework for
easily writing applications which process vast amounts of
data in-parallel on large clusters(thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
Jobtracker- ItJobtracker- It is the master daemon service for submitting
and tracking MapReduce jobs in Hadoop.
Tasktracker- ItTasktracker- It is a slave node daemon in the cluster that
accepts tasks (Map, Reduce and Shuffle operations) from a
JobTracker. There is only one Task- tracker process run on
any hadoop slave node.
Hadoop Architecture Contd.
Largest Hadoop Cluster At YAHOOLargest Hadoop Cluster At YAHOO
Running 42,000 NodesRunning 42,000 Nodes
MapReduce is hard to program.
To make it simpler, we use the Hadoop
Some well-known users of PIG are Yahoo and
It has two main component:
1.1. High-level processing languageHigh-level processing language: PIG Latin
2.2. A compilerA compiler that runs PIG Latin: usually
It makes MapReduce program easy.
Sql type language for sql lover.
Developed by Facebook.
Process large data stored in HDFS
It is datawarehouse infrastructure built on the top
Extension make Hadoop easyExtension make Hadoop easy
Challenges of HadoopChallenges of Hadoop
Not all problem can be converted into
Hadoop cannot directly mountedmounted on existing OS.
But with FUSEFUSE it is possible.
SecuritySecurity is provided by third party.
Proper recovery from partial failure must be
BIG DATA Case Study ForBIG DATA Case Study For
BIG DATA Application AreaBIG DATA Application Area
HADOOP For BIOINFORMATICSHADOOP For BIOINFORMATICS
The initial delay in the adoption of
Hadoop for Big DataBig Data was mostly due
to a lack of information and inertia
within the community.
HadoopHadoop began to be used in
BioinformaticsBioinformatics in May 2009May 2009.
Hadoop is used mostly in NextNext
Generation SequencingGeneration Sequencing because
that’s where most of the Big Data is
CloudBurstCloudBurst was the first
Bioinformatics tool that runs on
Human SciencesHuman Sciences(BIOINFORMATICS)(BIOINFORMATICS)
NextBio is using Hadoop MapReduce and HBase to
process massive amounts of human genome data.
Problem:Problem: Processing multi-terabyte data sets wasn't
feasible using traditional databases like mysql.
Solution:Solution: NextBio uses Hadoop map reduce to process
genome data in batches and it uses HBase as a
scalable data store.
Hadoop vendor:Hadoop vendor: Intel
It is a scalable software pipeline for whole genome
It is a cloud version of Bowtie and it align reads to a
reference genome with Bowtie and it uses the huge
compute node of of hadoop and mapreduce. After that it
uses SOAPsnp for genotyping the sample.
The main aim of making this tool is to analyse sequence
data on hadoop cluster with existing tool for making this
computation fast and all the features of hadoop like
scalability,fault tolerance, reliabilty, parallization, and
distributed computing is available with the accuracy of
Hadoop and the MapReduce programming paradigm
already have a substantial base in the bioinformatics
community, especially in the field of next-generation
sequencing analysis, and such use is increasing. This
is due to the cost-effectiveness of Hadoop-based
analysis on commodity Linux clusters, and ease-of-
use of the MapReduce method in parallelization of
many data analysis algorithms.