• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,550
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
166
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. BIG DATA Case StudyBIG DATA Case Study Anju Singh
  • 2. AGENDAAGENDA What is BIG DATA Problem associated with Big Data Solutions for Big Data Problem Comparison of traditional vs non-traditional solutions Introduction of various Big Data solutions Case study of Big Data For Bioinfo What is my Approach Conclusions Questions
  • 3. BIG DATABIG DATA Big Data are a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 4. Why BIG DATA?Why BIG DATA? Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics, complex physics simulations and biological and environmental research.
  • 5. BIG DATA ConfusionBIG DATA Confusion
  • 6. BIG DATA Survey BIGDATA@WORK Survey conducted by IBM in mid-2012 with 1144 professionals from 95 countries across 26 industry. Respodents represent a mix of disciplines, including both business professionals and IT professionals.
  • 7. The Four Vs
  • 8. BIG DATA ActivityBIG DATA Activity
  • 9. BIG DATA Analytics
  • 10. BIG DATA Adoption StagesBIG DATA Adoption Stages
  • 11. Problem Associated withProblem Associated with BIG DATABIG DATA
  • 12. BIG DATA Challenges CaptureCapture CurationCuration StorageStorage SearchSearch SharingSharing TransferTransfer AnalysisAnalysis visualizationvisualization
  • 13. BIG DATA SolutionsBIG DATA Solutions
  • 14. SolutionsSolutions We have various Big Data solutions:We have various Big Data solutions:  MPI running of HPC  MapReduce running on Hadoop Cluster
  • 15. Traditional Vs Non-TraditionalTraditional Vs Non-Traditional Solutions For BIG DATASolutions For BIG DATA
  • 16. Hadoop Vs RDBMS
  • 17. MapReduce Vs MPIMapReduce Vs MPI MapReduceMapReduce Good for large Data Highly Scalable Fault-tolerance Move Computations All the burden except business logic is takes care by map reduce library & other language like PIG & HIVE . MPIMPI Good for large Computation Lack of Scalability Lack of Fault-tolerance Move Data Tough to write MPI code because communication, synchronization, I/O, debugging, check pointing need to consider.
  • 18. Hadoop Cluster Vs HPCHadoop Cluster Vs HPC (PetaBytes Vs PetaFlop)(PetaBytes Vs PetaFlop) HadoopHadoop For large Data For SIMD Run on Commodity cluster Private storage on node Move computation to data Good for data scalability Total exe time= exe time + disk seek time Resource scheduler is inbuilt with jobtracker HPCHPC For large Computation For SIMD & MIMD both Run on highly dedicated server Use a common storage SAN Move data to compute node Good for load balancing Total exe time=exe time+disk seek time+n/w transfer time Need to install resource scheduler separately.
  • 19. MapReduce Frameworks forMapReduce Frameworks for BIG DATABIG DATA
  • 20. BIG DATA SolutionsBIG DATA Solutions (MapReduce Framework)(MapReduce Framework)
  • 21.  Solves the challenge of getting your hands on the right data in an ocean of structured or unstructured data.  Apache Hadoop is an open source software framework that supports data-intensive distributed applications.  Works with Big Data using the concept of parallel processing.  Hadoop requires Java Runtime Environment (JRE) 1.6 or higher.
  • 22. How is it better?How is it better? Performing computation on large volumes of data has been done before, usually in a distributed setting. What makes Hadoop unique is its simplified programming modelsimplified programming model which allows the user to quickly write and test distributed systems, and its efficient, automatic distribution of data andefficient, automatic distribution of data and work across machineswork across machines and in turn utilizing the underlying parallelism of the CPU cores.
  • 23. HADOOP FeaturesHADOOP Features Hadoop is 100% open source scalable and fault tolerant. For both structured and unstructured data. Runs on commodity of cluster. Master-Slave architecture. Good for Data intensive application. Move computation, not data. Good for batch processing and streaming data access. Pig & Hive made MapReduce program easy. Developed in java, so platform independent.
  • 24. Hadoop ArchitectureHadoop Architecture HDFSHDFS NameNode-NameNode-Master DataNode-DataNode-Slave SecondaryNameNodeSecondaryNameNode MAPREDUCEMAPREDUCE Jobtracker-Jobtracker-Master Tasktracker-Tasktracker-Slave
  • 25. Hadoop Distributed File System(HDFS)Hadoop Distributed File System(HDFS) - HDFS stores large data on various machines in cluster and split these large file into a fixed size of block (64 MB or 128 MB) and store in HDFS with replication factor 3. It works as a master-slave architecture. NameNodeNameNode-It works as a master for HDFS. Which stores metadata for HDFS. DataNode-DataNode- It works as a slave node for HDFS, which stores the actual data in the form of block. SecondaryNamenode-SecondaryNamenode- It does some housekeeping kind of things for NameNode. Hadoop Architecture Contd.
  • 26. HDFSHDFS(Hadoop Distributed File System)(Hadoop Distributed File System)
  • 27. MapReduceMapReduce--MapReduce is a software framework for easily writing applications which process vast amounts of data in-parallel on large clusters(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Jobtracker- ItJobtracker- It is the master daemon service for submitting and tracking MapReduce jobs in Hadoop. Tasktracker- ItTasktracker- It is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only one Task- tracker process run on any hadoop slave node. Hadoop Architecture Contd.
  • 28. Graphical view of CloudBurst Mapreduce
  • 29. Hadoop ClusterHadoop Cluster
  • 30. Largest Hadoop Cluster At YAHOOLargest Hadoop Cluster At YAHOO Running 42,000 NodesRunning 42,000 Nodes
  • 31. PIGPIG  MapReduce is hard to program.  To make it simpler, we use the Hadoop extension PIGPIG.  Some well-known users of PIG are Yahoo and Twitter.  It has two main component: 1.1. High-level processing languageHigh-level processing language: PIG Latin 2.2. A compilerA compiler that runs PIG Latin: usually Hadoop
  • 32. HIVEHIVE It makes MapReduce program easy. Sql type language for sql lover. Developed by Facebook. Process large data stored in HDFS It is datawarehouse infrastructure built on the top of Hadoop.
  • 33. Extension make Hadoop easyExtension make Hadoop easy
  • 34. Challenges of HadoopChallenges of Hadoop Not all problem can be converted into MapreduceMapreduce. Hadoop cannot directly mountedmounted on existing OS. But with FUSEFUSE it is possible. SecuritySecurity is provided by third party. Proper recovery from partial failure must be there.
  • 35. BIG DATA Case Study ForBIG DATA Case Study For BIOINFORMATICSBIOINFORMATICS
  • 36. BIG DATA Application AreaBIG DATA Application Area
  • 37. HADOOP For BIOINFORMATICSHADOOP For BIOINFORMATICS The initial delay in the adoption of Hadoop for Big DataBig Data was mostly due to a lack of information and inertia within the community. HadoopHadoop began to be used in BioinformaticsBioinformatics in May 2009May 2009. Hadoop is used mostly in NextNext Generation SequencingGeneration Sequencing because that’s where most of the Big Data is generated. CloudBurstCloudBurst was the first Bioinformatics tool that runs on
  • 38. Human SciencesHuman Sciences(BIOINFORMATICS)(BIOINFORMATICS) NextBioNextBio NextBio is using Hadoop MapReduce and HBase to process massive amounts of human genome data. Problem:Problem: Processing multi-terabyte data sets wasn't feasible using traditional databases like mysql. Solution:Solution: NextBio uses Hadoop map reduce to process genome data in batches and it uses HBase as a scalable data store. Hadoop vendor:Hadoop vendor: Intel
  • 39. Hadoop for Saving TimeHadoop for Saving Time
  • 40. What others have done?What others have done?
  • 41. Bioinformatics Tools on HadoopBioinformatics Tools on Hadoop • CloudBurst • Crossbow • Myrna • Contrail • Jnomics • Hadoop-BAM • CloudBLAST • SEAL • PeakRanger • Quake
  • 42. Crossbow Performance in AmazonCrossbow Performance in Amazon
  • 43. What is My Approach?What is My Approach?
  • 44. CrossbowCrossbow  It is a scalable software pipeline for whole genome resequencing analysis.  It is a cloud version of Bowtie and it align reads to a reference genome with Bowtie and it uses the huge compute node of of hadoop and mapreduce. After that it uses SOAPsnp for genotyping the sample.  The main aim of making this tool is to analyse sequence data on hadoop cluster with existing tool for making this computation fast and all the features of hadoop like scalability,fault tolerance, reliabilty, parallization, and distributed computing is available with the accuracy of existing tools.
  • 45. Bioinformatics Tools on hadoop
  • 46. ConclusionConclusion Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and ease-of- use of the MapReduce method in parallelization of many data analysis algorithms.
  • 47. QUESTION