RDBMS and Hadoop - Co-existence
                or competition
                                                               Ram Mohan




            Copyright © 2011 Flytxt B.V. All rights reserved      1/16/2012
Session Agenda!
   Introduction to RDBMS
   What is Hadoop and Map-Reduce
   Hadoop and RDBMS – A comparison
   Co-Existence – Practical Example - Master Website
   Q&A




               Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   2
Relational DBMS
   Based on Relational Mathematics principles
   Data is represented in terms of rows and columns of a table
   Relational Terminology
    ◦ Tuple (Row)
    ◦ Attribute (Column)
    ◦ Relation (Table)
   Integrity Constraints
    ◦ Primary Key
    ◦ Foreign Key
    ◦ Alternate Key
   ACID Test
    ◦   Atomicity
    ◦   Consistency
    ◦   Isolation
    ◦   Durability




                 Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   3
Normalization
   Normalization - process of removing data redundancy by decomposing
    relations in a Database.
   De normalization - carefully introduced redundancy to improve query
    performance.




               Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   4
Relational DBMS




        Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   5
Example Data
S#   SNAME      STATUS             CITY
S1   Smith      20                 London
S2   Jones      10                 Paris
S3   Blake      30                 Paris


P#   PNAME      COLOR             WEIGHT                 CITY
P1   Nut        Red               12                     London
P2   Bolt       Green             17                     Paris
P3   Screw      Blue              17                     Rome
P4   Screw      Red               14                     London


S#   P#   QTY
S1   P1   300
S1   P2   200
S1   P3   400
S2   P1   300
S2   P2   400
S3   P2   200


                Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   6
Five computers & a 640k ;-)

                                                              "I think there is a world
                                                              market for about five
      Moore’s                                                 computers"
       Law
                                                                        Thomas Watson 1943,
                                                                        Chairman of the board of IBM




      "640k ought to be enough
      for anybody"


                     Attributed to
                     Bill Gates in 1981.




           Copyright © 2011 Flytxt B.V. All rights reserved                                      1/16/2012   7
The Big Data Challenges
   Sources of Data and the amount of data to analyze is growing
    exponentially
   Stale data exists because DW solutions cannot ingest the vast amounts of
    data fast enough
   Lack of performance for advanced analytics and complex queries
   The number of users and the concurrency of users is increasing rapidly




               Copyright © 2011 Flytxt B.V. All rights reserved     1/16/2012   8
Hadoop Architecture




        Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   9
Hadoop – HDFS(Hadoop Distributed File System)
   Reliably store petabytes of replicated data across thousand of nodes
    ◦ Data divided in to 64 MB blocks, each block replicated three times
   Master/Slave architecture
    ◦ Master NameNode contains block locations
    ◦ Slave Datanode manages blocks on local FS
   Built on local commodity hardware
    ◦ No RAID required




                Copyright © 2011 Flytxt B.V. All rights reserved           1/16/2012   10
Hadoop – HDFS(Hadoop Distributed File System)
   Reliably store petabytes of replicated data across thousand of nodes
    ◦ Data divided in to 64 MB blocks, each block replicated three times
   Master/Slave architecture
    ◦ Master NameNode contains block locations
    ◦ Slave Datanode manages blocks on local FS
   Built on local commodity hardware
    ◦ No RAID required




                Copyright © 2011 Flytxt B.V. All rights reserved           1/16/2012   11
Map-Reduce Model




          Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   12
Hadoop – Limitations
   Is not intended for realtime querying.
   Does not support random access.
   Significant learning curve
   Provides barebones functionality out of the box but scaling is built-in and
    inexpensive




                Copyright © 2011 Flytxt B.V. All rights reserved       1/16/2012   13
Where SQL Makes life easy
   Joining
    ◦ In a single query, get all products in an order with their product information
   Secondary Indexing
    ◦ Get CustomerId by e-mail
   Referential Integrity
   Realtime Analysis.
   Millions are trained in SQL and relational data modelling
   RDBMS provides tremendous functionality, but is extremely difficult and
    costly to scale




                 Copyright © 2011 Flytxt B.V. All rights reserved             1/16/2012   14
Master Website – A Practical Example




         Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   15
Master Website – RDBMS Use Cases
   Profile Information – That is provided during sign up
   Intelligence generated ie the output of the analytic jobs.
   Any online purchasing track records and account management
   Reporting tools




               Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   16
Master Website – Hadoop Use Cases
   Generating Intelligence from the continuous stream of data
    ◦ Wall Posts on Facebook
   New tags to be added based on the old logs available, due to new
    requirements




                Copyright © 2011 Flytxt B.V. All rights reserved       1/16/2012   17
A Practical Example – Facebook Architecture




         Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   18
THANK YOU




       Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   19

Co existence or Competitions? RDBMS and Hadoop

  • 1.
    RDBMS and Hadoop- Co-existence or competition Ram Mohan Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012
  • 2.
    Session Agenda!  Introduction to RDBMS  What is Hadoop and Map-Reduce  Hadoop and RDBMS – A comparison  Co-Existence – Practical Example - Master Website  Q&A Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 2
  • 3.
    Relational DBMS  Based on Relational Mathematics principles  Data is represented in terms of rows and columns of a table  Relational Terminology ◦ Tuple (Row) ◦ Attribute (Column) ◦ Relation (Table)  Integrity Constraints ◦ Primary Key ◦ Foreign Key ◦ Alternate Key  ACID Test ◦ Atomicity ◦ Consistency ◦ Isolation ◦ Durability Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 3
  • 4.
    Normalization  Normalization - process of removing data redundancy by decomposing relations in a Database.  De normalization - carefully introduced redundancy to improve query performance. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 4
  • 5.
    Relational DBMS Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 5
  • 6.
    Example Data S# SNAME STATUS CITY S1 Smith 20 London S2 Jones 10 Paris S3 Blake 30 Paris P# PNAME COLOR WEIGHT CITY P1 Nut Red 12 London P2 Bolt Green 17 Paris P3 Screw Blue 17 Rome P4 Screw Red 14 London S# P# QTY S1 P1 300 S1 P2 200 S1 P3 400 S2 P1 300 S2 P2 400 S3 P2 200 Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 6
  • 7.
    Five computers &a 640k ;-) "I think there is a world market for about five Moore’s computers" Law Thomas Watson 1943, Chairman of the board of IBM "640k ought to be enough for anybody" Attributed to Bill Gates in 1981. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 7
  • 8.
    The Big DataChallenges  Sources of Data and the amount of data to analyze is growing exponentially  Stale data exists because DW solutions cannot ingest the vast amounts of data fast enough  Lack of performance for advanced analytics and complex queries  The number of users and the concurrency of users is increasing rapidly Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 8
  • 9.
    Hadoop Architecture Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 9
  • 10.
    Hadoop – HDFS(HadoopDistributed File System)  Reliably store petabytes of replicated data across thousand of nodes ◦ Data divided in to 64 MB blocks, each block replicated three times  Master/Slave architecture ◦ Master NameNode contains block locations ◦ Slave Datanode manages blocks on local FS  Built on local commodity hardware ◦ No RAID required Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 10
  • 11.
    Hadoop – HDFS(HadoopDistributed File System)  Reliably store petabytes of replicated data across thousand of nodes ◦ Data divided in to 64 MB blocks, each block replicated three times  Master/Slave architecture ◦ Master NameNode contains block locations ◦ Slave Datanode manages blocks on local FS  Built on local commodity hardware ◦ No RAID required Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 11
  • 12.
    Map-Reduce Model Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 12
  • 13.
    Hadoop – Limitations  Is not intended for realtime querying.  Does not support random access.  Significant learning curve  Provides barebones functionality out of the box but scaling is built-in and inexpensive Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 13
  • 14.
    Where SQL Makeslife easy  Joining ◦ In a single query, get all products in an order with their product information  Secondary Indexing ◦ Get CustomerId by e-mail  Referential Integrity  Realtime Analysis.  Millions are trained in SQL and relational data modelling  RDBMS provides tremendous functionality, but is extremely difficult and costly to scale Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 14
  • 15.
    Master Website –A Practical Example Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 15
  • 16.
    Master Website –RDBMS Use Cases  Profile Information – That is provided during sign up  Intelligence generated ie the output of the analytic jobs.  Any online purchasing track records and account management  Reporting tools Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 16
  • 17.
    Master Website –Hadoop Use Cases  Generating Intelligence from the continuous stream of data ◦ Wall Posts on Facebook  New tags to be added based on the old logs available, due to new requirements Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 17
  • 18.
    A Practical Example– Facebook Architecture Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 18
  • 19.
    THANK YOU Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 19

Editor's Notes

  • #4 No centralized control.Data Redundancy Data Inconsistency Data can not be sharedStandards can not be enforcedSecurity issues Integrity can not be maintainedData dependenceCentralized control.No Data Redundancy Data Consistency Data can be sharedStandards can be enforcedSecurity can be enforcedIntegrity can be maintainedData independence
  • #8 Can all the data be structured?Will we be able to store all the data in the tables ie can we model all the data?Should we discard the data after getting the required structured data from the log files or should we archive it?
  • #9 Take the example of students using the facilities provided by college.
  • #10 Two Core Components – HDFS & Map-ReduceMachines are un-reliableSeparates distributed fault-tolerant computing code from application logic.No need to worry about identity of a machinelets you interact with a cluster, not a bunch of machines.Analysis workloads span across multiple machinesruns as a cloud(cluster) & possibly on a cloud (EC2)
  • #16 Consumer interested inSocial NetworkingOnline purchasing/bookingService Provider Interested dataAdvertisements or Revenue generationReporting – For internal house keepingChallenges Recommendation – publishing those advertisements which consumer look at as an information or which he is interested in.