Why every NoSQL deployment should
      be paired with Hadoop


              James Phillips             Amr Awadallah
           Co-founder and SVP Products    Co-founder and CTO
                   Couchbase                   Cloudera




                                                               1
Agenda

•   Big Audience vs. Big Data
•   NoSQL for Big Audience
•   Hadoop for Big Data
•   Big Audiences create and consume Big Data
    – NoSQL and Hadoop are highly synergistic
• Couchbase + Cloudera




                                                2
Aren’t NoSQL, Hadoop, ā€œBig Dataā€ all the same?




                   No.

                                                 3
Two challenges at the data layer




    ā€œBig Audience.ā€                              ā€œBig Data.ā€
    Most new interactive software           IDC estimates that more than 1.8
   systems are accessed via browser       trillion gigabytes of information was
   with 2 billion potential users and a       created in 2011 and that it will
       24x7 uptime requirement.                   double every two years.




                                                                                  4
5
Changes in interactive software – NoSQL driver




                                                 6
Modern interactive software architecture


                                                             Application Scales Out
                                                             Just add more commodity web servers




                                                             Database Scales Up
                                                             Get a bigger, more complex server




  Note – Relational database technology is great for what it is great for, but it is not great for this.
                                                                                                           7
Extending the scope of RDBMS technology

• Data partitioning (ā€œshardingā€)
   – Disruptive to reshard – impacts application
   – No cross-shard joins
   – Schema management at every shard
• Denormalizng
   – Increases speed
   – At the limit, provides complete flexibility
   – Eliminates relational query benefits
• Distributed caching
   – Accelerate reads
   – Scale out
   – Another tier, no write acceleration, coherency management

                                                                 8
Lacking market solutions, users forced to invent



   Bigtable            Dynamo          Cassandra       Voldemort
  November 2006       October 2007     August 2008     February 2009



         •    No schema required before inserting data
         •    No schema change required to change data format
         •    Auto-sharding without application participation
         •    Distributed queries
         •    Integrated main memory caching
         •    Data synchronization (mobile, multi-datacenter)


                                                                       9
NoSQL database matches application logic tier architecture
Data layer now scales with linear cost and constant performance.



                                                     Application Scales Out
                                                     Just add more commodity web servers




     NoSQL Database Servers
                                                     Database Scales Out
                                                     Just add more commodity data servers




     Scaling out flattens the cost and performance curves.
                                                                                            10
Survey: Schema inflexibility #1 adoption driver

           What is the biggest data management problem
           driving your use of NoSQL in the coming year?


            Lack of flexibility/rigid schemas                                                       49%



                  Inability to scale out data                                  35%



            High latency/low performance                               29%



                                       Costs          16%



                                 All of these   12%



                                      Other     11%


                                                       Source: Couchbase NoSQL Survey, December 2011, n=1351




                                                                                                               11
12
13
14
15
16
17
18
Two peas. One pod.




               http://tinyurl.com/6tx42tw
                                            19
Hadoop as a Web application feeder or consumer

Pattern 1                               Pattern 2
Hadoop feeding a web application        Hadoop consuming web application data


                         big audience
                                        ā€œbig audienceā€
                                                                    insights
                            Web
 ā€œbig dataā€              application
                                              Web
                                           application




              insights
                                                         big data
                                                                               20
Pattern 1 Case Study: AOL Ad Targeting

• One of the largest online ad targeting operations
• Ad slot filling optimization
   – Serve the most relevant ad to a given user
   – Meet contracted impression counts
• Relevancy criteria
   – Demographic
   – Psychographic
   – Current behavioral
• 40 milliseconds to fill all slots


                                                      21
AOL Advertising: Hadoop as an ad targeting feeder

                               40 milliseconds to respond
                               with the decision.




                             profiles, real time campaign
                         3   statistics
   affiliates




                                          2
                  1                       profiles, campaigns
                events
                                                                22
Pattern 2 Case Study: Social gaming user analysis

• Tens to hundreds of millions of users
• Game optimization requirements
   – Keep game fresh and retain audience
   – Maximize revenue through offer and experience tuning
• Very different data management tasks
   – Serving game data
      •   System of record game data
      •   Very low latency data access
      •   Non-disruptive elasticity
      •   Complex queries
   – Analyzing user behavior
      • Not game data, rather user behavior data
      • High-throughput data analysis

                                                            23
Social Game: Game optimization via Hadoop

           User
     interacting   1
     with game
                                                      Insights
                                                 5

  Validation and
       response    2
                                  4
      Game and user data       User behavioral data
        system of record
                           3




                                                                 24
25
Couchcbase Sqoop connector for Cloudera




                                          Cloudera-certified connector
                                          Bi-directional data movement
                                            - Hadoop -> Couchbase
                                            - Couchbase -> Hadoop




       http://www.couchbase.com/develop/connectors/hadoop
                                                                         26

Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar

  • 1.
    Why every NoSQLdeployment should be paired with Hadoop James Phillips Amr Awadallah Co-founder and SVP Products Co-founder and CTO Couchbase Cloudera 1
  • 2.
    Agenda • Big Audience vs. Big Data • NoSQL for Big Audience • Hadoop for Big Data • Big Audiences create and consume Big Data – NoSQL and Hadoop are highly synergistic • Couchbase + Cloudera 2
  • 3.
    Aren’t NoSQL, Hadoop,ā€œBig Dataā€ all the same? No. 3
  • 4.
    Two challenges atthe data layer ā€œBig Audience.ā€ ā€œBig Data.ā€ Most new interactive software IDC estimates that more than 1.8 systems are accessed via browser trillion gigabytes of information was with 2 billion potential users and a created in 2011 and that it will 24x7 uptime requirement. double every two years. 4
  • 5.
  • 6.
    Changes in interactivesoftware – NoSQL driver 6
  • 7.
    Modern interactive softwarearchitecture Application Scales Out Just add more commodity web servers Database Scales Up Get a bigger, more complex server Note – Relational database technology is great for what it is great for, but it is not great for this. 7
  • 8.
    Extending the scopeof RDBMS technology • Data partitioning (ā€œshardingā€) – Disruptive to reshard – impacts application – No cross-shard joins – Schema management at every shard • Denormalizng – Increases speed – At the limit, provides complete flexibility – Eliminates relational query benefits • Distributed caching – Accelerate reads – Scale out – Another tier, no write acceleration, coherency management 8
  • 9.
    Lacking market solutions,users forced to invent Bigtable Dynamo Cassandra Voldemort November 2006 October 2007 August 2008 February 2009 • No schema required before inserting data • No schema change required to change data format • Auto-sharding without application participation • Distributed queries • Integrated main memory caching • Data synchronization (mobile, multi-datacenter) 9
  • 10.
    NoSQL database matchesapplication logic tier architecture Data layer now scales with linear cost and constant performance. Application Scales Out Just add more commodity web servers NoSQL Database Servers Database Scales Out Just add more commodity data servers Scaling out flattens the cost and performance curves. 10
  • 11.
    Survey: Schema inflexibility#1 adoption driver What is the biggest data management problem driving your use of NoSQL in the coming year? Lack of flexibility/rigid schemas 49% Inability to scale out data 35% High latency/low performance 29% Costs 16% All of these 12% Other 11% Source: Couchbase NoSQL Survey, December 2011, n=1351 11
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Two peas. Onepod. http://tinyurl.com/6tx42tw 19
  • 20.
    Hadoop as aWeb application feeder or consumer Pattern 1 Pattern 2 Hadoop feeding a web application Hadoop consuming web application data big audience ā€œbig audienceā€ insights Web ā€œbig dataā€ application Web application insights big data 20
  • 21.
    Pattern 1 CaseStudy: AOL Ad Targeting • One of the largest online ad targeting operations • Ad slot filling optimization – Serve the most relevant ad to a given user – Meet contracted impression counts • Relevancy criteria – Demographic – Psychographic – Current behavioral • 40 milliseconds to fill all slots 21
  • 22.
    AOL Advertising: Hadoopas an ad targeting feeder 40 milliseconds to respond with the decision. profiles, real time campaign 3 statistics affiliates 2 1 profiles, campaigns events 22
  • 23.
    Pattern 2 CaseStudy: Social gaming user analysis • Tens to hundreds of millions of users • Game optimization requirements – Keep game fresh and retain audience – Maximize revenue through offer and experience tuning • Very different data management tasks – Serving game data • System of record game data • Very low latency data access • Non-disruptive elasticity • Complex queries – Analyzing user behavior • Not game data, rather user behavior data • High-throughput data analysis 23
  • 24.
    Social Game: Gameoptimization via Hadoop User interacting 1 with game Insights 5 Validation and response 2 4 Game and user data User behavioral data system of record 3 24
  • 25.
  • 26.
    Couchcbase Sqoop connectorfor Cloudera Cloudera-certified connector Bi-directional data movement - Hadoop -> Couchbase - Couchbase -> Hadoop http://www.couchbase.com/develop/connectors/hadoop 26