• Like

Why Every NoSQL Deployment Should be Paired With Hadoop

  • 2,389 views
Uploaded on

Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, …

Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, versus relational database technology which favors centralized computing. But the “problems” these technologies address are quite different. Hadoop, the Big Data poster child, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases are transactional systems – delivering high-performance, cost-effective data management for modern real-time web and mobile applications; this is the Big User problem. Of course, if you have a lot of users, you are probably going to generate a lot of data. IDC estimates that more than 1.8 trillion gigabytes of information was created in 2011 and that this number will double every two years. The proliferation of user-generated data from interactive web and mobile applications are key contributors to this growth.

These slides will address:

- Why NoSQL and Big Data are similar, but different
- The categories of NoSQL systems, and the types of applications for which they are best suited
- How Cloudera’s Distribution Including Apache Hadoop and Couchbase can be used together to build better applications
- Explore real-world use cases where NoSQL and Hadoop technologies work in concert

To view Couchbase webinars on-demand visit http://www.couchbase.com/webinars

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,389
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
78
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Why  every  NoSQL  deployment  should   be  paired  with  Hadoop   James  Phillips   Amr  Awadallah   Co-­‐founder  and  SVP  Products   Co-­‐founder  and  CTO   Couchbase   Cloudera   1  
  • 2. Agenda     •  Big  Audience  vs.  Big  Data   •  NoSQL  for  Big  Audience   •  Hadoop  for  Big  Data   •  Big  Audiences  create  and  consume  Big  Data   –  NoSQL  and  Hadoop  are  highly  synergisJc   •  Couchbase  +  Cloudera   2  
  • 3. Aren’t  NoSQL,  Hadoop,  “Big  Data”  all  the  same?   No.   3  
  • 4. Two  challenges  at  the  data  layer     “Big  Audience.”   “Big  Data.”   Most  new  interacJve  soWware   IDC  esJmates  that  more  than  1.8   systems  are  accessed  via  browser   trillion  gigabytes  of  informaJon  was   with  2  billion  potenJal  users  and  a   created  in  2011  and  that  it  will   24x7  upJme  requirement.   double  every  two  years.   4  
  • 5. NoSQL for“Big Audience” 5  
  • 6. Changes  in  interacJve  soWware  –  NoSQL  driver   6  
  • 7. Modern interactive software architecture Application Scales Out Just add more commodity web servers Database Scales Up Get a bigger, more complex server Note  –  RelaJonal  database  technology  is  great  for  what  it  is  great  for,  but  it  is  not  great  for  this.   7  
  • 8. Extending  the  scope  of  RDBMS  technology   •  Data  parJJoning  (“sharding”)   –  DisrupJve  to  reshard  –  impacts  applicaJon   –  No  cross-­‐shard  joins   –  Schema  management  at  every  shard   •  Denormalizng   –  Increases  speed   –  At  the  limit,  provides  complete  flexibility   –  Eliminates  relaJonal  query  benefits   •  Distributed  caching   –  Accelerate  reads   –  Scale  out   –  Another  Jer,  no  write  acceleraJon,  coherency  management   8  
  • 9. Lacking  market  soluJons,  users  forced  to  invent   Bigtable   Dynamo   Cassandra   Voldemort   November  2006   October  2007   August  2008   February  2009   •  No  schema  required  before  inserJng  data   •  No  schema  change  required  to  change  data  format   •  Auto-­‐sharding  without  applicaJon  parJcipaJon   •  Distributed  queries   •  Integrated  main  memory  caching   •  Data  synchronizaJon  (mobile,  mulJ-­‐datacenter)   9  
  • 10. NoSQL database matches application logic tier architectureData layer now scales with linear cost and constant performance. Application Scales Out Just add more commodity web servers NoSQL  Database  Servers   Database Scales Out Just add more commodity data servers Scaling out flattens the cost and performance curves. 10  
  • 11. Survey:  Schema  inflexibility  #1  adopJon  driver   What  is  the  biggest  data  management  problem     driving  your  use  of  NoSQL  in  the  coming  year?   Lack  of  flexibility/rigid  schemas   49%   Inability  to  scale  out  data   35%   High  latency/low  performance   29%   Costs   16%   All  of  these   12%   Other   11%   Source: Couchbase NoSQL Survey, December 2011, n=1351 11  
  • 12. Hadoop for“Big Data” 12  
  • 13. The Problems with Current Data Systems BI Reports + Interactive Apps 1. Can’t Explore Original High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid 2. Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) 3. Archiving Mostly Append = Premature Collection Data Death Instrumentation13 ©2012 Cloudera, Inc. All Rights Reserved.
  • 14. The Solution: A Combined Storage/Compute Layer 1. Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS (aggregated data) 2. Scalable Throughput For ETL & Aggregation 3. Keep Data Hadoop: Storage + Compute Grid Alive For Ever Mostly Append Collection Instrumentation14 ©2012 Cloudera, Inc. All Rights Reserved.
  • 15. The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop):•  Schema must be created before •  Data is simply copied to the file any data can be loaded. store, no transformation is needed.•  An explicit load operation has to •  A SerDe (Serializer/Deserlizer) is take place which transforms applied during read time to extract data to DB internal structure. the required columns (late binding)•  New columns must be added •  New data can start flowing anytime explicitly before new data for and will appear retroactively once such columns can be loaded the SerDe is updated to parse it. into the database. •  Read is Fast •  Load is Fast Pros   •  Standards/Governance •  Flexibility/Agility 15 ©2012 Cloudera, Inc. All Rights Reserved.
  • 16. Scalability: Scalable Software Development Grows without requiring developers to re-architect their algorithms/application. AUTO  SCALE   16 ©2012 Cloudera, Inc. All Rights Reserved.
  • 17. Economics: Return on Byte •  Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte •  If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB17 ©2012 Cloudera, Inc. All Rights Reserved.
  • 18. Hadoop in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA SYSTEM ARCHITECTS OPERATORS Modeling BI / Enterprise IDEs Tools Analytics Reporting Meta Data/ Cloudera ETL Tools Manager ODBC, JDBC, NFS Enterprise Data Sqoop Warehouse Sqoop CUSTOMERS Flume Flume Flume Sqoop Web/Mobile Applications Relational Logs Files Web Data Databases18 ©2012 Cloudera, Inc. All Rights Reserved.
  • 19. Big Audiencescreate and consume Big Data. 19  
  • 20. Two  peas.  One  pod.   hnp://Jnyurl.com/6tx42tw   20  
  • 21. Hadoop  as  a  Web  applicaJon  feeder  or  consumer   Panern  1   Panern  2   Hadoop  feeding  a  web  applicaJon   Hadoop  consuming  web  applicaJon  data   big  audience   “big  audience”   insights   Web   applicaJon   “big  data”   Web   applicaJon   insights   big  data   21  
  • 22. Panern  1  Case  Study:  AOL  Ad  TargeJng   •  One  of  the  largest  online  ad  targeJng  operaJons   •  Ad  slot  filling  opJmizaJon   –  Serve  the  most  relevant  ad  to  a  given  user   –  Meet  contracted  impression  counts   •  Relevancy  criteria   –  Demographic   –  Psychographic   –  Current  behavioral   •  40  milliseconds  to  fill  all  slots   22  
  • 23. AOL  AdverJsing:  Hadoop  as  an  ad  targeJng  feeder   40  milliseconds  to  respond   with  the  decision.   profiles,  real  Jme  campaign     3   staJsJcs   affiliates   2   1   profiles,  campaigns   events   23  
  • 24. Panern  2  Case  Study:  Social  gaming  user  analysis   •  Tens  to  hundreds  of  millions  of  users   •  Game  opJmizaJon  requirements   –  Keep  game  fresh  and  retain  audience   –  Maximize  revenue  through  offer  and  experience  tuning   •  Very  different  data  management  tasks   –  Serving  game  data   •  System  of  record  game  data   •  Very  low  latency  data  access   •  Non-­‐disrupJve  elasJcity   •  Complex  queries   –  Analyzing  user  behavior   •  Not  game  data,  rather  user  behavior  data   •  High-­‐throughput  data  analysis   24  
  • 25. Social  Game:  Game  opJmizaJon  via  Hadoop   User   interacJng   1   with  game   Insights   5   ValidaJon  and   response   2   4   Game  and  user  data   User  behavioral  data   system  of  record   3   25  
  • 26. Couchbase and Cloudera 26  
  • 27. Couchcbase  Sqoop  connector  for  Cloudera   Cloudera-­‐cerJfied  connector   Bi-­‐direcJonal  data  movement          -­‐  Hadoop  -­‐>  Couchbase          -­‐  Couchbase  -­‐>  Hadoop   hnp://www.couchbase.com/develop/connectors/hadoop     27  
  • 28. Questions? 28