Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why and How to integrate Hadoop and NoSQL?


Published on

Learn why and how you can integrate Hadoop and NoSQL. This presentation shows some use cases, and concrete examples using Hadoop and Couchbase.

Published in: Technology
  • Be the first to comment

Why and How to integrate Hadoop and NoSQL?

  1. 1. Monday, June 10, 13
  2. 2. Goto  Night  CPH,  June  6th  2013How  to  integrate  Hadoop  with  your  NoSQL  database?Tugdual  “Tug”  GrallTechnical  EvangelistMonday, June 10, 13
  3. 3. Goto  Night  CPH,  June  6th  2013About  Me  • Tugdual  “Tug”  Grall­ Couchbase• Technical  Evangelist­ eXo• CTO­ Oracle• Developer/Product  Manager• Mainly  Java/SOA­ Developer  in  consul@ng  firms• Web• @tgrall• hAp://• tgrall• NantesJUG  co-­‐founder• Pet  Project  :• hAp://www.resultri.comMonday, June 10, 13
  4. 4. Goto  Night  CPH,  June  6th  2013 400.501.001.502.002000 2006 2011Source:  IDC  2011  Digital  Universe  Study  (hKp://­‐digital-­‐universe-­‐2011/index.htm)Trillions  of  Gigabytes  (ZeKabytes) Big  DataHigh  Data  Variety  and  VelocityUnstructured  and  Semi-­‐Structured  DataStructured  DataText,  Log  Files,  Click  Streams,  Blogs,  Tweets,  Audio,  Video,  etc.More  Flexible  Data  Model  RequiredMonday, June 10, 13
  5. 5. Goto  Night  CPH,  June  6th  2013<50%?202795%RelaOonal  Technology$30B  Database  Market  Being  Disrupted2013All  new  database  growth  will  be  NoSQLRelaOonal  TechnologyRelaOonal  TechnologyRelaOonal  TechnologyNoSQLTechnologyOtherMonday, June 10, 13
  6. 6. Goto  Night  CPH,  June  6th  2013ClouderaHortonworksOpera@onal  vs.  Analy@c  DatabasesCouchbaseMongoAnalyOcDatabasesGet  insights  from  dataReal-­‐Ome,  InteracOve  DatabasesFast  access  to  dataNoSQLMonday, June 10, 13
  7. 7. Goto  Night  CPH,  June  6th  2013Lack  of  flexibility/rigid  schemasInability  to  scale  out  dataPerformance  challenges Cost All  of  these Other49%35%29%16%12%11%Source:  Couchbase  Survey,  December  2011,  n  =  1351.Monday, June 10, 13
  8. 8. Goto  Night  CPH,  June  6th  2013HadoopMonday, June 10, 13
  9. 9. Goto  Night  CPH,  June  6th  2013What  is  Hadoop?• Highly  scalable• Unstructured  data• Open  source• Big  Data  OperaOng  System• Changing  the  World  One  Petabyte  at  a  TimeMonday, June 10, 13
  10. 10. Goto  Night  CPH,  June  6th  2013What  is  Hadoop?• Simplest  unit  of  compute  and  storageCPUDisks ApplicationDataMonday, June 10, 13
  11. 11. Goto  Night  CPH,  June  6th  2013What  is  Hadoop?• And  when  it  grows?ApplicationDataMonday, June 10, 13
  12. 12. Goto  Night  CPH,  June  6th  2013What  is  Hadoop?• And  when  it  grows  more?Monday, June 10, 13
  13. 13. Goto  Night  CPH,  June  6th  2013What  is  Hadoop?• NoSQL  to  the  rescueApplicationDataMonday, June 10, 13
  14. 14. Goto  Night  CPH,  June  6th  2013What  is  Hadoop?• Hadoop  is  a  different  paradigmApplicationDataMonday, June 10, 13
  15. 15. Goto  Night  CPH,  June  6th  2013Monday, June 10, 13
  16. 16. Goto  Night  CPH,  June  6th  2013Hadoop  and  NoSQLMonday, June 10, 13
  17. 17. Goto  Night  CPH,  June  6th  2013eventsprofiles,  campaignsprofiles,  real  @me  campaign  sta@s@cs40  milliseconds  to  respond  with  the  decision.231Ad  and  offer  targeOngMonday, June 10, 13
  18. 18. Goto  Night  CPH,  June  6th  2013LogsCouchbase Server ClusterHadoop Clustersqoop importLogsLogsLogsLogsAd TargetingPlatformsqoop exportflumeflowMoving  PartsMonday, June 10, 13
  19. 19. Goto  Night  CPH,  June  6th  2013events&user&profiles&make&&recommenda2ons&2&3&1&ContentOriented SiteLegacy RelationalDatabaseContent  &  RecommendaOon  TargeOngMonday, June 10, 13
  20. 20. Goto  Night  CPH,  June  6th  2013LogsCouchbase Server ClusterHadoop Clustersqoop importLogsLogsLogsLogsContent DrivenWeb Sitesqoop exportOriginal RDBMSIn order to keep up with changing needs onricher, more targeted content that is deliveredto larger and larger audiences very quickly,data behind content driven sites is shifting toCouchbase.Hadoop excels at complex analytics whichmay involve multiple steps of processingwhich incorporate a number of different datasources.sqoop importflumeflowMoving  PartsMonday, June 10, 13
  21. 21. Goto  Night  CPH,  June  6th  2013Sqoop is a tool designed to transfer data between Hadoop and relationaldatabases.You can use Sqoop to import data from a relational database managementsystem (RDBMS) such as MySQL or Oracle into the Hadoop Distributed FileSystem (HDFS), transform the data in Hadoop MapReduce, and thenexport the data back into an RDBMS.sqoop.apache.orgWhat  is  Sqoop?Monday, June 10, 13
  22. 22. Goto  Night  CPH,  June  6th  2013• Traditional ETLApplication DataDataTWhat  is  Sqoop?Monday, June 10, 13
  23. 23. Goto  Night  CPH,  June  6th  2013• A different paradigmDataApplicationDataWhat  is  Sqoop?Monday, June 10, 13
  24. 24. Goto  Night  CPH,  June  6th  2013• A very scalable different paradigmDataApplicationDataApplicationDataApplicationDataWhat  is  Sqoop?Monday, June 10, 13
  25. 25. Goto  Night  CPH,  June  6th  2013• Where did the Transform go?ApplicationDataTTT TTT TTT TTTWhat  is  Sqoop?Monday, June 10, 13
  26. 26. Goto  Night  CPH,  June  6th  2013What  is  Sqoop?• Sqoop  “SQL-­‐Hadoop”­ Default  connec@on  is  via  JDBC• Lots  of  custom  connectors­ Couchbase,  VoltDB,  Ver@ca­ Teradata,  Netezza­ Oracle,  MySQL,  PostgresMonday, June 10, 13
  27. 27. Goto  Night  CPH,  June  6th  2013Sqoop  :  Importsqoop import --connect jdbc:mysql:// customersMonday, June 10, 13
  28. 28. Goto  Night  CPH,  June  6th  2013Sqoop  :  Exportsqoop export --connect jdbc:mysql:// sales--export-dir /user/hive/warehouse/zip_profits--input-fields-terminated-by 0001Monday, June 10, 13
  29. 29. Goto  Night  CPH,  June  6th  2013Sqoop  :  Importsqoop import –-connect http://localhost:8091/pools--table DUMPMonday, June 10, 13
  30. 30. MapReduceJobGoto  Night  CPH,  June  6th  2013Sqoop  :  ImportHDFSMapHDFSMapHDFSMapSqoop  ClientMetadataLaunchesMonday, June 10, 13
  31. 31. Goto  Night  CPH,  June  6th  2013Sqoop  :  Exportsqoop export --connect http://localhost:8091/pools--table DUMP--export-dir /user/hive/profiles/recommendation--username socialMonday, June 10, 13
  32. 32. Goto  Night  CPH,  June  6th  2013Sqoop  :  ExportMapReduceJobHDFSMapHDFSMapHDFSMapSqoop  ClientMetadataLaunchesMonday, June 10, 13
  33. 33. Goto  Night  CPH,  June  6th  2013DemonstraOonMonday, June 10, 13
  34. 34. Goto  Night  CPH,  June  6th  2013CouchbaseMonday, June 10, 13
  35. 35. Goto  Night  CPH,  June  6th  2013Easy  ScalabilityConsistent  High  PerformanceAlways  On  24x365Grow  cluster  without  applica@on  changes,  without  down@me  with  a  single  clickConsistent  sub-­‐millisecond  read  and  write  response  @mes  with  consistent  high  throughputNo  down@me  for  so`ware  upgrades,  hardware  maintenance,  etc.Flexible  Data  ModelJSON  document  model  with  no  fixed  schema.JSONJSONJSONJSONJSONPERFORMANCECouchbase  Server  Core  PrinciplesMonday, June 10, 13
  36. 36. Goto  Night  CPH,  June  6th  2013Couchbase  Handles  Real  World  ScaleMonday, June 10, 13
  37. 37. Goto  Night  CPH,  June  6th  2013Couchbase  Server  2.0HeartbeatProcess  monitorGlobal  singleton  supervisorConfiguraQon  manageron  each  nodeRebalance  orchestratorNode  health  monitorone  per  clustervBucket  state  and  replicaQon  managerhdpREST  management  API/Web  UIHTTP8091Erlang  port  mapper4369Distributed  Erlang21100  -­‐  21199Erlang/OTPstorage  interfaceCouchbase  EP  Engine11210Memcapable    2.0Moxi11211Memcapable    1.0MemcachedNew  Persistence  Layer8092Query  APIQuery  EngineData  Manager Cluster  ManagerMonday, June 10, 13
  38. 38. Goto  Night  CPH,  June  6th  2013Couchbase  Server  2.0HeartbeatProcess  monitorGlobal  singleton  supervisorConfiguraQon  manageron  each  nodeRebalance  orchestratorNode  health  monitorone  per  clustervBucket  state  and  replicaQon  managerhdpREST  management  API/Web  UIHTTP8091Erlang  port  mapper4369Distributed  Erlang21100  -­‐  21199Erlang/OTPstorage  interfaceCouchbase  EP  Engine11210Memcapable    2.0Moxi11211Memcapable    1.0MemcachedNew  Persistence  Layer8092Query  APIQuery  EngineMonday, June 10, 13
  39. 39. The  Classic  Order  Entry  StructureGoto  Night  CPH,  June  6th  2013 39hKp://  databases  were  not  designed  with  clusters  in  mind,  which  is  why  people  have  cast  around  for  an  alterna%ve.  Storing  aggregates  as  fundamental  units  makes  a  lot  of  sense  for  running  on  a  cluster.  Monday, June 10, 13
  40. 40. Goto  Night  CPH,  June  6th  2013 40o::1001{uid:  “ji22jd”,customer:  “Ann”,line_items:  [  {  sku:  0321293533,  quan:  3,    unit_price:  48.0  },{  sku:  0321601912,  quan:  1,  unit_price:  39.0  },{  sku:  0131495054,  quan:  1,  unit_price:  51.0  }  ],payment:  {                      type:  “Amex”,                    expiry:  “04/2001”,  last5:  12345}• Easy  to  distribute  data• Makes  sense  to  applicaQon  programmersAggregate  by  ComparisonMonday, June 10, 13
  41. 41. Goto  Night  CPH,  June  6th  2013COUCHBASE  SERVER    CLUSTER• Docs  distributed  evenly  across  servers  • Each  server  stores  both  acOve  and  replica  docsOnly  one  server  acQve  at  a  Qme• Client  library  provides  app  with  simple  interface  to  database• Cluster  map  provides  map  to  which  server  doc  is  onApp  never  needs  to  know• App  reads,  writes,  updates  docs• MulOple  app  servers  can  access  same  document  at  same  OmeUser  Configured  Replica  Count  =  1READ/WRITE/UPDATEACTIVEDoc  5Doc  2DocDocDocSERVER  1ACTIVEDoc  4Doc  7DocDocDocSERVER  2Doc  8ACTIVEDoc  1Doc  2DocDocDocREPLICADoc  4Doc  1Doc  8DocDocDocREPLICADoc  6Doc  3Doc  2DocDocDocREPLICADoc  7Doc  9Doc  5DocDocDocSERVER  3Doc  6APP  SERVER  1COUCHBASE  Client  LibraryCLUSTER  MAPCOUCHBASE  Client  LibraryCLUSTER  MAPAPP  SERVER  2Doc  9Basic  OperaOonsMonday, June 10, 13
  42. 42. Goto  Night  CPH,  June  6th  2013COUCHBASE  SERVER    CLUSTERACTIVEDoc  5Doc  2DocDocDocSERVER  1REPLICADoc  4Doc  1Doc  8DocDocDocAPP  SERVER  1COUCHBASE  Client  LibraryCLUSTER  MAPCOUCHBASE  Client  LibraryCLUSTER  MAPAPP  SERVER  2Doc  9• Indexing  work  is  distributed  amongst  nodes• Large  data  set  possible• Parallelize  the  effort• Each  node  has  index  for  data  stored  on  it• Queries  combine  the  results  from  required  nodesACTIVEDoc  5Doc  2DocDocDocSERVER  2REPLICADoc  4Doc  1Doc  8DocDocDocDoc  9ACTIVEDoc  5Doc  2DocDocDocSERVER  3REPLICADoc  4Doc  1Doc  8DocDocDocDoc  9QueryIndexingMonday, June 10, 13
  43. 43. Goto  Night  CPH,  June  6th  2013DemonstraOonMonday, June 10, 13
  44. 44. ≠Goto  Night  CPH,  June  6th  2013Map  Reduce  ...• Deal  with  “Big  Data”• “More”  is  beder  than  “Faster”• Batch  Oriented• Usually  used  to  “extract/transform”  data• Fully  distributed­ Map,  Shuffle,  Reduce• Distributed  • Executed  where  the  document  is• Deal  with  “indexing”  data  • As  fast  as  possible• Use  to  query  the  data  in  the  DatabaseMonday, June 10, 13
  45. 45. Goto  Night  CPH,  June  6th  2013Conclusion• Big  Data  and  Big  Users  working  together• Use  Hadoop  to  store  “everything”­ Batch  oriented­ Complex  data  processing• MapReduce• Expose  a  subset  of  the  dataset  to  your  applicaOon­ Real  @me  analy@cs­ Low  latency­ Simple  data  interac@ons  and  queriesMonday, June 10, 13
  46. 46. Goto  Night  CPH,  June  6th  2013Q&AWe’re  Hiring!, June 10, 13
  47. 47. Goto  Night  CPH,  June  6th  2013Q&AMonday, June 10, 13