Your SlideShare is downloading. ×
(Tugdual grall)   no sql-hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

(Tugdual grall) no sql-hadoop

2,843
views

Published on

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,843
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
37
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. NoSQL & BigData Why Every NoSQL Deployment Should be Paired with Hadoop Tugdual Grall Couchbase @tgrall
  • 2. About  Me   • Tugdual  “Tug”  Grall ­ Couchbase • Technical  Evangelist ­ eXo • CTO ­ Oracle • Developer/Product  Manager • Mainly  Java/SOA ­ Developer  in  consul@ng  firms • Web • @tgrall • hEp://blog.grallandco.com • tgrall • NantesJUG  co-­‐founder • Pet  Project  : • hEp://www.resultri.com
  • 3. Big  Data High  Data  Variety  and  Velocity Trillions  of  Gigabytes  (ZeEabytes) 2.00 1.50 1.00 0.50 0 2000 2006 2011 Source:  IDC  2011  Digital  Universe  Study  (hEp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm) More  Flexible  Data  Model  Required 3 • Usually  when  people  talk  about  Big  Data  they  talk  about  capturing  huge  amounts  of  data  and  analyzing  it.  This  reference  to  Big  Data  is   certainly  a  big  trend. • But  Big  Data  affects  opera@onal  databases  in  a  big  way  as  well  but  for  a  different  set  of  reasons. • There  are  2  aspects  of  Big  Data  that  are  pushing  people  toward  NoSQL  technologies. • The  first  is  that  the  vast  majority  of  the  increase  in  data  is  in  the  form  of  un-­‐structured  or  semi-­‐structured  data.    This  is  data  like  user-­‐ generated  content  like  consumer  recommenda@ons  and  machine  generated  data  like  log  files  and  website  click  data.    Rela@onal  databases   aren’t  well  suited  for  storing  this  type  of  data  while  NoSQL  technologies  like  document-­‐oriented  database  are  ideally  suited  for  this. • The  second  is  that  applica@on  developers  are  finding  new  types  of  data  they  want  to  store  all  the  @me.    It  might  be  new  informa@on  they   want  to  store  in  a  user’s  account  profile,  new  logging  informa@on,  etc.    The  point  is  that  what  developers  want  to  store  is  changing  very   rapidly  and  the  amount  of  data  they  want  to  store  is  increasing  very  rapidly.    The  result  is  that  developers  want  a  very  flexible  data  model   that  they  can  evolve  very  quickly.   • Rela@onal  databases  have  fixed  schemas  that  ofen  take  weeks  or  months  to  change.    On  the  other  hand,  NoSQL  databases  are  schema-­‐less.     As  a  result,  you  can  far  more  easily  add  new  types  of  data  and  iterate  quickly  on  your  applica@on.
  • 4. Big  Data High  Data  Variety  and  Velocity Trillions  of  Gigabytes  (ZeEabytes) 2.00 1.50 1.00 0.50 0 2000 2006 2011 Source:  IDC  2011  Digital  Universe  Study  (hEp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm) More  Flexible  Data  Model  Required 3 • Usually  when  people  talk  about  Big  Data  they  talk  about  capturing  huge  amounts  of  data  and  analyzing  it.  This  reference  to  Big  Data  is   certainly  a  big  trend. • But  Big  Data  affects  opera@onal  databases  in  a  big  way  as  well  but  for  a  different  set  of  reasons. • There  are  2  aspects  of  Big  Data  that  are  pushing  people  toward  NoSQL  technologies. • The  first  is  that  the  vast  majority  of  the  increase  in  data  is  in  the  form  of  un-­‐structured  or  semi-­‐structured  data.    This  is  data  like  user-­‐ generated  content  like  consumer  recommenda@ons  and  machine  generated  data  like  log  files  and  website  click  data.    Rela@onal  databases   aren’t  well  suited  for  storing  this  type  of  data  while  NoSQL  technologies  like  document-­‐oriented  database  are  ideally  suited  for  this. • The  second  is  that  applica@on  developers  are  finding  new  types  of  data  they  want  to  store  all  the  @me.    It  might  be  new  informa@on  they   want  to  store  in  a  user’s  account  profile,  new  logging  informa@on,  etc.    The  point  is  that  what  developers  want  to  store  is  changing  very   rapidly  and  the  amount  of  data  they  want  to  store  is  increasing  very  rapidly.    The  result  is  that  developers  want  a  very  flexible  data  model   that  they  can  evolve  very  quickly.   • Rela@onal  databases  have  fixed  schemas  that  ofen  take  weeks  or  months  to  change.    On  the  other  hand,  NoSQL  databases  are  schema-­‐less.     As  a  result,  you  can  far  more  easily  add  new  types  of  data  and  iterate  quickly  on  your  applica@on.
  • 5. Big  Data High  Data  Variety  and  Velocity Trillions  of  Gigabytes  (ZeEabytes) 2.00 1.50 1.00 0.50 0 Structured  Data 2000 2006 2011 Source:  IDC  2011  Digital  Universe  Study  (hEp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm) More  Flexible  Data  Model  Required 3 • Usually  when  people  talk  about  Big  Data  they  talk  about  capturing  huge  amounts  of  data  and  analyzing  it.  This  reference  to  Big  Data  is   certainly  a  big  trend. • But  Big  Data  affects  opera@onal  databases  in  a  big  way  as  well  but  for  a  different  set  of  reasons. • There  are  2  aspects  of  Big  Data  that  are  pushing  people  toward  NoSQL  technologies. • The  first  is  that  the  vast  majority  of  the  increase  in  data  is  in  the  form  of  un-­‐structured  or  semi-­‐structured  data.    This  is  data  like  user-­‐ generated  content  like  consumer  recommenda@ons  and  machine  generated  data  like  log  files  and  website  click  data.    Rela@onal  databases   aren’t  well  suited  for  storing  this  type  of  data  while  NoSQL  technologies  like  document-­‐oriented  database  are  ideally  suited  for  this. • The  second  is  that  applica@on  developers  are  finding  new  types  of  data  they  want  to  store  all  the  @me.    It  might  be  new  informa@on  they   want  to  store  in  a  user’s  account  profile,  new  logging  informa@on,  etc.    The  point  is  that  what  developers  want  to  store  is  changing  very   rapidly  and  the  amount  of  data  they  want  to  store  is  increasing  very  rapidly.    The  result  is  that  developers  want  a  very  flexible  data  model   that  they  can  evolve  very  quickly.   • Rela@onal  databases  have  fixed  schemas  that  ofen  take  weeks  or  months  to  change.    On  the  other  hand,  NoSQL  databases  are  schema-­‐less.     As  a  result,  you  can  far  more  easily  add  new  types  of  data  and  iterate  quickly  on  your  applica@on.
  • 6. Big  Data High  Data  Variety  and  Velocity Trillions  of  Gigabytes  (ZeEabytes) 2.00 1.50 Unstructured  and  Semi-­‐ Structured  Data 1.00 0.50 0 Text,  Log  Files,  Click   Streams,  Blogs,   Tweets,  Audio,   Video,  etc. Structured  Data 2000 2006 2011 Source:  IDC  2011  Digital  Universe  Study  (hEp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm) More  Flexible  Data  Model  Required 3 • Usually  when  people  talk  about  Big  Data  they  talk  about  capturing  huge  amounts  of  data  and  analyzing  it.  This  reference  to  Big  Data  is   certainly  a  big  trend. • But  Big  Data  affects  opera@onal  databases  in  a  big  way  as  well  but  for  a  different  set  of  reasons. • There  are  2  aspects  of  Big  Data  that  are  pushing  people  toward  NoSQL  technologies. • The  first  is  that  the  vast  majority  of  the  increase  in  data  is  in  the  form  of  un-­‐structured  or  semi-­‐structured  data.    This  is  data  like  user-­‐ generated  content  like  consumer  recommenda@ons  and  machine  generated  data  like  log  files  and  website  click  data.    Rela@onal  databases   aren’t  well  suited  for  storing  this  type  of  data  while  NoSQL  technologies  like  document-­‐oriented  database  are  ideally  suited  for  this. • The  second  is  that  applica@on  developers  are  finding  new  types  of  data  they  want  to  store  all  the  @me.    It  might  be  new  informa@on  they   want  to  store  in  a  user’s  account  profile,  new  logging  informa@on,  etc.    The  point  is  that  what  developers  want  to  store  is  changing  very   rapidly  and  the  amount  of  data  they  want  to  store  is  increasing  very  rapidly.    The  result  is  that  developers  want  a  very  flexible  data  model   that  they  can  evolve  very  quickly.   • Rela@onal  databases  have  fixed  schemas  that  ofen  take  weeks  or  months  to  change.    On  the  other  hand,  NoSQL  databases  are  schema-­‐less.     As  a  result,  you  can  far  more  easily  add  new  types  of  data  and  iterate  quickly  on  your  applica@on.
  • 7. Opera@onal  vs.  Analy@c  Databases AnalyOc Databases Real-­‐Ome,   InteracOve  Databases NoSQL Get  insights  from   data Fast  access   to  data Couchbase Mongo Cloudera Hortonworks 4 • There  are  two  types  of  databases.  Each  is  focused  on  a  very  different  problem. • AnalyOc  databases  were  referred  to  in  the  past  as  OLAP  databases.    They  are  focused  on  looking  through  every  record  in  a  huge  database  to   answer  a  ques@on  or  gain  an  insight  about  the  data  contained  in  it.    These  analyses  are  batch  processes  that  access  every  piece  of  data  in  the   database,  are  very  “read”  heavy,  and  produce  results  in  seconds,  minutes,  or  someOmes  days.  For  analy@c  databases,  “real  @me”  means  an   analysis  takes  a  few  seconds  to  run. • Real-­‐Ome  interac@ve  databases  are  ofen  referred  to  as  operaOonal  databases.    They  store  a  lot  of  data  but  usually  much  less  than  an   analy@c  database. • They  must  provide  access  to  individual  records  in  a  database  in  milliseconds  so  that  users  of  an  applica@on  get  good  response  @me. • Since  the  requirements  of  each  database  is  very  different,  the  architectures  and  capabili@es  of  each  are  very  different  as  well. • When  I  refer  to  NoSQL  in  my  presenta@on,  I  am  referring  to  real-­‐Ome,  interacOve  databases.    This  is  the  type  of  NoSQL  database  Couchbase   provides.
  • 8. 49% 35% 29% 16% Lack  of  flexibility/ rigid  schemas Inability  to  scale   Performance  challenges out  data Source:  Couchbase  Survey,  December  2011,  n  =  1351. Cost 12% 11% All  of  these Other
  • 9. NoSQL  catalog Cache (memory  only) Key-­‐Value Data  Structure Memcached Document Column Graph Redis Couchbase Cassandra Neo4j MongoDB HBase Database (memory/disk) Coherence Membase InfiniteGraph
  • 10. Use  Cases Key  Value •  Session  Management •  User  Profile/Preferences •  Shopping  Cart Document •  Event  Logging •  Content  Management   •  Web  AnalyOcs •  E-­‐Commerce  ApplicaOon Columns •  Event  Logging •  Content  Management •  Counters Graph •  Connected  Data  /    Social  Networks •  RouOng,  Dispatch •  RecommendaOons  based  on  Social  Graph
  • 11. Hadoop
  • 12. What  is  Hadoop? • Highly  scalable • Unstructured  data • Open  source • Big  Data  OperaOng  System • Changing  the  World  One  Petabyte  at  a  Time
  • 13. What  is  Hadoop? • Simplest  unit  of  compute  and  storage Disks CPU Application Data
  • 14. What  is  Hadoop? • And  when  it  grows? Application Data
  • 15. What  is  Hadoop? • And  when  it  grows  more?
  • 16. What  is  Hadoop? • NoSQL  to  the  rescue Application Data
  • 17. What  is  Hadoop? • Hadoop  is  a  different  paradigm Application Data
  • 18. Hadoop is not a “NoSQL Database” but more a set of tools to work with BigData: the ultimate Swiss Army Knife to deal with VERY VERY large volume of data Oozie: Workflow, coordination Sqoop : Data connector to import/export data Hive : SQL-Like interface Pig : High level programming language Mahout : Machine learning library Whirr : Hadoop management tools for cloud services Flume : Aggregator Map Reduce : Framework to process large volume of data HBase : Key Value data store Zookeeper : Centralized configuration management HDFS : Distributed file system
  • 19. Hadoop  and  NoSQL
  • 20. Ad  and  offer  targeOng 40  milliseconds  to  respond  with   the  decision. 3 profiles,  real  @me  campaign   sta@s@cs 2 1 profiles,  campaigns events 17
  • 21. Ad  and  offer  targeOng 40  milliseconds  to  respond  with   the  decision. 3 profiles,  real  @me  campaign   sta@s@cs 2 1 profiles,  campaigns events 17
  • 22. Moving  Parts Ad Targeting Platform Couchbase Server Cluster sqoop export Logs Logs Logs Logs Logs flume flow sqoop import Hadoop Cluster 18
  • 23. Content  &  RecommendaOon  TargeOng 3& make&& recommenda2ons& Content Oriented Site 1& events& 2& user&profiles& 19 Legacy Relational Database
  • 24. Content  &  RecommendaOon  TargeOng 3& make&& recommenda2ons& Content Oriented Site 1& events& 2& user&profiles& 19 Legacy Relational Database
  • 25. Moving  Parts In order to keep up with changing needs on richer, more targeted content that is delivered to larger and larger audiences very quickly, data behind content driven sites is shifting to Couchbase. Content Driven Web Site Original RDBMS Couchbase Server Cluster Logs Logs Logs Logs Logs flume flow sqoop import Hadoop excels at complex analytics which may involve multiple steps of processing which incorporate a number of different data sources. sqoop export Hadoop Cluster 20 sqoop import
  • 26. Sqoop  :  What  is  this?
  • 27. What  is  Sqoop? Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. sqoop.apache.org 22
  • 28. What  is  Sqoop? • Traditional ETL T Data Application 23 Data
  • 29. What  is  Sqoop? • A different paradigm Applicatio n Data Data 24
  • 30. What  is  Sqoop? • A very scalable different paradigm Application Data Application Data Application Data Data 25
  • 31. What  is  Sqoop? • Where did the Transform go? TTT TTT TTT TTT Application Data 26
  • 32. What  is  Sqoop? • Sqoop  “SQL-­‐Hadoop” ­ Default  connec@on  is  via  JDBC • Lots  of  custom  connectors ­ Couchbase,  VoltDB,  Ver@ca ­ Teradata,  Netezza ­ Oracle,  MySQL,  Postgres
  • 33. Sqoop  :  Import
  • 34. Sqoop  :  Import sqoop import --connect jdbc:mysql://rdbms1.demo.com/CRM --table customers
  • 35. Sqoop  :  Export
  • 36. Sqoop  :  Export sqoop export --connect jdbc:mysql://rdbms1.demo.com/ANALYTICS --table sales --export-dir /user/hive/warehouse/zip_profits --input-fields-terminated-by '0001'
  • 37. Sqoop  :  Import
  • 38. Sqoop  :  Import sqoop import –-connect http://localhost:8091/pools --table DUMP
  • 39. Sqoop  :  Import Metadata Sqoop   Client Launches Map Map Map HDFS HDFS HDFS MapReduceJob
  • 40. Sqoop  :  Import Metadata Sqoop   Client Launches Map Map Map HDFS HDFS HDFS MapReduceJob
  • 41. Sqoop  :  Export
  • 42. Sqoop  :  Export sqoop export --connect http://localhost:8091/pools --table DUMP --export-dir /user/hive/profiles/recommendation --username social
  • 43. Sqoop  :  Export MapReduceJob HDFS HDFS HDFS Map Map Map Launches Sqoop   Client Metadata
  • 44. Sqoop  :  Export MapReduceJob HDFS HDFS HDFS Map Map Map Launches Sqoop   Client Metadata
  • 45. DemonstraOon
  • 46. NoSQL & BigData Why Every NoSQL Deployment Should be Paired with Hadoop Tugdual Grall Couchbase @tgrall Q&A

×