Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Roundtable 2: Big Data Analytics and NoSQL


Published on

Slides from the Live Webcast on Mar.14, 2012

Watch this Database Revolution roundtable to learn from four of the best minds in the business: Mark Madsen of Third Nature, Robin Bloor of The Bloor Group, Colin White of BI Research and Steve Dine of Datasource Consulting. Each will present their thoughts on what’s happening in the NoSQL space, followed by an extended Q&A in which you can pose your detailed questions.

For more information visit:

Watch this and the entire series at :

Published in: Technology, Business
  • Be the first to comment

Roundtable 2: Big Data Analytics and NoSQL

  1. 1. DB Revolution: 2nd RoundtableWednesday, March 14, 12
  2. 2. Eric Kavanagh Twitter Tag: #briefrWednesday, March 14, 12
  3. 3. To conduct an Open Research program that invites the participation of both IT users and technology vendors To assist IT buyers in understanding database technology and the architecture that surrounds it. Allow audience members to pose serious questions... and get answers! Publish all findings Twitter Tag: #briefrWednesday, March 14, 12
  4. 4. Your Host: Eric Kavanagh Research Leader: Mark Madsen - Third Nature Primary Collaborator: Robin Bloor - The Bloor Group Guest Analyst 1: Colin White - BI Research Guest Analyst 2: Steve Dine - DataSource ConsultingWednesday, March 14, 12
  5. 5. Colin White is the president of DataBase Associates Inc. and founder of BI Research. He is well known for his in-depth knowledge of data management, information integration, and business intelligence technologies. He has consulted for dozens of companies throughout the world and is a frequent speaker at leading IT events. For ten years he was the conference chair of the DCI and Shared Insights Portals, Content Management, and Collaboration conference. Twitter Tag: #briefrWednesday, March 14, 12
  6. 6. Big Data is Bigger than NoSQL Colin White President BI Research March 2012Wednesday, March 14, 12
  7. 7. What is Big Data? A term that represents workloads and data management solutions that could not previously be supported because of cost considerations and/or technology limitations Three important technologies: •Optimized analytic RDBMSs •Non-relational “NoSQL” systems •Stream processing systems Copyright © BI Research, 2012 2Wednesday, March 14, 12
  8. 8. Big Data: The Business Case Smarter Decisions • Analyze new sources of data e.g., sensor data, web content, systems logs, text, XML files, graph data, map data, etc. • More sophisticated analyses - advanced analytics Faster Decisions • Supports workloads that were difficult to implement previously in a timely or cost-effective manner • Faster data analysis, e.g., analysis of large detailed data stores, dramatic increase in analytic model execution Faster Time to Value • Analyze data that is outside of the enterprise data warehouse, e.g., machine-generated data such as sensor data Copyright © BI Research, 2012 3Wednesday, March 14, 12
  9. 9. Non-Relational Solutions Some organizations have developed their own non-relational (NoSQL) systems to support extreme workloads • Google: MapReduce + BigTable DBMS + Google File System Non-relational systems are not new, but modern versions are often available to the open source community • Often support commodity hardware in a large- scale distributed computing environment • Several types of data stores (key value, graph, document, indexed file/DB systems) • A key vendor focus area is the Hadoop distributed computing system Copyright © BI Research, 2012 4Wednesday, March 14, 12
  10. 10. Hadoop versus an RDBMS This debate is reminiscent of the object versus relational database debates of the 1980s, and the reasons are similar • Programmers prefer procedural programmatic approaches for accessing and manipulating data, e.g., MapReduce • Non-programmers prefer declarative languages, e.g., RDBMSs and SQL Adding the Hive SQL-like language to Hadoop, and MR functions to RDBMSs, however, complicates the debate Key requirements are: • The ability for organizations to easily analyze large volumes of multi- structured data with good price/performance • The need to make technologies for developing and running these analyses more usable by data scientists Organizations will likely use Hadoop and an RDBMS - the challenges are deciding which to use when and interconnecting the systems Copyright © BI Research, 2012 5Wednesday, March 14, 12
  11. 11. 11Wednesday, March 14, 12
  12. 12. The Value of Big Data: McKinsey Report Technology_and_Innovation/Big_data_The_next_frontier_for_innovation Copyright © BI Research, 2012 7Wednesday, March 14, 12
  13. 13. Robin Bloor is Chief Analyst at The Bloor Group. Twitter Tag: #briefrWednesday, March 14, 12
  14. 14. The Hardware Landscape CPUs go multicore Memory/Disk cost ratio falls Speed of random reads lag speed of serial reads Faster networking and fast switches Parallelism becomes more important Commodity servers Cloud computing cuts H/W costsWednesday, March 14, 12
  15. 15. That MapReduce Thing There are two fundamental approaches to parallelism Data Partitioning Process partitioning MapReduce implements an approach which is oriented to data partitioning This relates to data processing rather than to database Hadoop is often used for ETLWednesday, March 14, 12
  16. 16. The Devil Is In The Workload NoSQL is a distraction Big Data can be Big US Big XML Data or Big SDATA D Table Store A Unstructured T A Column Store Document Store workloads are rarely V O suited to traditional L U RDBMS ODBMS RDMBS-type engines M E Database Database Analytical workloads span both More Structured Less StructuredWednesday, March 14, 12
  17. 17. If you don’t know the expected workloads, you shouldn’t be selecting a databaseWednesday, March 14, 12
  18. 18. Steve Dine is the founder of Datasource Consulting, LLC. He has extensive experience delivering and managing successful, highly scalable and maintainable data integration and business intelligence solutions. Steve combines hands-on technical experience across the entire BI project lifecycle with strong business acumen. He currently works as a consultant for Fortune 500 companies. Steve is a faculty member at TDWI and a judge for the Annual TDWI Best Practices Awards. He teaches courses and presents on many BI topics. Contact info: Twitter: @steve_dineEmail: Web: http:// Twitter Tag: #briefrWednesday, March 14, 12
  19. 19. The State of NoSQL & BI From the trenches… “Hey  Bob,  seems  like  a  no  brainer.     So,  what’s  the  catch?” *  Graphic  from  h=p://­‐reviews/c-­‐s-­‐lewis/page/2/ Confiden)al,  Datasource  Consul)ng,  LLC 19Wednesday, March 14, 12
  20. 20. Why NoSQL? More  data More  different  types  of  data  (semi-­‐structured,   unstructured) More  frequent  changes  to  the  structure  of  the  data  we   need  to  store  and  analyze More  demand  for  the  long  tail  analysis More  “affordable”,  commodity  hardware  available  (blade   servers,  “cheap”  storage,  cloud) More  buzz! *  Graphic  from  h=p://­‐on-­‐nosql/ Confiden)al,  Datasource  Consul)ng,  LLC 20Wednesday, March 14, 12
  21. 21. Why Not Not NoSQL? RelaCvely  immature  (0.x  –  2.x) Difficult  to  describe  to  decision  makers Not  fit  for  purpose  (low  latency,  update  heavy,  complex   joins) In  many  organizaCons  it’s  a  soluCon  looking  for  a  problem   Lack  of  “BI”  support Skills  gap! *  Graphic  based  on  h=p://­‐on-­‐nosql/ Confiden)al,  Datasource  Consul)ng,  LLC 21Wednesday, March 14, 12
  22. 22. BI-NoSQL Skills Gap “SQL”  Skills NoSQL  Skills •  GUI’s  (mostly) •  Command  Line •  Rela)onal  Data   •  Key-­‐Value  /  Column   Modeling   Family  Modeling •  RDBMS   •  Distributed  Data   •  SQL Store •  Stored  procedures •  Programming  (Java,   •  LDAP Jscript,  Python,  etc) •  Javascript •  MapReduce  (Hive) •  Batch/Shell  Scripts •  JSON •  Shell  Scripts *  Graphic  based  on  h=p://­‐soa-­‐chasm/ Confiden)al,  Datasource  Consul)ng,  LLC 22Wednesday, March 14, 12
  23. 23. Conclusions? • Best  to  evaluate  your  true  data  size,  data  growth,  data  formats,  data   structure  and  analyCc  requirements  before  deciding  on  soluCon • Make  sure  to  evaluate  your  available  skills • Experienced  NoSQL  resources  with  BI  experience  not  always  easy  to   find • Need  to  plan  for  addiConal  technology  risk  in  project  plan   • Consider  starCng  out  with  one  part  of  your  DW  architecture  (i.e.   staging) • POC  POC  POC • NoSQL  maturing  quickly  and  will  likely  conCnue  to  evolve  into  a  hybrid   soluCon Confiden)al,  Datasource  Consul)ng,  LLC 23Wednesday, March 14, 12
  24. 24. Mark Madsen is founder of Third Nature, a research and consulting firm focused on analytics, BI and decision-making. Mark spent the past two decades working on analysis and decision support in many industries and countries. He is an award- winning architect and former CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http:// Twitter Tag: #briefrWednesday, March 14, 12
  25. 25. One Size Doesn’t Fit All Choosing which big data, NoSQL or database technology to use March 14, 2012 Mark R. Madsen http://ThirdNature.netWednesday, March 14, 12
  26. 26. Wednesday, March 14, 12
  27. 27. Big  data? Unstructured  data  isn’t  really   unstructured. The  problem  is  that  this  data   is  unmodeled. The  real  challenge  is   complexity.Wednesday, March 14, 12
  28. 28. The  holy  grail  of  databases  under  current  market  hype A  key  problem  is  that  we’re   talking  mostly  about   computa?on  over  data   when  we  talk  about  “big   data”  and  analy?cs,  a   poten?al  mismatch  for  both   rela?onal  and  nosql.Wednesday, March 14, 12
  29. 29. Solving  the  Problem  Depends  on  the  DiagnosisWednesday, March 14, 12
  30. 30. You  must  understand  your  workload  -­‐  throughput  and  response  =me  requirements  aren’t  enough. ▪ 100  simple  queries  accessing   month-­‐to-­‐date  data ▪ 90  simple  queries  accessing   month-­‐to-­‐date  data  plus  10   complex  queries  using  two  years   of  history ▪ Hazard  calculaCon  for  the  enCre   customer  master ▪ Performance  problems  are  rarely   due  to  a  single  factor.  Wednesday, March 14, 12
  31. 31. Workload:  One  big  query  or  many  small  queries?Retrieval: small return set or large?Selectivity: large volume of data scanned or small?Wednesday, March 14, 12
  32. 32. Important  workload  parameters  to  know • Read-­‐intensive    vs.  write-­‐intensiveWednesday, March 14, 12
  33. 33. Important  workload  parameters  to  know • Read-­‐intensive    vs.  write-­‐intensive • Mutable  vs.  immutable  dataWednesday, March 14, 12
  34. 34. Important  workload  parameters  to  know • Read-­‐intensive    vs.  write-­‐intensive • Mutable  vs.  immutable  data • Immediate  vs.  eventual  consistencyWednesday, March 14, 12
  35. 35. Important  workload  parameters  to  know • Read-­‐intensive    vs.  write-­‐intensive • Mutable  vs.  immutable  data • Immediate  vs.  eventual  consistency • Short  vs.  long  access  latencyWednesday, March 14, 12
  36. 36. Important  workload  parameters  to  know • Read-­‐intensive    vs.  write-­‐intensive • Mutable  vs.  immutable  data • Immediate  vs.  eventual  consistency • Short  vs.  long  access  latency • Predictable  vs.  unpredictable  data  access  paEernsWednesday, March 14, 12
  37. 37. Types  of  workloads Write-­‐biased:   Read-­‐biased: ▪ OLTP Query ▪ OLTP,  batch Query,  simple  retrieval ▪ OLTP,  lite Query,  complex ▪ Object  persistence Query-­‐hierarchical  /   ▪ Data  ingest,  batch object  /  network ▪ Data  ingest,  real-­‐Cme AnalyCc Mixed? Inline analytic execution, operational BIWednesday, March 14, 12
  38. 38. Matching  to  parameters,  at  assumpCon  of  data  scale Workload   Write-­‐ Read-­‐ Updateable   Eventual   Un-­‐ Compute   parameters biased biased data consistency   predictable   intensive ok query  path Standard   RDBMS Parallel   RDBMS NoSQL  (kv,   dht,  obj) Hadoop* Streaming   database You see the problem: it’s an intersection of multiple parameters, and this chart only includes the first tier of parameters. Plus, workload factors can completely invert these general rules of thumb.Wednesday, March 14, 12
  39. 39. Matching  to  parameters,  at  assumpCon  of  data  scale Workload   Complex   SelecCve   Low  latency   High   High  ingest   parameters queries queries queries concurrency rate Standard   RDBMS Parallel  RDBMS NoSQL  (kv,  dht,   obj) Hadoop Streaming   database You have to look at the combination of workload factors: data scale, concurrency, latency & response time, then chart the parameters.Wednesday, March 14, 12
  40. 40. Always  build  a  proof  of  concept!Wednesday, March 14, 12
  41. 41. Disection & Discussion Twitter Tag: #briefrWednesday, March 14, 12
  42. 42. Wednesday, March 14, 12
  43. 43. March: Vendor Research March 14th: Second Round Table focusing on No SQL databases and their application DB Revolution Survey conducted April: Vendor Research Publishing of Round Table Transcripts, with comments May: Authoring of White Paper Publishing of White Paper Publishing of survey activity Twitter Tag: #briefrWednesday, March 14, 12
  44. 44. March Briefing Room: Integration April Briefing Room: Discovery May Briefing Room: Analytics Twitter Tag: #briefrWednesday, March 14, 12
  45. 45. Thank You For Your AttentionWednesday, March 14, 12