Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data

High-level business/tech presentation on Big Data for SAI - presented April 7th 2011

Presentation by Wim Van Leuven and Steven Noels.

  • Login to see the comments

Big Data

  1. 1. Big Data Steven Noels & Wim Van Leuven SAI, 7 april 2011
  2. 2. Hello <ul><li>Steven Noels </li></ul><ul><ul><li>Outerthought </li></ul></ul><ul><ul><li>scalable content apps </li></ul></ul><ul><ul><li>Lily repository </li></ul></ul><ul><ul><ul><li>smart data, at scale, made easy </li></ul></ul></ul><ul><ul><ul><li>HBase / SOLR </li></ul></ul></ul><ul><ul><li>NoSQLSummer </li></ul></ul><ul><li>@stevenn </li></ul><ul><li>[email_address] </li></ul><ul><li>Wim Van Leuven </li></ul><ul><ul><li>Lives software, applications  and development group dynamics </li></ul></ul><ul><ul><li>Cloud and BigData enthusiast </li></ul></ul><ul><ul><li>wannabe entrepreneur  using all of the above </li></ul></ul><ul><li>  </li></ul><ul><li>@wimvanleuven </li></ul><ul><li>[email_address] </li></ul>
  3. 3. Agenda <ul><ul><li>Big Data and its Challenges </li></ul></ul><ul><ul><li>Data Systems </li></ul></ul><ul><ul><li>Market classification </li></ul></ul><ul><ul><li>Parting thoughts </li></ul></ul><ul><ul><li>Announcement </li></ul></ul>
  4. 4. Houston, we have a problem. IDC says Digital Universe will be 35 Zettabytes by 2020. 1 Zettabyte = 1,000,000,000,000,000,000,000 bytes, or 1 billion terrabytes
  5. 5. We're drowning in a sea of data.
  6. 6. The fire hose of social and attention data.
  7. 7. We regard content as cost .
  8. 8. ... but data is an opportunity !
  9. 9. Think about it ...
  10. 10. advertisements
  11. 11. recommendations
  12. 12. profile data
  13. 13. anything that sells
  14. 14. The future is for data nerds.
  15. 15. Houston, we have a problem <ul><ul><li>Data Volume vs. Law of Moore </li></ul></ul><ul><ul><li>Enterprise (tools) focus on 20% of (structured, transactional) data </li></ul></ul><ul><ul><li>Pure-play data ventures (FB, LinkedIn, Google) focus on monetizing 80% of organic, semi-structured, variable, time-sensitive, social ... data </li></ul></ul>
  16. 16. The incumbents view
  17. 17. Issues with incumbents <ul><ul><li>Batch-oriented </li></ul></ul><ul><ul><ul><li>turn-around times of hours rather than minutes </li></ul></ul></ul><ul><ul><ul><li>batches cannot be interrupted or reconfigured on-the-fly </li></ul></ul></ul><ul><ul><li>Schema management required in multiple places </li></ul></ul><ul><ul><li>Lots of data being shuffled around </li></ul></ul><ul><ul><li>Data duplication > where's the master copy ? </li></ul></ul><ul><ul><li>$$$ </li></ul></ul>
  18. 18. A different approach: (big) data systems real time !
  19. 19. What is a Data System? (Nathan Marz)
  20. 20. What is a Data System? (Nathan Marz)
  22. 22. Essential properties of a Data System <ul><ul><li>robustness (against machine AND operation malfunction) </li></ul></ul><ul><ul><li>fast reads AND updates/inserts </li></ul></ul><ul><ul><li>scalable (data and user volume) </li></ul></ul><ul><ul><li>generic (you have only ONE Data System in-house) </li></ul></ul><ul><ul><li>extensible </li></ul></ul><ul><ul><li>allows ad-hoc analysis </li></ul></ul><ul><ul><li>low-cost maintenance </li></ul></ul><ul><ul><li>debuggable, as every application should be  </li></ul></ul>(Nathan Marz)
  23. 23. Challenges in data-centric architectures <ul><ul><li>to store all data > schema-flexibility, scalability & robustness </li></ul></ul><ul><ul><li>to process all data > map-reduce + (near-)real-time </li></ul></ul><ul><ul><ul><li>half of your problem is preparing for search: indexes! </li></ul></ul></ul><ul><ul><ul><li>indexes are your ONLY challenge (besides consistency) </li></ul></ul></ul><ul><ul><ul><li>in a bigdata world, pre-processing saves the day (instead of ad-hoc with RDBMS) </li></ul></ul></ul><ul><ul><li>generate insights / feedback > BI, but less boring </li></ul></ul><ul><ul><li>accommodate feedback > focus on usefulness rather than correctness </li></ul></ul>
  24. 24. Technical Challenges <ul><ul><li>Moore's law?  </li></ul></ul><ul><ul><li>Yes ... but there are other factors,  like ...  </li></ul></ul><ul><ul><ul><li>disk seek times  </li></ul></ul></ul><ul><ul><ul><li>network limitations > the fallacies of distributed systems </li></ul></ul></ul><ul><ul><ul><li>... </li></ul></ul></ul><ul><ul><li>Force to think out of the (one) box </li></ul></ul><ul><ul><ul><li>automatically buys you scale and SPOF-robustness </li></ul></ul></ul><ul><ul><ul><li>... at the cost of complexity </li></ul></ul></ul>
  25. 25. <ul><li>Beyond the node  </li></ul><ul><li>... programming challenge </li></ul>
  26. 26. <ul><li>One intrinsic feature ...  </li></ul>Imminent failure is.  Assistance you will be needing!
  27. 27. The fault-tolerant plumbing <ul><ul><li>Researchers and innovators (Google, Amazon, ...) bring the ideas as  </li></ul></ul><ul><ul><ul><li>DFS  </li></ul></ul></ul><ul><ul><ul><li>MR </li></ul></ul></ul><ul><ul><ul><li>BigTable and MegaStore </li></ul></ul></ul><ul><ul><ul><li>Dynamo </li></ul></ul></ul><ul><ul><li>FOSS bring the  implementations   </li></ul></ul><ul><ul><ul><li>HDFS </li></ul></ul></ul><ul><ul><ul><li>Hadoop MR </li></ul></ul></ul><ul><ul><ul><li>HBase </li></ul></ul></ul><ul><ul><ul><li>Cassandra </li></ul></ul></ul><ul><ul><ul><li>Mahout (AI and learning) </li></ul></ul></ul><ul><ul><ul><li>... </li></ul></ul></ul>
  28. 28. <ul><li>To take-on   </li></ul><ul><li>       ...  the grid </li></ul>
  29. 29. <ul><ul><li>Large files are split into parts </li></ul></ul><ul><ul><li>Move file-parts  into the cluster  </li></ul></ul><ul><ul><li>Fault-tolerant through replication  across nodes while being rack-aware </li></ul></ul><ul><ul><li>... plus the bookkeeping  via the NameNode </li></ul></ul>HDFS
  30. 30. MapReduce <ul><ul><li>Move algorithms close to the data </li></ul></ul><ul><ul><li>by structuring them for parallel execution so that each task works on a part of the data </li></ul></ul><ul><ul><li>power of simplicity  via map and reduce phase </li></ul></ul>
  31. 31. MapReduce <ul><li>  </li></ul>
  32. 32. WORM you say? <ul><ul><li>Write Once Read Many due to </li></ul></ul><ul><ul><ul><li>Clustered </li></ul></ul></ul><ul><ul><ul><li>File part size </li></ul></ul></ul><ul><ul><ul><li>Data size </li></ul></ul></ul><ul><ul><li>What about  fast reads AND updates/inserts  as Essential Properties of a data system ? </li></ul></ul>
  33. 33. Enter the Realm of noSQL <ul><ul><li>Old technology ... older then RDBMS </li></ul></ul><ul><ul><ul><li>Sequential files </li></ul></ul></ul><ul><ul><ul><li>Hierarchical database </li></ul></ul></ul><ul><ul><ul><li>Network database </li></ul></ul></ul><ul><ul><ul><li>... </li></ul></ul></ul><ul><ul><li>noSQL is  not  only scalable stores </li></ul></ul><ul><ul><ul><li>Graph databases (Neo4j, AllegroGraph, ...) </li></ul></ul></ul><ul><ul><ul><li>Document stores (Lotus Domino!, MongoDB, JackRabbit, ...) </li></ul></ul></ul><ul><ul><ul><li>Memory caches (memcacheDB) </li></ul></ul></ul><ul><ul><li>But also scalable datastores </li></ul></ul><ul><ul><ul><li>Some classification criteria ... </li></ul></ul></ul>
  34. 34. CAP Theorem <ul><li>CAP theorem , aka Brewer's  theorem , states that it is impossible for any distributed (!) data store to provide all 3 of following guarantees </li></ul><ul><ul><li>Consistency : a consistent system operates fully, or not at all. In a cluster, this means that all nodes see and provide the same data at all times. </li></ul></ul><ul><ul><li>Availability : node failures do not prevent survivors from continuing to operate.  </li></ul></ul><ul><ul><li>Partition tolerance : the system continues to operate despite arbitrary message loss. </li></ul></ul><ul><li>According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three . </li></ul>
  35. 35. Types of store
  36. 36. Is your data BIG enough ?
  37. 37. Classification <ul><ul><li>CAP </li></ul></ul><ul><ul><li>Type </li></ul></ul><ul><ul><li>Programming language (server) </li></ul></ul><ul><ul><li>License </li></ul></ul><ul><ul><li>API/Protocol </li></ul></ul>Lorenzo Alberton, NoSQL Databases: Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
  38. 38. Just storage? <ul><ul><li>Lack of common features </li></ul></ul><ul><ul><ul><li>Secondary indexes? </li></ul></ul></ul><ul><ul><ul><li>Joins? </li></ul></ul></ul><ul><ul><ul><li>Consistency? </li></ul></ul></ul><ul><ul><ul><li>Transactions? </li></ul></ul></ul><ul><ul><ul><li>Query language? </li></ul></ul></ul><ul><ul><ul><li>Search? </li></ul></ul></ul><ul><ul><li>Need for add-ons  and  extra solutions </li></ul></ul><ul><ul><li>Market opportunity ? </li></ul></ul>
  39. 39. Common tools <ul><ul><li>Infrastructure  management and provisioning </li></ul></ul><ul><ul><ul><li>Nice tools exist from the DevOps movement </li></ul></ul></ul><ul><ul><ul><li>like Chef and Puppet </li></ul></ul></ul><ul><ul><li>Grid  monitoring tools </li></ul></ul><ul><ul><ul><li>to maintain operational insight </li></ul></ul></ul><ul><ul><ul><li>like Nagios, Ganglia, Hyperic and ZenOSS </li></ul></ul></ul><ul><ul><li>Development supporting tools are  limited  even  non-existent </li></ul></ul><ul><ul><ul><li>DB structure tools? </li></ul></ul></ul><ul><ul><ul><li>DB model browsers? </li></ul></ul></ul><ul><ul><ul><li>Schema editors? </li></ul></ul></ul><ul><ul><ul><li>MR tools? </li></ul></ul></ul><ul><ul><ul><li>ETL tools? </li></ul></ul></ul>
  40. 40. Niche players <ul><ul><li>Tools </li></ul></ul><ul><ul><ul><li>Operational tools </li></ul></ul></ul><ul><ul><ul><ul><li>Management and monitoring tools for datacenters and clusters  </li></ul></ul></ul></ul><ul><ul><ul><ul><li>like Chef, Puppet and Ganglia </li></ul></ul></ul></ul><ul><ul><li>Integrated  solutions </li></ul></ul><ul><ul><ul><li>Analytics platforms  </li></ul></ul></ul><ul><ul><ul><ul><li>solution for datawarehouse, data analytics and business intelligence  </li></ul></ul></ul></ul><ul><ul><ul><ul><li>like Netezza, GreenPlum and Vertica </li></ul></ul></ul></ul><ul><ul><ul><li>Universal data platforms  </li></ul></ul></ul><ul><ul><ul><ul><li>aiming to be the one-stop-shop for all data storage needs within your entreprise > bigdata made easy </li></ul></ul></ul></ul><ul><ul><ul><ul><li>like Lily and Spire </li></ul></ul></ul></ul>
  41. 41. Enterprise players <ul><li>Establishment is entering BigData mainly through acquisition </li></ul><ul><ul><li>First wave: </li></ul></ul><ul><ul><ul><li>IBM < Cognos </li></ul></ul></ul><ul><ul><ul><li>Oracle < Hyperion </li></ul></ul></ul><ul><ul><ul><li>SAP < Business Objects </li></ul></ul></ul><ul><ul><li>Second wave: </li></ul></ul><ul><ul><ul><li>HP < Vertica </li></ul></ul></ul><ul><ul><ul><li>EMC < Greenplum </li></ul></ul></ul><ul><ul><ul><li>Teradata < Aster </li></ul></ul></ul>
  42. 42. Enterprise players <ul><li>Native players focus on simplification via packaging, deployment, setup, monitoring and tooling </li></ul><ul><ul><li>Cloudera (Hadoop as mainstream Linux packages ++) </li></ul></ul><ul><li>or by touching enterprise hot-spots </li></ul><ul><ul><li>Datameer (analytics & BI) </li></ul></ul><ul><ul><li>CR-X (ETL) </li></ul></ul>
  43. 43. Parting Thoughts A couple of ideas we want you to remember
  44. 44. Platonic architecture of a Data System Speed Layer Batch Layer
  45. 45. Batch Layer <ul><ul><li>Arbitrary computations </li></ul></ul><ul><ul><li>Horizontally scalable </li></ul></ul><ul><ul><li>Higher latency </li></ul></ul><ul><ul><li>Map/Reduce </li></ul></ul><ul><ul><li>Append-only, master copy of dataset </li></ul></ul><ul><ul><li>Higher-latency updates ! </li></ul></ul>
  46. 46. Speed Layer <ul><ul><li>Compensates for high latency updates of batch layer </li></ul></ul><ul><ul><li>Incremental algorithms </li></ul></ul><ul><ul><li>Stores hours of data, rather than years </li></ul></ul><ul><ul><li>'Eventual accuracy' </li></ul></ul>
  47. 47. Event Driven Architecture
  48. 48. “ Top-performing organizations are twice as likely to apply analytics to activities.” (MIT Sloan Management Review, Winter 2011)
  49. 49. From analytics to recommendations <ul><ul><li>shorten distance between transactional and analytical aspects </li></ul></ul><ul><ul><li>(near) real-time analytics > Insights </li></ul></ul><ul><ul><ul><li>&quot;people are buying this now&quot; </li></ul></ul></ul><ul><ul><li>algorithmical feedback > data-backed feedback </li></ul></ul><ul><ul><ul><li>based on Insights > recommendations </li></ul></ul></ul><ul><ul><ul><li>shorten feedback cycle </li></ul></ul></ul><ul><ul><li>store data + metadata + attention data in one data system </li></ul></ul><ul><ul><ul><li>basis for analytical queries </li></ul></ul></ul><ul><ul><ul><li>single point of growth </li></ul></ul></ul><ul><ul><ul><li>incremental insights </li></ul></ul></ul>
  50. 50. Challenges ahead <ul><ul><li>solving real-time aspects </li></ul></ul><ul><ul><ul><li>in a distributed environment </li></ul></ul></ul><ul><ul><li>terra-byte-sized snapshots & backups? </li></ul></ul><ul><ul><ul><li>build resilience into data architecture (i.e. against operator malfunction) </li></ul></ul></ul><ul><ul><li>presenting bigdata insights: new UIs </li></ul></ul><ul><ul><ul><li>less static / report-driven </li></ul></ul></ul><ul><ul><ul><li>more 'Minority Report' </li></ul></ul></ul><ul><ul><ul><li>from near-real-time data to real-time decision systems </li></ul></ul></ul>
  51. 51.
  52. 52.
  53. 53. Cool stuff to think about <ul><li>'Exotic' BigData use cases </li></ul><ul><ul><li>fraud or anomaly detection </li></ul></ul><ul><ul><li>genomics > large-scale needle/haystack searching </li></ul></ul><ul><ul><li>deep personalization </li></ul></ul><ul><ul><ul><li>social magazines based on  anything  you read </li></ul></ul></ul><ul><ul><ul><li>on where you are </li></ul></ul></ul><ul><ul><ul><li>on  who you know </li></ul></ul></ul>
  54. 54. Zite - interest-based e-magazine (iPad)
  55. 55. social second screen app
  56. 56. social second screen app
  57. 57. FlipBoard: everyone's excuse to buy an iPad
  58. 58. Announcement
  59. 59. <ul><li>online and real-life community for BigData practitioners </li></ul><ul><ul><li>knowledge & experience sharing </li></ul></ul><ul><ul><li>networking </li></ul></ul>
  60. 60. Conclusions <ul><ul><li>it's the new old new, all over again </li></ul></ul><ul><ul><ul><li>low-level knowledge required for making the right decision </li></ul></ul></ul><ul><ul><ul><li>... but you can be part of it (open source) </li></ul></ul></ul><ul><ul><li>4 trends ( </li></ul></ul><ul><ul><ul><li>real-time is rapidly becoming really important </li></ul></ul></ul><ul><ul><ul><li>agility with complex/dynamic information </li></ul></ul></ul><ul><ul><ul><li>predictive analysis </li></ul></ul></ul><ul><ul><ul><li>one store for operational data + analytics </li></ul></ul></ul>
  61. 61. Thanks ! Wim & Steven.