Big  Data Steven Noels & Wim Van Leuven SAI, 7 april 2011
Hello Steven Noels Outerthought scalable content apps Lily repository smart data, at scale, made easy HBase / SOLR NoSQLSummer @stevenn [email_address] Wim Van Leuven Lives software, applications  and development group dynamics Cloud and BigData enthusiast wannabe entrepreneur  using all of the above   @wimvanleuven [email_address]
Agenda Big Data and its Challenges Data Systems Market classification Parting thoughts Announcement
Houston, we have a problem. IDC says Digital Universe will be 35 Zettabytes by 2020. 1 Zettabyte = 1,000,000,000,000,000,000,000 bytes, or 1 billion terrabytes
We're drowning in a sea of data.
The fire hose of social and attention data.
We regard content as  cost .
... but data is an  opportunity  !
Think about it ...
advertisements
recommendations
profile data
anything that sells
The future is for data nerds.
Houston, we have a problem Data Volume vs. Law of Moore Enterprise (tools) focus on 20% of (structured, transactional) data Pure-play data ventures (FB, LinkedIn, Google) focus on monetizing 80% of organic, semi-structured, variable, time-sensitive, social ... data
The incumbents view
Issues with incumbents Batch-oriented turn-around times of hours rather than minutes batches cannot be interrupted or reconfigured on-the-fly Schema management required in multiple places Lots of data being shuffled around Data duplication > where's the master copy ? $$$
A different approach: (big) data systems real time !
What is a Data System? (Nathan Marz)
What is a Data System? (Nathan Marz)
DATA SYSTEM IMPLEMENTATION (Nathan Marz)
Essential properties of a Data System robustness (against machine AND operation malfunction) fast reads AND updates/inserts scalable (data and user volume) generic (you have only ONE Data System in-house) extensible allows ad-hoc analysis low-cost maintenance debuggable, as every application should be  (Nathan Marz)
Challenges in data-centric architectures to store  all  data > schema-flexibility, scalability & robustness to process  all data > map-reduce + (near-)real-time half of your problem is preparing for search: indexes! indexes are your ONLY challenge (besides consistency) in a bigdata world, pre-processing saves the day (instead of ad-hoc with RDBMS) generate insights  /  feedback  > BI, but less boring accommodate feedback  > focus on usefulness rather than correctness
Technical Challenges Moore's law?  Yes ... but there are  other factors,  like ...  disk seek times  network limitations > the fallacies of distributed systems ... Force to think  out of the (one) box automatically buys you scale and SPOF-robustness ... at the cost of complexity
Beyond the node  ... programming   challenge
One intrinsic feature ...  Imminent failure is.  Assistance you will be needing!
The fault-tolerant plumbing Researchers and innovators (Google, Amazon, ...) bring the  ideas  as  DFS  MR BigTable and MegaStore Dynamo FOSS bring the  implementations   HDFS Hadoop MR HBase Cassandra Mahout (AI and learning) ...
To take-on           ...  the   grid
Large files  are split into parts Move file-parts  into  the cluster  Fault-tolerant through  replication  across nodes while being rack-aware ... plus the  bookkeeping  via the NameNode HDFS
MapReduce Move  algorithms  close to the data by structuring them for  parallel execution  so that each task works on a part of the data power of  simplicity  via map and reduce phase
MapReduce  
WORM you say? Write Once Read Many  due to Clustered File part size Data size What about  fast reads   AND   updates/inserts  as  Essential Properties  of a data system ?
Enter the Realm of noSQL Old  technology ... older then RDBMS Sequential files Hierarchical database Network database ... noSQL is  not  only scalable stores Graph databases  (Neo4j, AllegroGraph, ...) Document stores  (Lotus Domino!, MongoDB, JackRabbit, ...) Memory caches (memcacheDB) But also  scalable datastores Some classification criteria ...
CAP Theorem CAP theorem , aka  Brewer's  theorem , states that it is impossible for any  distributed (!)  data store to provide all 3 of following guarantees Consistency : a consistent system operates fully, or not at all. In a cluster, this means that all nodes see and provide the same data at all times. Availability : node failures do not prevent survivors from continuing to operate.  Partition tolerance : the system continues to operate despite arbitrary message loss. According to the theorem, a distributed system can satisfy  any two  of these guarantees at the same time, but  not all three .
Types of store
Is your data BIG enough ?
Classification CAP Type Programming language (server) License API/Protocol Lorenzo Alberton, NoSQL Databases: Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
Just storage? Lack  of common features Secondary indexes? Joins? Consistency? Transactions? Query language? Search? Need for  add-ons  and  extra  solutions Market  opportunity ?
Common tools Infrastructure  management and provisioning Nice tools exist from the DevOps movement like Chef and Puppet Grid  monitoring tools to maintain operational insight like Nagios, Ganglia, Hyperic and ZenOSS Development supporting tools are  limited  even  non-existent DB structure tools? DB model browsers? Schema editors? MR tools? ETL tools?
Niche players Tools Operational  tools Management and monitoring tools for datacenters and clusters  like Chef, Puppet and Ganglia Integrated  solutions Analytics  platforms  solution for datawarehouse, data analytics and business intelligence  like Netezza, GreenPlum and Vertica Universal data  platforms  aiming to be the one-stop-shop for all data storage needs within your entreprise >  bigdata made easy like Lily and Spire
Enterprise players Establishment  is entering BigData mainly through acquisition First wave: IBM < Cognos Oracle < Hyperion SAP < Business Objects Second wave: HP < Vertica EMC < Greenplum Teradata < Aster
Enterprise players Native players  focus on simplification via packaging, deployment, setup, monitoring and tooling Cloudera (Hadoop as mainstream Linux packages ++) or by touching  enterprise hot-spots Datameer (analytics & BI) CR-X (ETL)
Parting Thoughts A couple of ideas we want you to remember
Platonic architecture of a Data System Speed Layer Batch Layer
Batch Layer Arbitrary computations Horizontally scalable Higher latency Map/Reduce Append-only, master copy of dataset Higher-latency updates !
Speed Layer Compensates for high latency updates of batch layer Incremental algorithms Stores hours of data, rather than years 'Eventual accuracy'
Event Driven Architecture
“ Top-performing organizations are twice as likely to apply analytics to activities.” (MIT Sloan Management Review, Winter 2011)
From analytics to recommendations shorten distance between transactional and analytical aspects (near) real-time analytics > Insights &quot;people are buying this now&quot; algorithmical feedback > data-backed feedback based on Insights > recommendations shorten feedback cycle store data + metadata + attention data in one data system basis for analytical queries single point of growth incremental insights
Challenges ahead solving real-time aspects in a distributed environment terra-byte-sized snapshots & backups? build resilience into data architecture (i.e. against operator malfunction) presenting bigdata insights: new UIs less static / report-driven more 'Minority Report' from near-real-time data to real-time decision systems
Cool stuff to think about 'Exotic' BigData use cases fraud or anomaly detection genomics > large-scale needle/haystack searching deep  personalization social magazines based on  anything  you read on  where  you are on  who  you know
Zite - interest-based e-magazine (iPad)
social second screen app
social second screen app
FlipBoard: everyone's excuse to buy an iPad
Announcement
www.bigdata.be online and real-life community for BigData practitioners knowledge & experience sharing networking
Conclusions it's the new old new, all over again low-level knowledge required for making the right decision ... but you can be part of it (open source) 4 trends  (http://bit.ly/fsARlQ) real-time is rapidly becoming really important agility with complex/dynamic information predictive analysis one store for operational data + analytics
Thanks ! Wim & Steven.

Big Data

  • 1.
    Big DataSteven Noels & Wim Van Leuven SAI, 7 april 2011
  • 2.
    Hello Steven NoelsOuterthought scalable content apps Lily repository smart data, at scale, made easy HBase / SOLR NoSQLSummer @stevenn [email_address] Wim Van Leuven Lives software, applications  and development group dynamics Cloud and BigData enthusiast wannabe entrepreneur  using all of the above   @wimvanleuven [email_address]
  • 3.
    Agenda Big Dataand its Challenges Data Systems Market classification Parting thoughts Announcement
  • 4.
    Houston, we havea problem. IDC says Digital Universe will be 35 Zettabytes by 2020. 1 Zettabyte = 1,000,000,000,000,000,000,000 bytes, or 1 billion terrabytes
  • 5.
    We're drowning ina sea of data.
  • 6.
    The fire hose of socialand attention data.
  • 7.
  • 8.
    ... but datais an opportunity !
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    The future isfor data nerds.
  • 15.
    Houston, we havea problem Data Volume vs. Law of Moore Enterprise (tools) focus on 20% of (structured, transactional) data Pure-play data ventures (FB, LinkedIn, Google) focus on monetizing 80% of organic, semi-structured, variable, time-sensitive, social ... data
  • 16.
  • 17.
    Issues with incumbentsBatch-oriented turn-around times of hours rather than minutes batches cannot be interrupted or reconfigured on-the-fly Schema management required in multiple places Lots of data being shuffled around Data duplication > where's the master copy ? $$$
  • 18.
    A different approach:(big) data systems real time !
  • 19.
    What is aData System? (Nathan Marz)
  • 20.
    What is aData System? (Nathan Marz)
  • 21.
  • 22.
    Essential properties ofa Data System robustness (against machine AND operation malfunction) fast reads AND updates/inserts scalable (data and user volume) generic (you have only ONE Data System in-house) extensible allows ad-hoc analysis low-cost maintenance debuggable, as every application should be  (Nathan Marz)
  • 23.
    Challenges in data-centricarchitectures to store all data > schema-flexibility, scalability & robustness to process all data > map-reduce + (near-)real-time half of your problem is preparing for search: indexes! indexes are your ONLY challenge (besides consistency) in a bigdata world, pre-processing saves the day (instead of ad-hoc with RDBMS) generate insights / feedback > BI, but less boring accommodate feedback > focus on usefulness rather than correctness
  • 24.
    Technical Challenges Moore'slaw?  Yes ... but there are other factors,  like ...  disk seek times  network limitations > the fallacies of distributed systems ... Force to think out of the (one) box automatically buys you scale and SPOF-robustness ... at the cost of complexity
  • 25.
    Beyond the node ... programming challenge
  • 26.
    One intrinsic feature ... Imminent failure is.  Assistance you will be needing!
  • 27.
    The fault-tolerant plumbingResearchers and innovators (Google, Amazon, ...) bring the ideas as  DFS  MR BigTable and MegaStore Dynamo FOSS bring the  implementations   HDFS Hadoop MR HBase Cassandra Mahout (AI and learning) ...
  • 28.
    To take-on         ...  the grid
  • 29.
    Large files are split into parts Move file-parts  into the cluster  Fault-tolerant through replication  across nodes while being rack-aware ... plus the bookkeeping  via the NameNode HDFS
  • 30.
    MapReduce Move algorithms close to the data by structuring them for parallel execution so that each task works on a part of the data power of simplicity  via map and reduce phase
  • 31.
  • 32.
    WORM you say?Write Once Read Many due to Clustered File part size Data size What about  fast reads AND updates/inserts  as Essential Properties of a data system ?
  • 33.
    Enter the Realmof noSQL Old technology ... older then RDBMS Sequential files Hierarchical database Network database ... noSQL is  not  only scalable stores Graph databases (Neo4j, AllegroGraph, ...) Document stores (Lotus Domino!, MongoDB, JackRabbit, ...) Memory caches (memcacheDB) But also scalable datastores Some classification criteria ...
  • 34.
    CAP Theorem CAPtheorem , aka Brewer's  theorem , states that it is impossible for any distributed (!) data store to provide all 3 of following guarantees Consistency : a consistent system operates fully, or not at all. In a cluster, this means that all nodes see and provide the same data at all times. Availability : node failures do not prevent survivors from continuing to operate.  Partition tolerance : the system continues to operate despite arbitrary message loss. According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three .
  • 35.
  • 36.
    Is your dataBIG enough ?
  • 37.
    Classification CAP TypeProgramming language (server) License API/Protocol Lorenzo Alberton, NoSQL Databases: Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
  • 38.
    Just storage? Lack of common features Secondary indexes? Joins? Consistency? Transactions? Query language? Search? Need for add-ons  and  extra solutions Market opportunity ?
  • 39.
    Common tools Infrastructure management and provisioning Nice tools exist from the DevOps movement like Chef and Puppet Grid  monitoring tools to maintain operational insight like Nagios, Ganglia, Hyperic and ZenOSS Development supporting tools are  limited  even  non-existent DB structure tools? DB model browsers? Schema editors? MR tools? ETL tools?
  • 40.
    Niche players ToolsOperational tools Management and monitoring tools for datacenters and clusters  like Chef, Puppet and Ganglia Integrated  solutions Analytics platforms  solution for datawarehouse, data analytics and business intelligence  like Netezza, GreenPlum and Vertica Universal data platforms  aiming to be the one-stop-shop for all data storage needs within your entreprise > bigdata made easy like Lily and Spire
  • 41.
    Enterprise players Establishment is entering BigData mainly through acquisition First wave: IBM < Cognos Oracle < Hyperion SAP < Business Objects Second wave: HP < Vertica EMC < Greenplum Teradata < Aster
  • 42.
    Enterprise players Nativeplayers focus on simplification via packaging, deployment, setup, monitoring and tooling Cloudera (Hadoop as mainstream Linux packages ++) or by touching enterprise hot-spots Datameer (analytics & BI) CR-X (ETL)
  • 43.
    Parting Thoughts Acouple of ideas we want you to remember
  • 44.
    Platonic architecture ofa Data System Speed Layer Batch Layer
  • 45.
    Batch Layer Arbitrarycomputations Horizontally scalable Higher latency Map/Reduce Append-only, master copy of dataset Higher-latency updates !
  • 46.
    Speed Layer Compensatesfor high latency updates of batch layer Incremental algorithms Stores hours of data, rather than years 'Eventual accuracy'
  • 47.
  • 48.
    “ Top-performing organizationsare twice as likely to apply analytics to activities.” (MIT Sloan Management Review, Winter 2011)
  • 49.
    From analytics torecommendations shorten distance between transactional and analytical aspects (near) real-time analytics > Insights &quot;people are buying this now&quot; algorithmical feedback > data-backed feedback based on Insights > recommendations shorten feedback cycle store data + metadata + attention data in one data system basis for analytical queries single point of growth incremental insights
  • 50.
    Challenges ahead solvingreal-time aspects in a distributed environment terra-byte-sized snapshots & backups? build resilience into data architecture (i.e. against operator malfunction) presenting bigdata insights: new UIs less static / report-driven more 'Minority Report' from near-real-time data to real-time decision systems
  • 53.
    Cool stuff tothink about 'Exotic' BigData use cases fraud or anomaly detection genomics > large-scale needle/haystack searching deep personalization social magazines based on  anything  you read on where you are on  who you know
  • 54.
    Zite - interest-basede-magazine (iPad)
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
    www.bigdata.be online andreal-life community for BigData practitioners knowledge & experience sharing networking
  • 60.
    Conclusions it's thenew old new, all over again low-level knowledge required for making the right decision ... but you can be part of it (open source) 4 trends (http://bit.ly/fsARlQ) real-time is rapidly becoming really important agility with complex/dynamic information predictive analysis one store for operational data + analytics
  • 61.
    Thanks ! Wim& Steven.

Editor's Notes

  • #25 - like disk seek time: how long does it take to read a full 1TB disk compared to the 4MB HD of 20 years ago? - Amazon lets you ship hard disks to load data
  • #26 - the only solution is to divide work beyond one node biringing us to cluster technology - but ... clusters have their own programming challenges, e.g.work load management, distributed locking and distributed transactions - but clusters do especially have one certain property ... Anyone knows which?
  • #27 - Failure! Nodes will certainly fail. In large setups there are continuously breakdowns. - ... making it even more difficult to build software on the grid. - It needs to be fault-tolerant, but also self orchestrating and self healing - Assistence you will be needing: standing on the shoulders of giants
  • #28 - Distributed File System for high available data - MapReduce to bring logic to the data on the nodes en bring back the results - BigTable &amp; Dynamo to add realtime read/write access to big data - with FOSS implementations which allow US to build applications, not the plumbing ...
  • #29 Althought the basic functions of those technologies are rather basic/high-level, their implementations hardly are.  - They represent the state-of-the-art in operating and distributed systems research: distributed hash tables (DHT), consistent hashing, distributed versioning, vector clocks, quorums,, gossip protocols, anti-entropy based recovery, etc - ... with an industrial/commercial angle: Amazon, Google, Facebook, ... Lets explain some of the basic technologies
  • #35 The most important classifier for scalable stores CA, AP, CP
  • #36 KV (Amazon Dynamo) Column family (Google BigTable) Document stores (MongoDB) Graph DBs (Neo4J) Please remember scalability, availability and resilience come at a cost
  • #37 RDBMSs scale to reasonable proportions, bringing commodity of technology, tools, knowlegde and experience.  BD stores are rather uncharted territory lacking tools, standardized APIs, etc.  cost of hardware vs cost of learning Do your homework!
  • #38 ref  http://www.slideshare.net/quipo/nosql-databases-why-what-and-when Good overview of different OSS and commercial implementations with their classification and features slides 96 ...
  • #39 Basic support for secondary indexes. Better use full text search tools like Solr or Katta. Implement joins by denormalization  Meaning consistency has to be maintained by the application, i.e. DIY Transactions are mostly non-existent, meaning you have to divide your application to support data statuses and/or implement counter-transactions for failures. No true query language, but map reduce jobs or more high-level languages like HiveQL and Pig-Latin. However not very interactive, rather meant for ETL and reporting. Think data warehouse. Complement with full text search tools like Sorl and Katta giving added value, and also faceted search possibilities.