Big Data, NoSQL, NewSQL & The Future of Data Management

6,415
-1

Published on

It is an exciting and interesting time to be involved in data. More change of influence has occurred in the database management in the last 18 months than has occurred in the last 18 years. New technologies such as NoSQL & Hadoop and radical redesigns of existing technologies, like NewSQL , will change dramatically how we manage data moving forward.

These technologies bring with them possibilities both in terms of the scale of data retained but also in how this data can be utilized as an information asset. The ability to leverage Big Data to drive deep insights will become a key competitive advantage for many organisations in the future.

Join Tony Bain as he takes us through both the high level drivers for the changes in technology, how these are relevant to the enterprise and an overview of the possibilities a Big Data strategy can start to unlock.

Published in: Technology, Business
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,415
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide
  • Facebook 30PB
  • Various surveys 6.6GB
  • Big Data, NoSQL, NewSQL & The Future of Data Management

    1. 1. Big Data, NoSQL, NewSQL &The Future of Data Management Tony Bain RockSolid SQL www.rocksolidsql.com
    2. 2. Introduction
    3. 3. RockSolid SQLAbout this Webinar Big Data • What is it? • Why does it matter? • How it impacts the Enterprise? Data Science Tech Level 100 NoSQL • Key/Value Stores • Document Stores • Graph Databases NewSQL • In Memory Transaction Processing • MPPP Hadoop Summary
    4. 4. Big Data
    5. 5. Big Data WebinarWhat is Big Data? “Big data[1] are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[2] search, sharing, analytics,[3] and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime."[4] Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.[5]” “Variety– Big data extends beyond structured data, including unstructured data of all varieties Velocity – Rate of acquisition and desired rate of consumption. Volume –Terabytes and even petabytes of information.”
    6. 6. My Definition Big Data is the “modern scale” at which we are defining or data usage challenges. Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs. While Big Data as a term seems to refer to volume this isn’t the case. Many existing technologies have little problem physically handling large volumes (TB or PB) of data. Instead the Big Data challenges result out of the combination of volume and our usage demands from that data. And those usage demands are nearly always tied to timeliness. Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes. The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal. This is of course the ability to handle an unlimited volume of data, processing all requests in real time.Jan 2010http://blog.tonybain.com/tony_bain/2010/01/what-is-big-data.html
    7. 7. For the purpose of this talk Volume and Variety of Data that is difficult to mange using Big Data = traditional data management technology
    8. 8. Where does big data come from? Big Data is not an outcome of what we consider traditional user “transactions”. “VisaNet authorizes, clears and settles an average of 130 million transactions per day in 200 countries and territories.”(http://corporate.visa.com/about-visa/technology/transaction-processing.shtml) Instead Big Data is mostly sourced from “Meta Data” about our existence • Sensors used to gather telemetry information • Social media interactions & relationships • Pictures and videos posted online • Cell phone GPS signals • Web page viewing habits • Network, Computer generated, equipment logs • Device outputs • Online gaming
    9. 9. Where does big data come from? Instead Big Data is generally meta data “Since we only track the data that the games explicitly want to track, and it’s all structured data …We just write the 5TB of structured information that we need, and this is a scale that high-end SQL databases can definitely handle.” “eBay processes 50 petabytes of information each day while adding 40 terabytes of data each day as millions buy and sell items across 50,000 categories. Over 5,000 business users and analysts turn over a terabyte of data every eight seconds. Raising parallel efficiency, the effectiveness of distributing large amounts of workload over pools and grids of servers equals millions in operating expense savings.”
    10. 10. What about the Enterprise? The median data warehouse has remained consistent for a number of years (6.6GB) • Is the pending Big Data explosion hype or a reality? The enterprise is focused on ROI • We do store large amounts of historical data as this serves a perceived need & the overhead is low • We have not yet started capturing large amounts of meta data as demonstrating return on this investment to date would have been difficult • More data will be captured as we gain the ability to derive value from that data • More data will be captured as we understand the value of auxiliary data The skills gap • Big data expertise are few and far between • Technologies are new and raw • The combination of skills is newHowever this is expected to change in the next 2-5 years
    11. 11. Gartner Hype Cycle, 2011http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp
    12. 12. Why Big Data? Competitive Advantage is becoming highly time sensitive • Consumers are fickle, change is easy, word of mouth is rapid and powerful • The rate at which we can understand and respond to changes in customer sentiment, product needs, satisfaction, engagement, risk profiles = competitive strength • Reinvention is constant So how do we do this? • With Data Science!
    13. 13. What is Data Science? “I keep saying that the sexy job in the next 10 years will be statisticians,” Hal Varian, chief economist at Google "data is the next Intel Inside.“ Tim OReilly, OReilly “We’re rapidly entering a world where everything can be monitored and measured, but the big problem is going to be the ability of humans to use, analyze and make sense of the data.” Erik Brynjolfsson, MIT “Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of whats possible.” Michael Driscoll , Metamarkets
    14. 14. What is Data Science? Historical Business Analytics • Effectively reporting what has happened or what is happening • Using pre-determined models for prediction • Known data models • Known questions • Known outcomes The Future of Data Science • Discovering new information from historical data • Often involves combining public & private data sets • Using machine learning to create unique prediction models • Diverse and complex data models • Questions may not be known • Outcomes may not be expected • Visualizing results
    15. 15. Public Data Sets There is a phenomenal amount of relevant publically available data sets • Geographical, Climatic, Demographic, Knowledge bases • Historical, political, medical, sports, social, music Infochimps (15,000 data sets) • NYSE Daily 1970-2010 Open, Close, High, Low and Volume • Global Daily Weather Data from the National Climate Data Center (NCDC) • Average Hours Worked Per Day by Employed Persons: 2005 • Family Net Worth -- Mean and Median Net Worth in Constant Factual Data.gov
    16. 16. Telecommunications Churn Analyzed billions of calls from millions of customers More than 2,000 call history variables such as time to end of contract, subscriber tenure, number of customer care calls, etc. Adding Social networking variables improved predicted churn much better. The "social networking" variables were identified by observing who each customer spoke with on the phone the most frequently. If someone in your caller network churned, you were then ~700% more likely to churn as well.
    17. 17. Visualizationshttp://www.tableausoftware.com/public/gallery/taleof100
    18. 18. Visualizationshttp://flowingdata.com/2011/10/12/where-people-dont-use-facebook/#more-19315
    19. 19. Big Data Technologies
    20. 20. Big Data & the RDBMS Volume/Velocity is too great for a single system to manage Big Data often requires distribution over multiple hosts The general purpose relational database is difficult to scale out Has been a computer science challenge for many years • Giving up nothing and scaling out has proven very difficult
    21. 21. Relaxing ConsistencyConsistent, Eventually (Eventual Consistency)“Given enough time without changes everything will eventually be consistent” Many big data technologies relax aspects of CAP to achieve scale, performance & availability Not necessarily suitable for all requirements Not necessarily the only way to achieve scale Much debate CAP, BASE Tunable consistency
    22. 22. NoSQL
    23. 23. Big Data WebinarWhat is NoSQL? NoSQL term was coined in 1998 however remained largely unused until 2009 Resurgence was the result of key papers from Amazon & Google and the resulting projects at the likes of Rackspace and Facebook. • Google Bigtable (http://labs.google.com/papers/bigtable.html) • Amazon Dynamo (www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) Started as a design philosophy of simplifying data storage technologies to the point where massive scale could be achieved Now more broad, includes a range of data management technologies which are not based on the relational model. NoSQL data stores tends to fall into broad categories • Distributed Key/Value stores • Document stores • Graph Databases A little ironic as many “NoSQL” data stores have evolved support for SQL like queries.
    24. 24. Big Data WebinarNoSQL www.google.com/trends
    25. 25. Big Data WebinarRDBMS www.google.com/trends
    26. 26. NoSQLDistributed Key/Value Store - Column Family  Database system designed to scale  Cassandra • Distributed scale over many nodes  Oracle Big Data Appliance  Usually focused on transactional • BerkeleyDB performance • Usually unsuitable for OLAP  Voldemort  Limited querying capabilities  Hypertable • Primarily single record lookup by “key”  Redis  Usually flexible schema • Row 1 may have different schema to  Raik Row 2  Usually • No Joins / normalization • No ordering • No aggregation
    27. 27. NoSQLDocument Model  Records are considered documents  MongoDB  Records have a flexible or varying schema  CouchDb depending on attributes of that document  Marklogic  Schemas are usually based on an encoding such as JSON, BJSON or XML  Document orientated data stores generally have more functionality than k/v stores so sacrifice some performance  Document orientated data stores tend to fit more naturally with OO paradigms than RDBMS reducing plumbing code significantly  Scale out is generally possible using a sharding and replication
    28. 28. NoSQLGraph  Graph databases are used to store entities  DEX and relationships between entities  FlockDB  Nodes, properties and edges  InfitieGraph  Examples of Graph database strengths • Shortest distance between entities  Neo4J • Social network analysis • If Bob knows Bill and Mary knows Bill maybe Bob knows Mary
    29. 29. NewSQL
    30. 30. NewSQLWhat is NewSQL Term coined recently by Michael Stonebraker Not throwing out the baby with the bathwater • SQL and the Relational Model is well known by almost every developer • Arguments that the relational model itself is not a limitation on scale, it is the physical implementation of the model which has limitations Retaining key aspects of the RDBMS but removing some of the general purpose nature By specializing challenges associated with scaling can be more easy overcome
    31. 31. NewSQLIn Memory TP  In Memory is becoming a reality for Examples transaction processing  VoltDB  Often designed only to process short  SAP Hana duration transactions  Systems are appearing with 1TB RAM per node  Distributed K-Safety protects from single node failure  Traditional command logging/snapshoting can provide protection from complete system failure • System bug • Comes with performance costs
    32. 32. NewSQLMPP  Focus on distributing of data and load Examples across multiple hosts for the purpose of analytical query performance  Microsoft Parallel Data Warehouse  Often use brute force access methods (i.e.  HP Vertica no user indexing)  EMC Greenplum  Often focus on running low concurrency high impact queries  Teradata / Aster Data  Often columnar storage or hybrid  IBM Netezza • Improved compression • Improved query performance across wide tables
    33. 33. Hadoop
    34. 34. HadoopWhat is Hadoop? Apache Hadoop is an open source project inspired by Google’s Map/Reduce and GFS papers • http://labs.google.com/papers/mapreduce.html • http://labs.google.com/papers/gfs.html Hadoop consists of • A distributed files system (HDFS) • The Map/Reduce engine Yahoo has been the largest contributor to the project, although now many original contributors are working for Cloudera or Hortonworks Hadoop is not a database • Data lives in the file system in raw format • Jobs process data direct from the files system using Map/Reduce • Data can be directly accessed without ETL
    35. 35. The Scale of Hadoop Hadoop Scales to very large clusters, capable of supporting very large workloads • Yahoo has ~20 Hadoop clusters ~200PB (largest 4000 nodes) • Facebook have 2000 nodes in a single cluster with 20PB Cloudera has 22 customers in prod with over 1PBBUT we are not all Facebook or Yahoo Most clusters are less than 30 nodes • The average cluster size is 200 nodes being boosted by the increasing number of large clusters In the Enterprise Hadoop is being used increasingly for • Data Processing • Analytics
    36. 36. HadoopHDFS & Map/Reduce  HDFS  Map/Reduce  Distributed file system  Programming Paradigm  Responsible for ensuring each  Takes seemingly complex tasks block is replicated to multiple and breaks them into a series of nodes • Mapping Tasks • Protection from node failure • Reduction Tasks • The large the cluster size the more likely node failure is  Elastic Map/Reduce
    37. 37. How is Hadoop used in the Enterprise? FSI Retail & Manufacturing  Customer Risk Analysis  Customer Churn  Surveillance and Fraud Detection  Brand and Sentiment Analysis  Central Data Repository  Point of Sales  Personalization and Asset Management  Pricing Models  Market Risk Modeling  Customer Loyalty  Trade Performance Analytics  Targeted Offers Science & Energy  Genomics  Utilities and Power Grid  Smart Meters  Biodiversity Indexing  Network Failures  Seismic Data
    38. 38. Summary
    39. 39. Where does this leave the RDBMS? Absolutely the RDBMS will remain the single most important data management technology for the foreseeable future • Its “General Purpose” nature still makes it the suitable technology of choice for 95% of data management needs • What we have discussed in this presentation is the 5% of requirements which have volume/velocity issues not suitably managed by the RDBMS These technologies are on a convergence path • NoSQL, NewSQL, Hapdoop, MPP – were all created out a need that wasn’t being met • But we don’t “want” many different technologies, we like one “General Purpose” tool • Bringing the technologies together through acquisition, innovation and integration is the role of the major vendors This is already happening
    40. 40. The major Vendors are following Oracle • Exadata • Big Data Appliance (http://www.oracle.com/us/corporate/features/feature-obda-498724.html) • Oracle Tools & Loader for Hadoop (http://www.oracle.com/us/corporate/features/feature-oracle- loader-for-hadoop-505115.html) • Exalytics Appliance (http://gigaom.com/2011/10/02/oracle-exalytics-attacks-big-data-analytics/) Microsoft • Parallel Data Warehouse (http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data- warehousing/pdw.aspx) • SQL Server connector for Hadoop (http://www.microsoft.com/download/en/details.aspx?id=27584) • Microsoft Hadoop (2012) (http://www.zdnet.com/blog/microsoft/microsoft-to-develop-hadoop- distributions-for-windows-server-and-azure/10958) • Microsoft Daytona (http://research.microsoft.com/en-us/projects/daytona/)
    41. 41. How does an enterprise prepare? Understand success metrics & where competitive advantage is derived • Customer retention, upsell, revenue growth Is business driven not technology driven IT can however ensure they are educated on new ways to fulfill business requirements Even if you remain aligned with major venders diversity is imminent
    42. 42. Q&A

    ×