SlideShare a Scribd company logo
Big Data, NoSQL, NewSQL &
The Future of Data Management


                                           Tony Bain
                                       RockSolid SQL
                                www.rocksolidsql.com
Introduction
RockSolid SQL

About this Webinar

 Big Data
     • What is it?
     • Why does it matter?
     • How it impacts the Enterprise?

 Data Science
                                         Tech Level 100
 NoSQL
    • Key/Value Stores
    • Document Stores
    • Graph Databases

 NewSQL
    • In Memory Transaction Processing
    • MPPP

 Hadoop

 Summary
Big Data
Big Data Webinar

What is Big Data?

  “Big data[1] are datasets that grow so large that they become awkward to
  work with using on-hand database management tools. Difficulties include
  capture, storage,[2] search, sharing, analytics,[3] and visualizing. This trend
  continues because of the benefits of working with larger and larger
  datasets allowing analysts to "spot business trends, prevent diseases,
  combat crime."[4] Though a moving target, current limits are on the order
  of terabytes, exabytes and zettabytes of data.[5]”



  “Variety– Big data extends beyond structured data, including
  unstructured data of all varieties
  Velocity – Rate of acquisition and desired rate of consumption.
  Volume –Terabytes and even petabytes of information.”
My Definition

 Big Data is the “modern scale” at which we are defining or data usage
  challenges. Big Data begins at the point where need to seriously start
  thinking about the technologies used to drive our information needs.

  While Big Data as a term seems to refer to volume this isn’t the case. Many
  existing technologies have little problem physically handling large volumes
  (TB or PB) of data. Instead the Big Data challenges result out of the
  combination of volume and our usage demands from that data. And those
  usage demands are nearly always tied to timeliness.

  Big Data is therefore the push to utilize “modern” volumes of data within
  “modern” timeframes. The exact definitions are of course are relative &
  constantly changing, however right now this is somewhere along the path
  towards the end goal. This is of course the ability to handle an unlimited
  volume of data, processing all requests in real time.

Jan 2010

http://blog.tonybain.com/tony_bain/2010/01/what-is-big-data.html
For the purpose of this talk




                               Volume and Variety of Data
                               that is difficult to mange using
              Big Data =       traditional data management
                               technology
Where does big data come from?

 Big Data is not an outcome of what we consider traditional user
  “transactions”.

    “VisaNet authorizes, clears and settles an average of 130 million
    transactions per day in 200 countries and territories.”
(http://corporate.visa.com/about-visa/technology/transaction-processing.shtml)




 Instead Big Data is mostly sourced from “Meta Data” about our
  existence
     • Sensors used to gather telemetry information
     • Social media interactions & relationships
     • Pictures and videos posted online
     • Cell phone GPS signals
     • Web page viewing habits
     • Network, Computer generated, equipment logs
     • Device outputs
     • Online gaming
Where does big data come from?

 Instead Big Data is generally meta data

  “Since we only track the data that the games explicitly want to track, and
  it’s all structured data …We just write the 5TB of structured information that
  we need, and this is a scale that high-end SQL databases can definitely
  handle.”




  “eBay processes 50 petabytes of information each day while adding 40
  terabytes of data each day as millions buy and sell items across 50,000
  categories. Over 5,000 business users and analysts turn over a terabyte of
  data every eight seconds. Raising parallel efficiency, the effectiveness of
  distributing large amounts of workload over pools and grids of servers
  equals millions in operating expense savings.”
What about the Enterprise?

 The median data warehouse has remained consistent for a number of years (6.6GB)
    • Is the pending Big Data explosion hype or a reality?

 The enterprise is focused on ROI
    • We do store large amounts of historical data as this serves a perceived need & the overhead is low
    • We have not yet started capturing large amounts of meta data as demonstrating return on this
       investment to date would have been difficult
    • More data will be captured as we gain the ability to derive value from that data
    • More data will be captured as we understand the value of auxiliary data

 The skills gap
    • Big data expertise are few and far between
    • Technologies are new and raw
    • The combination of skills is new



However this is expected to change in the next 2-5 years
Gartner Hype Cycle, 2011




http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp
Why Big Data?



 Competitive Advantage is becoming highly time sensitive
   • Consumers are fickle, change is easy, word of mouth is
     rapid and powerful
   • The rate at which we can understand and respond to
     changes in customer sentiment, product needs,
     satisfaction, engagement, risk profiles = competitive
     strength
   • Reinvention is constant



 So how do we do this?
    • With Data Science!
What is Data Science?

  “I keep saying that the sexy job in the next 10 years will be
  statisticians,”
  Hal Varian, chief economist at Google

  "data is the next Intel Inside.“
  Tim O'Reilly, O'Reilly

  “We’re rapidly entering a world where everything can be
  monitored and measured, but the big problem is going to be the
  ability of humans to use, analyze and make sense of the data.”
  Erik Brynjolfsson, MIT

  “Data science is the civil engineering of data. Its acolytes
  possess a practical knowledge of tools & materials, coupled with
  a theoretical understanding of what's possible.”
  Michael Driscoll , Metamarkets
What is Data Science?

 Historical Business Analytics
     • Effectively reporting what has happened or what is happening
     • Using pre-determined models for prediction
     • Known data models
     • Known questions
     • Known outcomes

 The Future of Data Science
    • Discovering new information from historical data
    • Often involves combining public & private data sets
    • Using machine learning to create unique prediction models
    • Diverse and complex data models
    • Questions may not be known
    • Outcomes may not be expected
    • Visualizing results
Public Data Sets


 There is a phenomenal amount of relevant publically available data sets
    • Geographical, Climatic, Demographic, Knowledge bases
    • Historical, political, medical, sports, social, music

 Infochimps (15,000 data sets)
     • NYSE Daily 1970-2010 Open, Close, High, Low and Volume
     • Global Daily Weather Data from the National Climate Data Center (NCDC)
     • Average Hours Worked Per Day by Employed Persons: 2005
     • Family Net Worth -- Mean and Median Net Worth in Constant

 Factual

 Data.gov
Telecommunications Churn

 Analyzed billions of calls from millions of customers
 More than 2,000 call history variables such as time to end of contract, subscriber tenure,
  number of customer care calls, etc.
 Adding Social networking variables improved predicted churn much better.
 The "social networking" variables were identified by observing who each customer spoke
  with on the phone the most frequently.

 If someone in your caller network churned, you were then ~700% more likely to churn as
  well.
Visualizations




http://www.tableausoftware.com/public/gallery/taleof100
Visualizations




http://flowingdata.com/2011/10/12/where-people-dont-use-facebook/#more-19315
Big Data Technologies
Big Data & the RDBMS


 Volume/Velocity is too great for a single system to manage

 Big Data often requires distribution over multiple hosts

 The general purpose relational database is difficult to scale out

 Has been a computer science challenge for many years
   • Giving up nothing and scaling out has proven very difficult
Relaxing Consistency



Consistent, Eventually (Eventual Consistency)

“Given enough time without changes everything will eventually be consistent”

 Many big data technologies relax aspects of CAP to achieve scale, performance &
  availability
 Not necessarily suitable for all requirements
 Not necessarily the only way to achieve scale

 Much debate CAP, BASE
 Tunable consistency
NoSQL
Big Data Webinar

What is NoSQL?

 NoSQL term was coined in 1998 however remained largely unused until 2009

 Resurgence was the result of key papers from Amazon & Google and the resulting projects at the likes of
  Rackspace and Facebook.
    • Google Bigtable (http://labs.google.com/papers/bigtable.html)
    • Amazon Dynamo (www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)

 Started as a design philosophy of simplifying data storage technologies to the point where massive scale
  could be achieved

 Now more broad, includes a range of data management technologies which are not based on the
  relational model.

 NoSQL data stores tends to fall into broad categories
    • Distributed Key/Value stores
    • Document stores
    • Graph Databases

 A little ironic as many “NoSQL” data stores have evolved support for SQL like queries.
Big Data Webinar

NoSQL




                   www.google.com/trends
Big Data Webinar

RDBMS




                   www.google.com/trends
NoSQL

Distributed Key/Value Store - Column Family

   Database system designed to scale               Cassandra
      • Distributed scale over many nodes
                                                    Oracle Big Data Appliance
   Usually focused on transactional                   • BerkeleyDB
    performance
      • Usually unsuitable for OLAP                 Voldemort

   Limited querying capabilities                   Hypertable
       • Primarily single record lookup by “key”
                                                    Redis
   Usually flexible schema
      • Row 1 may have different schema to          Raik
         Row 2

   Usually
      • No Joins / normalization
      • No ordering
      • No aggregation
NoSQL

Document Model

  Records are considered documents               MongoDB

  Records have a flexible or varying schema      CouchDb
   depending on attributes of that document
                                                  Marklogic
  Schemas are usually based on an encoding
   such as JSON, BJSON or XML

  Document orientated data stores generally
   have more functionality than k/v stores so
   sacrifice some performance

  Document orientated data stores tend to fit
   more naturally with OO paradigms than
   RDBMS reducing plumbing code
   significantly

  Scale out is generally possible using a
   sharding and replication
NoSQL

Graph

  Graph databases are used to store entities    DEX
   and relationships between entities
                                                 FlockDB
  Nodes, properties and edges
                                                 InfitieGraph
  Examples of Graph database strengths
     • Shortest distance between entities        Neo4J
     • Social network analysis
     • If Bob knows Bill and Mary knows Bill
       maybe Bob knows Mary
NewSQL
NewSQL

What is NewSQL

 Term coined recently by Michael Stonebraker

 Not throwing out the baby with the bathwater
   • SQL and the Relational Model is well known by almost
       every developer
   • Arguments that the relational model itself is not a
       limitation on scale, it is the physical implementation of
       the model which has limitations

 Retaining key aspects of the RDBMS but removing some of
  the general purpose nature

 By specializing challenges associated with scaling can be
  more easy overcome
NewSQL

In Memory TP

  In Memory is becoming a reality for         Examples
   transaction processing
                                                VoltDB
  Often designed only to process short         SAP Hana
   duration transactions

  Systems are appearing with 1TB RAM per
   node

  Distributed K-Safety protects from single
   node failure

  Traditional command logging/snapshoting
   can provide protection from complete
   system failure
      • System bug
      • Comes with performance costs
NewSQL

MPP

  Focus on distributing of data and load       Examples
   across multiple hosts for the purpose of
   analytical query performance
                                                   Microsoft Parallel Data Warehouse
  Often use brute force access methods (i.e.
                                                   HP Vertica
   no user indexing)
                                                   EMC Greenplum
  Often focus on running low concurrency
   high impact queries                             Teradata / Aster Data
  Often columnar storage or hybrid                IBM Netezza
      • Improved compression
      • Improved query performance across
         wide tables
Hadoop
Hadoop

What is Hadoop?

 Apache Hadoop is an open source project inspired by Google’s Map/Reduce
  and GFS papers
    • http://labs.google.com/papers/mapreduce.html
    • http://labs.google.com/papers/gfs.html

 Hadoop consists of
    • A distributed files system (HDFS)
    • The Map/Reduce engine

 Yahoo has been the largest contributor to the project, although now many
  original contributors are working for Cloudera or Hortonworks

 Hadoop is not a database
    • Data lives in the file system in raw format
    • Jobs process data direct from the files system using Map/Reduce
    • Data can be directly accessed without ETL
The Scale of Hadoop

 Hadoop Scales to very large clusters, capable of supporting very large
  workloads
    • Yahoo has ~20 Hadoop clusters ~200PB (largest 4000 nodes)
    • Facebook have 2000 nodes in a single cluster with 20PB

 Cloudera has 22 customers in prod with over 1PB

BUT we are not all Facebook or Yahoo

 Most clusters are less than 30 nodes
    • The average cluster size is 200 nodes being boosted by the increasing
       number of large clusters

 In the Enterprise Hadoop is being used increasingly for
      • Data Processing
      • Analytics
Hadoop

HDFS & Map/Reduce

  HDFS                                  Map/Reduce
  Distributed file system               Programming Paradigm
  Responsible for ensuring each         Takes seemingly complex tasks
   block is replicated to multiple        and breaks them into a series of
   nodes                                    • Mapping Tasks
     • Protection from node failure         • Reduction Tasks
     • The large the cluster size the
        more likely node failure is
                                         Elastic Map/Reduce
How is Hadoop used in the Enterprise?

  FSI                                      Retail & Manufacturing
   Customer Risk Analysis                  Customer Churn
   Surveillance and Fraud Detection        Brand and Sentiment Analysis
   Central Data Repository                 Point of Sales
   Personalization and Asset Management    Pricing Models
   Market Risk Modeling                    Customer Loyalty
   Trade Performance Analytics             Targeted Offers


  Science & Energy
   Genomics
   Utilities and Power Grid
   Smart Meters
   Biodiversity Indexing
   Network Failures
   Seismic Data
Summary
Where does this leave the RDBMS?

 Absolutely the RDBMS will remain the single most important data management technology for the
  foreseeable future
     • Its “General Purpose” nature still makes it the suitable technology of choice for 95% of data
        management needs
     • What we have discussed in this presentation is the 5% of requirements which have volume/velocity
        issues not suitably managed by the RDBMS

 These technologies are on a convergence path
    • NoSQL, NewSQL, Hapdoop, MPP – were all created out a need that wasn’t being met
    • But we don’t “want” many different technologies, we like one “General Purpose” tool
    • Bringing the technologies together through acquisition, innovation and integration is the role of the
       major vendors

 This is already happening
The major Vendors are following

 Oracle
    • Exadata
    • Big Data Appliance (http://www.oracle.com/us/corporate/features/feature-obda-498724.html)
    • Oracle Tools & Loader for Hadoop (http://www.oracle.com/us/corporate/features/feature-oracle-
       loader-for-hadoop-505115.html)
    • Exalytics Appliance (http://gigaom.com/2011/10/02/oracle-exalytics-attacks-big-data-analytics/)

 Microsoft
    • Parallel Data Warehouse (http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-
       warehousing/pdw.aspx)
    • SQL Server connector for Hadoop (http://www.microsoft.com/download/en/details.aspx?id=27584)
    • Microsoft Hadoop (2012) (http://www.zdnet.com/blog/microsoft/microsoft-to-develop-hadoop-
       distributions-for-windows-server-and-azure/10958)
    • Microsoft Daytona (http://research.microsoft.com/en-us/projects/daytona/)
How does an enterprise prepare?

 Understand success metrics & where competitive advantage is derived
    • Customer retention, upsell, revenue growth

 Is business driven not technology driven

 IT can however ensure they are educated on new ways to fulfill business requirements

 Even if you remain aligned with major venders diversity is imminent
Q&A

More Related Content

What's hot

Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
 
Less05 storage
Less05 storageLess05 storage
Less05 storage
Imran Ali
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
magda3695
 

What's hot (20)

Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
Database Security & Encryption
Database Security & EncryptionDatabase Security & Encryption
Database Security & Encryption
 
Facebook thrift
Facebook thriftFacebook thrift
Facebook thrift
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Less05 storage
Less05 storageLess05 storage
Less05 storage
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Google File System
Google File SystemGoogle File System
Google File System
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
Php Lecture Notes
Php Lecture NotesPhp Lecture Notes
Php Lecture Notes
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Database Design and the ER Model, Indexing and Hashing
Database Design and the ER Model, Indexing and HashingDatabase Design and the ER Model, Indexing and Hashing
Database Design and the ER Model, Indexing and Hashing
 
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFSOzone: An Object Store in HDFS
Ozone: An Object Store in HDFS
 

Viewers also liked

分布式构架简介 草稿
分布式构架简介 草稿分布式构架简介 草稿
分布式构架简介 草稿
guestd7133d1
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
Steve Min
 

Viewers also liked (17)

Rock Solid SQL Server Management
Rock Solid SQL Server ManagementRock Solid SQL Server Management
Rock Solid SQL Server Management
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Tachyon 2015 08 China
Tachyon 2015 08 ChinaTachyon 2015 08 China
Tachyon 2015 08 China
 
分布式构架简介 草稿
分布式构架简介 草稿分布式构架简介 草稿
分布式构架简介 草稿
 
Choosing a Next Gen Database: the New World Order of NoSQL, NewSQL, and MySQL
Choosing a Next Gen Database: the New World Order of NoSQL, NewSQL, and MySQLChoosing a Next Gen Database: the New World Order of NoSQL, NewSQL, and MySQL
Choosing a Next Gen Database: the New World Order of NoSQL, NewSQL, and MySQL
 
结构化数据存储
结构化数据存储结构化数据存储
结构化数据存储
 
华为软件定义存储架构分析
华为软件定义存储架构分析华为软件定义存储架构分析
华为软件定义存储架构分析
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
My SQL and Ceph: Head-to-Head Performance Lab
My SQL and Ceph: Head-to-Head Performance LabMy SQL and Ceph: Head-to-Head Performance Lab
My SQL and Ceph: Head-to-Head Performance Lab
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey results
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Big Data, NoSQL, NewSQL & The Future of Data Management

Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
dickonsondorris
 

Similar to Big Data, NoSQL, NewSQL & The Future of Data Management (20)

Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Data warehouse Vs Big Data
Data warehouse Vs Big Data Data warehouse Vs Big Data
Data warehouse Vs Big Data
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Big data
Big dataBig data
Big data
 
Big data.pptx
Big data.pptxBig data.pptx
Big data.pptx
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
 
Big data and analytics
Big data and analyticsBig data and analytics
Big data and analytics
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
big data
big data big data
big data
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 

Big Data, NoSQL, NewSQL & The Future of Data Management

  • 1. Big Data, NoSQL, NewSQL & The Future of Data Management Tony Bain RockSolid SQL www.rocksolidsql.com
  • 3. RockSolid SQL About this Webinar  Big Data • What is it? • Why does it matter? • How it impacts the Enterprise?  Data Science Tech Level 100  NoSQL • Key/Value Stores • Document Stores • Graph Databases  NewSQL • In Memory Transaction Processing • MPPP  Hadoop  Summary
  • 5. Big Data Webinar What is Big Data? “Big data[1] are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[2] search, sharing, analytics,[3] and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime."[4] Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.[5]” “Variety– Big data extends beyond structured data, including unstructured data of all varieties Velocity – Rate of acquisition and desired rate of consumption. Volume –Terabytes and even petabytes of information.”
  • 6. My Definition  Big Data is the “modern scale” at which we are defining or data usage challenges. Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs. While Big Data as a term seems to refer to volume this isn’t the case. Many existing technologies have little problem physically handling large volumes (TB or PB) of data. Instead the Big Data challenges result out of the combination of volume and our usage demands from that data. And those usage demands are nearly always tied to timeliness. Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes. The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal. This is of course the ability to handle an unlimited volume of data, processing all requests in real time. Jan 2010 http://blog.tonybain.com/tony_bain/2010/01/what-is-big-data.html
  • 7. For the purpose of this talk Volume and Variety of Data that is difficult to mange using Big Data = traditional data management technology
  • 8. Where does big data come from?  Big Data is not an outcome of what we consider traditional user “transactions”. “VisaNet authorizes, clears and settles an average of 130 million transactions per day in 200 countries and territories.” (http://corporate.visa.com/about-visa/technology/transaction-processing.shtml)  Instead Big Data is mostly sourced from “Meta Data” about our existence • Sensors used to gather telemetry information • Social media interactions & relationships • Pictures and videos posted online • Cell phone GPS signals • Web page viewing habits • Network, Computer generated, equipment logs • Device outputs • Online gaming
  • 9. Where does big data come from?  Instead Big Data is generally meta data “Since we only track the data that the games explicitly want to track, and it’s all structured data …We just write the 5TB of structured information that we need, and this is a scale that high-end SQL databases can definitely handle.” “eBay processes 50 petabytes of information each day while adding 40 terabytes of data each day as millions buy and sell items across 50,000 categories. Over 5,000 business users and analysts turn over a terabyte of data every eight seconds. Raising parallel efficiency, the effectiveness of distributing large amounts of workload over pools and grids of servers equals millions in operating expense savings.”
  • 10. What about the Enterprise?  The median data warehouse has remained consistent for a number of years (6.6GB) • Is the pending Big Data explosion hype or a reality?  The enterprise is focused on ROI • We do store large amounts of historical data as this serves a perceived need & the overhead is low • We have not yet started capturing large amounts of meta data as demonstrating return on this investment to date would have been difficult • More data will be captured as we gain the ability to derive value from that data • More data will be captured as we understand the value of auxiliary data  The skills gap • Big data expertise are few and far between • Technologies are new and raw • The combination of skills is new However this is expected to change in the next 2-5 years
  • 11. Gartner Hype Cycle, 2011 http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp
  • 12. Why Big Data?  Competitive Advantage is becoming highly time sensitive • Consumers are fickle, change is easy, word of mouth is rapid and powerful • The rate at which we can understand and respond to changes in customer sentiment, product needs, satisfaction, engagement, risk profiles = competitive strength • Reinvention is constant  So how do we do this? • With Data Science!
  • 13. What is Data Science? “I keep saying that the sexy job in the next 10 years will be statisticians,” Hal Varian, chief economist at Google "data is the next Intel Inside.“ Tim O'Reilly, O'Reilly “We’re rapidly entering a world where everything can be monitored and measured, but the big problem is going to be the ability of humans to use, analyze and make sense of the data.” Erik Brynjolfsson, MIT “Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what's possible.” Michael Driscoll , Metamarkets
  • 14. What is Data Science?  Historical Business Analytics • Effectively reporting what has happened or what is happening • Using pre-determined models for prediction • Known data models • Known questions • Known outcomes  The Future of Data Science • Discovering new information from historical data • Often involves combining public & private data sets • Using machine learning to create unique prediction models • Diverse and complex data models • Questions may not be known • Outcomes may not be expected • Visualizing results
  • 15. Public Data Sets  There is a phenomenal amount of relevant publically available data sets • Geographical, Climatic, Demographic, Knowledge bases • Historical, political, medical, sports, social, music  Infochimps (15,000 data sets) • NYSE Daily 1970-2010 Open, Close, High, Low and Volume • Global Daily Weather Data from the National Climate Data Center (NCDC) • Average Hours Worked Per Day by Employed Persons: 2005 • Family Net Worth -- Mean and Median Net Worth in Constant  Factual  Data.gov
  • 16. Telecommunications Churn  Analyzed billions of calls from millions of customers  More than 2,000 call history variables such as time to end of contract, subscriber tenure, number of customer care calls, etc.  Adding Social networking variables improved predicted churn much better.  The "social networking" variables were identified by observing who each customer spoke with on the phone the most frequently.  If someone in your caller network churned, you were then ~700% more likely to churn as well.
  • 20. Big Data & the RDBMS  Volume/Velocity is too great for a single system to manage  Big Data often requires distribution over multiple hosts  The general purpose relational database is difficult to scale out  Has been a computer science challenge for many years • Giving up nothing and scaling out has proven very difficult
  • 21. Relaxing Consistency Consistent, Eventually (Eventual Consistency) “Given enough time without changes everything will eventually be consistent”  Many big data technologies relax aspects of CAP to achieve scale, performance & availability  Not necessarily suitable for all requirements  Not necessarily the only way to achieve scale  Much debate CAP, BASE  Tunable consistency
  • 22. NoSQL
  • 23. Big Data Webinar What is NoSQL?  NoSQL term was coined in 1998 however remained largely unused until 2009  Resurgence was the result of key papers from Amazon & Google and the resulting projects at the likes of Rackspace and Facebook. • Google Bigtable (http://labs.google.com/papers/bigtable.html) • Amazon Dynamo (www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)  Started as a design philosophy of simplifying data storage technologies to the point where massive scale could be achieved  Now more broad, includes a range of data management technologies which are not based on the relational model.  NoSQL data stores tends to fall into broad categories • Distributed Key/Value stores • Document stores • Graph Databases  A little ironic as many “NoSQL” data stores have evolved support for SQL like queries.
  • 24. Big Data Webinar NoSQL www.google.com/trends
  • 25. Big Data Webinar RDBMS www.google.com/trends
  • 26. NoSQL Distributed Key/Value Store - Column Family  Database system designed to scale  Cassandra • Distributed scale over many nodes  Oracle Big Data Appliance  Usually focused on transactional • BerkeleyDB performance • Usually unsuitable for OLAP  Voldemort  Limited querying capabilities  Hypertable • Primarily single record lookup by “key”  Redis  Usually flexible schema • Row 1 may have different schema to  Raik Row 2  Usually • No Joins / normalization • No ordering • No aggregation
  • 27. NoSQL Document Model  Records are considered documents  MongoDB  Records have a flexible or varying schema  CouchDb depending on attributes of that document  Marklogic  Schemas are usually based on an encoding such as JSON, BJSON or XML  Document orientated data stores generally have more functionality than k/v stores so sacrifice some performance  Document orientated data stores tend to fit more naturally with OO paradigms than RDBMS reducing plumbing code significantly  Scale out is generally possible using a sharding and replication
  • 28. NoSQL Graph  Graph databases are used to store entities  DEX and relationships between entities  FlockDB  Nodes, properties and edges  InfitieGraph  Examples of Graph database strengths • Shortest distance between entities  Neo4J • Social network analysis • If Bob knows Bill and Mary knows Bill maybe Bob knows Mary
  • 30. NewSQL What is NewSQL  Term coined recently by Michael Stonebraker  Not throwing out the baby with the bathwater • SQL and the Relational Model is well known by almost every developer • Arguments that the relational model itself is not a limitation on scale, it is the physical implementation of the model which has limitations  Retaining key aspects of the RDBMS but removing some of the general purpose nature  By specializing challenges associated with scaling can be more easy overcome
  • 31. NewSQL In Memory TP  In Memory is becoming a reality for Examples transaction processing  VoltDB  Often designed only to process short  SAP Hana duration transactions  Systems are appearing with 1TB RAM per node  Distributed K-Safety protects from single node failure  Traditional command logging/snapshoting can provide protection from complete system failure • System bug • Comes with performance costs
  • 32. NewSQL MPP  Focus on distributing of data and load Examples across multiple hosts for the purpose of analytical query performance  Microsoft Parallel Data Warehouse  Often use brute force access methods (i.e.  HP Vertica no user indexing)  EMC Greenplum  Often focus on running low concurrency high impact queries  Teradata / Aster Data  Often columnar storage or hybrid  IBM Netezza • Improved compression • Improved query performance across wide tables
  • 34. Hadoop What is Hadoop?  Apache Hadoop is an open source project inspired by Google’s Map/Reduce and GFS papers • http://labs.google.com/papers/mapreduce.html • http://labs.google.com/papers/gfs.html  Hadoop consists of • A distributed files system (HDFS) • The Map/Reduce engine  Yahoo has been the largest contributor to the project, although now many original contributors are working for Cloudera or Hortonworks  Hadoop is not a database • Data lives in the file system in raw format • Jobs process data direct from the files system using Map/Reduce • Data can be directly accessed without ETL
  • 35. The Scale of Hadoop  Hadoop Scales to very large clusters, capable of supporting very large workloads • Yahoo has ~20 Hadoop clusters ~200PB (largest 4000 nodes) • Facebook have 2000 nodes in a single cluster with 20PB  Cloudera has 22 customers in prod with over 1PB BUT we are not all Facebook or Yahoo  Most clusters are less than 30 nodes • The average cluster size is 200 nodes being boosted by the increasing number of large clusters  In the Enterprise Hadoop is being used increasingly for • Data Processing • Analytics
  • 36. Hadoop HDFS & Map/Reduce  HDFS  Map/Reduce  Distributed file system  Programming Paradigm  Responsible for ensuring each  Takes seemingly complex tasks block is replicated to multiple and breaks them into a series of nodes • Mapping Tasks • Protection from node failure • Reduction Tasks • The large the cluster size the more likely node failure is  Elastic Map/Reduce
  • 37. How is Hadoop used in the Enterprise? FSI Retail & Manufacturing  Customer Risk Analysis  Customer Churn  Surveillance and Fraud Detection  Brand and Sentiment Analysis  Central Data Repository  Point of Sales  Personalization and Asset Management  Pricing Models  Market Risk Modeling  Customer Loyalty  Trade Performance Analytics  Targeted Offers Science & Energy  Genomics  Utilities and Power Grid  Smart Meters  Biodiversity Indexing  Network Failures  Seismic Data
  • 39. Where does this leave the RDBMS?  Absolutely the RDBMS will remain the single most important data management technology for the foreseeable future • Its “General Purpose” nature still makes it the suitable technology of choice for 95% of data management needs • What we have discussed in this presentation is the 5% of requirements which have volume/velocity issues not suitably managed by the RDBMS  These technologies are on a convergence path • NoSQL, NewSQL, Hapdoop, MPP – were all created out a need that wasn’t being met • But we don’t “want” many different technologies, we like one “General Purpose” tool • Bringing the technologies together through acquisition, innovation and integration is the role of the major vendors  This is already happening
  • 40. The major Vendors are following  Oracle • Exadata • Big Data Appliance (http://www.oracle.com/us/corporate/features/feature-obda-498724.html) • Oracle Tools & Loader for Hadoop (http://www.oracle.com/us/corporate/features/feature-oracle- loader-for-hadoop-505115.html) • Exalytics Appliance (http://gigaom.com/2011/10/02/oracle-exalytics-attacks-big-data-analytics/)  Microsoft • Parallel Data Warehouse (http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data- warehousing/pdw.aspx) • SQL Server connector for Hadoop (http://www.microsoft.com/download/en/details.aspx?id=27584) • Microsoft Hadoop (2012) (http://www.zdnet.com/blog/microsoft/microsoft-to-develop-hadoop- distributions-for-windows-server-and-azure/10958) • Microsoft Daytona (http://research.microsoft.com/en-us/projects/daytona/)
  • 41. How does an enterprise prepare?  Understand success metrics & where competitive advantage is derived • Customer retention, upsell, revenue growth  Is business driven not technology driven  IT can however ensure they are educated on new ways to fulfill business requirements  Even if you remain aligned with major venders diversity is imminent
  • 42. Q&A

Editor's Notes

  1. Facebook 30PB
  2. Various surveys 6.6GB