Big Data Strategy
for the Relational World
Embracing Disruption, Avoiding Regression
Andrew J. Brust
Founder & CEO, Blue Badge Insights
Big Data correspondent, ZDNet
Big Data Analyst, GigaOM Research
Bio
• CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair, Visual Studio Live! and 18 years as a speaker
• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond Developer News
• Twitter: @andrewbrust
Andrew on ZDNet (bit.ly/bigondata)
Read all about it!
Big Data: Why Should You Care?
• Because analytics (i.e. BI) has always been
important, but it was expensive and obscure
• Because the economics of processing and
storage make Big Data feasible
Big Data: Why Should You be
Cautious?
• Too many vendors; too much churn
• Designed for the lab, not for mainstream
business
• Immature technology and tooling
– Results in serious recruiting and dev costs
• So, you can’t ignore Big Data, but you can’t
just pursue with abandon, either.
– That’s hard!
Agenda
• Trends
• Technologies
– NoSQL
– Hadoop
– SQL Convergence
– NewSQL
– In-Memory
• Forecasts
• Risks
• Recommendations
Database Trends
• Mongo and Cassandra, primarilyNoSQL
• aka “unstructured data”Late-bound schema
• Especially HDFSFile-based table handling
• And Massively Parallel ProcessingColumnar storage
• Very few throwing them awayCo-existence with RDBMS, OLAP
databases
• Still expect tables or cubesLittle change in tools/clients
NoSQL
Key-Value
Store
• Couchbase
• Riak
• Redis
• Voldemort
• DynamoDB
• Azure tables
Document
Store
• MongoDB
• CouchDB
• Cloudant
• Couchbase
Wide Column
Store
• HBase
• Cassandra
Graph
Database
• Neo4J
SQLSQL
Consistency
• CAP Theorem
–Databases may only excel at two of the following
three attributes: consistency, availability and partition
tolerance
• NoSQL does not offer “ACID” guarantees
–Atomicity, consistency, isolation and durability
• Instead offers “eventual consistency”
–Similar to DNS propagation
CAP Theorem
Consistency
Availability
Partition
Tolerance
Relational
NoSQL
NoSQL Upside
• Distributed by default
• Open source lets you peg costs to personnel,
more than to customers
• Developer enthusiasm
Hadoop
• Open source, petabyte-scale data analysis and
processing framework
• Runs on commodity hardware
• Lots of ecosystem
• Two main components:
– Hadoop Distributed File System (HDFS)
– MapReduce engine
Hadoop
• Open source, petabyte-scale data analysis and
processing framework
• Runs on commodity hardware
• Lots of ecosystem
• Two main components:
– Hadoop Distributed File System (HDFS)
– MapReduce engine
Why MapReduce is Cool
• Extremely flexible – full power of a procedural
programming language
• Map step, essentially, allows ad hoc ETL
• With Reduce step, aggregation is a first-class
concept
• Growing ecosystem of tools that generate
MapReduce code
Why MapReduce Sucks
• It’s a batch mode technology
• It’s not declarative
• Most BI products don’t work with MR natively
– They connect via Hive instead (by and large)
• It’s good for a group of use cases, but it’s not a
good general framework
The Google DNA
• Hadoop and HBase came from Google
– MapReduce, GFS
– BigTable
• Hadoop was built for their use cases, and they
don’t use it as extensively now
• So why is the world going Hadoop-crazy?
Benefits of Schema-Free
• Variable schema is accommodated
– Great for product catalogs, content management
and the like
• Simple for archival storage
• For analysis:
– Avoids politics of achieving consensus on
structure
– Allows different schema for different applications
Cloud Effect
• Database as a service and SaaS BI/Analytics gets
companies excited
– Cloudant
– Amazon: DynamoDB, RDS, RedShift, Jaspersoft
• Elastic capabilities of cloud provide small customers
with access to huge clusters
– Amazon EMR, Microsoft Windows Azure HDInsight now
– Google Compute Engine, Rackspace/Hortonworks to come
• Cloud-borne reference data adds value
• But casualties emerging: e.g. Xeround
SQL Skillset and Ecosystem
• Making recruiting faster and cheaper
DBAs, most devs know it
ORMs expect it
• Even if they also talk to MDX and NoSQL sources
Reporting/analysis tools are premised on it
Companies are invested in it
Abandoning it is naive
MPP is Big Data
(via acquisition)
• Acquired Aster DataTeradata
• IBMNetezza
• HPVertica
• EMCPivotal/Greenplum
• ActianParAccel
• Microsoft-DATAllegro acquisitionSQL Server Parallel Data
Warehouse
SQL – BD Convergence
• Brings the SQL language and data warehouse
products, on one side, together with Hadoop, on
the other
• Goal is to make Hadoop interactive, non-batch
• May involve Hive and its APIs
• May involve direct access to HDFS
– Bypassing MapReduce
• Think of the “database” as HDFS, and MapReduce
as merely an access method.
One Repository, Multiple Access
Methods
HCatalog
Cloudera Impala (v1.0 shipped April 30)
Hortonworks “Stinger” initiative
•Make Hive 100x faster
EMC Pivotal
Microsoft PolyBase, Data Explorer
Teradata Aster SQL-H
ParAccel (Actian) ODI
SQL – BD Convergence
NuoDB
VoltDB
Clustrix
TransLattice
NewSQL Entrants
Dremel and Drill
• Dremel is Google’s column store analytical database
– Proprietary; available publicly as BigQuery
• Hierarchical/nested too
– Allows schema variance without anarchy
• “…scales to thousands of CPUs and petabytes of data,
and has thousands of users at Google.”
• Uses SQL, has growing BI tool support
• Petabyte scale
• Drill:Dremel as Hadoop:MapReduce+GFS
• And then there’s Spanner
In-Memory
• SAP HANA
– And Sybase IQ
• Data Warehouse Appliances
• VoltDB
• Oracle TimesTen
• IBM solidDB
– Also TM1 (in-memory OLAP)
• Coming: SQL Server’s “Hekaton” engine
The Truth About In-Memory
• Judicious use of in-memory database technology can
speed analytical queries
– Combine with columnar technology, rinse, repeat
• Can also eliminate need for deferred writes
• A RAM-only strategy like HANA’s seems impractical
• Keep in mind:
– SSD is memory too. It’s slower, but it’s memory.
– Conversely, L1, L2 and L3 cache is faster than RAM. Single
Instruction, Multiple Data (SIMD) makes things faster still.
• Hybrid approaches are most sensible
What’s Ahead?
• Consolidation! We can’t have this many vendors:
– Some will go out of business
– Some will get acquired
– A few will stay independent (but may merge with each
other)
• Hadoop recedes into the service layer
• NoSQL shakes out, matures, coexists
• NewSQL gets adopted or acquired
• In-memory becomes a standard option
Risks and Considerations
• Pick an esoteric database now and you may be
forced to migrate later
• SQL Server and Oracle could add features that
make the specialty products superfluous
– Or new products
• Conversely, NoSQL products may acquire
ACID-like features themselves
• More convergence
Recommendations
• NoSQL has its use cases. But it also has its
abuses.
• Look carefully at the number of customers
• Look also at how widely deployed the product
is within those customer companies
Recommendations
• If you haven’t looked seriously at Hadoop, do so.
But remember, it’s infrastructure.
• You can reach out to Big Data now, or you can
wait for it to reach out to you
– Cost/benefit of earlier adoption vs. late following
• For repeatable big problems, MapReduce works
well; for iterative query, “SQL” technologies are
much better
– akin to standard reports versus ad hoc queries
Parting Thoughts
• NoSQL and Big Data are disruptive
• You ignore them at your peril
• But if they can’t, ultimately, blend into current
technology environments then they’re
destined to fail
• You can embrace the change without being
sacrificed. Just watch your back.
Thank You!
• Email
• andrew.brust@bluebadgeinsights.com
• Blog:
• http://www.zdnet.com/blog/big-data
• Twitter
• @andrewbrust on twitter

Big Data Strategy for the Relational World

  • 1.
    Big Data Strategy forthe Relational World Embracing Disruption, Avoiding Regression Andrew J. Brust Founder & CEO, Blue Badge Insights Big Data correspondent, ZDNet Big Data Analyst, GigaOM Research
  • 2.
    Bio • CEO andFounder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair, Visual Studio Live! and 18 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • Twitter: @andrewbrust
  • 3.
    Andrew on ZDNet(bit.ly/bigondata)
  • 4.
  • 5.
    Big Data: WhyShould You Care? • Because analytics (i.e. BI) has always been important, but it was expensive and obscure • Because the economics of processing and storage make Big Data feasible
  • 6.
    Big Data: WhyShould You be Cautious? • Too many vendors; too much churn • Designed for the lab, not for mainstream business • Immature technology and tooling – Results in serious recruiting and dev costs • So, you can’t ignore Big Data, but you can’t just pursue with abandon, either. – That’s hard!
  • 7.
    Agenda • Trends • Technologies –NoSQL – Hadoop – SQL Convergence – NewSQL – In-Memory • Forecasts • Risks • Recommendations
  • 8.
    Database Trends • Mongoand Cassandra, primarilyNoSQL • aka “unstructured data”Late-bound schema • Especially HDFSFile-based table handling • And Massively Parallel ProcessingColumnar storage • Very few throwing them awayCo-existence with RDBMS, OLAP databases • Still expect tables or cubesLittle change in tools/clients
  • 9.
    NoSQL Key-Value Store • Couchbase • Riak •Redis • Voldemort • DynamoDB • Azure tables Document Store • MongoDB • CouchDB • Cloudant • Couchbase Wide Column Store • HBase • Cassandra Graph Database • Neo4J SQLSQL
  • 10.
    Consistency • CAP Theorem –Databasesmay only excel at two of the following three attributes: consistency, availability and partition tolerance • NoSQL does not offer “ACID” guarantees –Atomicity, consistency, isolation and durability • Instead offers “eventual consistency” –Similar to DNS propagation
  • 11.
  • 12.
    NoSQL Upside • Distributedby default • Open source lets you peg costs to personnel, more than to customers • Developer enthusiasm
  • 13.
    Hadoop • Open source,petabyte-scale data analysis and processing framework • Runs on commodity hardware • Lots of ecosystem • Two main components: – Hadoop Distributed File System (HDFS) – MapReduce engine
  • 14.
    Hadoop • Open source,petabyte-scale data analysis and processing framework • Runs on commodity hardware • Lots of ecosystem • Two main components: – Hadoop Distributed File System (HDFS) – MapReduce engine
  • 15.
    Why MapReduce isCool • Extremely flexible – full power of a procedural programming language • Map step, essentially, allows ad hoc ETL • With Reduce step, aggregation is a first-class concept • Growing ecosystem of tools that generate MapReduce code
  • 16.
    Why MapReduce Sucks •It’s a batch mode technology • It’s not declarative • Most BI products don’t work with MR natively – They connect via Hive instead (by and large) • It’s good for a group of use cases, but it’s not a good general framework
  • 17.
    The Google DNA •Hadoop and HBase came from Google – MapReduce, GFS – BigTable • Hadoop was built for their use cases, and they don’t use it as extensively now • So why is the world going Hadoop-crazy?
  • 18.
    Benefits of Schema-Free •Variable schema is accommodated – Great for product catalogs, content management and the like • Simple for archival storage • For analysis: – Avoids politics of achieving consensus on structure – Allows different schema for different applications
  • 19.
    Cloud Effect • Databaseas a service and SaaS BI/Analytics gets companies excited – Cloudant – Amazon: DynamoDB, RDS, RedShift, Jaspersoft • Elastic capabilities of cloud provide small customers with access to huge clusters – Amazon EMR, Microsoft Windows Azure HDInsight now – Google Compute Engine, Rackspace/Hortonworks to come • Cloud-borne reference data adds value • But casualties emerging: e.g. Xeround
  • 20.
    SQL Skillset andEcosystem • Making recruiting faster and cheaper DBAs, most devs know it ORMs expect it • Even if they also talk to MDX and NoSQL sources Reporting/analysis tools are premised on it Companies are invested in it Abandoning it is naive
  • 21.
    MPP is BigData (via acquisition) • Acquired Aster DataTeradata • IBMNetezza • HPVertica • EMCPivotal/Greenplum • ActianParAccel • Microsoft-DATAllegro acquisitionSQL Server Parallel Data Warehouse
  • 22.
    SQL – BDConvergence • Brings the SQL language and data warehouse products, on one side, together with Hadoop, on the other • Goal is to make Hadoop interactive, non-batch • May involve Hive and its APIs • May involve direct access to HDFS – Bypassing MapReduce • Think of the “database” as HDFS, and MapReduce as merely an access method.
  • 23.
    One Repository, MultipleAccess Methods HCatalog
  • 24.
    Cloudera Impala (v1.0shipped April 30) Hortonworks “Stinger” initiative •Make Hive 100x faster EMC Pivotal Microsoft PolyBase, Data Explorer Teradata Aster SQL-H ParAccel (Actian) ODI SQL – BD Convergence
  • 26.
  • 27.
    Dremel and Drill •Dremel is Google’s column store analytical database – Proprietary; available publicly as BigQuery • Hierarchical/nested too – Allows schema variance without anarchy • “…scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.” • Uses SQL, has growing BI tool support • Petabyte scale • Drill:Dremel as Hadoop:MapReduce+GFS • And then there’s Spanner
  • 28.
    In-Memory • SAP HANA –And Sybase IQ • Data Warehouse Appliances • VoltDB • Oracle TimesTen • IBM solidDB – Also TM1 (in-memory OLAP) • Coming: SQL Server’s “Hekaton” engine
  • 29.
    The Truth AboutIn-Memory • Judicious use of in-memory database technology can speed analytical queries – Combine with columnar technology, rinse, repeat • Can also eliminate need for deferred writes • A RAM-only strategy like HANA’s seems impractical • Keep in mind: – SSD is memory too. It’s slower, but it’s memory. – Conversely, L1, L2 and L3 cache is faster than RAM. Single Instruction, Multiple Data (SIMD) makes things faster still. • Hybrid approaches are most sensible
  • 30.
    What’s Ahead? • Consolidation!We can’t have this many vendors: – Some will go out of business – Some will get acquired – A few will stay independent (but may merge with each other) • Hadoop recedes into the service layer • NoSQL shakes out, matures, coexists • NewSQL gets adopted or acquired • In-memory becomes a standard option
  • 31.
    Risks and Considerations •Pick an esoteric database now and you may be forced to migrate later • SQL Server and Oracle could add features that make the specialty products superfluous – Or new products • Conversely, NoSQL products may acquire ACID-like features themselves • More convergence
  • 32.
    Recommendations • NoSQL hasits use cases. But it also has its abuses. • Look carefully at the number of customers • Look also at how widely deployed the product is within those customer companies
  • 33.
    Recommendations • If youhaven’t looked seriously at Hadoop, do so. But remember, it’s infrastructure. • You can reach out to Big Data now, or you can wait for it to reach out to you – Cost/benefit of earlier adoption vs. late following • For repeatable big problems, MapReduce works well; for iterative query, “SQL” technologies are much better – akin to standard reports versus ad hoc queries
  • 34.
    Parting Thoughts • NoSQLand Big Data are disruptive • You ignore them at your peril • But if they can’t, ultimately, blend into current technology environments then they’re destined to fail • You can embrace the change without being sacrificed. Just watch your back.
  • 35.
    Thank You! • Email •andrew.brust@bluebadgeinsights.com • Blog: • http://www.zdnet.com/blog/big-data • Twitter • @andrewbrust on twitter