Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Strategy for the Relational World


Published on

Published in: Technology
  • Be the first to comment

Big Data Strategy for the Relational World

  1. 1. Big Data Strategyfor the Relational WorldEmbracing Disruption, Avoiding RegressionAndrew J. BrustFounder & CEO, Blue Badge InsightsBig Data correspondent, ZDNetBig Data Analyst, GigaOM Research
  2. 2. Bio• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair, Visual Studio Live! and 18 years as a speaker• Founder, Microsoft BI User Group of NYC–• Co-moderator, NYC .NET Developers Group–• “Redmond Review” columnist forVisual Studio Magazine and Redmond Developer News• Twitter: @andrewbrust
  3. 3. Andrew on ZDNet (
  4. 4. Read all about it!
  5. 5. Big Data: Why Should You Care?• Because analytics (i.e. BI) has always beenimportant, but it was expensive and obscure• Because the economics of processing andstorage make Big Data feasible
  6. 6. Big Data: Why Should You beCautious?• Too many vendors; too much churn• Designed for the lab, not for mainstreambusiness• Immature technology and tooling– Results in serious recruiting and dev costs• So, you can’t ignore Big Data, but you can’tjust pursue with abandon, either.– That’s hard!
  7. 7. Agenda• Trends• Technologies– NoSQL– Hadoop– SQL Convergence– NewSQL– In-Memory• Forecasts• Risks• Recommendations
  8. 8. Database Trends• Mongo and Cassandra, primarilyNoSQL• aka “unstructured data”Late-bound schema• Especially HDFSFile-based table handling• And Massively Parallel ProcessingColumnar storage• Very few throwing them awayCo-existence with RDBMS, OLAPdatabases• Still expect tables or cubesLittle change in tools/clients
  9. 9. NoSQLKey-ValueStore• Couchbase• Riak• Redis• Voldemort• DynamoDB• Azure tablesDocumentStore• MongoDB• CouchDB• Cloudant• CouchbaseWide ColumnStore• HBase• CassandraGraphDatabase• Neo4JSQLSQL
  10. 10. Consistency• CAP Theorem–Databases may only excel at two of the followingthree attributes: consistency, availability and partitiontolerance• NoSQL does not offer “ACID” guarantees–Atomicity, consistency, isolation and durability• Instead offers “eventual consistency”–Similar to DNS propagation
  11. 11. CAP TheoremConsistencyAvailabilityPartitionToleranceRelationalNoSQL
  12. 12. NoSQL Upside• Distributed by default• Open source lets you peg costs to personnel,more than to customers• Developer enthusiasm
  13. 13. Hadoop• Open source, petabyte-scale data analysis andprocessing framework• Runs on commodity hardware• Lots of ecosystem• Two main components:– Hadoop Distributed File System (HDFS)– MapReduce engine
  14. 14. Hadoop• Open source, petabyte-scale data analysis andprocessing framework• Runs on commodity hardware• Lots of ecosystem• Two main components:– Hadoop Distributed File System (HDFS)– MapReduce engine
  15. 15. Why MapReduce is Cool• Extremely flexible – full power of a proceduralprogramming language• Map step, essentially, allows ad hoc ETL• With Reduce step, aggregation is a first-classconcept• Growing ecosystem of tools that generateMapReduce code
  16. 16. Why MapReduce Sucks• It’s a batch mode technology• It’s not declarative• Most BI products don’t work with MR natively– They connect via Hive instead (by and large)• It’s good for a group of use cases, but it’s not agood general framework
  17. 17. The Google DNA• Hadoop and HBase came from Google– MapReduce, GFS– BigTable• Hadoop was built for their use cases, and theydon’t use it as extensively now• So why is the world going Hadoop-crazy?
  18. 18. Benefits of Schema-Free• Variable schema is accommodated– Great for product catalogs, content managementand the like• Simple for archival storage• For analysis:– Avoids politics of achieving consensus onstructure– Allows different schema for different applications
  19. 19. Cloud Effect• Database as a service and SaaS BI/Analytics getscompanies excited– Cloudant– Amazon: DynamoDB, RDS, RedShift, Jaspersoft• Elastic capabilities of cloud provide small customerswith access to huge clusters– Amazon EMR, Microsoft Windows Azure HDInsight now– Google Compute Engine, Rackspace/Hortonworks to come• Cloud-borne reference data adds value• But casualties emerging: e.g. Xeround
  20. 20. SQL Skillset and Ecosystem• Making recruiting faster and cheaperDBAs, most devs know itORMs expect it• Even if they also talk to MDX and NoSQL sourcesReporting/analysis tools are premised on itCompanies are invested in itAbandoning it is naive
  21. 21. MPP is Big Data(via acquisition)• Acquired Aster DataTeradata• IBMNetezza• HPVertica• EMCPivotal/Greenplum• ActianParAccel• Microsoft-DATAllegro acquisitionSQL Server Parallel DataWarehouse
  22. 22. SQL – BD Convergence• Brings the SQL language and data warehouseproducts, on one side, together with Hadoop, onthe other• Goal is to make Hadoop interactive, non-batch• May involve Hive and its APIs• May involve direct access to HDFS– Bypassing MapReduce• Think of the “database” as HDFS, and MapReduceas merely an access method.
  23. 23. One Repository, Multiple AccessMethodsHCatalog
  24. 24. Cloudera Impala (v1.0 shipped April 30)Hortonworks “Stinger” initiative•Make Hive 100x fasterEMC PivotalMicrosoft PolyBase, Data ExplorerTeradata Aster SQL-HParAccel (Actian) ODISQL – BD Convergence
  25. 25. NuoDBVoltDBClustrixTransLatticeNewSQL Entrants
  26. 26. Dremel and Drill• Dremel is Google’s column store analytical database– Proprietary; available publicly as BigQuery• Hierarchical/nested too– Allows schema variance without anarchy• “…scales to thousands of CPUs and petabytes of data,and has thousands of users at Google.”• Uses SQL, has growing BI tool support• Petabyte scale• Drill:Dremel as Hadoop:MapReduce+GFS• And then there’s Spanner
  27. 27. In-Memory• SAP HANA– And Sybase IQ• Data Warehouse Appliances• VoltDB• Oracle TimesTen• IBM solidDB– Also TM1 (in-memory OLAP)• Coming: SQL Server’s “Hekaton” engine
  28. 28. The Truth About In-Memory• Judicious use of in-memory database technology canspeed analytical queries– Combine with columnar technology, rinse, repeat• Can also eliminate need for deferred writes• A RAM-only strategy like HANA’s seems impractical• Keep in mind:– SSD is memory too. It’s slower, but it’s memory.– Conversely, L1, L2 and L3 cache is faster than RAM. SingleInstruction, Multiple Data (SIMD) makes things faster still.• Hybrid approaches are most sensible
  29. 29. What’s Ahead?• Consolidation! We can’t have this many vendors:– Some will go out of business– Some will get acquired– A few will stay independent (but may merge with eachother)• Hadoop recedes into the service layer• NoSQL shakes out, matures, coexists• NewSQL gets adopted or acquired• In-memory becomes a standard option
  30. 30. Risks and Considerations• Pick an esoteric database now and you may beforced to migrate later• SQL Server and Oracle could add features thatmake the specialty products superfluous– Or new products• Conversely, NoSQL products may acquireACID-like features themselves• More convergence
  31. 31. Recommendations• NoSQL has its use cases. But it also has itsabuses.• Look carefully at the number of customers• Look also at how widely deployed the productis within those customer companies
  32. 32. Recommendations• If you haven’t looked seriously at Hadoop, do so.But remember, it’s infrastructure.• You can reach out to Big Data now, or you canwait for it to reach out to you– Cost/benefit of earlier adoption vs. late following• For repeatable big problems, MapReduce workswell; for iterative query, “SQL” technologies aremuch better– akin to standard reports versus ad hoc queries
  33. 33. Parting Thoughts• NoSQL and Big Data are disruptive• You ignore them at your peril• But if they can’t, ultimately, blend into currenttechnology environments then they’redestined to fail• You can embrace the change without beingsacrificed. Just watch your back.
  34. 34. Thank You!• Email•• Blog:•• Twitter• @andrewbrust on twitter