Microsoft Big Data @ SQLUG 2013

6,555
-1

Published on

Published in: Technology

Microsoft Big Data @ SQLUG 2013

  1. 1. BIG DATAWesley BackelantTechnology AdvisorMicrosoft@WesleyBackelantNathan BijnensBig Data ConsultantDataCrunchers@nathan_gs
  2. 2. AGENDA• Big Data• Hadoop (& Ecosystem)• How does it fit in the Microsoft world?• Demo• Resources• Q&A
  3. 3. THE WORLD OF DATA IS CHANGING
  4. 4. TODAY A NEW SET OF QUESTIONS ARE BEING ASKED OFTHE BUSINESS: What’s the social How do I better sentiment for my predict future brand or products outcomes? How do I optimize my fleet based on weather and traffic patterns?
  5. 5. TRANSFORMATION OF ONLINE MARKETING BLOGS.FORBES.COM/DAVEFEINLEIB
  6. 6. TRANSFORMATION OF OPERATIONS BLOGS.FORBES.COM/DAVEFEINLEIB
  7. 7. TRANSFORMATION OF CUSTOMER SERVICE BLOGS.FORBES.COM/DAVEFEINLEIB
  8. 8. TRANSFORMATION OF ENERGY
  9. 9. TRANSFORMATION OF FRAUD DETECTIONThen… Now…
  10. 10. NEW HARDWARE APPROACHTraditional Big Data Exotic HW Commodity HW • Big central servers • racks of pizza boxes • SAN • Ethernet • RAID • JBOD Hardware reliability Unreliable HW Limited scalability Scales further Cost effective
  11. 11. NEW SOFTWARE APPROACHTraditional Big Data Monolotic Distributed • Centralized - storage & compute nodes • RDBMS Raw data Schema first Proprietary
  12. 12. HADOOP & BIG DATA ECOSYSTEM MapReduce HDFS
  13. 13. HDFS
  14. 14. HDFS
  15. 15. MAPREDUCE
  16. 16. MAPREDUCE
  17. 17. MAPREDUCE
  18. 18. HIVE
  19. 19. HIVEA data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. – Ideal for ad hoc querying – Query execution via MapReduce.Key Building Principles: – SQL – Extensibility – Types – Functions – Scripts
  20. 20. HIVEIt supports many SQL features like: – Data partitioning – Aggregations – Grouping – Joins
  21. 21. HIVEAnd it’s extendable using UDFs. package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } }There are many UDFs published by external parties, for:- Loading / Saving (SerDe)- Field Transformations
  22. 22. HADOOP PIG: INTROPig is a high level data flow language.
  23. 23. HADOOP PIG: 3 COMPONENTS• Pig Latin• Grunt• PigServer
  24. 24. HADOOP PIGdata = LOAD employee.csv USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray );
  25. 25. HADOOP PIGgrouped_by_department = GROUP data BY department;total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage;total_ordered = ORDER total_wage_by_department BY total_wage;total_limited = LIMIT total_ordered 10;
  26. 26. HADOOP PIGDUMP total_limited;STORE total_limited INTO ‘/test/’;
  27. 27. UDF● Custom Load and Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog● Custom extraction eg. date, ...● Take a look at the PiggyBank.
  28. 28. HBASE A distributed, versioned, column-oriented database.• Main features: • Horizontal scalability • Machine failure tolerance • Row-level atomic operations including compare-and-swap ops like incrementing counters • Augmented key-value schemas, the user can group columns into families which are configured independently • Multiple clients like its native Java library, Thrift, and REST • Upcoming Security
  29. 29. STORM
  30. 30. STORM
  31. 31. STORM• Message passing.• Distributed processing.• Horizontally scalable.• Incremental algorithms.• Fast.• Data in motion.
  32. 32. STORM Nimbus Zookeeper Supervisor Supervisor Supervisor Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Node Worker Node Worker Node
  33. 33. STORM• Tuple• Stream
  34. 34. STORM• Spout• Bolt
  35. 35. STORM• Grouping
  36. 36. A DATA SYSTEM
  37. 37. DATA IS MORE THAN INFORMATION Not all information is equal. Some information is derived from other pieces of information.
  38. 38. DATA IS MORE THAN INFORMATIONEventually you will reach the most ‘raw’ form of information. This is the information you hold true, simple because it exists. Let’s call this ‘data’, very similar to ‘event’.
  39. 39. EVENTSEverything we do generates events: • Pay with Credit Card • Commit to Git • Click on a webpage • Tweet
  40. 40. EVENTS - BEFORE Events used to manipulate the master data.
  41. 41. EVENTS - AFTER Today, events are the master data.
  42. 42. DATA SYSTEM Let’s store everything.
  43. 43. EVENTS Data is Immutable
  44. 44. EVENTS Data is Time Based
  45. 45. CAPTURING CHANGE TRADITIONALLYPerson Location Person LocationNathan Antwerp Nathan GhentGeert Dendermonde Geert DendermondeJohn Ghent John Ghent
  46. 46. CAPTURING CHANGEPerson Location Timestamp Person Location Time Nathan Antwerp 2005-01-01Nathan Antwerp 2005-01-01 Geert Dendermonde 2011-10-08Geert Dendermonde 2011-10-08 John Ghent 2010-05-02John Ghent 2010-05-02 Nathan Ghent 2013-02-03
  47. 47. QUERY The data you query is often transformed, aggregated, ... Rarely used in it’s original form.
  48. 48. QUERY Query = function ( data )
  49. 49. NUMBER OF PEOPLE LIVING IN EACH CITY.Person Location Time Location CountNathan Antwerp 2005-01-01 Ghent 2 Dendermonde 1Geert Dendermonde 2011-10-08John Ghent 2010-05-02Nathan Ghent 2013-02-03
  50. 50. QUERY All Data Query
  51. 51. QUERY: PRECOMPUTE All Data Precomputed View Query
  52. 52. LAYERED ARCHITECTURE Batch Layer Speed Layer Serving Layer
  53. 53. LAYERED ARCHITECTURE SQL QueryIncoming Data HD Insight Column Store
  54. 54. BATCH LAYER
  55. 55. BATCH LAYERIncoming Data HD Insight Column Store
  56. 56. BATCH LAYER Unrestrained computation.
  57. 57. BATCH LAYER Horizontal scalable.
  58. 58. BATCH LAYER High Latency. Let’s pretend temporarily that update latency doesn’t matter.
  59. 59. BATCH LAYER Stores master copy of data set... append only.
  60. 60. BATCH LAYER
  61. 61. BATCH: VIEW GENERATION View #1 Master Dataset View #2 MapReduce View #3
  62. 62. MAPREDUCE 1. Take a large problem and divide it into sub-problems … MAP 2. Perform the same function on all sub-problems … DoWork() DoWork() DoWork() 3. Combine the output from all sub-problems REDUCE … Output
  63. 63. BATCH VIEW DATABASE Read only database. No random writes required.
  64. 64. BATCH LAYERWe are not done yet… Just a few hours of data. Not yetData absorbed into Batch Views absorbed. Time Now
  65. 65. SPEED LAYER
  66. 66. OVERVIEW SQLIncoming Data HD Insight Column Store
  67. 67. SPEED LAYER Stream processing.
  68. 68. SPEED LAYER Continuous computation.
  69. 69. SPEED LAYER Transactional.
  70. 70. SPEED LAYER Storing a limited window of data. Compensating for the last few hours of data.
  71. 71. SPEED LAYER All the complexity is isolated in the Speed layer. If anything goes wrong, it’s auto-corrected.
  72. 72. CAPYou have a choice between:• Availability • Queries are eventual consistent.• Consistency • Queries are consistent.
  73. 73. EVENTUAL ACCURACY Some algorithms are hard to implement in real time. For those cases we could estimate the results.
  74. 74. SPEED LAYER Real Time View 1Incoming Data Real Time View 2
  75. 75. SPEED LAYER VIEWS• The views are stored in Read & Write database. • MS SQL Server • Column Store • Cassandra • …• Much more complex than a read only view.
  76. 76. SERVING LAYER
  77. 77. OVERVIEW SQL QueryIncoming Data HD Insight Column Store
  78. 78. SERVING LAYER This layer queries the Batch & Real Time views and merges it.
  79. 79. SERVING LAYER Batch Views Merge Real Time Views
  80. 80. SERVING LAYER Polybase is a great fit.
  81. 81. OVERVIEW
  82. 82. OVERVIEW SQL QueryIncoming Data HD Insight Column Store
  83. 83. LAMBDA ARCHITECTURE• Can discard any view, batch and real time, and just recreate everything from the master data.• Mistakes are corrected via recomputation. • Write bad data? Remove the data & recompute. • Bug in view generation? Just recompute the view.• Data storage is highly optimized.
  84. 84. MICROSOFT BIG DATA
  85. 85. WHAT IS MICROSOFT DOING ONTHE BI & DEVELOPMENT SIDE
  86. 86. INSIGHTS FROM ANY DATA, ANY SIZE, ANYWHERE 010101010101010101 1010101010101010 01010101010101 101010101010
  87. 87. WE DELIVER INSIGHTS TO EVERYONE BY ENABLING BIG DATAANALYSIS WITH FAMILIAR END USER TOOLSBenefits Interaction and analysis of unstructured data in HadoopKey Features Hive add-in for Excel
  88. 88. UNLOCKING IMMERSIVE INSIGHTS FROM ALL DATAWITH MICROSOFT BI TOOLSBenefits Familiar self service BI toolsKey Features Hive ODBC Driver integrates Hadoop to SQL Server Analysis Services, PowerPivot, and Power View
  89. 89. WHILE DRAMATICALLY SIMPLIFYING PROGRAMMINGON HADOOP MapReduce programsBenefits in JavaScript Simplified Simplified Deployment of Programming MapReduce jobsKey Features JS Deploy JavaScript Hadoop Integration with .NET and jobs from a simple web new JavaScript libraries for browser on any supported Hadoop device
  90. 90. WE MANAGE STREAMING DATA WITH STREAMINSIGHTBenefitsKey Features StreamInsight SQL StreamInsight
  91. 91. WHAT IS MICROSOFT DOING ONTHE HADOOP & INTEGRATION SIDE?
  92. 92. WE MANAGE RELATIONAL DATA WITH MICROSOFTENTERPRISE DATA WAREHOUSE SOLUTIONS Reference Architectures Appliances Dell Parallel HP Enterprise Data Data Fast Track for Warehouse Warehouse Dell HP Business Quickstart Data Data Warehouse Warehouse
  93. 93. INTRODUCING POLYBASEFundamental Breakthrough in Data Processing Single Query; Structured and Unstructured SQL • Query and join Hadoop tables with Relational Tables SQL Server 2012 • Use Standard SQL language PDW Powered • Select, From Where by PolyBase Existing SQL No IT Save Time Analyze All Skillset Intervention and Costs Data Types
  94. 94. AND SUPPORT UNSTRUCTURED DATA WITH ENTERPRISECLASS HADOOP ON PREMISE AND IN THE CLOUDBenefitsKey Features
  95. 95. MICROSOFT BRINGS THE SIMPLICITY AND MANAGEABILITYOF WINDOWS AND SQL SERVER TO HADOOP Benefits Key Features
  96. 96. MICROSOFT DELIVERS BIG DATA THROUGH OPENPLATFORM AND A RICH PARTNER ECOSYSTEMBenefitsKey Features
  97. 97. BIG DATA DEMO:FROM DATA TO INSIGHTS! Analysis with familiar Collaboration onSimplicity tools insights
  98. 98. THANK YOU!!!
  99. 99. RESOURCES• Microsoft Big Data Solution: www.microsoft.com/bigdata• Windows Azure: www.windowsazure.com/en-us/home/scenarios/big-data• Try Now: https://www.hadooponazure.com• HDInsight For Windows Beta Download: http://hortonworks.com/download/• HDInsight Services For Windows: http://social.technet.microsoft.com/wiki/contents/articles/6204.hdinsight-services-for- windows.aspx#videos• Hadoop in PowerPivot: http://social.technet.microsoft.com/wiki/contents/articles/6294.how-to- connect-excel-powerpivot-to-hive-on-azure-via-hiveodbc.aspx• Hadoop in SSIS: http://msdn.microsoft.com/en-us/library/jj720569.aspx• Hurricane Sandy: http://sqlcat.com/sqlcat/b/msdnmirror/archive/2013/02/01/hurricane-sandy- mash-up-hive-sql-server-powerpivot-amp-power-view.aspx• Hadoop PowerShell: http://blogs.msdn.com/b/cindygross/archive/2012/08/23/how-to-install-the- powershell-cmdlets-for-apache-hadoop-based-services-for-windows.aspx• SQL Server BCP to Hive: http://blogs.msdn.com/b/cindygross/archive/2012/09/28/load-sql-server- bcp-data-to-hive.aspx• Internal vs External Table Hive: http://blogs.msdn.com/b/cindygross/archive/2013/02/06/hdinsight- hive-internal-and-external-tables-intro.aspx• Microsoft.NET SDK for Hadoop: http://hadoopsdk.codeplex.com/• Twitter Analytics Example: http://twitterbigdata.codeplex.com/
  100. 100. DATACRUNCHERSWe enable companies in envisioning, defining and implementing a datastrategy.A one-stop-shop for all your Big Data needs.The first Big Data Consultancy agency in Belgium.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×