SQL Track: SQL Server unleashed meet SQL Server's extreme sides


Published on

This session is a special one, and yes because of the subject matter but also because of the set-up of the session. It is split in two mini-sessions, one about New Technologies and an introduction to Parallel Datawarehouse:

New Technolgies:
This part of the session is all about the discovering the extremes of SQL Server.
First we will talk about the new SQL Servers In-memory technologies the updatable ColumnStore and Heckaton technology, both pushing the boundaries of SMB machines far beyond what we taught possible 3 years ago.

SQL Server PDW:
With the new SQL Servers In-memory technologies SQL Server pushes the SMB machines far, for some of us these boundaries are still to close for comfort.

So meet the scalable version of SQL Server, obliterating the limits of SMB machines. This is an introduction to SQL Server PDW, the next step in the (r)evolution of SQL Server, capable of running high performance data warehouse queries on big data even offering seamless integration with Hadoop using PolyBase.

Published in: Technology
1 Comment
  • http://www.dbmanagement.info/Tutorials/SQL.htm
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • De world of data is changing, let’s try to look into the near future shall we, by as short as 2015, organizations will have to integrate high-value, diverse, and even completely new information types and sources and have to try to turn all this into coherent information

    Regina Casonato et al., “Information Management in the 21st Century

  • I am a Senior SQL Server trainer en Senior Consultant working for Kohera
    Currently working as a SQL Server architect
    I coach and train DBA’s and developers.
    Rich experiance in both complex development and production environments
    I’m specialised in tweaking and tuning in both virtual and physical environments
    Succurity is currently a hughly underestimated issue.
  • Questions on social and web analytics
    Example: What is my brand and product sentiment? How effective is my online campaign? Who am I reaching? How can I optimize or target the correct audience?

    Questions that require connecting to live data feeds
    Example: A large shipping company uses live weather feeds and traffic patterns to fine tune its ship and truck routes leading to improved delivery times and cost savings. Retailers analyze sales, pricing, economic, demographic, and live weather data to tailor product selections at particular stores and determine the timing of price markdowns.

    Questions that require advanced analytics
    Example: Financial firms use machine learning to build better fraud detection algorithms that go beyond the simple business rules involving charge frequency and location to also include an individual’s customized buying patterns, ultimately leading to a better customer experience.

    Organizations that are able to take advantage of new technologies to ask and answer these new types of questions will be able to more effectively differentiate and derive new value for the business whether it is in the form of revenue growth, cost savings, or creating entirely new business models.

  • As we all know, the recession in 2008 dramatically impacted most organizations where, in some cases, significant cost cutting measures were put into place to control spending. This impacted IT and the CIO’s budget where spending was tightly controlled and in many cases dramatically lowered.
    In 2012, Gartner did a survey with more than 2,000 CIOs and found that IT budgets will not increase dramatically from previous years. In the average case, IT showed flat budgets. However, even with this being the case, there is an expectation that technology’s role in the enterprise must provide more value than before.
    This presents a scenario with IT that they have to both meet an increasing expectation to deliver value (that is, actively contributing to the enterprise’s growth) with the expectation that IT must also help reduce or control costs. That’s why IT must address these tough challenges by amplifying their strategies and operations to do more with what they already have. CIOs need to be efficient in how they allocate their budgets so that they can amplify their value to the business.
  • Slow, Inifficient and with a steep learning curve
    Need for different systems
    Data that is difficult to corrolate
    Doubtfull results
  • SQL Server doesn’t scale
    SQL Server has no DWH sollution
    SQL Server is like access
    SQL Server cannot handle TB size DWH’s, leave allone PB Sized
  • Databases are more and more becoming a dynamic environment
    Availability: There are hardly any SLAs for databases in the cloud. To prepare and run effectively on the dynamic cloud environment, every database, regardless of its size, must run in a replicable setup, which is typically more complex and expensive.
    Scalability: While scaling an application is pretty straightforward, scaling the database tier is more difficult
    Flexibility: Allowing you to add/remove resources to match your needs, with no need for over-provisioning or over-paying to prepare for any future peaks.
    Overhead: Cloud IT operations are tedious, complex and often more cumbersome.
    Expertise: Developers flock to the cloud – and with good reason. The flip side is that once the application gains momentum, running it effectively – and the DB in particular – requires a skill set not readily available for most developers. To allow developers to focus on their code rather than on the IT, the cloud ecosystem provides a myriad of off-the-shelf development platforms and cloud services to integrate with to streamline development and time-to-production.
    Multi Tenacy: For cloud providers, PaaS, SaaS and other large customers that need to run thousands of databases simultaneously, multi-tenancy enables a cost-effective and operationally efficient framework.
  • SMB
    PDW (MPP)

    For those data warehousing clients with requirements that are multiterabyte, high-end decision support, requiring superior price/performance across significant numbers of users, massively parallel processing (MPP) is an operational necessity.

    Although symmetric multiprocessing (SMP) is raising the price/performance bar and pushing the crossover point upward, SMP and MPP will coexist in a gray transition area at the high end; and clients will benefit from the choices and competition.

    Many more data warehousing clients will be able to make use of standard SMP approaches than had previously been the case, but those with multiterabyte volumes combined with high-performance requirements (active data warehousing) will still require a special-purpose data warehouse server. Innovations in have reduced contention in SMP designs and have reduced the coordination costs

    However, slope of one linear scalability of hundreds of processors still requires an MPP database.
    The central trade-off between single image (SMP) and parallel processing (MPP) database warehousing servers is between ease of administration and scalability.

    While MPP scales linearly over it’s nodes, troubleshooting so many processors can be an issue for administrators trained in the SMP world.
    The performance cost of data movement through the high-speed switch (a defining characteristic of clustered hardware and MPP databases) can be significant, and data placement remains a critical success factor. This is true, but it is the required trade-off for high-performance results given complex queries against large volume points.

    MPP offers superior scalability of computing power and throughput;
    SMP offers the best price/performance, especially below the gray area

    MPP has an issue of leaving processing power unused throughout the day. (Consider a perfectly distributed MPP database. Now run a query that joins two tables on "date." Redistribution occurs, and more data get hashed to certain nodes. Real-time imbalance occurs.)

    SMP overcomes this imbalance as the software is dynamically able to allocate parallel tasks within a single node
  • Columnstore provides dramatic performance

    Updateable and clustered xVelocity columnstore

    Stores data in columnar format

    Memory-optimized for next-generation performance

    Updateable to support bulk and/or trickle loading
  • The SQL Server connector for Apache Hadoop lets customers move large volumes of data between Hadoop and SQL Server while the SQL Server PDW connector for Apache Hadoop moves data between Hadoop and SQL Server Parallel Data Warehouse (PDW). These new connectors will enable customers to work effectively with both structured and unstructured data.

    External tables and full SQL query access to data stored in Hadoop Distributed File System (HDFS)
    HDFS bridge for direct and fully parallelized Access to data in HDFS
    Joining “on-the-fly” PDW data with data from HDFS
    Parallel import of data from HDFS in PDW tables for persistent storage
    Parallel export of PDW data into HDFS, including “round-tripping” of data

    More specifically, Hadoop is a basic set of tools that help developers create applications spread across multiple CPU cores on multiple servers
    it’s parallelism taken to an extreme.

    If you need to work with big data, Hadoop is becoming the _de facto_ answer. But once your data is in Hadoop, how do you query it?
    If you need big data warehousing, look no further than Hive is a data warehouse built on top of Hadoop. Hive is a mature tool – it was developed at Facebook to handle their data warehouse needs. It’s best to think of Hive as an enterprise data warehouse (EDW)

    Hive was designed to be easy for SQL professionals to use. Rather than write Java, developers write queries using HiveQL (based on ANSI SQL) and receive results as a table. As you’d expect from an EDW, Hive queries will take a long time to run; results are frequently pushed into tables to be consumed by reporting or business intelligence tools. It’s not uncommon to see Hive being used to pre-process data that will be pushed into a data mart or processed into a cube.

  • Analytics Platform System (APS) isn’t simply a renaming of the Parallel Data Warehouse (PDW).  It is not really a new product, but rather a name change due to a new feature in Appliance Update 1 (AU1) of PDW.  That new feature is the ability to have a HDInsight region (a Hadoop cluster) inside the appliance.
    So APS combines SQL Server and Hadoop into a single offering that Microsoft is touting as providing “big data in a box.”
    Think of APS as the “evolution” of Microsoft’s current SQL Server Parallel Data Warehouse product.  Using PolyBase, it now supports the ability to query data using SQL across the traditional data warehouse, plus data stored in a Hadoop region, whether in the appliance or a separate Hadoop Cluster.

    APS is a no-compromise modern data warehouse solution that seamlessly combines a best-in-class relational database management system, in-memory technologies, Hadoop and cloud integration in a turnkey package built for Big Data analytics.
  • General details
    All hosts run Windows Server 2012 Standard
    All virtual machines run Windows Server 2012 Standard as a guest operating system
    All fabric and workload activity happens in Hyper-V virtual machines
    Fabric virtual machines, MAD01, and CTL share one server
    Lower overhead costs especially for small topologies
    PDW Agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workload
    DWConfig and Admin Console continue to exist
    Minor extensions expose host-level information
    Windows Storage Spaces handles mirroring and spares and enables use of lower cost DAS (JBODs) rather than SAN

    PDW workload details
    SQL Server 2012 Enterprise Edition (PDW build) control node and compute nodes for PDW workload

    Storage details
    Similar layout to V1
    More files per filegroup
    Larger number of spindles in parallel
  • Excel is one of the primary clients to enable big data analytics on Microsoft platforms. In Excel 2013, our primary BI tools are PowerPivot, a data-modeling tool, and Power View, a data-visualization tool, and they are built right into the software, no additional downloads required. This enables users of all levels to do self-service BI using the familiar interface of Excel.

    Through a Hive Add-in for Excel, our HDInsight services easily integrate with the BI tools in Office 2013, allowing users to create easily analyze massive amounts of structured or unstructured data with a very familiar tool.

    In addition to Excel, Microsoft offers other client tools for interacting with Big Data: BI Professionals can use BI Developer Studio to design OLAP cubes or scalable PowerPivot models in SQL Server Analysis Services. Developers will continue using Visual Studio to develop and test MapReduce programs written in .NET. Finally, IT operators will manage their Hadoop clusters on HDInsight with System Center that they use today.
  • Direct parallel data access between PDW Compute Nodes and Hadoop Data Nodes
    Support of all HDFS file formats
    Introducing “structure” on the “unstructured” data
  • High-level goals for V2
    Seamless Integration with Hadoop via regular T-SQL
    Enhancing PDW query engine to process data coming from the Hadoop Distributed File System (HDFS)
    Fully parallelized query processing for highly performing data import and export from HDFS
    Integration with various Hadoop implementations
    Hadoop on Windows Server, Hortonworks, and Cloudera

    Both distributed systems
    Parallel data access between PDW and Hadoop
    Different goals and internal architecture
    Combined power of Big Data integration
  • SQL Track: SQL Server unleashed meet SQL Server's extreme sides

    1. 1. SQL Server unleashed: Meet SQL Server's Extreme sides Karel Coenye
    2. 2. The world of data is changing
    3. 3. About Me
    4. 4. New Questions, More data
    5. 5. Do more with less
    6. 6. Previous Limitations
    7. 7. Urban Myths
    8. 8. But… Do you know the challenges
    9. 9. Data is the Key
    10. 10. PDW vs. SMB
    11. 11. SMB on Steroids
    12. 12. OLTP
    13. 13. Hekaton
    14. 14. SMB
    15. 15. DWH’s
    16. 16. Next Gen DWH Performance
    17. 17. Updateble ColumnStore
    18. 18. Enter Big Data
    19. 19. Hadoop
    20. 20. Scale Standard Enterprise Fasttrack PDW Reliable SMB Reliable Business Critical SMB Reference Architecture High End MPP DWH Needs Maintenance hours Online Maintenance 24/7/365 Based upon Enterprise edition High end Data marts and EDWs Software Only Sofware Only Architecture (hard and software) Appliance Scale Up Scale Up Scale Up DWH Scale out OLTP OLTP / / Small DWH DWH up to 10’s of TB Data Marts and small to midsize DWH Up to PB’s
    21. 21. MPP - PDW
    22. 22. MPP - APS
    23. 23. Hardware
    24. 24. Virtualization
    25. 25. Yes… But does it work with Excel…
    26. 26. Polybase PDW Appliance Hadoop Cluster
    27. 27. Ok sounds cool, but what does it do
    28. 28. Ok so you said it’s fast… but now show me
    29. 29. Load performance 0 10 20 30 40 50 60 70 PDW DEV SQL -> SQL PROD TeraData -> SQL Loading 100 milion rows in minutes (shorter is better)
    30. 30. Data Pumps • Reading 132 MB/s from disk = 8 GB per minute • Reading 2 DVDs per minute
    31. 31. Scaling • Scales Lineair – Demo PDW has only 2 units • There is a pdw development edition – but it is a developer appliance! For an msdn ultimate subscription, there is 1 pdw developer license.
    32. 32. Future Proof • DWLoader is the fastest load mechanism • Transformations can be done using CTAS statements • Loading from remote server: – Any remote server connected with infiniband switch – Multiple servers allowed • Pollybase – Ready for big data • Can use existing SSIS
    33. 33. DEMO
    34. 34. Follow Technet Belgium @technetbelux Subscribe to the TechNet newsletter aka.ms/benews Be the first to know
    35. 35. Belgium’s biggest IT PRO Conference