Successfully reported this slideshow.
Your SlideShare is downloading. ×

PASS_Summit_2019_Azure_Storage_Options_for_Analytics

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 49 Ad

PASS_Summit_2019_Azure_Storage_Options_for_Analytics

Download to read offline

Storage options for Analytics are not one size fits all. To deliver the best solution, you need to understand the use case, performance requirements, and users of the system. This session will break down the options you have in Azure to build a data analytics ecosystem, and explain why everyone's talking about data lakes and where's best to build your data warehouse.

Storage options for Analytics are not one size fits all. To deliver the best solution, you need to understand the use case, performance requirements, and users of the system. This session will break down the options you have in Azure to build a data analytics ecosystem, and explain why everyone's talking about data lakes and where's best to build your data warehouse.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to PASS_Summit_2019_Azure_Storage_Options_for_Analytics (20)

Advertisement

Recently uploaded (20)

PASS_Summit_2019_Azure_Storage_Options_for_Analytics

  1. 1. Azure Storage Options for Analytics Dustin Vannoy Data Engineer Cloud + Streaming
  2. 2. Please silence cell phones
  3. 3. everything PASS has to offer Free online webinar events Free 1-day local training events Local user groups around the world Online special interest user groups Business analytics training Get involved Free Online Resources Newsletters PASS.org Explore
  4. 4. Dustin Vannoy Data Engineering Consultant Co-founder Data Engineering San Diego /dustinvannoy @dustinvannoy dustin@dustinvannoy.com Technologies • Azure & AWS • Spark • Kafka • Python Modern Data Systems • Data Lakes • Analytics in Cloud • Streaming
  5. 5. PASS Summit Learning Pathway: Becoming an Azure Data Engineer Roles and Responsibilities of the Azure Data Engineer Jes Borland Wednesday November 06, 10:15 AM Room: TCC Tahoma 2 Azure Storage Options for Analytics Dustin Vannoy Wednesday, November 06, 3:15 PM Room: TCC Skagit 4 An Azure Data Engineer’s ETL Toolkit Simon Whiteley Thursday, November 07, 3:15 PM Room: TCC Tahoma 4 Data Modeling Trends for 2019 and Beyond Ike Ellis Friday, November 08, 9:30 AM Room: 2AB
  6. 6. Azure Storage for Analytics 1. Data Lakes 2. Data Warehouses 3. Analytics
  7. 7. Data Lakes in Azure
  8. 8. Data Lake Defined Varied Data Raw, intermediate, and fully processed Ready for Analysts Query layer, other analytic tools access Big Data Capable Store first, evaluate and model later * Not just a file system
  9. 9. Store Everything Why Data Lakes? • CSV, JSON, Logs, Text • No schema on write • Cheaper storage Reason #1
  10. 10. Massive Scale (Big Data) Why Data Lakes? • Serverless Hadoop • Span hot and cold storage • Pay for what you use Reason #2
  11. 11. Reason #3 Storage + Compute Separate Why Data Lakes? • Cost savings • Multiple analytics tools / same data
  12. 12. D E M O Example Data Lake Querying
  13. 13. Data Lake Best Practices • Metadata portal • Not just raw data • Dataset certification • Not too much governance
  14. 14. Azure Blob Storage • Storage for pretty much anything • Can choose from Block blob, Append blob, or Page blob • Low cost: $
  15. 15. Azure Blob Storage Structure Storage Account Containers Blobs
  16. 16. ADLS Gen 1 ADLS Gen 2 Azure Data Lake Storage File system semantics Granular security Scale Benefits from Gen 1 + Low cost + Hierarchical namespace
  17. 17. Data Lake Storage, Gen 2 • Built on Azure Blob Storage • Hadoop compatible access • Optimized for cloud analytics • Low cost: $$
  18. 18. ADLS Gen 2 Structure Storage Account File System Files
  19. 19. Options for Import Getting Data into ADLS Gen 2 • Azure Databricks • Azure Data Factory • AzCopy • Azure Storage Explorer
  20. 20. Options for Access Accessing Data From ADLS Gen 2 • Azure Databricks • HD Insight • Polybase (SQL DW / SQL Server) • Power BI
  21. 21. D E M O ADLS Gen 2: Setup and Upload
  22. 22. Archive Storage • Still part of Azure Blob Storage • Seamless integration with hot/cool • Keep everything • Very low cost but... • High read cost • Early deletion charges
  23. 23. Cost Comparison – Hot LRS Type Storage (Dollars/GB) Reads (per 10,000) Writes (per 10,000) Blob Storage (Hot) .021 .004 .055 ADLS Gen 2 (Hot) .021 .006 .072 * for ADLS every 4MB is considered an operation
  24. 24. Cost Comparison – Cool LRS Type Storage (Dollars/GB) Reads (per 10,000) Writes (per 10,000) Blob Storage (Cool) .015 .010 .100 ADLS Gen 2 (Cool) .015 .013 .130 * for ADLS every 4MB is considered an operation
  25. 25. Cost Comparison – Archive LRS Type Storage (Dollars/GB) Reads (per 10,000) Writes (per 10,000) Blob Storage (Archive) .002 5.500 .110 ADLS Gen 2 (Archive) .002 7.15 .143 * for ADLS every 4MB is considered an operation
  26. 26. Storage Redundancy Options Review redundancy and cost implications: https://azure.microsoft.com/en- us/pricing/details/storage/
  27. 27. Data Warehouses in Azure
  28. 28. Data Warehouse Defined Structured Data Processed and modeled for analytics use Interactive queries Analysts can get answers to questions quickly BI tool support Reporting tools can query efficiently
  29. 29. Speed of thought Why Data Warehouses? • Fast query response • Indexing or column store • SQL with analytic functions Reason #1
  30. 30. Reason #2 Ready to use data Why Data Warehouses? • Useful column names • Cleaned and standardized • Focused
  31. 31. Update/Delete Why Data Warehouses? • Support for real-time ingestion • Keep latest view or manage history Reason #3
  32. 32. Data Warehouse Best Practices • Staging data off limits • Star schema design • Indexing strategies • Read replicas
  33. 33. Azure SQL DB • Good ole relational database • Less DBA work required • Scalable on demand • Medium cost: $$ - $$$$ Managed SQL Server
  34. 34. Azure SQL DB – Elastic pools • DBs can auto-scale within the pool • Can move DB to different pool • Want DBs peak usage at different times • Important to understand utilization of DBs Resources shared among DBs
  35. 35. Azure SQL DB – Managed Instances Most on-premise features supported • SQL Agent jobs • Change Data Capture • Enabled CLR • Cross database queries • DB Mail • Service Broker • Transactional Replication Best for migrations
  36. 36. Azure SQL DB – Hyperscale • Storage, Compute, and Log scale separately • Backups, restores and scaling not tied to volume of data • Optimized for OLTP, but supports analytical workloads • One way migration Highly scalable storage and compute
  37. 37. Hyperscale Architecture http://aka.ms/ SQLDB_Hyperscale
  38. 38. D E M O Azure SQL DB: Analytics querying
  39. 39. Azure Synapse Analytics - SQL DW • MPP - fast reads, many users • Supports Polybase • Scalable on demand • High cost: $$$$ High performance Analytic DB
  40. 40. D E M O Synapse Analytics (SQL DW): Analytics querying
  41. 41. Cosmos DB • Useful for in-app analytics • Best with known search key, e.g. CustomerID • Key-value, Column-family, Document, Graph • SQL, Cassandra, MongoDB, Gremlin, Table, etcd, Spark • Medium cost: $$ - $$$ Managed NoSQL
  42. 42. Analytics in Azure
  43. 43. Shared semantic model Cache data Azure Analysis Services Build calculations and aggregations into a model that can be used by many analytics tools Improve query speeds by caching data
  44. 44. Visual report tool Supports most sources Power BI Build interactive dashboards and reports or do exploratory data analysis Connects to everything Azure and many other source types
  45. 45. D E M O Power BI: Connect to Data Lake
  46. 46. Final Thoughts
  47. 47. Keep Learning! Databricks / ETL 10 Cool Things You Can Do With Azure Databricks – Ike, Simon, Dustin An Azure Data Engineer's ETL Toolkit – Simon Whiteley Code Like a Snake Charmer - Introduction to Python! – Jamey Johnston Code Like a Snake Charmer – Advanced Data Modeling in Python! – Jamey Johnston Cosmos Cosmic DBA - Cosmos DB for SQL Server Admins and Developers – Michael Donnelly CosmosDB - Designing and Troubleshooting Lessons – Neil Hambly Data Modeling Data Modeling Trends for 2019 and Beyond – Ike Ellis Innovative Data Modeling for Cool Data Warehouses – Jeff Renz, Leslie Weed Data Warehouse / SQL DB Best, Better, Hyperscale! The Last Database You will Ever Need in the Cloud – Denzil Ribeiro Introducing Azure Synapse Analytics: The End-to-End Analytics Platform Built for Every Data Professional – Saveen Reddy Azure SQL Database: Maximizing Cloud Performance and Availability – Joe Sack, Denzil Ribeiro Delivering a Data Warehouse in the Cloud – Jeff Renz Data Warehousing: Which of the Many Cloud Products is the Right One for You? – Ginger Grant
  48. 48. Session Evaluations Submit by 5pm Friday, November 15th to win prizes. Download the GuideBook App and search: PASS Summit 2019 Follow the QR code link on session signage Go to PASSsummit.com 3 W A Y S T O A C C E S S
  49. 49. Thank You Dustin Vannoy @dustinvannoy dustin@dustinvannoy.com

Editor's Notes

  • Part of “Becoming an Azure Data Engineer” learning pathway - https://www.pass.org/summit/2019/Learn/LearningPathways.aspx#AzureDataEngineer


    Azure Storage Options for Analytics - https://www.pass.org/summit/2019/Learn/SessionDetails.aspx?sid=94120
  • I’m easy to find – just look for my full name or go to dustinvannoy.com
  • Things a data lake will have: Varied data – raw, intermediate, and fully processed data all included.
    Varied type – normally multiple file formats and includes data that isn’t fully structured/modeled
    Usable by analysts – some type of query layer or other analytic access should be available
    Large capacity – assumed that a data lake isn’t a place where we question the value of every file and field, typically the history kept here is large

    Where does the analogy come from:
    James Dixon from Pentaho in 2010 – if thinking about data marts or analytic data tables, they are your bottled water – structured and refined, ready to go. The data lake is a place that data streams in and people can come to examine it, dive in, or take a sample. Reference: Stacia Varga on RunAs Radio podcast.
  • Data Marts need to be cleaned. Too much data flowing in is imposible to clean, so we store it all in raw form and do some processing in the data lake layer. Instead of a backlog of data that needs cleaned and structured for analytics, we make the data available prior to much cleaning happening
  • 10:00
    Duration: 5 minutes
    Overview of querying a data lake in Azure without explaining the storage and tools involved.
    Quick overview of Azure Databricks as a place for data lake analytics.
    Show using azure databricks, use million songs dataset and nyc trips
    Describe how storage is separate from querying,

    data_lake_sql_demo: show different ways of using SQL only in Databricks – discuss that data is actually stored in Azure Storage
    create_spark_tables_v2: show how by learning a little bit of PySpark code you can create tables or transform data using data frames



  • Metadata portal – some type of data discovery and documentation is really beneficial. The tools out there to enable this are never out of the box, a lot of work has to happen to get enough metadata captured for users to actually find what they want.

    Some processing as done and that processed data is stored back into the lake. Not necessarily all processed data needs to go back to the lake, but just dumping data into Azure Storage is not enough to expect the results you desire from building a Data Lake.

    Have some certified data sets – this is one that Finance has used for their monthly reporting so you can count on it to be maintained and align with what stakeholders have seen as top level numbers

    Balanced access - few users have access to ALL data, but a good amount of data is available by default for analysts and users trained in data privacy and confidentiality. If you put all the data in the lake and make it a pain to get to, you will not get the experimentation and unplanned discoveries that are possible when data is made available to smart people.
  • Blobs can be one of three types:
    Block blobs
    Append blobs
    Page blobs
  • To store data we have to create an Azure Storage Account. You may think of these as a namespace or root directory. Within each storage account we may create many containers, similar to directories that help us organize data. Within a container we can store our data in blobs which is easiest to think of as a file, though it is a bit more complex than that.


    Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction
  • "The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access. A common object store naming convention uses slashes in the name to mimic a hierarchical directory structure. This structure becomes real with Data Lake Storage Gen2. Operations such as renaming or deleting a directory become single atomic metadata operations on the directory rather than enumerating and processing all objects that share the name prefix of the directory.” - https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction

  • ” Hadoop compatible access: Data Lake Storage Gen2 allows you to manage and access data just as you would with a Hadoop Distributed File System (HDFS). The new ABFS driver is available within all Apache Hadoop environments, including Azure HDInsight, Azure Databricks, and SQL Data Warehouse to access data stored in Data Lake Storage Gen2.”
    https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction
  • The options for analytics will be discussed more in a later section but a quick mention of how we expect to access ADLS data

    Polybase - There is no pushdown computation support, so PolyBase is mostly used for data loading from ADLS Gen2
    - https://www.jamesserra.com/archive/2019/09/ways-to-access-data-in-adls-gen2/

    Power BI – directly (beta) or in dataflows (preview)
  • 25:00
    Duration 5 min

    Should show uploading data and using databricks
  • https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers


  • See FAQ on billing scenario for example: https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/


  • See FAQ on billing scenario for example: https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/


  • See FAQ on billing scenario for example: https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/
  • Raw and intermediate data is normally in a backend staging area (separate database or schema) that only the data warehouse development team can get to
    Star schema design to balance storage, indexing, and joining
    Conformed dimensions – one version of customer dimension data, common calendar table, etc.
    Slow cahnging dimensions – there are techniques used for tracking history to dimension values. An example is if a product is re-assigned to a new product category. You can choose to just overwrite product category which simplifies new queries but means all reports built using product category in the past will not be possible to recreate.

    Indexing or partitioning carefully considered
    Read replicas to reduce locking and resource contention for those reading data and the jobs writing data
  • Elastic querying with external tables

  • When to use: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-elastic-pool#when-should-you-consider-a-sql-database-elastic-pool

  • SQL Server Agent jobs
    Change Data Capture 
    * Enabled CLR
    * Cross database queries
    * DB Mail enabled
    * Service Broker
    Transactional Replication

    References:
    https://docs.microsoft.com/en-us/azure/sql-database/sql-database-paas-vs-sql-server-iaas
    https://docs.microsoft.com/en-us/azure/sql-database/sql-database-features
  • Scaling transactional systems horizontally is something that the industry has struggled with forever. Hyperscale is going to keep your data consistent while at the same time scaling storage and compute.

    Hyperscale – really cool technology where they separate the storage engine used by SQL Server and scale that out – called Page Servers instead of Storage Engine. Each Page server stores up to 128 GB of data pages and has secondary. Scales out horizontally by adding more page servers.
    Multi tiered architecture, SSD based caching on compute later, SSD based cache on page server.
    Scale up by adding more cores very rapidly (spin up new compute in a couple minutes and failover to the new compute near instantaneous)
    Scale out with ready only compute


    Built on SQL Server engine so same experience you are used to
    100 TB storage (will expand)
    Compute scales fast and independently of storage

    References:
    Kevin Farlee - https://www.youtube.com/watch?v=Z9AFnKI7sfI

  • Hyperscale – really cool technology where they separate the storage engine used by SQL Server and scale that out – called Page Servers instead of Storage Engine. Each Page server stores up to 128 GB of data pages and has secondary. Scales out horizontally by adding more page servers.
    Multi tiered architecture, SSD based caching on compute later, SSD based cache on page server.
    Scale up by adding more cores very rapidly (spin up new compute in a couple minutes and failover to the new compute near instantaneous)
    Scale out with ready only compute


    Built on SQL Server engine so same experience you are used to
    100 TB storage (will expand)
    Compute scales fast and independently of storage

    References:
    Kevin Farlee - https://www.youtube.com/watch?v=Z9AFnKI7sfI
  • 40:00
    Duration: 5 min

    Show options of general purpose, business critical, and hypserscale

  • SQL DW – trade off some of the SQL features but able to scale as MPP – going to lose some things like foreign keys which may not be required for analytics but consider that carefully to make sure you are comfortable without the features that don’t fit in this MPP service. Usually cost is the main factor, expecting you to need to query multiple terabytes of structured data and get much faster performance than a standard database solution provides. Will not be best option for random seeks, such as looking up a single item or small amount of items in a large dataset. Expects you to do operations that would require table scans and is built to handle those way better by parallelizing the load.


  • 50:00
    Duration: 5 min
  • Document – Microsoft Document (recommended) or MongoDB (migrations). Set at collection level, have to use that API for that collection.

    SQL API – work on top of Microsoft Document

    Cassandra – eventually consistent option, different tradeoffs than Document option

    Graph – Gremlin, etc
    KeyValue – Azure Table Storage API – highly consistent

  • Typically you will build out a star schema in SQL DB and then import to analysis services
  • Can import data into its own data model so may skip analysis services cube and only store in Power BI dataset.
    Possible to share via Power BI Shared Datasets, but development experience will be different than with cubes and you can only use from Power BI (though some additional options if you have Power BI Premium, may be a good option for larger organizations).
  • 65:00
    Duration: 5 min

    https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-power-bi

    Path and key in presentationfolder/powerbi_adls_info.txt

    https://dvtrainingadls.dfs.core.windows.net/demo/spotify/


  • Can import data into its own data model so may skip analysis services cube and only store in Power BI dataset.
    Possible to share via Power BI Shared Datasets, but development experience will be different than with cubes and you can only use from Power BI (though some additional options if you have Power BI Premium, may be a good option for larger organizations).


  • See FAQ on billing scenario for example: https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/

×