SlideShare a Scribd company logo
1 of 37
Microsoft.com/Learn
Join the chat at https://aka.ms/LearnLiveTV
Title
Speaker Name
Read and write data in Azure
Databricks
Speaker Name
Title
Learning
objectives
 Use Azure Databricks to read multiple file types, both with
and without a Schema.
 Combine inputs from files and data stores, such as Azure
SQL Database.
 Transform and store that data for advanced analytics.
Unit Prerequisites
Microsoft Azure Account: You will need a valid and active Azure
account for the Azure labs.
• If you are a Visual Studio Active Subscriber, you are entitled to Azure credits per month. You can refer
to this link to find out more including how to activate and start using your monthly Azure credit.
• If you are not a Visual Studio Subscriber, you can sign up for the FREE Visual Studio Dev Essentials
program to create Azure free account.
Create the required resources
To complete this lab, you will need to deploy an Azure Databricks
workspace in your Azure subscription.
Agenda  Introduction
 Read data in CSV format
 Read data in JSON format
 Read data in Parquet format
 Read data stored in tables and views
Agenda
continued
 Write data
 Exercises: Read and write data
 Knowledge check
 Summary
Introduction
Introduction
Suppose you're working for a data analytics startup that's now
expanding along with its increasing customer base.
Creating your Databricks
workspace
Deploy an Azure Databricks workspace
Click the following button to open the Azure Resource Manager
Template (ARM) template in the Azure portal.
• Click the following button to open the Azure Resource Manager Template (ARM) template in the Azure portal. Deploy Databricks
from the ARM Template
• Provide the required values to create your Azure Databricks workspace:
• Subscription: Choose the Azure Subscription in which to deploy the workspace.
• Resource Group: Leave at Create new and provide a name for the new resource group.
• Location: Select a location near you for deployment. For the list of regions supported by Azure Databricks, see Azure
services available by region.
• Workspace Name: Provide a name for your workspace.
• Pricing Tier: Ensure premium is selected.
• Accept the terms and conditions.
• Select Purchase.
• The workspace creation takes a few minutes. During workspace creation, the portal displays the Submitting deployment for Azure
Databricks tile on the right side. You may need to scroll right on your dashboard to see the tile. There is also a progress bar
displayed near the top of the screen. You can watch either area for progress.
Create a cluster
When your Azure Databricks workspace creation is complete, select
the link to go to the resource.
Clone the Databricks archive
If you do not currently have your Azure Databricks workspace open: in
the Azure portal, navigate to your deployed Azure Databricks
workspace and select Launch Workspace.
• Select Import.
• Select the 03-Reading-and-writing-data-in-
Azure-Databricks folder that appears.
Read data in CSV format
Read data in CSV format
In this unit, you need to complete the exercises within a Databricks
Notebook.
Complete the following notebook
Open the 1.Reading Data - CSV notebook.
• Start working with the API documentation
• Introduce the class SparkSession and other entry points
• Introduce the class DataFrameReader
• Read data from:
• CSV without a Schema
• CSV with a Schema
Read data in JSON format
Read data in JSON format
In your Azure Databricks workspace, open the 03-Reading-and-
writing-data-in-Azure-Databricks folder that you imported within
your user folder.
• Read data from:
• JSON without a Schema
• JSON with a Schema
Read data in Parquet format
Read data in Parquet format
In your Azure Databricks workspace, open the 03-Reading-and-
writing-data-in-Azure-Databricks folder that you imported within
your user folder.
• Introduce the Parquet file format
• Read data from:
• Parquet files without a schema
• Parquet files with a schema
Read data stored in tables and
views
Read data stored in tables and views
In your Azure Databricks workspace, open the 03-Reading-and-
writing-data-in-Azure-Databricks folder that you imported within
your user folder.
• Demonstrate how to pre-register data sources in Azure Databricks
• Introduce temporary views over files
• Read data from tables/views
Write data
Write data
In your Azure Databricks workspace, open the 03-Reading-and-
writing-data-in-Azure-Databricks folder that you imported within
your user folder.
• Write data to a Parquet file
• Read the Parquet file back and display the results
Exercise
Exercises: Read and write data
Exercises: Read and write data
In your Azure Databricks workspace, open the 03-Reading-and-
writing-data-in-Azure-Databricks folder that you imported within
your user folder.
Knowledge check
Question 1
How do you list files in DBFS within a notebook?
A. ls /my-file-path
B. %fs dir /my-file-path
C. %fs ls /my-file-path
Question 1
How do you list files in DBFS within a notebook?
A. ls /my-file-path
B. %fs dir /my-file-path
C. %fs ls /my-file-path
Question 2
How do you infer the data types and column names when you read a
JSON file?
A. spark.read.option("inferSchema", "true").json(jsonFile)
B. spark.read.inferSchema("true").json(jsonFile)
C. spark.read.option("inferData", "true").json(jsonFile)
Question 2
How do you infer the data types and column names when you read a
JSON file?
A. spark.read.option("inferSchema", "true").json(jsonFile)
B. spark.read.inferSchema("true").json(jsonFile)
C. spark.read.option("inferData", "true").json(jsonFile)
Summary
Summary
In this module, you learned the basics about reading and writing data
in Azure Databricks.
• Read data from CSV files into a Spark Dataframe
• Provide a Schema when reading Data into a Spark Dataframe
• Read data from JSON files into a Spark Dataframe
• Read Data from parquet files into a Spark Dataframe
• Create Tables and Views
• Write data from a Spark Dataframe
Clean up
If you plan on completing other Azure Databricks modules, don't
delete your Azure Databricks instance yet.
Delete the Azure Databricks instance
• Navigate to the Azure portal.
• Navigate to the resource group that contains your Azure Databricks instance.
• Select Delete resource group.
• Type the name of the resource group in the confirmation text box.
• Select Delete.
Next Steps
Practice your knowledge by
trying these Learn modules:
Please tell us how you liked this
workshop by filling out this survey:
https://aka.ms/workshopomatic-
feedback
There is a slightly more advanced and
involved learning path that covers Data
Engineering with Azure Databricks.
© Copyright Microsoft Corporation. All rights reserved.

More Related Content

Similar to Azure Data serices and databricks architecture

4castplus document management
4castplus document management4castplus document management
4castplus document management4castplus
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowKarsten Dambekalns
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with linksChris Testa-O'Neill
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Sql server 2012 tutorials writing transact-sql statements
Sql server 2012 tutorials   writing transact-sql statementsSql server 2012 tutorials   writing transact-sql statements
Sql server 2012 tutorials writing transact-sql statementsSteve Xu
 
Spring Boot with Microsoft Azure Integration.pdf
Spring Boot with Microsoft Azure Integration.pdfSpring Boot with Microsoft Azure Integration.pdf
Spring Boot with Microsoft Azure Integration.pdfInexture Solutions
 
PHP and MySQL.pptx
PHP and MySQL.pptxPHP and MySQL.pptx
PHP and MySQL.pptxnatesanp1234
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveKevin Epstein
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentationOleksii Usyk
 
Azure Enterprise Data Analyst (DP-500) Exam Dumps 2023.pdf
Azure Enterprise Data Analyst (DP-500) Exam Dumps 2023.pdfAzure Enterprise Data Analyst (DP-500) Exam Dumps 2023.pdf
Azure Enterprise Data Analyst (DP-500) Exam Dumps 2023.pdfSkillCertProExams
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
 
ESA14 Workshop on SEAD's Data Services and Tools
ESA14 Workshop on SEAD's Data Services and ToolsESA14 Workshop on SEAD's Data Services and Tools
ESA14 Workshop on SEAD's Data Services and ToolsSEAD
 

Similar to Azure Data serices and databricks architecture (20)

Spark sql
Spark sqlSpark sql
Spark sql
 
4castplus document management
4castplus document management4castplus document management
4castplus document management
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 Flow
 
SAS BASICS
SAS BASICSSAS BASICS
SAS BASICS
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with links
 
Apache hive
Apache hiveApache hive
Apache hive
 
AWS essentials S3
AWS essentials S3AWS essentials S3
AWS essentials S3
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
ora_sothea
ora_sotheaora_sothea
ora_sothea
 
Sql server 2012 tutorials writing transact-sql statements
Sql server 2012 tutorials   writing transact-sql statementsSql server 2012 tutorials   writing transact-sql statements
Sql server 2012 tutorials writing transact-sql statements
 
Introduction to DSpace
Introduction to DSpaceIntroduction to DSpace
Introduction to DSpace
 
Spring Boot with Microsoft Azure Integration.pdf
Spring Boot with Microsoft Azure Integration.pdfSpring Boot with Microsoft Azure Integration.pdf
Spring Boot with Microsoft Azure Integration.pdf
 
PHP and MySQL.pptx
PHP and MySQL.pptxPHP and MySQL.pptx
PHP and MySQL.pptx
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
WebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data LoadWebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data Load
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentation
 
Azure Enterprise Data Analyst (DP-500) Exam Dumps 2023.pdf
Azure Enterprise Data Analyst (DP-500) Exam Dumps 2023.pdfAzure Enterprise Data Analyst (DP-500) Exam Dumps 2023.pdf
Azure Enterprise Data Analyst (DP-500) Exam Dumps 2023.pdf
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
ESA14 Workshop on SEAD's Data Services and Tools
ESA14 Workshop on SEAD's Data Services and ToolsESA14 Workshop on SEAD's Data Services and Tools
ESA14 Workshop on SEAD's Data Services and Tools
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Azure Data serices and databricks architecture

  • 2. Join the chat at https://aka.ms/LearnLiveTV Title Speaker Name Read and write data in Azure Databricks Speaker Name Title
  • 3. Learning objectives  Use Azure Databricks to read multiple file types, both with and without a Schema.  Combine inputs from files and data stores, such as Azure SQL Database.  Transform and store that data for advanced analytics.
  • 4. Unit Prerequisites Microsoft Azure Account: You will need a valid and active Azure account for the Azure labs. • If you are a Visual Studio Active Subscriber, you are entitled to Azure credits per month. You can refer to this link to find out more including how to activate and start using your monthly Azure credit. • If you are not a Visual Studio Subscriber, you can sign up for the FREE Visual Studio Dev Essentials program to create Azure free account.
  • 5. Create the required resources To complete this lab, you will need to deploy an Azure Databricks workspace in your Azure subscription.
  • 6. Agenda  Introduction  Read data in CSV format  Read data in JSON format  Read data in Parquet format  Read data stored in tables and views
  • 7. Agenda continued  Write data  Exercises: Read and write data  Knowledge check  Summary
  • 9. Introduction Suppose you're working for a data analytics startup that's now expanding along with its increasing customer base.
  • 11. Deploy an Azure Databricks workspace Click the following button to open the Azure Resource Manager Template (ARM) template in the Azure portal. • Click the following button to open the Azure Resource Manager Template (ARM) template in the Azure portal. Deploy Databricks from the ARM Template • Provide the required values to create your Azure Databricks workspace: • Subscription: Choose the Azure Subscription in which to deploy the workspace. • Resource Group: Leave at Create new and provide a name for the new resource group. • Location: Select a location near you for deployment. For the list of regions supported by Azure Databricks, see Azure services available by region. • Workspace Name: Provide a name for your workspace. • Pricing Tier: Ensure premium is selected. • Accept the terms and conditions. • Select Purchase. • The workspace creation takes a few minutes. During workspace creation, the portal displays the Submitting deployment for Azure Databricks tile on the right side. You may need to scroll right on your dashboard to see the tile. There is also a progress bar displayed near the top of the screen. You can watch either area for progress.
  • 12. Create a cluster When your Azure Databricks workspace creation is complete, select the link to go to the resource.
  • 13. Clone the Databricks archive If you do not currently have your Azure Databricks workspace open: in the Azure portal, navigate to your deployed Azure Databricks workspace and select Launch Workspace. • Select Import. • Select the 03-Reading-and-writing-data-in- Azure-Databricks folder that appears.
  • 14. Read data in CSV format
  • 15. Read data in CSV format In this unit, you need to complete the exercises within a Databricks Notebook.
  • 16. Complete the following notebook Open the 1.Reading Data - CSV notebook. • Start working with the API documentation • Introduce the class SparkSession and other entry points • Introduce the class DataFrameReader • Read data from: • CSV without a Schema • CSV with a Schema
  • 17. Read data in JSON format
  • 18. Read data in JSON format In your Azure Databricks workspace, open the 03-Reading-and- writing-data-in-Azure-Databricks folder that you imported within your user folder. • Read data from: • JSON without a Schema • JSON with a Schema
  • 19. Read data in Parquet format
  • 20. Read data in Parquet format In your Azure Databricks workspace, open the 03-Reading-and- writing-data-in-Azure-Databricks folder that you imported within your user folder. • Introduce the Parquet file format • Read data from: • Parquet files without a schema • Parquet files with a schema
  • 21. Read data stored in tables and views
  • 22. Read data stored in tables and views In your Azure Databricks workspace, open the 03-Reading-and- writing-data-in-Azure-Databricks folder that you imported within your user folder. • Demonstrate how to pre-register data sources in Azure Databricks • Introduce temporary views over files • Read data from tables/views
  • 24. Write data In your Azure Databricks workspace, open the 03-Reading-and- writing-data-in-Azure-Databricks folder that you imported within your user folder. • Write data to a Parquet file • Read the Parquet file back and display the results
  • 26. Exercises: Read and write data In your Azure Databricks workspace, open the 03-Reading-and- writing-data-in-Azure-Databricks folder that you imported within your user folder.
  • 28. Question 1 How do you list files in DBFS within a notebook? A. ls /my-file-path B. %fs dir /my-file-path C. %fs ls /my-file-path
  • 29. Question 1 How do you list files in DBFS within a notebook? A. ls /my-file-path B. %fs dir /my-file-path C. %fs ls /my-file-path
  • 30. Question 2 How do you infer the data types and column names when you read a JSON file? A. spark.read.option("inferSchema", "true").json(jsonFile) B. spark.read.inferSchema("true").json(jsonFile) C. spark.read.option("inferData", "true").json(jsonFile)
  • 31. Question 2 How do you infer the data types and column names when you read a JSON file? A. spark.read.option("inferSchema", "true").json(jsonFile) B. spark.read.inferSchema("true").json(jsonFile) C. spark.read.option("inferData", "true").json(jsonFile)
  • 33. Summary In this module, you learned the basics about reading and writing data in Azure Databricks. • Read data from CSV files into a Spark Dataframe • Provide a Schema when reading Data into a Spark Dataframe • Read data from JSON files into a Spark Dataframe • Read Data from parquet files into a Spark Dataframe • Create Tables and Views • Write data from a Spark Dataframe
  • 34. Clean up If you plan on completing other Azure Databricks modules, don't delete your Azure Databricks instance yet.
  • 35. Delete the Azure Databricks instance • Navigate to the Azure portal. • Navigate to the resource group that contains your Azure Databricks instance. • Select Delete resource group. • Type the name of the resource group in the confirmation text box. • Select Delete.
  • 36. Next Steps Practice your knowledge by trying these Learn modules: Please tell us how you liked this workshop by filling out this survey: https://aka.ms/workshopomatic- feedback There is a slightly more advanced and involved learning path that covers Data Engineering with Azure Databricks.
  • 37. © Copyright Microsoft Corporation. All rights reserved.

Editor's Notes

  1. Link to published module on Learn: https://docs.microsoft.com/en-us/learn/modules/learn-pr/wwl-data-ai/read-write-data-azure-databricks/
  2. Microsoft Azure Account: You will need a valid and active Azure account for the Azure labs. If you do not have one, you can sign up for a free trial If you are a Visual Studio Active Subscriber, you are entitled to Azure credits per month. You can refer to this link to find out more including how to activate and start using your monthly Azure credit. If you are not a Visual Studio Subscriber, you can sign up for the FREE Visual Studio Dev Essentials program to create Azure free account.
  3. To complete this lab, you will need to deploy an Azure Databricks workspace in your Azure subscription.
  4. Suppose you're working for a data analytics startup that's now expanding along with its increasing customer base. You receive customer data from multiple sources in different raw formats. To efficiently handle huge amounts of customer data, your company has decided to invest in Azure Databricks. Your team is responsible for analyzing how Databricks supports day-to-day data-handling functions, such as reads, writes, and queries. Your team performs these tasks to prepare the data for advanced analytics and machine learning operations.
  5. Click the following button to open the Azure Resource Manager Template (ARM) template in the Azure portal. Deploy Databricks from the ARM Template Provide the required values to create your Azure Databricks workspace: Subscription: Choose the Azure Subscription in which to deploy the workspace. Resource Group: Leave at Create new and provide a name for the new resource group. Location: Select a location near you for deployment. For the list of regions supported by Azure Databricks, see Azure services available by region. Workspace Name: Provide a name for your workspace. Pricing Tier: Ensure premium is selected. Accept the terms and conditions. Select Purchase. The workspace creation takes a few minutes. During workspace creation, the portal displays the Submitting deployment for Azure Databricks tile on the right side. You may need to scroll right on your dashboard to see the tile. There is also a progress bar displayed near the top of the screen. You can watch either area for progress.
  6. When your Azure Databricks workspace creation is complete, select the link to go to the resource. Select Launch Workspace to open your Databricks workspace in a new tab. In the left-hand menu of your Databricks workspace, select Clusters. Select Create Cluster to add a new cluster. Enter a name for your cluster. Use your name or initials to easily differentiate your cluster from your coworkers. Select the Cluster Mode: Single Node. Select the Databricks RuntimeVersion: Runtime: 7.3 LTS (Scala 2.12, Spark 3.0.1). Under Autopilot Options, leave the box checked and in the text box enter 45. Select the Node Type: Standard_DS3_v2. Select Create Cluster.
  7. If you do not currently have your Azure Databricks workspace open: in the Azure portal, navigate to your deployed Azure Databricks workspace and select Launch Workspace. In the left pane, select Workspace > Users, and select your username (the entry with the house icon). In the pane that appears, select the arrow next to your name, and select Import. In the Import Notebooks dialog box, select the URL and paste in the following URL: https://github.com/solliancenet/microsoft-learning-paths-databricks-notebooks/blob/master/data-engineering/DBC/03-Reading-and-writing-data-in-Azure-Databricks.dbc?raw=true Select Import. Select the 03-Reading-and-writing-data-in-Azure-Databricks folder that appears.
  8. In this unit, you need to complete the exercises within a Databricks Notebook. To begin, you need to have access to an Azure Databricks workspace. If you do not have a workspace available, follow the instructions below. Otherwise, you can skip to the bottom of the page to Clone the Databricks archive.
  9. Open the 1.Reading Data - CSV notebook. Make sure you attach your cluster to the notebook before following the instructions and running the cells within. Within the notebook, you will: Start working with the API documentation Introduce the class SparkSession and other entry points Introduce the class DataFrameReader Read data from: CSV without a Schema CSV with a Schema After you've completed the notebook, return to this screen, and continue to the next step.
  10. In your Azure Databricks workspace, open the 03-Reading-and-writing-data-in-Azure-Databricks folder that you imported within your user folder. Open the 2.Reading Data - JSON notebook. Make sure you attach your cluster to the notebook before following the instructions and running the cells within. Within the notebook, you will: Read data from: JSON without a Schema JSON with a Schema After you've completed the notebook, return to this screen, and continue to the next step.
  11. In your Azure Databricks workspace, open the 03-Reading-and-writing-data-in-Azure-Databricks folder that you imported within your user folder. Open the 3.Reading Data - Parquet notebook. Make sure you attach your cluster to the notebook before following the instructions and running the cells within. Within the notebook, you will: Introduce the Parquet file format Read data from: Parquet files without a schema Parquet files with a schema After you've completed the notebook, return to this screen, and continue to the next step.
  12. In your Azure Databricks workspace, open the 03-Reading-and-writing-data-in-Azure-Databricks folder that you imported within your user folder. Open the 4.Reading Data - Tables and Views notebook. Make sure you attach your cluster to the notebook before following the instructions and running the cells within. Within the notebook, you will: Demonstrate how to pre-register data sources in Azure Databricks Introduce temporary views over files Read data from tables/views After you've completed the notebook, return to this screen, and continue to the next step.
  13. In your Azure Databricks workspace, open the 03-Reading-and-writing-data-in-Azure-Databricks folder that you imported within your user folder. Open the 5.Writing Data notebook. Make sure you attach your cluster to the notebook before following the instructions and running the cells within. Within the notebook, you will: Write data to a Parquet file Read the Parquet file back and display the results After you've completed the notebook, return to this screen, and continue to the next step.
  14. Link to published module on Learn: https://docs.microsoft.com/en-us/learn/modules/learn-pr/wwl-data-ai/read-write-data-azure-databricks/7-exercises
  15. https://docs.microsoft.com/en-us/learn/modules/learn-pr/wwl-data-ai/read-write-data-azure-databricks/7-exercises
  16. Explanation: Correct. You added the file system magic to the cell before executing the ls command.
  17. Explanation: Correct. You added the file system magic to the cell before executing the ls command.
  18. Explanation: Correct. This approach is the correct way to infer the file's schema.
  19. Explanation: Correct. This approach is the correct way to infer the file's schema.
  20. In this module, you learned the basics about reading and writing data in Azure Databricks. You now know how to read CSV, JSON, and Parquet file formats, and how to write Parquet files to the Databricks file system (DBFS) with compression options. Though you only wrote the files in Parquet format, you can use the same DataFrame.write method to output to other formats. Finally, you put your knowledge to the test by completing an exercise that required you to read a random file that you had not yet seen. Now that you have concluded this module, you should know: Read data from CSV files into a Spark Dataframe Provide a Schema when reading Data into a Spark Dataframe Read data from JSON files into a Spark Dataframe Read Data from parquet files into a Spark Dataframe Create Tables and Views Write data from a Spark Dataframe
  21. If you plan on completing other Azure Databricks modules, don't delete your Azure Databricks instance yet. You can use the same environment for the other modules.
  22. Navigate to the Azure portal. Navigate to the resource group that contains your Azure Databricks instance. Select Delete resource group. Type the name of the resource group in the confirmation text box. Select Delete.