Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ike Ellis, MVP, Crafting Bytes
Dive into the Azure
Data Lake
Please silence
cell phones
Please silence
cell phones
2
Free online webinar
events
Free 1-day local
training events
Local user groups
around the world
Online special
interest use...
Session evaluations
Download the GuideBook App
and search: PASS Summit 2017
Follow the QR code link
displayed on session s...
Ike Ellis
Partner, Crafting Bytes
Microsoft MVP
Since 2011
Frequent PASS Summit Speaker
And speak often at SQL Saturdays +...
Thank You
Learn more from Ike Ellis
ike@craftingbytes.co
m
@ike_ellis
What is a data lake as a data architecture
term?
“If you think of a datamart as a store of bottled water
– cleansed, packa...
Data Lake Attributes
Contains unstructured,
structured, semi-structured
and multi-structured data.
Data lakes have been
tr...
Complexities with Standard MPP
Typical HDFS installation
Typical Hadoop installation
Redshift
Azure Data Lake Store
Cloud based file system that is unlimited in size
Compatible with Hadoop/Spark
• Hive
• Pig
• Sqoop
Azure Data Lake Store
• Unlimited Storage – a single
petabyte file
• Tuned for batch jobs
• Use active directory for secur...
Import data
• Use the Azure Portal
• .NET SDK
• Data Factory
• DistCp (Hadoop
Distributed Copy)
• Apache Sqoop
Azure Data Lake Store vs Azure Blob
Storage
• ADLS is optimized for analytics
• Blob Storage holds tons of data inappropri...
DEMO
Provisioning Azure Data Lake Store
Provisioning Azure Data Lake Analytics
Connecting the two
Azure Data Lake Analytics
Service that queries data for analysis
ADLA runs jobs for:
querying
cleaning
aggregating
transfo...
U-SQL Job Syntax
U-SQL combines SQL and C#
SQL is a declarative language
• Don’t need to write all the steps
• Just write what data you wan...
U-SQL – Start with a simple script
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string...
Let’s run this script?
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int...
OK, here’s the good scripts
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Game...
We can do GROUP BY@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, Ho...
JOINS, TOO! Does this script work?@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
...
Nope! Need AS for alias, ambiguious
column names, INNER for JOINS@hitters =
EXTRACT Rank int
, PlayerName string
, Age int...
You can process every file in a folder like
this
@log = EXTRACT date string,
time string,
client_ip string,
username strin...
U-SQL Tips
Case Sensitive
C# Data Type
• String instead of varchar
C# Comparison - ==, !=
COMPARE NULLS!
Typical U-SQL Job
1) Extract
2) Assign results to variable
3) Create a schema on read
4) Use schema when we process it
5) ...
Visual Studio Code
28
VS Code Cool Features
• Generate Code Behind
• Format Code
• Submit Jobs
• Automatically takes you to the portal
29
Visual Studio
• Autocomplete
• Local Data Lake
for testing
• Show the
advanced tab
U-SQL Catalogs
CREATE EXTERNAL TABLE
Let’s you create a managed table that’s permanent
Can also create other metadata obje...
Demo U-SQL Catalog
Views
Functions
Procedures
U-SQL and C#
Use a code behind file
Or create an assembly and deploy it to use it all the time.
Deployment script
DEMO U-SQL Code Behind
DEMO U-SQL Custom Assemblies
1. Create a class library
2. Name project Data Utilities
3. Implement class
4. View the Cloud...
U-SQL Job Architecture
Each job step is divided into vertices.
Each vertex represents a single piece of work
Each unit (th...
Job Monitor
Job List
Replay Job
See Vertices
Can use the portal or Visual Studio
Designing for performance
Add ADLS data for Hive in HDInsight
• Connect ADLS with HDInsight
• Provide a security principal
• Query WebADFS with the ...
ALL DONE!
Ike Ellis
@ike_ellis
Crafting Bytes
We’re hiring a Data Experts!
Microsoft MVP
Chairperson of the San Diego TIG
...
Dive Into Azure Data Lake - PASS 2017
Upcoming SlideShare
Loading in …5
×

Dive Into Azure Data Lake - PASS 2017

763 views

Published on

My presentation on Azure Data Lake Store, Azure Data Lake Analytics, U-SQL, Visual Studio Code, Visual Studio, and HDInsight + ADLS.

Published in: Software
  • Be the first to comment

Dive Into Azure Data Lake - PASS 2017

  1. 1. Ike Ellis, MVP, Crafting Bytes Dive into the Azure Data Lake
  2. 2. Please silence cell phones Please silence cell phones 2
  3. 3. Free online webinar events Free 1-day local training events Local user groups around the world Online special interest user groups Business analytics training Free Online Resources PASS Blog White Papers Session Recordings Newsletter www.pass.org Explore everything PASS has to offer PASS Connector BA Insights Get involved
  4. 4. Session evaluations Download the GuideBook App and search: PASS Summit 2017 Follow the QR code link displayed on session signage throughout the conference venue and in the program guide Your feedback is important and valuable. Go to passSummit.com Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
  5. 5. Ike Ellis Partner, Crafting Bytes Microsoft MVP Since 2011 Frequent PASS Summit Speaker And speak often at SQL Saturdays + maker of youtube videos – Ike’s tips! Author of Developing Azure Solutions Now in 2nd Edition /ikeellis @ike_ellis ellisike
  6. 6. Thank You Learn more from Ike Ellis ike@craftingbytes.co m @ike_ellis
  7. 7. What is a data lake as a data architecture term? “If you think of a datamart as a store of bottled water – cleansed, packaged, and structured for easy consumption – the data lake is a large body of water in a more natural state.” – James Dixon – creator of the term
  8. 8. Data Lake Attributes Contains unstructured, structured, semi-structured and multi-structured data. Data lakes have been traditionally Azure Blob Storage, Amazon S3, and Hadoop HDFS. Data is transformed in place with no movement.
  9. 9. Complexities with Standard MPP Typical HDFS installation Typical Hadoop installation Redshift
  10. 10. Azure Data Lake Store Cloud based file system that is unlimited in size Compatible with Hadoop/Spark • Hive • Pig • Sqoop
  11. 11. Azure Data Lake Store • Unlimited Storage – a single petabyte file • Tuned for batch jobs • Use active directory for security • Store all data in native format • Uses ADL:// and a URI • Exposes WebHDFS API
  12. 12. Import data • Use the Azure Portal • .NET SDK • Data Factory • DistCp (Hadoop Distributed Copy) • Apache Sqoop
  13. 13. Azure Data Lake Store vs Azure Blob Storage • ADLS is optimized for analytics • Blob Storage holds tons of data inappropriate for analytics like VHDs • ADLS has folders, Blob Storage has containers • ADLS uses WebHDFS Rest API, Azure Blob Storage uses the Azure SDK Rest APIs • ADLS has no size limits • ADLS has no geo-redundancy yet • ADLS is more expensive • Azure Blob Storage can be better for analytics if there are a lot of small files
  14. 14. DEMO Provisioning Azure Data Lake Store Provisioning Azure Data Lake Analytics Connecting the two
  15. 15. Azure Data Lake Analytics Service that queries data for analysis ADLA runs jobs for: querying cleaning aggregating transforming ADLA uses U-SQL as a language
  16. 16. U-SQL Job Syntax
  17. 17. U-SQL combines SQL and C# SQL is a declarative language • Don’t need to write all the steps • Just write what data you want and let the optimizer get the data for you • Multi-threading is done for you in the background C# is a procedural language that is often a better fit • Custom Functions • Multi-threading is very difficult U-SQL unifies both of these languages • Highly customizable and extensible way to interact with data
  18. 18. U-SQL – Start with a simple script @hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); OUTPUT @hitters TO "/output/hitters.txt" USING Outputters.Text(); 19
  19. 19. Let’s run this script? @hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballS tats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = select * from @hitters where HomeRuns != null; OUTPUT @bestHitters TO "/output/hitters.txt" USING Outputters.Text(); 20
  20. 20. OK, here’s the good scripts @hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT * FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0; OUTPUT @bestHitters TO "/output/besthitters.txt" USING Outputters.Text(); 21
  21. 21. We can do GROUP BY@hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0 GROUP BY Team; OUTPUT @bestHitters TO "/output/besthitters.txt" USING Outputters.Text(); 22
  22. 22. JOINS, TOO! Does this script work?@hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0 GROUP BY Team; @bestHitterWithName = SELECT PlayerName, Team, HomeRuns FROM @hitters h JOIN @bestHitters bh ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal OUTPUT @bestHitterWithName TO "/output/besthitterswithnames.txt" 23
  23. 23. Nope! Need AS for alias, ambiguious column names, INNER for JOINS@hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0 GROUP BY Team; @bestHitterWithName = SELECT h.PlayerName, h.Team, h.HomeRuns FROM @hitters AS h INNER JOIN @bestHitters AS bh ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal; OUTPUT @bestHitterWithName TO "/output/besthitterswithnames.txt" 24
  24. 24. You can process every file in a folder like this @log = EXTRACT date string, time string, client_ip string, username string, server_ip string, port int, method string, stem string, query string, status string, server_bytes int, client_bytes int, time_taken int, user_agent string, referrer string FROM "/iislogs/{*}.txt" USING Extractors.Text(' ', silent:true); @dailysummary = SELECT date, COUNT(*) AS hits, SUM(server_bytes) AS bytes_sent, SUM(client_bytes) AS bytes_received FROM @log GROUP BY date; 25
  25. 25. U-SQL Tips Case Sensitive C# Data Type • String instead of varchar C# Comparison - ==, != COMPARE NULLS!
  26. 26. Typical U-SQL Job 1) Extract 2) Assign results to variable 3) Create a schema on read 4) Use schema when we process it 5) Filter data that we need 6) Apply the results to a variable 7) Output the variable someplace, like a CSV file
  27. 27. Visual Studio Code 28
  28. 28. VS Code Cool Features • Generate Code Behind • Format Code • Submit Jobs • Automatically takes you to the portal 29
  29. 29. Visual Studio • Autocomplete • Local Data Lake for testing • Show the advanced tab
  30. 30. U-SQL Catalogs CREATE EXTERNAL TABLE Let’s you create a managed table that’s permanent Can also create other metadata objects • Views – Saved query • Table Valued Function • Stored Procedure
  31. 31. Demo U-SQL Catalog
  32. 32. Views
  33. 33. Functions
  34. 34. Procedures
  35. 35. U-SQL and C# Use a code behind file Or create an assembly and deploy it to use it all the time. Deployment script
  36. 36. DEMO U-SQL Code Behind
  37. 37. DEMO U-SQL Custom Assemblies 1. Create a class library 2. Name project Data Utilities 3. Implement class 4. View the Cloud Explorer pane 5. In the Solution Explorer pane, right-click the DataUtilities project and click Register Assembly. 6. select the webdata database. 7. In the Azure portal, on the blade for your Azure Data Lake Analytics account, click Data Explorer; and then browse to the Assemblies folder in your webdata database to verify that the assembly has been registered. 8. Use in a query
  38. 38. U-SQL Job Architecture Each job step is divided into vertices. Each vertex represents a single piece of work Each unit (the slider bar at the top) is functionally similar to a physical node. Each available unit is going to grab a vertex and work through it until the task is completed. Then it will move on to the next task.
  39. 39. Job Monitor Job List Replay Job See Vertices Can use the portal or Visual Studio
  40. 40. Designing for performance
  41. 41. Add ADLS data for Hive in HDInsight • Connect ADLS with HDInsight • Provide a security principal • Query WebADFS with the ADL:// prefix 42
  42. 42. ALL DONE! Ike Ellis @ike_ellis Crafting Bytes We’re hiring a Data Experts! Microsoft MVP Chairperson of the San Diego TIG Book co-author – Developing Azure Solutions Upcoming course on Azure Data Lake www.craftingbytes.com www.ikeellis.com

×