SlideShare a Scribd company logo
Any use of this material without the permission of Talavant is prohibited
PROCESSING BIG DATA
with Azure Data Lake Analytics
Sean Forgatch
Business Intelligence Consultant
Sean.Forgatch@talavant.com
About Me
Sean Forgatch
• Milwaukee, WI
• Business Intelligence Consultant
• Healthcare, Insurance, SaaS
• Integration and Analytics
• Microsoft Big Data Certified
• PASS
• Industry Speaker
• FoxPASS President
• Running, Craft Beers, Reading
About Talavant
There is a better way to make data work for companies. Better resources, strategy, sustainability, inclusion of the
organization as a whole, understanding of client needs, tools, outcomes, better ROI.
• Accelerated planning, implementation and results
• Sustainable solutions
• Increased company-wide buy-in & usage
VALUE WE PROVIDE
By providing a holistic approach inclusive of a
client’s people, processes and technologies -
built on investment in our own employees and
company growth.
HOW WE DO IT
STRATEGY
ARCHITECTURE
IMPLEMENTATION
1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
GDELT Analysis in PowerBI
Big Data Primer
THREE V s of Big
Data
VOLUME
VELOCITY
VARIETY
Big Data Primer: Azure Tools
THREE V s of Big
Data
VOLUME
VELOCITY Veracity
VARIETY
Azure Data
Warehouse
Azure Data
Lake
• Increasing Amount of Data and
Sources
• Increase of Data Asset
Acquisitions
• Streaming Data
• Real Time Analytics
• Increase in Demand
• Reliability of Trustworthy Data
• Timeliness of Data Operations
• New Data Formats Being Used
• Images, JSON, Free Text
• Avro, Parquet, ORC
Azure Data
Lake
Analytics
Event Hubs
IoT Hubs
Stream
Analytics
HDInsight
A Big Data Trends
“Variety, not volume or velocity, drives big-data investments”
A Big Data Trends
“Variety, not volume or velocity, drives big-data investments”
“Big data grows up: Hadoop adds to enterprise standards”
A Big Data Trends
“Variety, not volume or velocity, drives big-data investments”
“Big data grows up: Hadoop adds to enterprise standards”
“Rise of metadata catalogs helps people find analysis-worthy
big data.”
-TDWI: Top Ten Big Data Trends for 2017
1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
Data Lake Concepts
Data Lake Operations
RAW STAGE CURATED
Azure Data Lake
Store
RAW STAGE CURATED
EXPLORATORY
• Operational
• Value has been
identified
• Explorational
• Value is being
discovered
Azure Data Lake
Store
2
1
Data Lake Operations
RAW STAGE CURATED
EXPLORATORY
ERP
IoT
Devices
RDBS
• Operational
• Value has been
identified
• Explorational
• Value is being
discovered
Azure Data Lake
Store
2
1
Data Lake Operations
Data Lake Operations
RAW STAGE CURATED
EXPLORATORY
ERP
IoT
Devices
RDBS
• Operational
• Value has been
identified
• Explorational
• Value is being
discovered
Azure Data Lake
Store
2
1
RAW STAGE CURATED
EXPLORATORY
ERP
IoT
Devices
RDBS
• Operational
• Value has been
identified
• Explorational
• Value is being
discovered
Azure Data Lake
Store
2
1
• Tag /
Explore
• Discover
Data Lake Operations
Data Lake Organization
SUBJECT
DATA SOURCE
DATA SET
DATE
FILE
DATASET
DATE
FILE
?
DESTINATION
DATA SET
FILE
Data Lake Security
RAW (1) STAGE (2) CURATED (3) EXPLORATION (0)
Data Experts/Engineers Data
Experts/Engineers
ETL and BI Engineers /
SME’s / Analysts
Data Scientist / Analysts
TOOLS
Data Lake Tagging
RAW (1) STAGE (2) CURATED (3) EXPLORATION (0)
Data Experts/Engineers Data
Experts/Engineers
ETL and BI Engineers /
SME’s / Analysts
Data Scientist / Analysts
AUTOMATED SME N/A
TOOLS
Data Lake Processing
RAW (1) STAGE (2) CURATED (3) EXPLORATION (0)
Data Experts/Engineers Data
Experts/Engineers
ETL and BI Engineers /
SME’s / Analysts
Data Scientist / Analysts
AUTOMATED SME N/A
INGESTION CLEANSING DISTRIBUTION
TOOLS
Conceptual Architecture
I
ADF and
PolyBase
Files Tables
• Reference Data
• Data Warehouse
Data
• Archive Data
• Land quickly, no
transformations
• Includes Transient
Data Operations
• Original Format
• Stored by Date
Ingested
• Tag and Catalog Data
• Possibly Compressed
Data Lake Bypass - BATCH
• IoT
• Medical Devices
• Social Media
• Imaging
• Events
• Video
• Application Data
• Electronic
Medical Records
Systems
• Customer
Relationship
Management
• Financial
Systems
• Data
Marketplace
• Data Lake
Standard
• U-SQL/Hive
Tables
• General Staging
• Can bypass
Curated
• Decompress
• Conform
• Delta
Preparation
Data Lake Bypass - STREAM
R
• Exploratory
Analytics
• Sandbox Area
Data Lake Questions
?
1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
Azure Data Lake Store
• HDFS Data Store for housing data in it’s Native Raw Format
• Built on Apache YARN
• Process and store Petabyte size files
• Enterprise Security through Azure Active Directory
• No storage limit
Data Lake Store
VERTEXES
EXTENTS
SMALL FILE
V1
E1
Data Lake Store
VERTEXES
EXTENTS
SMALL FILE
V1
E1
V1 V2 V3 V4
E1 E2 E3 E4
BIG FILE
Data Lake Store
VERTEXES
EXTENTS
SMALL FILE
V1
E1
Data Lake Ingestion
• Visual Studio
• Azure Portal (Limit)
• SSIS Data Lake Destination and SSIS Data Lake Task
• Azure Data Factory
• Powershell
• ADLCopy
• Sqoop (HDInsight)
• Other Apache tools
1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
Azure Data Lake Analytics
• Big Data Queries as a Service
• Analytics Federation
• Develop in U-SQL, .NET, R, and Python
• Cognitive Services
• Scale Instantly
• Pay Per Job
1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
Intro to U-SQL
KEY FEATURES
• Combines SQL and C#
• Patterned File Processing – Thousands of Files
• Extensions: Python, R, Cognitive
• Query Data where it Lives (Federated Querying)
• Partition and Distribution of Data for Massive Parallelism
• Manage Structure and Shared Programming through U-SQL Catalog
• U-SQL Procedures
U-SQL : Extract Query
@MyExtract =
EXTRACT
Field1 string,
Field2 int,
Field 3 int?
FROM “/datalake/01_RAW/{*}.csv”
USING Extractors.Csv();
U-SQL T-SQL
INSERT INTO myTable
( Field1, Field2, Field3)
SELECT
CAST(Field1 as varchar(100) as Field1,
CAST(Field2 AS INT) as Field2,
CONVERT(INT, Field3) as Field 3
FROM myTable
CREATE TABLE myTable
(
Field1 VARCHAR(100),
Field2 INT,
Field3 INT NOT NULL
);
1
U-SQL : Extract Query
@MyExtract =
EXTRACT
Field1 string,
Field2 int,
Field 3 int?
FROM “/datalake/01_RAW/{*}.csv”
USING Extractors.Csv();
@MyAgg =
SELECT
Field1,
MAX(Field2) AS Field2
FROM @MyExtract
GROUP BY Field1;
U-SQL
1
2
T-SQL
CREATE TABLE myTable
(
Field1 VARCHAR(100),
Field2 INT,
Field3 INT NOT NULL
);
INSERT INTO myTable
( Field1, Field2, Field3)
SELECT
CAST(Field1 as varchar(100) as Field1,
CAST(Field2 AS INT) as Field2,
CONVERT(INT, Field3) as Field 3
FROM myTable
U-SQL : Extract Query
@MyExtract =
EXTRACT
Field1 string,
Field2 int,
Field 3 int?
FROM “/datalake/01_RAW/{*}.csv”
USING Extractors.Csv();
@MyAgg =
SELECT
Field1,
MAX(Field2) AS Field2
FROM @MyExtract
GROUP BY Field1;
OUTPUT @MyAgg
TO “/datalake/02_STAGE/MyOutput.csv”
USING Outputters.Csv();
1
2
3
U-SQL T-SQL
CREATE TABLE myTable
(
Field1 VARCHAR(100),
Field2 INT,
Field3 INT NOT NULL
);
INSERT INTO myTable
( Field1, Field2, Field3)
SELECT
CAST(Field1 as varchar(100) as Field1,
CAST(Field2 AS INT) as Field2,
CONVERT(INT, Field3) as Field 3
FROM myTable
U-SQL : Extractors and Outputters
CURRENT EXTRACTORS
• Csv()
• Tsv()
• Txt()
CURRENT OUTPUTTERS
• Csv()
• Tsv()
• Txt()
U-SQL : Extractor and Outputter Parameters
EXTRACT
…
FROM “/datalake/01_RAW/{*}.CSV
USING Extractors.Csv(silent : true , delimiter : “,”);
CSV(); PARAMETERS
• Delimiter
• Encoding
• escapeCharecter
• nullEscape
• Quoting
• rowDelimiter
• Silent
• skipFirstNRows
• charFormat
U-SQL : Extractors and Outputters
CURRENT EXTRACTORS
• Csv()
• Tsv()
• Txt()
CURRENT OUTPUTTERS
• Csv()
• Tsv()
• Txt()
CUSTOM EXTRACTORS and OUTPUTTERS
• FlexExtractor()
• XML()
• JSON()
• Avro()
U-SQL : Virtual Columns
DECLARE @IN =“/datalake/01_stage/2017/06/{FileName}.csv”;
@MyExtract =
EXTRACT
Field1 string,
Field2 int,
Field 3 int?,
FileName string //My Virtual Column
FROM @IN
USING Extractors.Csv()
WHERE FileName == “MyRedditFile_20170602”;
File Names :
MyRedditFile_20170601.csv
MyRedditFile_20170602.csv
etc…
U-SQL: Job Execution
U-SQL: Job Execution
U-SQL : Tables
GUIDELINES
1. Must Have Clustered Index
2. Utilize When Improving Performance with
Distribution/Partitioning
3. You have Multiple Large Files
4. Don’t Use when:
• No Filtering, Joining, Grouping
U-SQL : Tables
DROP TABLE IF EXISTS <adla>.<database>.<schema>.tableName;
CREATE TABLE <adla>.<database>.<schema>.tableName
(
Field1 int,
Field2 string,
Field3 int?
INDEX idx_1 CLUSTERED(Field1)
DISTRIBUTED BY HASH(Field2)
);
U-SQL : Tables
DROP TABLE IF EXISTS
<adla>.<database>.<schema>.tableName;
CREATE TABLE <adla>.<database>.<schema>.tableName
(
Field1 int,
Field2 string,
Field3 int?
INDEX idx_1 CLUSTERED(Field1)
DISTRIBUTED BY HASH(Field2)
)
AS SELECT …
U-SQL : Tables
DROP TABLE IF EXISTS
<adla>.<database>.<schema>.tableName;
CREATE TABLE <adla>.<database>.<schema>.tableName
(
Field1 int,
Field2 string,
Field3 int?
INDEX idx_1 CLUSTERED(Field1)
DISTRIBUTED BY HASH(Field2)
)
AS EXTRACT …
U-SQL : Tables
DROP TABLE IF EXISTS
<adla>.<database>.<schema>.tableName;
CREATE TABLE <adla>.<database>.<schema>.tableName
(
Field1 int,
Field2 string,
Field3 int?
INDEX idx_1 CLUSTERED(Field1)
DISTRIBUTED BY HASH(Field2)
)
AS TVF …
U-SQL : Tables
MANAGED
• Own Their Data
• No Heaps
• INSERT only
USQL
METADATA and DATA
U-SQL : Tables
EXTERNAL
• Stored Metadata
• Use VIEW or TVF over EXTRACT
• Data Lives on Source
• Azure SQL DB
• Azure SQL DW
• Azure SQL VM
USQL
METADATA DATA
U-SQL : Data Federation
CURRENT STATE
EXTRACTORS
U-SQL : Operators
COMPARISON OPERATORS LOGICAL OPERATORS
IS NULL AND
== BETWEEN
> IN, NOT IN
>= LIKE, NOT LIKE
!= NOT
OR
U-SQL : Functions
REPORTING FUNCTIONS
• COUNT
• SUM
• MIN
• MAX
• AVG
• STDEV
• VAR
RANKING FUNCTIONS
• RANK
• DENSE_RANK
• NTILE
• ROW_NUMBER
ANALYTIC FUNCTIONS
• CUME_DIST
• PERCENT_RANK
• PERCENTILE_CONT
• PERCENTILE_DISC
U-SQL : C# Functions
STRING METHODS
• Compare
• Concat
• Contains
• Equals
• Replace
• Split
• ToUpper
• Trim
• ..plus many more!
MATH METHODS
• Abs
• BigMul
• Floor
• Max/Min
• Round
• Sqrt
• ..plus many more!
Copyright Talavant, Inc. 2017 54
Advice
• Identify Value of Data Lake
Approach
• Data Lake: Invest Time and Strategy
into Data Lake Design
• U-SQL: Utilize U-SQL Constructs
before C#
• U-SQL: Understand and Control Data
through Partitioning
*as of 2017/07/25
Learn U-SQL !
•Michael Rys – LinkedIn Slide Share’s
•GitHub – U-SQL Repository
•SQL Server Central – Stairway to U-SQL
•Azure – Built in Example
Let’s Connect!
• https://www.linkedin.com/in/seanforgatch/
• Sean.Forgatch@Talavant.com
• @4gatchSQL

More Related Content

What's hot

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
Bryan Yang
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Julian Hyde
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Don Demcsak
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
Christian Tzolov
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
Max De Marzi
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
Spark sql
Spark sqlSpark sql
Spark sql
Freeman Zhang
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
 
Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.
Julian Hyde
 
How to integrate Splunk with any data solution
How to integrate Splunk with any data solutionHow to integrate Splunk with any data solution
How to integrate Splunk with any data solution
Julian Hyde
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax
 

What's hot (20)

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.
 
How to integrate Splunk with any data solution
How to integrate Splunk with any data solutionHow to integrate Splunk with any data solution
How to integrate Splunk with any data solution
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 

Similar to Talavant Data Lake Analytics

Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
Waqas Idrees
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreuk
Erwin de Kreuk
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
Erwin de Kreuk
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
BizTalk360
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
Revathiparamanathan
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Nilesh Shah
 
Data saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukData saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de Kreuk
Erwin de Kreuk
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
Datasaturday Pordenone Azure Purview Erwin de Kreuk
Datasaturday Pordenone Azure Purview Erwin de KreukDatasaturday Pordenone Azure Purview Erwin de Kreuk
Datasaturday Pordenone Azure Purview Erwin de Kreuk
Erwin de Kreuk
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
Jean-Georges Perrin
 
Analytics in the Cloud
Analytics in the CloudAnalytics in the Cloud
Analytics in the Cloud
Ross McNeely
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
Azure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de KreukAzure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de Kreuk
Erwin de Kreuk
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Accelerating Business Intelligence Solutions with Microsoft Azure   passAccelerating Business Intelligence Solutions with Microsoft Azure   pass
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Jason Strate
 

Similar to Talavant Data Lake Analytics (20)

Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreuk
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Data saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukData saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de Kreuk
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
 
Datasaturday Pordenone Azure Purview Erwin de Kreuk
Datasaturday Pordenone Azure Purview Erwin de KreukDatasaturday Pordenone Azure Purview Erwin de Kreuk
Datasaturday Pordenone Azure Purview Erwin de Kreuk
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Analytics in the Cloud
Analytics in the CloudAnalytics in the Cloud
Analytics in the Cloud
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Azure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de KreukAzure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de Kreuk
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Accelerating Business Intelligence Solutions with Microsoft Azure   passAccelerating Business Intelligence Solutions with Microsoft Azure   pass
Accelerating Business Intelligence Solutions with Microsoft Azure pass
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 

Talavant Data Lake Analytics

  • 1. Any use of this material without the permission of Talavant is prohibited PROCESSING BIG DATA with Azure Data Lake Analytics Sean Forgatch Business Intelligence Consultant Sean.Forgatch@talavant.com
  • 2. About Me Sean Forgatch • Milwaukee, WI • Business Intelligence Consultant • Healthcare, Insurance, SaaS • Integration and Analytics • Microsoft Big Data Certified • PASS • Industry Speaker • FoxPASS President • Running, Craft Beers, Reading
  • 3. About Talavant There is a better way to make data work for companies. Better resources, strategy, sustainability, inclusion of the organization as a whole, understanding of client needs, tools, outcomes, better ROI. • Accelerated planning, implementation and results • Sustainable solutions • Increased company-wide buy-in & usage VALUE WE PROVIDE By providing a holistic approach inclusive of a client’s people, processes and technologies - built on investment in our own employees and company growth. HOW WE DO IT STRATEGY ARCHITECTURE IMPLEMENTATION
  • 4. 1. Big Data Overview 2. Data Lake Concepts 3. Azure Data Lake Store 4. Azure Data Lake Analytics 5. U-SQL
  • 6. Big Data Primer THREE V s of Big Data VOLUME VELOCITY VARIETY
  • 7. Big Data Primer: Azure Tools THREE V s of Big Data VOLUME VELOCITY Veracity VARIETY Azure Data Warehouse Azure Data Lake • Increasing Amount of Data and Sources • Increase of Data Asset Acquisitions • Streaming Data • Real Time Analytics • Increase in Demand • Reliability of Trustworthy Data • Timeliness of Data Operations • New Data Formats Being Used • Images, JSON, Free Text • Avro, Parquet, ORC Azure Data Lake Analytics Event Hubs IoT Hubs Stream Analytics HDInsight
  • 8. A Big Data Trends “Variety, not volume or velocity, drives big-data investments”
  • 9. A Big Data Trends “Variety, not volume or velocity, drives big-data investments” “Big data grows up: Hadoop adds to enterprise standards”
  • 10. A Big Data Trends “Variety, not volume or velocity, drives big-data investments” “Big data grows up: Hadoop adds to enterprise standards” “Rise of metadata catalogs helps people find analysis-worthy big data.” -TDWI: Top Ten Big Data Trends for 2017
  • 11. 1. Big Data Overview 2. Data Lake Concepts 3. Azure Data Lake Store 4. Azure Data Lake Analytics 5. U-SQL
  • 13. Data Lake Operations RAW STAGE CURATED Azure Data Lake Store
  • 14. RAW STAGE CURATED EXPLORATORY • Operational • Value has been identified • Explorational • Value is being discovered Azure Data Lake Store 2 1 Data Lake Operations
  • 15. RAW STAGE CURATED EXPLORATORY ERP IoT Devices RDBS • Operational • Value has been identified • Explorational • Value is being discovered Azure Data Lake Store 2 1 Data Lake Operations
  • 16. Data Lake Operations RAW STAGE CURATED EXPLORATORY ERP IoT Devices RDBS • Operational • Value has been identified • Explorational • Value is being discovered Azure Data Lake Store 2 1
  • 17. RAW STAGE CURATED EXPLORATORY ERP IoT Devices RDBS • Operational • Value has been identified • Explorational • Value is being discovered Azure Data Lake Store 2 1 • Tag / Explore • Discover Data Lake Operations
  • 18. Data Lake Organization SUBJECT DATA SOURCE DATA SET DATE FILE DATASET DATE FILE ? DESTINATION DATA SET FILE
  • 19. Data Lake Security RAW (1) STAGE (2) CURATED (3) EXPLORATION (0) Data Experts/Engineers Data Experts/Engineers ETL and BI Engineers / SME’s / Analysts Data Scientist / Analysts TOOLS
  • 20. Data Lake Tagging RAW (1) STAGE (2) CURATED (3) EXPLORATION (0) Data Experts/Engineers Data Experts/Engineers ETL and BI Engineers / SME’s / Analysts Data Scientist / Analysts AUTOMATED SME N/A TOOLS
  • 21. Data Lake Processing RAW (1) STAGE (2) CURATED (3) EXPLORATION (0) Data Experts/Engineers Data Experts/Engineers ETL and BI Engineers / SME’s / Analysts Data Scientist / Analysts AUTOMATED SME N/A INGESTION CLEANSING DISTRIBUTION TOOLS
  • 22. Conceptual Architecture I ADF and PolyBase Files Tables • Reference Data • Data Warehouse Data • Archive Data • Land quickly, no transformations • Includes Transient Data Operations • Original Format • Stored by Date Ingested • Tag and Catalog Data • Possibly Compressed Data Lake Bypass - BATCH • IoT • Medical Devices • Social Media • Imaging • Events • Video • Application Data • Electronic Medical Records Systems • Customer Relationship Management • Financial Systems • Data Marketplace • Data Lake Standard • U-SQL/Hive Tables • General Staging • Can bypass Curated • Decompress • Conform • Delta Preparation Data Lake Bypass - STREAM R • Exploratory Analytics • Sandbox Area
  • 24. 1. Big Data Overview 2. Data Lake Concepts 3. Azure Data Lake Store 4. Azure Data Lake Analytics 5. U-SQL
  • 25. Azure Data Lake Store • HDFS Data Store for housing data in it’s Native Raw Format • Built on Apache YARN • Process and store Petabyte size files • Enterprise Security through Azure Active Directory • No storage limit
  • 27. Data Lake Store VERTEXES EXTENTS SMALL FILE V1 E1 V1 V2 V3 V4 E1 E2 E3 E4 BIG FILE
  • 29. Data Lake Ingestion • Visual Studio • Azure Portal (Limit) • SSIS Data Lake Destination and SSIS Data Lake Task • Azure Data Factory • Powershell • ADLCopy • Sqoop (HDInsight) • Other Apache tools
  • 30. 1. Big Data Overview 2. Data Lake Concepts 3. Azure Data Lake Store 4. Azure Data Lake Analytics 5. U-SQL
  • 31. Azure Data Lake Analytics • Big Data Queries as a Service • Analytics Federation • Develop in U-SQL, .NET, R, and Python • Cognitive Services • Scale Instantly • Pay Per Job
  • 32. 1. Big Data Overview 2. Data Lake Concepts 3. Azure Data Lake Store 4. Azure Data Lake Analytics 5. U-SQL
  • 33. Intro to U-SQL KEY FEATURES • Combines SQL and C# • Patterned File Processing – Thousands of Files • Extensions: Python, R, Cognitive • Query Data where it Lives (Federated Querying) • Partition and Distribution of Data for Massive Parallelism • Manage Structure and Shared Programming through U-SQL Catalog • U-SQL Procedures
  • 34. U-SQL : Extract Query @MyExtract = EXTRACT Field1 string, Field2 int, Field 3 int? FROM “/datalake/01_RAW/{*}.csv” USING Extractors.Csv(); U-SQL T-SQL INSERT INTO myTable ( Field1, Field2, Field3) SELECT CAST(Field1 as varchar(100) as Field1, CAST(Field2 AS INT) as Field2, CONVERT(INT, Field3) as Field 3 FROM myTable CREATE TABLE myTable ( Field1 VARCHAR(100), Field2 INT, Field3 INT NOT NULL ); 1
  • 35. U-SQL : Extract Query @MyExtract = EXTRACT Field1 string, Field2 int, Field 3 int? FROM “/datalake/01_RAW/{*}.csv” USING Extractors.Csv(); @MyAgg = SELECT Field1, MAX(Field2) AS Field2 FROM @MyExtract GROUP BY Field1; U-SQL 1 2 T-SQL CREATE TABLE myTable ( Field1 VARCHAR(100), Field2 INT, Field3 INT NOT NULL ); INSERT INTO myTable ( Field1, Field2, Field3) SELECT CAST(Field1 as varchar(100) as Field1, CAST(Field2 AS INT) as Field2, CONVERT(INT, Field3) as Field 3 FROM myTable
  • 36. U-SQL : Extract Query @MyExtract = EXTRACT Field1 string, Field2 int, Field 3 int? FROM “/datalake/01_RAW/{*}.csv” USING Extractors.Csv(); @MyAgg = SELECT Field1, MAX(Field2) AS Field2 FROM @MyExtract GROUP BY Field1; OUTPUT @MyAgg TO “/datalake/02_STAGE/MyOutput.csv” USING Outputters.Csv(); 1 2 3 U-SQL T-SQL CREATE TABLE myTable ( Field1 VARCHAR(100), Field2 INT, Field3 INT NOT NULL ); INSERT INTO myTable ( Field1, Field2, Field3) SELECT CAST(Field1 as varchar(100) as Field1, CAST(Field2 AS INT) as Field2, CONVERT(INT, Field3) as Field 3 FROM myTable
  • 37. U-SQL : Extractors and Outputters CURRENT EXTRACTORS • Csv() • Tsv() • Txt() CURRENT OUTPUTTERS • Csv() • Tsv() • Txt()
  • 38. U-SQL : Extractor and Outputter Parameters EXTRACT … FROM “/datalake/01_RAW/{*}.CSV USING Extractors.Csv(silent : true , delimiter : “,”); CSV(); PARAMETERS • Delimiter • Encoding • escapeCharecter • nullEscape • Quoting • rowDelimiter • Silent • skipFirstNRows • charFormat
  • 39. U-SQL : Extractors and Outputters CURRENT EXTRACTORS • Csv() • Tsv() • Txt() CURRENT OUTPUTTERS • Csv() • Tsv() • Txt() CUSTOM EXTRACTORS and OUTPUTTERS • FlexExtractor() • XML() • JSON() • Avro()
  • 40. U-SQL : Virtual Columns DECLARE @IN =“/datalake/01_stage/2017/06/{FileName}.csv”; @MyExtract = EXTRACT Field1 string, Field2 int, Field 3 int?, FileName string //My Virtual Column FROM @IN USING Extractors.Csv() WHERE FileName == “MyRedditFile_20170602”; File Names : MyRedditFile_20170601.csv MyRedditFile_20170602.csv etc…
  • 43. U-SQL : Tables GUIDELINES 1. Must Have Clustered Index 2. Utilize When Improving Performance with Distribution/Partitioning 3. You have Multiple Large Files 4. Don’t Use when: • No Filtering, Joining, Grouping
  • 44. U-SQL : Tables DROP TABLE IF EXISTS <adla>.<database>.<schema>.tableName; CREATE TABLE <adla>.<database>.<schema>.tableName ( Field1 int, Field2 string, Field3 int? INDEX idx_1 CLUSTERED(Field1) DISTRIBUTED BY HASH(Field2) );
  • 45. U-SQL : Tables DROP TABLE IF EXISTS <adla>.<database>.<schema>.tableName; CREATE TABLE <adla>.<database>.<schema>.tableName ( Field1 int, Field2 string, Field3 int? INDEX idx_1 CLUSTERED(Field1) DISTRIBUTED BY HASH(Field2) ) AS SELECT …
  • 46. U-SQL : Tables DROP TABLE IF EXISTS <adla>.<database>.<schema>.tableName; CREATE TABLE <adla>.<database>.<schema>.tableName ( Field1 int, Field2 string, Field3 int? INDEX idx_1 CLUSTERED(Field1) DISTRIBUTED BY HASH(Field2) ) AS EXTRACT …
  • 47. U-SQL : Tables DROP TABLE IF EXISTS <adla>.<database>.<schema>.tableName; CREATE TABLE <adla>.<database>.<schema>.tableName ( Field1 int, Field2 string, Field3 int? INDEX idx_1 CLUSTERED(Field1) DISTRIBUTED BY HASH(Field2) ) AS TVF …
  • 48. U-SQL : Tables MANAGED • Own Their Data • No Heaps • INSERT only USQL METADATA and DATA
  • 49. U-SQL : Tables EXTERNAL • Stored Metadata • Use VIEW or TVF over EXTRACT • Data Lives on Source • Azure SQL DB • Azure SQL DW • Azure SQL VM USQL METADATA DATA
  • 50. U-SQL : Data Federation CURRENT STATE EXTRACTORS
  • 51. U-SQL : Operators COMPARISON OPERATORS LOGICAL OPERATORS IS NULL AND == BETWEEN > IN, NOT IN >= LIKE, NOT LIKE != NOT OR
  • 52. U-SQL : Functions REPORTING FUNCTIONS • COUNT • SUM • MIN • MAX • AVG • STDEV • VAR RANKING FUNCTIONS • RANK • DENSE_RANK • NTILE • ROW_NUMBER ANALYTIC FUNCTIONS • CUME_DIST • PERCENT_RANK • PERCENTILE_CONT • PERCENTILE_DISC
  • 53. U-SQL : C# Functions STRING METHODS • Compare • Concat • Contains • Equals • Replace • Split • ToUpper • Trim • ..plus many more! MATH METHODS • Abs • BigMul • Floor • Max/Min • Round • Sqrt • ..plus many more!
  • 54. Copyright Talavant, Inc. 2017 54 Advice • Identify Value of Data Lake Approach • Data Lake: Invest Time and Strategy into Data Lake Design • U-SQL: Utilize U-SQL Constructs before C# • U-SQL: Understand and Control Data through Partitioning
  • 56. Learn U-SQL ! •Michael Rys – LinkedIn Slide Share’s •GitHub – U-SQL Repository •SQL Server Central – Stairway to U-SQL •Azure – Built in Example
  • 57. Let’s Connect! • https://www.linkedin.com/in/seanforgatch/ • Sean.Forgatch@Talavant.com • @4gatchSQL