Integration Monday - Analysing StackExchange data with Azure Data Lake

Analysing StackExchange data
with Azure Data Lake
Analysing StackExchange data with Azure Data Lake

Nice to meet you
Tom KERKHOVE
➔ Integration Professional
➔ IoT Competency Lead
➔ Windows Development &
Microsoft Azure MVP
tom.kerkhove@codit.eu
+32 473 701 074
@TomKerkhove
be.linkedin.com/in/tomkerkhove
github.com/tomkerkhove

Agenda
• Why should we care about Big Data?
• Big Data in Azure
• Azure Data Lake
• Demo
• Q & A
3

Integration of ThingsInternet of Things
5

Connect and scale
with efficiency
Analyze and act
on new data
Integrate and transform
business processes

Event producers & gateways Ingestion & transformation Report, Act, Predict

Microsoft Patterns & Practices – IoT Journey

Platform Services
Infrastructure Services
Web Apps
Mobile
Apps
API
Management
API Apps
Logic Apps
Notification
Hubs
Content
Delivery
Network (CDN)
Media
Services
BizTalk
Services
Hybrid
Connections
Service Bus
Storage
Queues
Hybrid
Operations
Backup
StorSimple
Azure Site
Recovery
Import/Export
SQL
Database
DocumentDB
Redis
Cache
Azure
Search
Storage
Tables
Data
Warehouse Azure AD
Health Monitoring
AD Privileged
Identity
Management
Operational
Analytics
Cloud
Services
Batch
RemoteApp
Service
Fabric
Visual Studio
App
Insights
Azure
SDK
VS Online
Domain Services
HDInsight Machine
Learning
Stream
Analytics
Data
Factory
Event
Hubs
Mobile
Engagement
Data
Lake
IoT Hub
Data
Catalog
Security &
Management
Azure Active
Directory
Multi-Factor
Authentication
Automation
Portal
Key Vault
Store/
Marketplace
VM Image Gallery
& VM Depot
Azure AD
B2C
Scheduler

Overview in Azure
13
DocumentDB
Data Factory Stream Analytics Data Lake HDInsight Data Lake
(Store & Analytics)
Virtual Machine
IoT Hub SQL Data
Warehouse
SQL DatabaseStorageEvent Hubs
Document Db
Data Ingestion Data Storage
Data Pipelines
Machine Learning
Data Analytics

Analysing Big Data in Azure
Azure Data Lake Family
HDInsight Data Lake Store Data Lake Analytics
• Unlimited storage
• WebHDFS Store
• Managed cluster service
• Open-source technology
• Runs on Windows or Linux
• Managed job service
• U-SQL batch-processing

Azure Data Lake Store
➔ WebHDFS compatible
➔ Any size
➔ Any format as-is
➔ Write-once-read-many
➔ Enterprise-grade security
➔ Thé big data store in Azure
17

Characteristics
➔ Data Warehousing
➔ Structured data
➔ Defined set of schemas
➔ Requires Extract-Transform-
Load (ETL) before storing
➔ Known for some of us
➔ Exploratory analysis is hard
because of transforming the
data
18
Data Lake vs DataWarehousing
➔ Data Lake
➔ Raw data
(unstructured/semi-structured/structured)
➔ “Dump” all your data in the
lake
➔ Data scientists will
interpret data from the lake
➔ Without metadata, turns in
a data swamp pretty fast

19Martin Fowler on Data Lake & Data Warehouses(link)

Azure Data Lake Analytics
➔ Run analytics jobs on managed clusters
➔ Don’t worry about scale
➔ Written in U-SQL
➔ SQL Syntax
➔ Extensibility in C#
➔ Easily scaled with Analytics Units
➔ Pay for processing time only
20

Writing U-SQL scripts
21
Extract from data source by
using built-in or custom
extractors.
Transform / Analyse the data
using SQL-syntax, in-line C# or
C# method calls
Output the result to a data
source by using built-in or
custom extractors

Data Lake Analytics - Data Sources
U-SQL
Query Query
Azure
Storage Blobs
Azure
Data Lake Store
Azure
SQL Database
Azure
SQL Data Warehouse
Azure SQL
in VMs
Azure Data Lake Analytics

Meet StackExchange
➔ Over 280 subwebsites
➔ 150+ GB of open-source data
➔ Different kinds of data
➔ Posts
➔ Users
➔ Votes
➔ ...
➔ A big data sample data set

What AreWe GoingTo Do?
• Downloading the
original data set
Acquiring The
Data
• Upload data set to
Azure
• Determine what
service to use
Moving The
Data • Merging data from
each site into one
file
• Conversion from
XML to CSV
Aggregating
The Data
• Run business logic
on it
• Attempt to gain
knowledge from it
Analyzing The
Data • Visualize what we’ve
learned
Visualizing The
Data
26

Azure Data Lake tools forVisual Studio
➔ Projects / Solutions / Source control
➔ Store Explorer
➔ Browse store
➔ Download complete / subset of file
➔ Preview
➔ JobVisualizer
➔ Determine bottlenecks by using heatmaps
➔ Playback jobs based on telemetry
➔ Query optimization
➔ Job Profiler
➔ Off-Line execution
27

Integration with Azure Services
➔ Integrate in your data pipelines in Azure Data Factory
➔ Move data from Azure Data Lake Store to other store
➔ Move data to Azure Data Lake Store
➔ Run U-SQL query within pipeline
➔ Integration with Azure Data Catalog
➔ Register your Azure Data Lake Store assets
28

Pricing
➔ Data Lake Store
➔ $0,08/GB stored per month
➔ $0,14 per 1M transactions
• 1 transaction is block of up to 128 kB
➔ Egress will be billed but not know yet
➔ Data Lake Analytics
➔ $0,05 per job
➔ $0,05 per minute per Analytics Unit for processing time
29

Azure Data Lake Store vs Blob Storage
30
No Limitations
Store whatever you
want in any format
Security
Built-in Azure Active
Directory support
Pricing
More expensive than
Storage RA-GRS
Redundancy
It’s there but no control
over it
Built for Scale
Optimized for high-
scale reads
Integration
With Data Factory, Data
Catalog & HDInsight

Summary
➔ Big Data is not just a hype so get ready
➔ Azure Data Lake Store
➔ Analyse today & explore tomorrow
➔ Data Swamps
➔ Data Lake Analytics
➔ No cluster management
➔ Re-use existing skills
➔ Pay for what we use
➔ Big Data in Azure? Azure Data Lake family and it’s easy!

Integration Monday - Analysing StackExchange data with Azure Data Lake

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Integration Monday - Analysing StackExchange data with Azure Data Lake

Similar to Integration Monday - Analysing StackExchange data with Azure Data Lake (20)

More from Tom Kerkhove

More from Tom Kerkhove (20)

Recently uploaded

Recently uploaded (20)

Integration Monday - Analysing StackExchange data with Azure Data Lake