Mapping data flows allow for code-free data transformation using an intuitive visual interface. They provide resilient data flows that can handle structured and unstructured data using an Apache Spark engine. Mapping data flows can be used for common tasks like data cleansing, validation, aggregation, and fact loading into a data warehouse. They allow transforming data at scale through an expressive language without needing to know Spark, Scala, Python, or manage clusters.
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Edureka!
** Microsoft Azure Certification Training : https://www.edureka.co/microsoft-azure-training **
This Edureka "Azure Data Factory” tutorial will give you a thorough and insightful overview of Microsoft Azure Data Factory and help you understand other related terms like Data Lakes and Data Warehousing.
Following are the offering of this tutorial:
1. Why Azure Data Factory?
2. What Is Azure Data Factory?
3. Data Factory Concepts
4. What is Azure Data Lake?
5. Data Lake Concepts
6. Data Lake Vs Data Warehouse
7. Demo- Moving On-Premise Data To Cloud
Check out our Playlists: https://goo.gl/A1CJjM
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
Organizations are grappling to manually classify and create an inventory for distributed and heterogeneous data assets to deliver value. However, the new Azure service for enterprises – Azure Synapse Analytics is poised to help organizations and fill the gap between data warehouses and data lakes.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
This is my slide presentation from Pragmatic Works' Azure Data Week 2019: Data Quality Patterns in the Cloud with Azure Data Factory using Mapping Data Flows
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. In this session we will learn how to create data integration solutions using the Data Factory service and ingest data from various data stores, transform/process the data, and publish the result data to the data stores.
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Edureka!
** Microsoft Azure Certification Training : https://www.edureka.co/microsoft-azure-training **
This Edureka "Azure Data Factory” tutorial will give you a thorough and insightful overview of Microsoft Azure Data Factory and help you understand other related terms like Data Lakes and Data Warehousing.
Following are the offering of this tutorial:
1. Why Azure Data Factory?
2. What Is Azure Data Factory?
3. Data Factory Concepts
4. What is Azure Data Lake?
5. Data Lake Concepts
6. Data Lake Vs Data Warehouse
7. Demo- Moving On-Premise Data To Cloud
Check out our Playlists: https://goo.gl/A1CJjM
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
Organizations are grappling to manually classify and create an inventory for distributed and heterogeneous data assets to deliver value. However, the new Azure service for enterprises – Azure Synapse Analytics is poised to help organizations and fill the gap between data warehouses and data lakes.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
This is my slide presentation from Pragmatic Works' Azure Data Week 2019: Data Quality Patterns in the Cloud with Azure Data Factory using Mapping Data Flows
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. In this session we will learn how to create data integration solutions using the Data Factory service and ingest data from various data stores, transform/process the data, and publish the result data to the data stores.
Migrating minimal databases with minimal downtime to AWS RDS, Amazon Redshift and Amazon Aurora
Migration of databases to same and different engines and from on premise to cloud
Schema conversion from Oracle and SQL Server to MySQL and Aurora
I published a 1-hour youtube video that covers all the essential topics that are there to know about the Microsoft Azure Data Fundamentals DP 900 exam. I made sure to only include relevant exam-related topics and not to bombard you with a lot of irrelevant details at the same time, I wanted to cover the basics of each topic with a demo wherever necessary. I also wanted to validate the content of my video hence, I gave the exam before publishing the video and got an easy 900 marks with just the content I published in this video. If you plan to give this certification exam or are interested in learning Azure data fundamentals DP 900 concepts, feel free to check out this video.
https://youtu.be/jopyoCgQjkM
Please watch the video till the end as I have included important tips and pointers to the exam in each of the topics which would help you with lots of questions in the Microsoft Azure data fundamentals DP 900 exam.
This video is sufficient for you to pass the exam. Good luck!
Sun Trainings is one of the best coaching center in Hyderabad. Join our online training sessions with our real time faculty of Informatica. Practical sessions will also be provided for hands on experience. We provide training courses ideal for software and data management professionals. Our training sessions covers all information from basic to advanced level. Don’t wait anymore and mail your queries on contact@suntrainings.com / (M) 9642434362 .
Sun Trainings is one of the best coaching center in Hyderabad. Join our online training sessions with our real time faculty of Informatica. Practical sessions will also be provided for hands on experience. We provide training courses ideal for software and data management professionals. Our training sessions covers all information from basic to advanced level. Don’t wait anymore and mail your queries on contact@suntrainings.com / (M) 9642434362 .
Azure Data Factory Data Wrangling with Power QueryMark Kromer
ADF has embedded Power Query in Data Factory for a code-free / data-first data wrangling experience. Use the Power Query spreadsheet-style interface in your data factory to explore and prep your data, then execute your M script at scale on ADF's Spark data flow integration runtimes.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
2. What are mapping data flows?
Code-free data transformation at scale
Serverless, scaled-out, ADF-managed
Apache Spark™ engine
Resilient flows handle structured and
unstructured data
Operationalized as an ADF pipeline activity
3. Code-free data transformation at scale
Intuitive UX lets you focus on building transformation logic
Data cleansing
Data validation
Data aggregation
No requirement of knowing Spark, cluster management, Scala, Python, etc
vs
4. INGEST
Modern Data Warehouse (MDW)
PREPARE TRANSFORM,
PREDICT
& ENRICH
SERVE
STORE
VISUALIZE
On-premises data
Cloud data
SaaS data
Data Pipeline Orchestration & Monitoring
11. Building transformation logic
Transformations: A ‘step’ in the data flow
Engine intelligently groups them at runtime
19 currently available
Core logic of data flow
Add/Remove/Alter Columns
Join or lookup data from datasets
Change number or order of rows
Aggregate data
Hierarchal to relational
12. Source transformation
Define the data read by your
data flow
Import projection vs generic
Schema drift
Connector specific properties and
optmizations
Min: 1, Max: ∞
Define in-line or use dataset
13. Source: In-line vs dataset
Define all source properties within a data flow or use a separate
entity to store them
Dataset:
Reusable in other ADF activities such as Copy
Not based in Spark -> some settings overridden
In-line
Useful when using flexible schemas, one-off source instances or parameterized sources
Do not need “dummy” dataset object
Based in Spark, properties native to data flow
Most connectors only in available in one
14. Supported connectors
File-based data stores (ADLS Gen1/Gen2, Azure Blob Storage)
Parquet, JSON, DelimitedText, Excel, Avro, XML
In-line only: Common Data Model, Delta Lake
SQL tables
Azure SQL Database
Azure Synapse Analytics (formerly SQL DW)
Cosmos DB
Coming soon: Snowflake
If not supported, ingest to staging area via Copy activity
90+ connectors supported natively
15. Duplicating data streams
Duplicate data stream from any
stage of your data flow
Select ‘New branch’
Operate on same data with
different transformation
requirements
Self-joins
Writing to different sinks
Aggregating in one branch
16. Joining two data streams together
Use Join transformation to append columns from incoming stream to
any stream in your data flow
Join types: full outer, inner, left outer, right outer, cross
SQL Join equivalent
Match on computed columns or use non-equality conditions
Broadcast small data streams to cache data and improve
performance
17. Lookup transformation
Similar to left outer join, but with more functionality
All incoming rows are passed through regardless of match
Matching conditions same as a join
Multi or single row lookup
Match on all, first, last, or any row that meets join conditions
isMatch() function can be used in downstream transformations to
verify output
18. Exists transformation
Check for existence of a value in another stream
SQL Exists equivalent
See if any row matches in a subquery, just like SQL
Filter based on join matching conditions
Choose Exist or Not Exist for your filter conditions
Can specify a custom expressoin
19. Union transformation
Combine rows from multiple
streams
Add as many streams as
needed
Combine data based upon
column name or ordinal
column position
Use cases:
Similar data from different connection
that undergo same transformations
Writing multiple data streams into the
same sink
20. Conditional split
Split data into separate streams
based upon conditions
Use data flow expression language to
evaluate boolean
Use cases:
Sinking subset of data to different
locations
Perform different calculations on data
depending on a set of values
21. Derived column
Transform data at row and column level using expression language
Generate new or modify existing columns
Build expressions using the expression builder
Handle structured or unstructured data
Use column patterns to match on rules and regular expressions
Can be used to transform multiple columns in bulk
Most heavily used transformation
22. Select transformation
Metadata and column maintenance
SQL Select statement
Alias or renames data stream and columns
Prune unwanted or duplicate columns
Common after joins and lookups
Rule-based mapping for flexible schemas, bulk mapping
Map hierarchal columns to flat structure
23. Surrogate key transformation
Generate incrementing key to use as a non-business key in your data
To seed the starting value of your surrogate key, use derived column
and a lookup from an existing table
Examples are in documentation
Useful for generating keys for star schema dimension tables
24. Aggregate transformation
Aggregate data into groups using aggregate function
Like SQL GROUP BY clause in a Select statement
Aggregate functions include sum(), max(), avg(), first(), collect()
Choose columns to group by
One row for each unique group by column value
Only columns used in transformation are in output data stream
Use self-join to append to existing data
Supports pattern matching
25. Pivot and unpivot transformations
Pivot row values into new columns and vice-versa
Both are aggregate transformations that require aggregate functions
If pivot key values not specified, all columns become drifted
Use map drifted quick action to add to schema quickly
26. Window transformation
Aggregates data across
“windows” of data partitions
Used to compare a row of data against
others in its ‘group’
Group determined by group by
columns, sorting conditions
and range bounds
Used for ranking rows in a
group and getting lead/lag
Sorting causes reshuffling of
data
“Expensive” operation
28. Alter row transformation
Mark rows as Insert, Update, Delete, or Upsert
Like SQL MERGE statement
Insert by default
Define policies to update your database
Works with SQL DB, Synapse, Cosmos DB, and Delta Lake
Specify allowed update methods in each sink
29. Flatten trasnformation
Unroll array values into individual
rows
One row per value
Used to convert hierarchies to flat
structures
Opposite of collect() aggregate
function
30. Sort transformation
Sort your data by column values
SQL Order By equivalent
Use sparingly: Reshuffles and coalesces data
Reduces effectiveness of data partitioning
Does not optimize speed like legacy ETL tools
Useful for data exploration and validation
31. Sink transformatoin
Define the properties for landing your data in your destination
target data store
Define using dataset or in-line
Can map columns similar to select transformation
Import schema definition from destination
Set actions on destinations
Truncate table or clear folder, SQL pre/post actions, database update methods
Choose how the written data is partitioned
Use current partitioning is almost always fastest
Note: Writing to single file can be very slow with large amounts
of data
33. Visual expression builder
List of columns
being modified
All available
functions, fields,
parameters …
Build expressions
here with full
auto-complete
and syntax
checking
View results of your
expression in the data
preview pane with
live, interactive
results
34. Expression language
Expressions are built using the data flow expression language
Expressions can reference:
Built-in expression functions
Defined input schema columns
Data flow parameters
Literals
Certain transformations have unique functions
Count(), sum() in Aggregate, denseRank() in Window, etc
Evaluates to spark data types
35. Debug mode
Quickly verify logic during development on small interactive cluster
4 core, 60-minute time to live
Enables the following:
Get data preview snapshot at each transformation
Preview output of expression in expression builder
Run debug pipeline with no spin up
Import Spark projection of source schema
Rule of thumb: If developing Data Flows, turn on right away
Initial 3-5-minute start up time
39. Parameterizing data flows
Both dataset properties and data-flow expressions can be
parameterized
Passed in via data flow activity
Can use data flow or pipeline expression language
Expressions can reference $parameterName
Can be literal values or column references
42. Schema drift
In real-world data integration solutions, source/target data stores
change shape
Source data fields can change names
Number of columns can change over time
Traditional ETL processes break when schemas drift
Mapping data flow has built-in handling for flexible schemas
Patterns, rule-based mappings, byName(s) function, etc
Source: Read additional columns on top of what is defined in the source schema
Sink: Write additional columns on top of what is defined in the sink schema
46. Data flow activity
Run as activity in pipeline
Integrated with existing ADF control flow, scheduling, orchestration, montoring, C/ICD
Choose which integration runtime (IR) to run on
# of cores, compute type, cluster time to live
Assign parameters
47. Data flow integration runtime
Integrated with existing Azure IR
Choose compute type, # of cores, time to live
Time to live: time a cluster is alive after last execution concludes
Minimal start up time for sequential data flows
Parameterize compute type, # of cores if using Auto Resolve
49. Data flow security considerations
All data stays inside VMs that run the Databricks cluster which are
spun up JIT for each job
• Azure Databricks attaches storage to the VMs for logging and spill-over from in-memory data frames
during job operation. These storage accounts are fully encrypted and within the Microsoft tenant.
• Each cluster is single-tenant and specific to your data and job. This cluster is not shared with any
other tenant
• Data flow processes are completely ephemeral. Once a job is completed,
all associated resources are destroyed
• Both cluster and storage account are deleted
• Data transfers in data flows are protected using certificates
• Active telemetry is logged and maintained for 45 days for troubleshooting
by the Azure Data Factory team
51. Best practices – Lifecycle
1. Test your transformation logic using debug mode and data
preview
Limited source size or use sample files
2. Test end-to-end pipeline logic using pipeline debug
Verify data is read/written correctly
Used as smoke test before merging your changes
3. Publish and trigger your pipelines within a Dev Factory
Test performance and cluster size
4. Promote pipelines to higher environments such as UAT and PROD
using CI/CD
Increase size and scope of data as you get to higher environments
52. Best practices – Debug (Data Preview)
Data Preview
Data preview is inside the data flow designer transformation properties
Uses row limits and sampling techniques to preview data from a small size of data
Allows you to build and validate units of logic with samples of data in real time
You have control over the size of the data limits under Debug Settings
If you wish to test with larger datasets, set a larger compute size in the Azure IR when
switching on “Debug Mode”
Data Preview is only a snapshot of data in memory from Spark data frames. This feature does
not write any data, so the sink drivers are not utilized and not tested in this mode.
53. Best practices – Debug (Pipeline Debug)
Pipeline Debug
Click debug button to test your data flow inside of a pipeline
Default debug limits the execution runtime so you will want to limit data sizes
Sampling can be applied here as well by using the “Enable Sampling” option in each Source
Use the debug button option of “use activity IR” when you wish to use a job execution
compute environment
This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the
default debug setting
54. Optimizing data flows
Transformation order generally does not matter
Data flows have a Spark optimizer that reorders logic to perform as best as it can
Repartitioning and reshuffling data negates optimizer
Each transformation has ‘Optimize’ tab to control partitioning
strategies
Generally do not need to alter
Altering cluster size and type has performance impact
Four components
1. Cluster startup time
2. Reading from sources
3. Transformation time
4. Writing to sinks
55. Identifying bottlenecks
1. Cluster startup time
2. Sink processing time
3. Source read time
4. Transformation stage time
1. Sequential executions can
lower the cluster startup time
by setting a TTL in Azure IR
2. Total time to process the
stream from source to sink.
There is also a post-processing
time when you click on the Sink
that will show you how much
time Spark had to spend with
partition and job clean-up.
Write to single file and slow
database connections will
increase this time
3. Shows you how long it took to
read data from source.
Optimize with different source
partition strategies
4. This will show you bottlenecks
in your transformation logic.
With larger general purpose
and mem optimized IRs, most
of these operations occur in
memory in data frames and are
usually the fastest operations
in your data flow
56. Best practices - Sources
When reading from file-based sources, data flow automatically
partitions the data based on size
~128 MB per partition, evenly distributed
Use current partitioning will be fastest for file-based and Synapse using PolyBase
Enable staging for Synapse
For Azure SQL DB, use Source partitioning on column with high
cardinality
Improves performance, but can saturate your source database
Reading can be limited by the I/O of your source
57. Optimizing transformations
Each transformation has its own optimize tab
Generally better to not alter -> reshuffling is a relatively slow process
Reshuffling can be useful if data is very skewed
One node has a disproportionate amount of data
For Joins, Exists and Lookups:
If you have a lot, memory optimized greatly increases performance
Can ‘Broadcast’ if the data on one side is small
Rule of thumb: Less than 50k rows
Increasing integration runtime can speed up transformations
Transformations that require reshuffling like Sort negatively impact
performance
58. Best practices - Sinks
SQL:
Disable indexes on target with pre/post SQL scripts
Increase SQL capacity during pipeline execution
Enable staging when using Synapse
File-based sinks:
Use current partitioning allows Spark to create output
Output to single file is a very slow operation
Combines data into single partition
Often unnecessary by whoever is consuming data
Can set naming patterns or use data in column
Any reshuffling of data is slow
Cosmos DB
Set throughput and batch size to meet performance requirements
59. Azure Integration Runtime
Data Flows use JIT compute to minimize running expensive clusters
when they are mostly idle
Generally more economical, but each cluster takes ~4 minutes to spin up
IR specifies what cluster type and core-count to use
Memory optimized is best, compute optimized doesn’t generally work for production workloads
When running Sequential jobs utilize Time to Live to reuse cluster
between executions
Keeps cluster alive for TTL minutes after execution for new job to use
Maximum one job per cluster
Rule of thumb: start small and scale up
61. Data flow script (DFS)
DFS defines the logical intent of your data transformations
Script is bundled and marshalled to Spark cluster as a job for
execution
DFS can be auto-generated and used for programmatic creation of
data flows
Access script behind UI via “Script” button
Click “Copy as Single Line” to save version of script that is ready for
JSON
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-
script
65. ETL Tool Migration Overview
Migrating from an existing large enterprise ETL installation to ADF and data flows requires
adherence to a formal methodology that incorporates classic SDLC, change management,
project management, and a deep understanding of your current data estate and ETL
requirements.
Successful migration projects require project plans, executive sponsorship, budget, and a
dedicated team to focus on rebuilding the ETL in ADF.
For existing on-prem ETL estates, it is very important to learn basics of Cloud, Azure, and ADF
generally before taking this Data Flows training.
68. Training
• On-prem to Cloud, Azure general training, ADF general training, Data Flows training
• A general understanding of the different between legacy client/server on-prem ETL
architectures and cloud-based Big Data processing is required
• ADF and Data Flows execute on Spark, so learn the fundamentals of the different between
row-by-row processing on a local server and batch/distributed computing on Spark in the
Cloud
69. Execution
• Start with the top 10 mission-critical ETL mappings and list out the primary logical goals and
steps achieved in each
• Use sample data and debug each scenario as new pipelines and data flows in ADF
• UAT each of those 10 mappings in ADF using sample data
• Lay out end-to-end project plan for remaining mapping migrations
• Plan the remainder of the project into quarterly calendar milestones
• Except each phase to take around 3 months
• Majority of large existing ETL infrastructure modernization migrations take 12-18 months to
complete
70. Roadmap: 2020 H2
New connectors:
• Snowflake (r/w) for Data Flow (GA)
• Delta lake (r/w) for Data Flow (GA)
• Common Data Model (CDM) format support for Mapping Data Flow (GA)
• Azure Database for PostgreSQL (r/w) for Data Flow (GA)
• Azure Database for MySQL (r/w) for Data Flow (GA)
• Dynamics 365/CDS (r/w) for Data Flow (GA)
• Error Row Handling (GA)
• Wide row completion (GA)
• Updated Expression Builder UX w/Local Vars (GA)
• Wrangling Data Flow (GA)