This document provides an overview of processing big data with Azure Data Lake Analytics. It discusses:
1. Sean Forgatch who is a business intelligence consultant specializing in Azure big data solutions.
2. Talavant, Sean's company, which provides holistic big data strategies and implementations.
3. An introduction to big data concepts like volume, velocity and variety and how Azure tools like Data Warehouse, Data Lake, and Data Lake Analytics address these.
The document then goes into further detail on Data Lake concepts, Azure Data Lake Store, Azure Data Lake Analytics, and the U-SQL language for querying and analyzing data in the data lake.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
How we, at eXelate, built an ETL pipeline for Elasticsearch using Spark, including :
* Processing the data using Spark.
* Indexing the processed data directly into Elasticsearch using elasticsearch-hadoop plugin-in for Spark.
* Managing the flow using some of the services provided by AWS (EMR, Data Pipeline, etc.).
The presentation includes some tips and discusses some of the pitfalls we encountered while setting-up this process.
Spark, the ultra-fast, general purpose big data computing platform provides some very flexible options for processing and accessing data. In a previous meetup we covered PySpark and the Schema RDD. In this session we reviewed and expanded on this, with an in-depth exploration of Spark SQL.
- Overview of Spark in the Hadoop ecosystem
- Deep dive into Spark SQL with step by steps on how to implement and use it
If you have questions about the presentation or want to learn more about our services, please visit our website: http://casertaconcepts.com/
SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.
This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
How we, at eXelate, built an ETL pipeline for Elasticsearch using Spark, including :
* Processing the data using Spark.
* Indexing the processed data directly into Elasticsearch using elasticsearch-hadoop plugin-in for Spark.
* Managing the flow using some of the services provided by AWS (EMR, Data Pipeline, etc.).
The presentation includes some tips and discusses some of the pitfalls we encountered while setting-up this process.
Spark, the ultra-fast, general purpose big data computing platform provides some very flexible options for processing and accessing data. In a previous meetup we covered PySpark and the Schema RDD. In this session we reviewed and expanded on this, with an in-depth exploration of Spark SQL.
- Overview of Spark in the Hadoop ecosystem
- Deep dive into Spark SQL with step by steps on how to implement and use it
If you have questions about the presentation or want to learn more about our services, please visit our website: http://casertaconcepts.com/
SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.
This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde
A talk given at the August, 2013 NoSQL Now! conference in San Jose, California.
2013 is the year of that SQL meets Hadoop & NoSQL. There is a surprisingly good fit between thirty-something SQL and the upstart millennial data technologies. SQL helps you deliver faster value, improve integration with visualization and analytics tools, and bring your data to a larger audience.
Julian Hyde, author of Mondrian and Optiq, and a contributor to Apache Drill and Cascading Lingual, shows how to quickly build a SQL interface to a NoSQL system using the Optiq framework.
As a case study, he shows how Optiq has allows the Mondrian in-memory analysis engine (part of Pentaho's business intelligence suite) to leverage the memory and compute power of a cluster and pull data from hybrid SQL and NoSQL sources.
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
From theory to implementation - follow the steps of implementing an end-to-end analytics solution illustrated with some best practices and examples in Azure Data Lake.
During this full training day we will share the architecture patterns, tooling, learnings and tips and tricks for building such services on Azure Data Lake. We take you through some anti-patterns and best practices on data loading and organization, give you hands-on time and the ability to develop some of your own U-SQL scripts to process your data and discuss the pros and cons of files versus tables.
This were the slides presented at the SQLBits 2018 Training Day on Feb 21, 2018.
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustData Con LA
Abstract: Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover
.All technical aspects of Delta Features
.What’s coming
.How to get started using it
.How to contribute
Bio: Michael Armbrust is committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Databricks Delta. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
https://fosdem.org/2017/schedule/event/hpc_bigdata_calcite/
When working with BigData & IoT systems we often feel the need for a Common Query Language. The platform specific languages are often harder to integrate with and require longer adoption time.
To fill this gap many NoSql (Not-only-Sql) vendors are building SQL layers for their platforms. It is worth exploring the driving forces behind this trend, how it fits in your BigData stacks and how we can adopt it in our favorite tools. However building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your big data system.
Calcite has been used to empower many Big-Data platforms such as Hive, Spark, Drill Phoenix to name some.
I will walk you through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaDataStax Academy
What You Will Learn At This Meetup:
• Review of Cassandra analytics landscape: Hadoop & HIVE
• Custom input formats to extract data from Cassandra
• How Spark & Shark increase query speed & productivity over standard solutions
Abstract
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
About Evan Chan
Evan Chan is a Software Engineer at Ooyala. In his own words: I love to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. I am a big believer in GitHub, open source, and meetups, and have given talks at conferences such as the Cassandra Summit 2013.
South Bay Cassandra Meetup URL: http://www.meetup.com/DataStax-Cassandra-South-Bay-Users/events/147443722/
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
Why is data independence (still) so important? Optiq and Apache Drill.Julian Hyde
Presentation to the Apache Drill Meetup in Sunnyvale, CA on 2012/9/13. Framing the debate about Drill's goals in terms of a "typical" modern DBMS architecture; and also introducing the Optiq extensible query optimizer.
How to integrate Splunk with any data solutionJulian Hyde
A presentation Julian Hyde gave to the Splunk 2012 User conference in Las Vegas, Tue 2012/9/11. Julian demonstrated a new technology called Optiq, described how it could be used to integrate data in Splunk with other systems, and demonstrated several queries accessing data in Splunk via SQL and JDBC.
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize Spark jobs. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization.
As a follow-up of last year’s “Lessons From The Field”, this session will review some common anti-patterns I’ve seen in the field that could introduce performance or stability issues to your Spark jobs. We’ll look at ways of better understanding your Spark jobs and identifying solutions to these anti-patterns to help you write better performing and more stable applications.
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
More and more customers who are looking to modernize analytics needs are exploring the data lake approach in Azure. Typically, they are most challenged by a bewildering array of poorly integrated technologies and a variety of data formats, data types not all of which are conveniently handled by existing ETL technologies. In this session, we’ll explore the basic shape of a modern ETL pipeline through the lens of Azure Data Lake. We will explore how this pipeline can scale from one to thousands of nodes at a moment’s notice to respond to business needs, how its extensibility model allows pipelines to simultaneously integrate procedural code written in .NET languages or even Python and R, how that same extensibility model allows pipelines to deal with a variety of formats such as CSV, XML, JSON, Images, or any enterprise-specific document format, and finally explore how the next generation of ETL scenarios are enabled though the integration of Intelligence in the data layer in the form of built-in Cognitive capabilities.
Development teams are facing an increased demand for building faster, smarter, more complex systems in record time. The DataStax development tools have been architected from the ground up to enable teams to deliver intelligent,
large-scale distributed, high-performance applications while remaining familiar and easy to use.
This session will provide an overview of the DataStax Developer Tools, where and how they fit into the development cycle.
About the Speaker
Alex Popescu Senior Product Manager, DataStax
I'm a developer turned product manager building developer tools for Apache Cassandra and DSE. With an eye for simplicity, I focus on creating friendly developer solutions that enable building high-performance, scalable, and fault tolerant applications. I'm passionate about open source and over years I made numerous contributions to major projects like TestNG and Groovy.
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde
A talk given at the August, 2013 NoSQL Now! conference in San Jose, California.
2013 is the year of that SQL meets Hadoop & NoSQL. There is a surprisingly good fit between thirty-something SQL and the upstart millennial data technologies. SQL helps you deliver faster value, improve integration with visualization and analytics tools, and bring your data to a larger audience.
Julian Hyde, author of Mondrian and Optiq, and a contributor to Apache Drill and Cascading Lingual, shows how to quickly build a SQL interface to a NoSQL system using the Optiq framework.
As a case study, he shows how Optiq has allows the Mondrian in-memory analysis engine (part of Pentaho's business intelligence suite) to leverage the memory and compute power of a cluster and pull data from hybrid SQL and NoSQL sources.
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
From theory to implementation - follow the steps of implementing an end-to-end analytics solution illustrated with some best practices and examples in Azure Data Lake.
During this full training day we will share the architecture patterns, tooling, learnings and tips and tricks for building such services on Azure Data Lake. We take you through some anti-patterns and best practices on data loading and organization, give you hands-on time and the ability to develop some of your own U-SQL scripts to process your data and discuss the pros and cons of files versus tables.
This were the slides presented at the SQLBits 2018 Training Day on Feb 21, 2018.
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustData Con LA
Abstract: Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover
.All technical aspects of Delta Features
.What’s coming
.How to get started using it
.How to contribute
Bio: Michael Armbrust is committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Databricks Delta. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
https://fosdem.org/2017/schedule/event/hpc_bigdata_calcite/
When working with BigData & IoT systems we often feel the need for a Common Query Language. The platform specific languages are often harder to integrate with and require longer adoption time.
To fill this gap many NoSql (Not-only-Sql) vendors are building SQL layers for their platforms. It is worth exploring the driving forces behind this trend, how it fits in your BigData stacks and how we can adopt it in our favorite tools. However building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your big data system.
Calcite has been used to empower many Big-Data platforms such as Hive, Spark, Drill Phoenix to name some.
I will walk you through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaDataStax Academy
What You Will Learn At This Meetup:
• Review of Cassandra analytics landscape: Hadoop & HIVE
• Custom input formats to extract data from Cassandra
• How Spark & Shark increase query speed & productivity over standard solutions
Abstract
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
About Evan Chan
Evan Chan is a Software Engineer at Ooyala. In his own words: I love to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. I am a big believer in GitHub, open source, and meetups, and have given talks at conferences such as the Cassandra Summit 2013.
South Bay Cassandra Meetup URL: http://www.meetup.com/DataStax-Cassandra-South-Bay-Users/events/147443722/
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
Why is data independence (still) so important? Optiq and Apache Drill.Julian Hyde
Presentation to the Apache Drill Meetup in Sunnyvale, CA on 2012/9/13. Framing the debate about Drill's goals in terms of a "typical" modern DBMS architecture; and also introducing the Optiq extensible query optimizer.
How to integrate Splunk with any data solutionJulian Hyde
A presentation Julian Hyde gave to the Splunk 2012 User conference in Las Vegas, Tue 2012/9/11. Julian demonstrated a new technology called Optiq, described how it could be used to integrate data in Splunk with other systems, and demonstrated several queries accessing data in Splunk via SQL and JDBC.
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize Spark jobs. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization.
As a follow-up of last year’s “Lessons From The Field”, this session will review some common anti-patterns I’ve seen in the field that could introduce performance or stability issues to your Spark jobs. We’ll look at ways of better understanding your Spark jobs and identifying solutions to these anti-patterns to help you write better performing and more stable applications.
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
More and more customers who are looking to modernize analytics needs are exploring the data lake approach in Azure. Typically, they are most challenged by a bewildering array of poorly integrated technologies and a variety of data formats, data types not all of which are conveniently handled by existing ETL technologies. In this session, we’ll explore the basic shape of a modern ETL pipeline through the lens of Azure Data Lake. We will explore how this pipeline can scale from one to thousands of nodes at a moment’s notice to respond to business needs, how its extensibility model allows pipelines to simultaneously integrate procedural code written in .NET languages or even Python and R, how that same extensibility model allows pipelines to deal with a variety of formats such as CSV, XML, JSON, Images, or any enterprise-specific document format, and finally explore how the next generation of ETL scenarios are enabled though the integration of Intelligence in the data layer in the form of built-in Cognitive capabilities.
Development teams are facing an increased demand for building faster, smarter, more complex systems in record time. The DataStax development tools have been architected from the ground up to enable teams to deliver intelligent,
large-scale distributed, high-performance applications while remaining familiar and easy to use.
This session will provide an overview of the DataStax Developer Tools, where and how they fit into the development cycle.
About the Speaker
Alex Popescu Senior Product Manager, DataStax
I'm a developer turned product manager building developer tools for Apache Cassandra and DSE. With an eye for simplicity, I focus on creating friendly developer solutions that enable building high-performance, scalable, and fault tolerant applications. I'm passionate about open source and over years I made numerous contributions to major projects like TestNG and Groovy.
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Azure Data Factory is one of the newer data services in Microsoft Azure and is part of the Cortana Analyics Suite, providing data orchestration and movement capabilities.
This session will describe the key components of Azure Data Factory and take a look at how you create data transformation and movement activities using the online tooling. Additionally, the new tooling that shipped with the recently updated Azure SDK 2.8 will be shown in order to provide a quickstart for your cloud ETL projects.
Move your on prem data to a lake in a Lake in CloudCAMMS
With the boom in data; the volume and its complexity, the trend is to move data to the cloud. Where and How do we do this? Azure gives you the answer. In this session, I will give you an introduction to Azure Data Lake and Azure Data Factory, and why they are good for the type of problem we are talking about. You will learn how large datasets can be stored on the cloud, and how you could transport your data to this store. The session will briefly cover Azure Data Lake as the modern warehouse for data on the cloud,
These are the slides for my talk "An intro to Azure Data Lake" at Azure Lowlands 2019. The session was held on Friday January 25th from 14:20 - 15:05 in room Santander.
Apache Spark v3 is a new milestone for the Big Data framework. In this session, you will (re)discover what Spark is, learn about the new features in its third major version, and go through a complete end-to-end project.
I like to call Spark an Analytics Operating Systems. It is offering far more than just a framework or a library. I will explain why. Spark v3 is the latest major evolution. It was released mid-June 2020 and adds impressive new features. After looking at them from a high level, I will detail a few of my favorites.
Finally, as we all like code (well, at least I do), I will demonstrate a complete data & AI pipeline looking at Covid-19 data.
Key takeaways: Spark as an Analytics OS, Spark v3 highlights, building data/AI pipelines/models with Spark.
Audience: software engineers, data engineers, architects, data scientists.
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
Data Lakes have become a new tool in building modern data warehouse architectures. In this presentation we will introduce Microsoft's Azure Data Lake offering and its new big data processing language called U-SQL that makes Big Data Processing easy by combining the declarativity of SQL with the extensibility of C#. We will give you an initial introduction to U-SQL by explaining why we introduced U-SQL and showing with an example of how to analyze some tweet data with U-SQL and its extensibility capabilities and take you on an introductory tour of U-SQL that is geared towards existing SQL users.
slides for SQL Saturday 635, Vancouver BC, Aug 2017
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
Accelerating Business Intelligence Solutions with Microsoft Azure passJason Strate
Business Intelligence (BI) solutions need to move at the speed of business. Unfortunately, roadblocks related to availability of resources and deployment often present an issue. What if you could accelerate the deployment of an entire BI infrastructure to just a couple hours and start loading data into it by the end of the day. In this session, we'll demonstrate how to leverage Microsoft tools and the Azure cloud environment to build out a BI solution and begin providing analytics to your team with tools such as Power BI. By end of the session, you'll gain an understanding of the capabilities of Azure and how you can start building an end to end BI proof-of-concept today.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
1. Any use of this material without the permission of Talavant is prohibited
PROCESSING BIG DATA
with Azure Data Lake Analytics
Sean Forgatch
Business Intelligence Consultant
Sean.Forgatch@talavant.com
2. About Me
Sean Forgatch
• Milwaukee, WI
• Business Intelligence Consultant
• Healthcare, Insurance, SaaS
• Integration and Analytics
• Microsoft Big Data Certified
• PASS
• Industry Speaker
• FoxPASS President
• Running, Craft Beers, Reading
3. About Talavant
There is a better way to make data work for companies. Better resources, strategy, sustainability, inclusion of the
organization as a whole, understanding of client needs, tools, outcomes, better ROI.
• Accelerated planning, implementation and results
• Sustainable solutions
• Increased company-wide buy-in & usage
VALUE WE PROVIDE
By providing a holistic approach inclusive of a
client’s people, processes and technologies -
built on investment in our own employees and
company growth.
HOW WE DO IT
STRATEGY
ARCHITECTURE
IMPLEMENTATION
4. 1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
7. Big Data Primer: Azure Tools
THREE V s of Big
Data
VOLUME
VELOCITY Veracity
VARIETY
Azure Data
Warehouse
Azure Data
Lake
• Increasing Amount of Data and
Sources
• Increase of Data Asset
Acquisitions
• Streaming Data
• Real Time Analytics
• Increase in Demand
• Reliability of Trustworthy Data
• Timeliness of Data Operations
• New Data Formats Being Used
• Images, JSON, Free Text
• Avro, Parquet, ORC
Azure Data
Lake
Analytics
Event Hubs
IoT Hubs
Stream
Analytics
HDInsight
8. A Big Data Trends
“Variety, not volume or velocity, drives big-data investments”
9. A Big Data Trends
“Variety, not volume or velocity, drives big-data investments”
“Big data grows up: Hadoop adds to enterprise standards”
10. A Big Data Trends
“Variety, not volume or velocity, drives big-data investments”
“Big data grows up: Hadoop adds to enterprise standards”
“Rise of metadata catalogs helps people find analysis-worthy
big data.”
-TDWI: Top Ten Big Data Trends for 2017
11. 1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
14. RAW STAGE CURATED
EXPLORATORY
• Operational
• Value has been
identified
• Explorational
• Value is being
discovered
Azure Data Lake
Store
2
1
Data Lake Operations
16. Data Lake Operations
RAW STAGE CURATED
EXPLORATORY
ERP
IoT
Devices
RDBS
• Operational
• Value has been
identified
• Explorational
• Value is being
discovered
Azure Data Lake
Store
2
1
17. RAW STAGE CURATED
EXPLORATORY
ERP
IoT
Devices
RDBS
• Operational
• Value has been
identified
• Explorational
• Value is being
discovered
Azure Data Lake
Store
2
1
• Tag /
Explore
• Discover
Data Lake Operations
19. Data Lake Security
RAW (1) STAGE (2) CURATED (3) EXPLORATION (0)
Data Experts/Engineers Data
Experts/Engineers
ETL and BI Engineers /
SME’s / Analysts
Data Scientist / Analysts
TOOLS
20. Data Lake Tagging
RAW (1) STAGE (2) CURATED (3) EXPLORATION (0)
Data Experts/Engineers Data
Experts/Engineers
ETL and BI Engineers /
SME’s / Analysts
Data Scientist / Analysts
AUTOMATED SME N/A
TOOLS
21. Data Lake Processing
RAW (1) STAGE (2) CURATED (3) EXPLORATION (0)
Data Experts/Engineers Data
Experts/Engineers
ETL and BI Engineers /
SME’s / Analysts
Data Scientist / Analysts
AUTOMATED SME N/A
INGESTION CLEANSING DISTRIBUTION
TOOLS
22. Conceptual Architecture
I
ADF and
PolyBase
Files Tables
• Reference Data
• Data Warehouse
Data
• Archive Data
• Land quickly, no
transformations
• Includes Transient
Data Operations
• Original Format
• Stored by Date
Ingested
• Tag and Catalog Data
• Possibly Compressed
Data Lake Bypass - BATCH
• IoT
• Medical Devices
• Social Media
• Imaging
• Events
• Video
• Application Data
• Electronic
Medical Records
Systems
• Customer
Relationship
Management
• Financial
Systems
• Data
Marketplace
• Data Lake
Standard
• U-SQL/Hive
Tables
• General Staging
• Can bypass
Curated
• Decompress
• Conform
• Delta
Preparation
Data Lake Bypass - STREAM
R
• Exploratory
Analytics
• Sandbox Area
24. 1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
25. Azure Data Lake Store
• HDFS Data Store for housing data in it’s Native Raw Format
• Built on Apache YARN
• Process and store Petabyte size files
• Enterprise Security through Azure Active Directory
• No storage limit
29. Data Lake Ingestion
• Visual Studio
• Azure Portal (Limit)
• SSIS Data Lake Destination and SSIS Data Lake Task
• Azure Data Factory
• Powershell
• ADLCopy
• Sqoop (HDInsight)
• Other Apache tools
30. 1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
31. Azure Data Lake Analytics
• Big Data Queries as a Service
• Analytics Federation
• Develop in U-SQL, .NET, R, and Python
• Cognitive Services
• Scale Instantly
• Pay Per Job
32. 1. Big Data Overview
2. Data Lake Concepts
3. Azure Data Lake Store
4. Azure Data Lake Analytics
5. U-SQL
33. Intro to U-SQL
KEY FEATURES
• Combines SQL and C#
• Patterned File Processing – Thousands of Files
• Extensions: Python, R, Cognitive
• Query Data where it Lives (Federated Querying)
• Partition and Distribution of Data for Massive Parallelism
• Manage Structure and Shared Programming through U-SQL Catalog
• U-SQL Procedures
34. U-SQL : Extract Query
@MyExtract =
EXTRACT
Field1 string,
Field2 int,
Field 3 int?
FROM “/datalake/01_RAW/{*}.csv”
USING Extractors.Csv();
U-SQL T-SQL
INSERT INTO myTable
( Field1, Field2, Field3)
SELECT
CAST(Field1 as varchar(100) as Field1,
CAST(Field2 AS INT) as Field2,
CONVERT(INT, Field3) as Field 3
FROM myTable
CREATE TABLE myTable
(
Field1 VARCHAR(100),
Field2 INT,
Field3 INT NOT NULL
);
1
35. U-SQL : Extract Query
@MyExtract =
EXTRACT
Field1 string,
Field2 int,
Field 3 int?
FROM “/datalake/01_RAW/{*}.csv”
USING Extractors.Csv();
@MyAgg =
SELECT
Field1,
MAX(Field2) AS Field2
FROM @MyExtract
GROUP BY Field1;
U-SQL
1
2
T-SQL
CREATE TABLE myTable
(
Field1 VARCHAR(100),
Field2 INT,
Field3 INT NOT NULL
);
INSERT INTO myTable
( Field1, Field2, Field3)
SELECT
CAST(Field1 as varchar(100) as Field1,
CAST(Field2 AS INT) as Field2,
CONVERT(INT, Field3) as Field 3
FROM myTable
36. U-SQL : Extract Query
@MyExtract =
EXTRACT
Field1 string,
Field2 int,
Field 3 int?
FROM “/datalake/01_RAW/{*}.csv”
USING Extractors.Csv();
@MyAgg =
SELECT
Field1,
MAX(Field2) AS Field2
FROM @MyExtract
GROUP BY Field1;
OUTPUT @MyAgg
TO “/datalake/02_STAGE/MyOutput.csv”
USING Outputters.Csv();
1
2
3
U-SQL T-SQL
CREATE TABLE myTable
(
Field1 VARCHAR(100),
Field2 INT,
Field3 INT NOT NULL
);
INSERT INTO myTable
( Field1, Field2, Field3)
SELECT
CAST(Field1 as varchar(100) as Field1,
CAST(Field2 AS INT) as Field2,
CONVERT(INT, Field3) as Field 3
FROM myTable
37. U-SQL : Extractors and Outputters
CURRENT EXTRACTORS
• Csv()
• Tsv()
• Txt()
CURRENT OUTPUTTERS
• Csv()
• Tsv()
• Txt()
43. U-SQL : Tables
GUIDELINES
1. Must Have Clustered Index
2. Utilize When Improving Performance with
Distribution/Partitioning
3. You have Multiple Large Files
4. Don’t Use when:
• No Filtering, Joining, Grouping
44. U-SQL : Tables
DROP TABLE IF EXISTS <adla>.<database>.<schema>.tableName;
CREATE TABLE <adla>.<database>.<schema>.tableName
(
Field1 int,
Field2 string,
Field3 int?
INDEX idx_1 CLUSTERED(Field1)
DISTRIBUTED BY HASH(Field2)
);
45. U-SQL : Tables
DROP TABLE IF EXISTS
<adla>.<database>.<schema>.tableName;
CREATE TABLE <adla>.<database>.<schema>.tableName
(
Field1 int,
Field2 string,
Field3 int?
INDEX idx_1 CLUSTERED(Field1)
DISTRIBUTED BY HASH(Field2)
)
AS SELECT …
46. U-SQL : Tables
DROP TABLE IF EXISTS
<adla>.<database>.<schema>.tableName;
CREATE TABLE <adla>.<database>.<schema>.tableName
(
Field1 int,
Field2 string,
Field3 int?
INDEX idx_1 CLUSTERED(Field1)
DISTRIBUTED BY HASH(Field2)
)
AS EXTRACT …
47. U-SQL : Tables
DROP TABLE IF EXISTS
<adla>.<database>.<schema>.tableName;
CREATE TABLE <adla>.<database>.<schema>.tableName
(
Field1 int,
Field2 string,
Field3 int?
INDEX idx_1 CLUSTERED(Field1)
DISTRIBUTED BY HASH(Field2)
)
AS TVF …
49. U-SQL : Tables
EXTERNAL
• Stored Metadata
• Use VIEW or TVF over EXTRACT
• Data Lives on Source
• Azure SQL DB
• Azure SQL DW
• Azure SQL VM
USQL
METADATA DATA
50. U-SQL : Data Federation
CURRENT STATE
EXTRACTORS
51. U-SQL : Operators
COMPARISON OPERATORS LOGICAL OPERATORS
IS NULL AND
== BETWEEN
> IN, NOT IN
>= LIKE, NOT LIKE
!= NOT
OR
52. U-SQL : Functions
REPORTING FUNCTIONS
• COUNT
• SUM
• MIN
• MAX
• AVG
• STDEV
• VAR
RANKING FUNCTIONS
• RANK
• DENSE_RANK
• NTILE
• ROW_NUMBER
ANALYTIC FUNCTIONS
• CUME_DIST
• PERCENT_RANK
• PERCENTILE_CONT
• PERCENTILE_DISC
54. Copyright Talavant, Inc. 2017 54
Advice
• Identify Value of Data Lake
Approach
• Data Lake: Invest Time and Strategy
into Data Lake Design
• U-SQL: Utilize U-SQL Constructs
before C#
• U-SQL: Understand and Control Data
through Partitioning