Big data processing increasingly needs to address not just querying big data but needs to apply domain specific algorithms to large amounts of data at scale. This ranges from developing and applying machine learning models to custom, domain specific processing of images, texts, etc. Often the domain experts and programmers have a favorite language that they use to implement their algorithms such as Python, R, C#, etc. Microsoft Azure Data Lake Analytics service is making it easy for customers to bring their domain expertise and their favorite languages to address their big data processing needs. In this session, I will showcase how you can bring your Python, R, and .NET code and apply it at scale using U-SQL.
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...Michael Rys
When analyzing big data, you often have to process data at scale that is not rectangular in nature and you would like to scale out your existing programs and cognitive algorithms to analyze your data. To address this need and make it easy for the programmer to add her domain specific code, U-SQL includes a rich extensibility model that allows you to process any kind of data, ranging from CSV files over JSON and XML to image files and add your own custom operators. In this presentation, we will provide some examples on how to use U-SQL to process interesting data formats with custom extractors and functions, including JSON, images, use U-SQL’s cognitive library and finally show how U-SQL allows you to invoke custom code written in Python and R.
Slides for SQL Saturday 635, Vancouver BC presentation, Vancouver BC. Aug 2017.
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
Data Lakes have become a new tool in building modern data warehouse architectures. In this presentation we will introduce Microsoft's Azure Data Lake offering and its new big data processing language called U-SQL that makes Big Data Processing easy by combining the declarativity of SQL with the extensibility of C#. We will give you an initial introduction to U-SQL by explaining why we introduced U-SQL and showing with an example of how to analyze some tweet data with U-SQL and its extensibility capabilities and take you on an introductory tour of U-SQL that is geared towards existing SQL users.
slides for SQL Saturday 635, Vancouver BC, Aug 2017
Killer Scenarios with Data Lake in Azure with U-SQLMichael Rys
Presentation from Microsoft Data Science Summit 2016
Presents 4 examples of custom U-SQL data processing: Overlapping Range Aggregation, JSON Processing, Image Processing and R with U-SQL
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
When processing TB and PB of data, running your Big Data queries at scale and having them perform at peak is essential. In this session, we show you some state-of-the art tools on how to analyze U-SQL job performances and we discuss in-depth best practices on designing your data layout both for files and tables and writing performing and scalable queries using U-SQL. You will learn how to analyze performance and scale bottlenecks and will learn several tips on how to make your big data processing scripts both faster and scale better.
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...Michael Rys
When analyzing big data, you often have to process data at scale that is not rectangular in nature and you would like to scale out your existing programs and cognitive algorithms to analyze your data. To address this need and make it easy for the programmer to add her domain specific code, U-SQL includes a rich extensibility model that allows you to process any kind of data, ranging from CSV files over JSON and XML to image files and add your own custom operators. In this presentation, we will provide some examples on how to use U-SQL to process interesting data formats with custom extractors and functions, including JSON, images, use U-SQL’s cognitive library and finally show how U-SQL allows you to invoke custom code written in Python and R.
Slides for SQL Saturday 635, Vancouver BC presentation, Vancouver BC. Aug 2017.
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
Data Lakes have become a new tool in building modern data warehouse architectures. In this presentation we will introduce Microsoft's Azure Data Lake offering and its new big data processing language called U-SQL that makes Big Data Processing easy by combining the declarativity of SQL with the extensibility of C#. We will give you an initial introduction to U-SQL by explaining why we introduced U-SQL and showing with an example of how to analyze some tweet data with U-SQL and its extensibility capabilities and take you on an introductory tour of U-SQL that is geared towards existing SQL users.
slides for SQL Saturday 635, Vancouver BC, Aug 2017
Killer Scenarios with Data Lake in Azure with U-SQLMichael Rys
Presentation from Microsoft Data Science Summit 2016
Presents 4 examples of custom U-SQL data processing: Overlapping Range Aggregation, JSON Processing, Image Processing and R with U-SQL
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
When processing TB and PB of data, running your Big Data queries at scale and having them perform at peak is essential. In this session, we show you some state-of-the art tools on how to analyze U-SQL job performances and we discuss in-depth best practices on designing your data layout both for files and tables and writing performing and scalable queries using U-SQL. You will learn how to analyze performance and scale bottlenecks and will learn several tips on how to make your big data processing scripts both faster and scale better.
U-SQL Query Execution and Performance TuningMichael Rys
This 400 level presentation explains the U-SQL Query Execution in Azure Data Lake and provides several Performance Tuning tips: What tools are available and some best practices.
A concentrated look at Apache Spark's library Spark SQL including background information and numerous Scala code examples of using Spark SQL with CSV, JSON and databases such as mySQL.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
Speaker: Marcel Kornacker
As data is ingested into Apache Hadoop at an increasing rate from a diverse range of data sources, it is becoming more and more important for users that new data be accessible for analysis as quickly as possible—because “data freshness” can have a direct impact on business results.
In the traditional ETL process, raw data is transformed from the source into a target schema, possibly requiring flattening and condensing, and then loaded into an MPP DBMS. However, this approach has multiple drawbacks that make it unsuitable for real-time, “at-source” analytics—for example, the “ETL lag” reduces data freshness, and the inherent complexity of the process makes it costly to deploy and maintain, and reduces the speed at which new analytic applications can be introduced.
In this talk, attendees will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics—including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines. This performance level is attained in spite of the most challenging and diverse input formats, which are addressed through an automated background conversion process into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem.
In this talk, attendees will learn about Impala’s upcoming features that will enable at-source analytics: support for nested structures such as JSON and XML documents, which allows direct querying of the source schema; automated background file format conversion into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem; and automated creation of declaratively-specified derived data for simplified data cleansing and transformation.
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
U-SQL Query Execution and Performance TuningMichael Rys
This 400 level presentation explains the U-SQL Query Execution in Azure Data Lake and provides several Performance Tuning tips: What tools are available and some best practices.
A concentrated look at Apache Spark's library Spark SQL including background information and numerous Scala code examples of using Spark SQL with CSV, JSON and databases such as mySQL.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
Speaker: Marcel Kornacker
As data is ingested into Apache Hadoop at an increasing rate from a diverse range of data sources, it is becoming more and more important for users that new data be accessible for analysis as quickly as possible—because “data freshness” can have a direct impact on business results.
In the traditional ETL process, raw data is transformed from the source into a target schema, possibly requiring flattening and condensing, and then loaded into an MPP DBMS. However, this approach has multiple drawbacks that make it unsuitable for real-time, “at-source” analytics—for example, the “ETL lag” reduces data freshness, and the inherent complexity of the process makes it costly to deploy and maintain, and reduces the speed at which new analytic applications can be introduced.
In this talk, attendees will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics—including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines. This performance level is attained in spite of the most challenging and diverse input formats, which are addressed through an automated background conversion process into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem.
In this talk, attendees will learn about Impala’s upcoming features that will enable at-source analytics: support for nested structures such as JSON and XML documents, which allows direct querying of the source schema; automated background file format conversion into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem; and automated creation of declaratively-specified derived data for simplified data cleansing and transformation.
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...Amazon Web Services
In this session, we’ll look at the AWS services that customers are using to build and deploy Microsoft-based solutions that use technologies like Windows, .NET, SQL Server, and PowerShell. We’ll start by showing you how to build a Windows-based CI/CD pipeline on AWS using AWS CodeDeploy, AWS CodePipeline, AWS CloudFormation, and PowerShell using an AWS Quick Start. We’ll also cover best practices for how you can create templates that let you automatically deploy ready-to-use Windows products by leveraging services and tools like AWS CloudFormation, PowerShell, and Git. Woot, an online retailer for electronics, will share how it moved from using a complex mix of custom PowerShell code for its DevOps processes to using services like Amazon EC2 Simple Systems Manager (SSM), AWS CodeDeploy, and AWS Directory Service. This migration eliminated the need for complex PowerShell scripts and reduced the operational complexity of performing operational tasks like renaming servers, joining domains, and securely handling keys.
Azure Resource Manager templates: Improve deployment time and reusabilityStephane Lapointe
Azure Resource Manager is the future of Azure and his templating features are a big improvement and simplification of how you provision resources on Azure. See how you can create ARM template in Visual Studio to create complex, multiple resources templates and how they can be combined and reused. Learn the different template functions available and how they can help you build more advanced template.
Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
Continuous Integration and the Data Warehouse - PASS SQL Saturday SloveniaDr. John Tunnicliffe
Continuous integration is not normally associate with data warehouse projects due to the perceived complexity of implementation. John shows how modern tools make it simple to apply CI to the data warehouse. The session covers:
* The benefits of the SQL Server Data Tools declarative model
* Using PowerShell and psake to automate your build and deployments
* Implementing the TeamCity build server
* Integration and regression testing
* Auto-code generation within SSDT using T4 templates and DacFx
Continuous Integration and the Data Warehouse - PASS SQL Saturday SloveniaDr. John Tunnicliffe
Continuous integration is not normally associate with data warehouse projects due to the perceived complexity of implementation. John shows how modern tools make it simple to apply CI to the data warehouse. The session covers:
* The benefits of the SQL Server Data Tools declarative model
* Using PowerShell and psake to automate your build and deployments
* Implementing the TeamCity build server
* Integration and regression testing
* Auto-code generation within SSDT using T4 templates and DacFx
Alex Thissen "Server-less compute with .NET based Azure Functions"Fwdays
Azure Functions offer server-less cloud services and are the next evolution in distributed computing and hosting after containers. Join this session to learn how to develop Azure Functions using C# and .NET, and how to test, build and deploy to Azure. We will cover .NET programming details, architecture, internals and hosting. Also, you will learn how to set up a local development environment to build and host locally. After this session you are ready to design and create state of the art Azure Functions using .NET
In this presentation you will learn about:
• CloudFormation 101
– The building block of Infrastructure as Code
• CodePipeline and CodeCommit 101
– Tools for our IaC pipeline
• Review of an example IaC Pipeline
– Automated validation
– Least privilege enforcement
– Manual review/approval
Getting started with Appcelerator TitaniumTechday7
Techday7, Cross platform application development using Appcelerator Titanium event's Getting started with Appcelerator Titanium By Naga Harish M, Lead Developer of Anubavam Technologies
By David Smith. Presented at Microsoft Build (Seattle), May 7 2018.
Your data scientists have created predictive models using open-source tools, proprietary software, or some combination of both, and now you are interested in lifting and shifting those models to the cloud. In this talk, I'll describe how data scientists can transition their existing workflows — while using mostly the same tools and processes — to train and deploy machine learning models based on open source frameworks to Azure. I'll provide guidance on keeping connections to data sources up-to-date, evaluating and monitoring models, and deploying applications that make use of those models.
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
Presentation by James Baker and myself on Running cost effective big data workloads with Azure Synapse and Azure Datalake Storage (ADLS) at Microsoft Ignite 2020. Covers Modern Data warehouse architecture supported by Azure Synapse, integration benefits with ADLS and some features that reduce cost such as Query Acceleration, integration of Spark and SQL processing with integrated meta data and .NET For Apache Spark support.
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
The presentation discusses how to migrate expensive open source big data workloads to Azure and leverage latest compute and storage innovations within Azure Synapse with Azure Data Lake Storage to develop a powerful and cost effective analytics solutions. It shows how you can bring your .NET expertise with .NET for Apache Spark to bear and how the shared meta data experience in Synapse makes it easy to create a table in Spark and query it from T-SQL.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
More and more customers who are looking to modernize analytics needs are exploring the data lake approach in Azure. Typically, they are most challenged by a bewildering array of poorly integrated technologies and a variety of data formats, data types not all of which are conveniently handled by existing ETL technologies. In this session, we’ll explore the basic shape of a modern ETL pipeline through the lens of Azure Data Lake. We will explore how this pipeline can scale from one to thousands of nodes at a moment’s notice to respond to business needs, how its extensibility model allows pipelines to simultaneously integrate procedural code written in .NET languages or even Python and R, how that same extensibility model allows pipelines to deal with a variety of formats such as CSV, XML, JSON, Images, or any enterprise-specific document format, and finally explore how the next generation of ETL scenarios are enabled though the integration of Intelligence in the data layer in the form of built-in Cognitive capabilities.
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
From theory to implementation - follow the steps of implementing an end-to-end analytics solution illustrated with some best practices and examples in Azure Data Lake.
During this full training day we will share the architecture patterns, tooling, learnings and tips and tricks for building such services on Azure Data Lake. We take you through some anti-patterns and best practices on data loading and organization, give you hands-on time and the ability to develop some of your own U-SQL scripts to process your data and discuss the pros and cons of files versus tables.
This were the slides presented at the SQLBits 2018 Training Day on Feb 21, 2018.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R code at scale with U-SQL (SQLBits and SQLKonferenz 2018)
1. Run Python, R and .NET
code at Data Lake scale
with U-SQL in Azure
Data Lake
Michael Rys
Principal Program Manager Big Data Team, Microsoft
@MikeDoesBigData
usql@microsoft.com
2. Agenda
• Characteristics of Big Data Analytics Programming
• Scaling out existing code with U-SQL:
• Scaling out Cognitive Libraries
• Introduction to U-SQL’s Extensibility Framework
• Scaling out .NET with U-SQL:
• Custom Image processing
• Scaling out Python with U-SQL
• Scaling out R with U-SQL:
• Model generation, Model testing and scoring
3. Some sample use cases
Digital Crime Unit – Analyze complex attack patterns
to understand BotNets and to predict and mitigate
future attacks by analyzing log records with
complex custom algorithms
Image Processing – Large-scale image feature
extraction and classification using custom code
Shopping Recommendation – Complex pattern
analysis and prediction over shopping records
using proprietary algorithms
Characteristics
of Big Data
Analytics
• Requires processing
of any type of data
• Allow use of custom
algorithms
• Scale to any size and
be efficient
Bring your own coding expertise and
existing code and scale it out?
4. Status Quo:
SQL for
Big Data
Declarativity does scaling and
parallelization for you
Extensibility is bolted on and
not “native”
hard to work with anything other than
structured data
difficult to extend with custom code:
complex installations and frameworks
Limited to one or two languages
5. Status Quo:
Programming
Languages for
Big Data
Extensibility through custom code
is “native”
Declarativity is bolted on and
not “native”
User often has to
care about scale and performance
SQL is 2nd class within string, only local
optimizations
Often no code reuse/
sharing across queries
6. Why U-SQL? Declarativity and Extensibility
are equally native!
Get benefits of both!
Scales out your custom imperative Code
(written in .NET, Python, R, and more to come)
in a declarative SQL-based framework
R
Python
.NET
U-SQL Framework
8. Scale Out Cognitive Library
https://github.com/Azure/usql/tree/master/Examples/ImageApp
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-cognitive
Car Green
Parked
Outdoor
Racing
9. REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
REFERENCE ASSEMBLY ImageOcr;
@imgs =
EXTRACT FileName string, ImgData byte[]
FROM @"/images/{FileName}.jpg"
USING new Cognition.Vision.ImageExtractor();
// Extract the number of objects on each image and tag them
@objects =
PROCESS @imgs
PRODUCE FileName,
NumObjects int,
Tags SqlMap<string, float?>
READONLY FileName
USING new Cognition.Vision.ImageTagger();
OUTPUT @objects
TO "/objects.tsv"
USING Outputters.Tsv();
Imaging
10. REFERENCE ASSEMBLY [TextSentiment];
REFERENCE ASSEMBLY [TextKeyPhrase];
@WarAndPeace =
EXTRACT No int,
Year string,
Book string, Chapter string,
Text string
FROM @"/usqlext/samples/cognition/war_and_peace.csv"
USING Extractors.Csv();
@sentiment =
PROCESS @WarAndPeace
PRODUCE No,
Year,
Book, Chapter,
Text,
Sentiment string,
Conf double
USING new Cognition.Text.SentimentAnalyzer(true);
OUTPUT @sentinment
TO "/sentiment.tsv"
USING Outputters.Tsv();
Text Analysis
11. U-SQL/Cognitive
Example
• Identify objects in images (tags)
• Identify faces and emotions and images
• Join datasets – find out which tags are associated with happiness
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
@objects =
PROCESS MegaFaceView
PRODUCE FileName, NumObjects int, Tags SqlMap<string,float?>
READONLY FileName
USING new Cognition.Vision.ImageTagger();
@tags =
SELECT FileName, T.Tag
FROM @objects CROSS APPLY EXPLODE(Tags.Split) AS T(Tag, Conf)
WHERE Tag.Contains("dog") OR Tag.Contains("cat");
@emotion =
SELECT ImageName, Details.Emotion
FROM MegaFaceView
CROSS APPLY new Cognition.Vision.EmotionApplier(imgCol:"image")
AS Details(NumFaces int, FaceIndex int,
RectX float, RectY float, Width float, Height float,
Emotion string, Confidence float);
@correlation =
SELECT T.FileName, Emotion, Tag
FROM @emotion AS E
INNER JOIN
@tags AS T
ON E.FileName == T.FileName;
Images
Objects Emotions
filter
join
aggregate
12. U-SQL extensibility
Extend U-SQL with C#/.NET, Python, R etc.
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
13. What are UDOs? • User-Defined Extractors
• Converts files into rowset
• User-Defined Outputters
• Converts rowset into files
• User-Defined Processors
• Take one row and produce one row
• Pass-through versus transforming
• User-Defined Appliers
• Take one row and produce 0 to n rows
• Used with OUTER/CROSS APPLY
• User-Defined Combiners
• Combines rowsets (like a user-defined join)
• User-Defined Reducers
• Take n rows and produce m rows (normally m<n)
• Scaled out with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
• EXTRACT
• OUTPUT
• CROSS APPLY
Custom Operator Extensions in
language of your choice
Scaled out by U-SQL
• PROCESS
• COMBINE
• REDUCE
14. Scaling out C# with U-SQL
https://github.com/Azure/usql/tree/master/Examples/ImageApp
Copyright Camera
Make
Camera
Model
Thumbnail
Michael Canon 70D
Michael Samsung S7
15. How to specify
.NET UDOs?
• .Net API provided to build UDOs
• Any .Net language usable
• however only C# is first-class in tooling
• Use U-SQL specific .Net DLLs
• Deploying UDOs
• Compile DLL
• Upload DLL to ADLS
• register with U-SQL script
• VisualStudio provides tool support
• UDOs can
• Invoke managed code
• Invoke native code deployed with UDO assemblies
• Invoke other language runtimes (e.g., Python, R)
• be scaled out by U-SQL execution framework
• UDOs cannot
• Communicate between different UDO invocations
• Call Webservices or Reach outside the vertex
boundary
17. • C# Class Project for U-SQL
How to specify UDOs?
18. [SqlUserDefinedExtractor]
public class DriverExtractor : IExtractor
{
private byte[] _row_delim;
private string _col_delim;
private Encoding _encoding;
// Define a non-default constructor since I want to pass in my own parameters
public DriverExtractor( string row_delim = "rn", string col_delim = ",“
, Encoding encoding = null )
{
_encoding = encoding == null ? Encoding.UTF8 : encoding;
_row_delim = _encoding.GetBytes(row_delim);
_col_delim = col_delim;
} // DriverExtractor
// Converting text to target schema
private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow)
{
var schema = outputrow.Schema;
if (schema[i].Type == typeof(int))
{
var tmp = Convert.ToInt32(c);
outputrow.Set(i, tmp);
}
...
} //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input
, IUpdatableRow outputrow)
{
foreach (var row in input.Split(_row_delim))
{
using(var s = new StreamReader(row, _encoding))
{
int i = 0;
foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None))
{
OutputValueAtCol_I(c, i++, outputrow);
} // foreach
} // using
yield return outputrow.AsReadOnly();
} // foreach
} // Extract
} // class DriverExtractor
UDO model
• Marking UDOs
• Parameterizing UDOs
• UDO signature
• UDO-specific processing
pattern
• Rowsets and their schemas
in UDOs
• Setting results
• By position
• By name
19. Managing Assemblies
• Create assemblies
• Reference assemblies
• Enumerate assemblies
• Drop assemblies
• VisualStudio makes registration easy!
• CREATE ASSEMBLY db.assembly FROM @path;
• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll system.data.dll,
System.Runtime.Serialization.dll, mscorelib.dll (e.g.,
System.Text, System.Text.RegularExpressions,
System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer and Azure Portal
• DROP ASSEMBLY db.assembly;
20. DEPLOY RESOURCE Syntax:
'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }.
Example:
DEPLOY RESOURCE "/config/configfile.xml", "package.zip";
Use Cases:
• Script specific configuration files (not stored with Asm)
• Script specific models
• Any other file you want to access from user code on all
vertices
Semantics:
• Files have to be in ADLS or WASB
• Files are deployed to vertex and are accessible from any custom
code
Limits:
• Single resource file limit is 400MB
• Overall limit for deployed resource files is 3GB
21. U-SQL Vertex Code (.NET)
C#
C++
Algebra
Additional non-dll files &
Deployed resources
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ADLS DEPLOY RESOURCE
System files
(built-in Runtimes, Core DLLs, OS)
22. Scale Out Python With U-SQL
Python
Author Tweet
MikeDoesBigData @AzureDataLake: Come and see the #SQLKonferenz sessions on #USQL
AzureDataLake What are your recommendations for #SQLKonferenz? @MikeDoesBigData
Author Mentions Topics
MikeDoesBigData {@AzureDataLake} {#SQLKonferenz, #USQL}
AzureDataLake {@MikeDoesBigData} {#SQLKonferenz}
23. REFERENCE ASSEMBLY [ExtPython];
DECLARE @myScript = @"
def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
";
@t =
SELECT * FROM
(VALUES
("D1","T1","A1","@foo Hello World @bar"),
("D2","T2","A2","@baz Hello World @beer")
) AS D( date, time, author, tweet );
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer(pyScript:@myScript);
Use U-SQL to create a massively
distributed program.
Executing Python code across
many nodes.
Using standard libraries such as
numpy and pandas.
Documentation:
https://docs.microsoft.com/en-
us/azure/data-lake-analytics/data-
lake-analytics-u-sql-python-
extensions
Python
Extensions
24. U-SQL Vertex Code (Python)
C#
C++
Algebra
Additional Python Libs and Script
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ExtPython
ADLS DEPLOY RESOURCE
Script.py
OtherLibs.zip
System files
(built-in Runtimes, Core DLLs, OS)
Python Python Engine & Libs
27. R running in U-
SQL
Generate a linear model
SampleScript_LM_Iris.R
REFERENCE ASSEMBLY [ExtR];
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFileModelSummary string =
@"/my/R/Output/LMModelSummaryCoefficientsIrisFromRCommand.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species = as.factor(inputFromUSQL$Species)
lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL)
#do not return readonly columns and make sure that the column names are
the same in usql and r scripts,
outputToUSQL=data.frame(summary(lm.fit)$coefficients)
colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"",
""Pr"")
outputToUSQL";
@InputData =
EXTRACT SepalLength double, SepalWidth double, PetalLength double,
PetalWidth double, Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData = SELECT 0 AS Par, * FROM @InputData;
@ModelCoefficients = REDUCE @ExtendedData ON Par
PRODUCE Par, Estimate double, StdError double, tValue double, Pr double
READONLY Par
USING new Extension.R.Reducer(command:@myRScript,
rReturnType:"dataframe");
OUTPUT @ModelCoefficients TO @OutputFileModelSummary USING Outputters.Tsv();
28. R running in U-
SQL
Use a previously
generated model
REFERENCE ASSEMBLY master.ExtR;
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda"; //
Prediction Model
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/Output/LMPredictionsIris.csv";
DECLARE @PartitionCount int = 10;
// R script to run
DECLARE @myRScript = @"
load(""my_model_LM_Iris.rda"")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval=""confidence""))";
@InputData =
EXTRACT SepalLength double, SepalWidth double, PetalLength double,
PetalWidth double, Species string
FROM @IrisData
USING Extractors.Csv();
//Randomly partition the data to apply the model in parallel
@ExtendedData =
SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par, *
FROM @InputData;
// Predict Species
@RScriptOutput =
REDUCE @ExtendedData ON Par
PRODUCE Par, fit double, lwr double, upr double
READONLY Par
USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe",
stringsAsFactors:false);
OUTPUT @RScriptOutput TO @OutputFilePredictions
USING Outputters.Csv(outputHeader:true);
29. U-SQL Vertex Code (R)
C#
C++
Algebra
Additional R Libs and Script
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ExtR
ADLS DEPLOY RESOURCE
Script.R
OtherLibs.zip
System files
(built-in Runtimes, Core DLLs, OS)
R R Engine & Libs
31. Scaling Out your Code and Language with U-SQL
Bring your Code or Write your Custom Operator Extensions in
.Net (C#, F#, etc)
Python
R
…
Scaled out by U-SQL
32. Additional
Resources
• Blogs and community page:
• http://usql.io (U-SQL Github)
• http://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.microsoft.com/mrys/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation, presentations and articles:
• http://aka.ms/usql_reference
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-
programmability-guide
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• https://msdn.microsoft.com/magazine/mt790200
• http://www.slideshare.net/MichaelRys
• Getting Started with R in U-SQL
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-python-
extensions
• ADL forums and feedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
• http://aka.ms/adlfeedback
Continue your education
at Microsoft Virtual
Academy online.
33. Vielen Dank für Eure
Aufmerksamkeit!
usql@microsoft.com@MikeDoesBigData
http://aka.ms/azuredatalake
Editor's Notes
Add velocity?
Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps.
Some examples:
Hive UDAgg
Code and compile .java into .jar
Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading
Extend GenericUDAFEvaluator class: implements logic in 8 methods.
- Deploy:
Deploy jar into class path on server
Edit FunctionRegistry.java to register as built-in
Update the content of show functions with ant
Hive UDF (as of v0.13)
Code
Load JAR into head node or at URI
CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
Spark supports Custom “inputters and outputters” for defining custom RDDs
No UDAGGs
Simple integration of UDFs but only for duration of program. No reuse/sharing.
Cloud dataflow? Requires has to care about scale and perf
Spark UDAgg
Is not yet supported ( SPARK-3947)
Spark UDF
Write inline functiondef westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state)
for SQL usage need to register the tablecustomerTable.registerTempTable("customerTable")
Register each UDFsqlContext.udf.register("westernState", westernState _)
Call itval westernStates =sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
Makes it easy for you by unifying:
Declarative and imperative
Unstructured and structured data processing
Local and remote Queries
Increase productivity and agility from Day 1 and at Day 100 for YOU!
ADL uses U-SQL to create a distributed, parallel job using simple declarative statements and provides discrete points for attaching user code
U-SQL is build on top of existing frameworks and languages
Extensions require .NET assemblies to be registered with a database