The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.
Spark SQL is a component of Apache Spark that introduces SQL support. It includes a DataFrame API that allows users to write SQL queries on Spark, a Catalyst optimizer that converts logical queries to physical plans, and data source APIs that provide a unified way to read/write data in various formats. Spark SQL aims to make SQL queries on Spark more efficient and extensible.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Spark SQL provides relational data processing capabilities in Spark. It introduces a DataFrame API that allows both relational operations on external data sources and Spark's built-in distributed collections. The Catalyst optimizer improves performance by applying database query optimization techniques. It is highly extensible, making it easy to add data sources, optimization rules, and data types for domains like machine learning. Spark SQL evaluation shows it outperforms alternative systems on both SQL query processing and Spark program workloads involving large datasets.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.
Spark SQL is a component of Apache Spark that introduces SQL support. It includes a DataFrame API that allows users to write SQL queries on Spark, a Catalyst optimizer that converts logical queries to physical plans, and data source APIs that provide a unified way to read/write data in various formats. Spark SQL aims to make SQL queries on Spark more efficient and extensible.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Spark SQL provides relational data processing capabilities in Spark. It introduces a DataFrame API that allows both relational operations on external data sources and Spark's built-in distributed collections. The Catalyst optimizer improves performance by applying database query optimization techniques. It is highly extensible, making it easy to add data sources, optimization rules, and data types for domains like machine learning. Spark SQL evaluation shows it outperforms alternative systems on both SQL query processing and Spark program workloads involving large datasets.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit
1) Jaws is a highly scalable and resilient data warehouse explorer that allows submitting Spark SQL queries concurrently and asynchronously through a RESTful API.
2) It provides features like persisted query logs, results pagination, and pluggable storage layers. Queries can be run on Spark SQL contexts configured to use data from HDFS, Cassandra, Parquet files on HDFS or Tachyon.
3) The architecture allows Jaws to scale on standalone, Mesos, or YARN clusters by distributing queries across multiple worker nodes, and supports canceling running queries.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
Spark SQL is a module for structured data processing in Spark. It provides DataFrames and the ability to execute SQL queries. Some key points:
- Spark SQL allows querying structured data using SQL, or via DataFrame/Dataset APIs for Scala, Java, Python, and R.
- It supports various data sources like Hive, Parquet, JSON, and more. Data can be loaded and queried using a unified interface.
- The SparkSession API combines SparkContext with SQL functionality and is used to create DataFrames from data sources, register databases/tables, and execute SQL queries.
Building a modern Application with DataFramesSpark Summit
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
In this talk, we’ll discuss technical designs of support of HBase as a “native” data source to Spark SQL to achieve both query and load performance and scalability: near-precise execution locality of query and loading, fine-tuned partition pruning, predicate pushdown, plan execution through coprocessor, and optimized and fully parallelized bulk loader. Point and range queries on dimensional attributes will benefit particularly well from the techniques. Preliminary test results vs. established SQL-on-HBase technologies will be provided. The speaker will also share the future plan and real-world use cases, particularly in the telecom industry.
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
The document provides an outline for the Spark Camp @ Strata CA tutorial. The morning session will cover introductions and getting started with Spark, an introduction to MLlib, and exercises on working with Spark on a cluster and notebooks. The afternoon session will cover Spark SQL, visualizations, Spark streaming, building Scala applications, and GraphX examples. The tutorial will be led by several instructors from Databricks and include hands-on coding exercises.
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets.
In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:
1) Limitations of Apache Hive
2) Spark SQL Advantages Over Hive
3) Spark SQL Success Story
4) Spark SQL Features
5) Architecture of Spark SQL
6) Spark SQL Libraries
7) Querying Using Spark SQL
8) Demo: Stock Market Analysis With Spark SQL
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization.
As a follow-up of last year’s “Lessons From The Field”, this session will review some common anti-patterns I’ve seen in the field that could introduce performance or stability issues to your Spark jobs. We’ll look at ways of better understanding your Spark jobs and identifying solutions to these anti-patterns to help you write better performing and more stable applications.
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
This document introduces .NET for Apache Spark, which allows .NET developers to use the Apache Spark analytics engine for big data and machine learning. It discusses why .NET support is needed for Apache Spark given that much business logic is written in .NET. It provides an overview of .NET for Apache Spark's capabilities including Spark DataFrames, machine learning, and performance that is on par or faster than PySpark. Examples and demos are shown. Future plans are discussed to improve the tooling, expand programming experiences, and provide out-of-box experiences on platforms like Azure HDInsight and Azure Databricks. Readers are encouraged to engage with the open source project and provide feedback.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit
1) Jaws is a highly scalable and resilient data warehouse explorer that allows submitting Spark SQL queries concurrently and asynchronously through a RESTful API.
2) It provides features like persisted query logs, results pagination, and pluggable storage layers. Queries can be run on Spark SQL contexts configured to use data from HDFS, Cassandra, Parquet files on HDFS or Tachyon.
3) The architecture allows Jaws to scale on standalone, Mesos, or YARN clusters by distributing queries across multiple worker nodes, and supports canceling running queries.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
Spark SQL is a module for structured data processing in Spark. It provides DataFrames and the ability to execute SQL queries. Some key points:
- Spark SQL allows querying structured data using SQL, or via DataFrame/Dataset APIs for Scala, Java, Python, and R.
- It supports various data sources like Hive, Parquet, JSON, and more. Data can be loaded and queried using a unified interface.
- The SparkSession API combines SparkContext with SQL functionality and is used to create DataFrames from data sources, register databases/tables, and execute SQL queries.
Building a modern Application with DataFramesSpark Summit
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
In this talk, we’ll discuss technical designs of support of HBase as a “native” data source to Spark SQL to achieve both query and load performance and scalability: near-precise execution locality of query and loading, fine-tuned partition pruning, predicate pushdown, plan execution through coprocessor, and optimized and fully parallelized bulk loader. Point and range queries on dimensional attributes will benefit particularly well from the techniques. Preliminary test results vs. established SQL-on-HBase technologies will be provided. The speaker will also share the future plan and real-world use cases, particularly in the telecom industry.
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
The document provides an outline for the Spark Camp @ Strata CA tutorial. The morning session will cover introductions and getting started with Spark, an introduction to MLlib, and exercises on working with Spark on a cluster and notebooks. The afternoon session will cover Spark SQL, visualizations, Spark streaming, building Scala applications, and GraphX examples. The tutorial will be led by several instructors from Databricks and include hands-on coding exercises.
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets.
In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:
1) Limitations of Apache Hive
2) Spark SQL Advantages Over Hive
3) Spark SQL Success Story
4) Spark SQL Features
5) Architecture of Spark SQL
6) Spark SQL Libraries
7) Querying Using Spark SQL
8) Demo: Stock Market Analysis With Spark SQL
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization.
As a follow-up of last year’s “Lessons From The Field”, this session will review some common anti-patterns I’ve seen in the field that could introduce performance or stability issues to your Spark jobs. We’ll look at ways of better understanding your Spark jobs and identifying solutions to these anti-patterns to help you write better performing and more stable applications.
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
This document introduces .NET for Apache Spark, which allows .NET developers to use the Apache Spark analytics engine for big data and machine learning. It discusses why .NET support is needed for Apache Spark given that much business logic is written in .NET. It provides an overview of .NET for Apache Spark's capabilities including Spark DataFrames, machine learning, and performance that is on par or faster than PySpark. Examples and demos are shown. Future plans are discussed to improve the tooling, expand programming experiences, and provide out-of-box experiences on platforms like Azure HDInsight and Azure Databricks. Readers are encouraged to engage with the open source project and provide feedback.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
This document discusses how SQL can be used in Lucidworks Fusion for various purposes like aggregating signals to compute relevance scores, ingesting and transforming data from various sources using Spark SQL, enabling self-service analytics through tools like Tableau and PowerBI, and running experiments to compare variants. It provides examples of using SQL for tasks like sessionization with window functions, joining multiple data sources, hiding complex logic in user-defined functions, and powering recommendations. The document recommends SQL in Fusion for tasks like analytics, data ingestion, machine learning, and experimentation.
Predicates allow filtering events based on:
- Event properties (fields)
- Session properties
- System properties
They are evaluated synchronously when the event fires. This allows filtering events and reducing overhead compared to capturing all events.
Common predicates:
- event_name = 'sql_statement_completed'
- database_id = 5
- cpu_time > 1000
Predicates give granular control over what events are captured.
This session introduces tools that can help you analyze and troubleshoot performance with SharePoint 2013. This sessions presents tools like perfmon, Fiddler, Visual Round Trip Analyzer, IIS LogParser, Developer Dashboard and of course we create Web and Load Tests in Visual Studio 2013.
At the end we also take a look at some of the tips and best practices to improve performance on SharePoint 2013.
CCI2019 - Monitorare SQL Server Senza Andare in Bancarottawalk2talk srl
Monitorare SQL Server può diventare un affare decisamente costoso. Certo, sul mercato ci sono moltissime soluzioni a pagamento, ma che fare se le istanze sono molte e i soldi sono pochi?
In questa sessione combineremo diversi strumenti open source (InfluxDB, Telegraf , Grafana, DbaTools and many more) per raccogliere metriche di performance significative, analizzare i dati raccolti, creare degli alert per gli eventi critici, fare troubleshooting dei problemi e pianificare le risorse per il futuro. Raggiungimi in questa sessione e vedrai che il monitoring non è un business per milionari.
By Gianluca Sartori
This document discusses observability and how to implement it using logging, metrics, and distributed tracing. It recommends using the three pillars together to gain insights into a distributed system. Spring Boot utilities like Actuator, Micrometer, and Spring Cloud Sleuth can provide much of the functionality out of the box. Centralized logging, metrics collection with Prometheus/Grafana, and distributed tracing with Zipkin are suggested for full observability.
Presto is a distributed SQL query engine that Treasure Data provides as a service. Taro Saito discussed the internals of the Presto service at Treasure Data, including how the TD Presto connector optimizes scan performance from storage systems and how the service manages multi-tenancy and resource allocation for customers. Key challenges in providing a database as a service were also covered, such as balancing cost and performance.
All we know that REST services are almost everywhere now and nearly all new projects use it.
But do we really know how to design proper interfaces? What are pitfalls and how to avoid them?
I did many REST service designs and have a bunch of tips and tricks you definitely would like to use.
It will save you and your team a lot of time in future.
This document outlines an agenda for an advanced Splunk user training workshop. The workshop covers topics like field aliasing, common information models, event types, tags, dashboard customization, index replication for high availability, report acceleration, and lookups. It provides overviews and examples for each topic and directs attendees to additional documentation resources for more in-depth learning. The workshop also includes demonstrations of dashboard customization techniques and discusses support options through the Splunk community.
The document discusses Parse's process for benchmarking MongoDB upgrades by replaying recorded production workloads on test servers. They found a 33-75% drop in throughput when upgrading from 2.4.10 to 2.6.3 due to query planner bugs. Working with MongoDB, they identified and helped fix several bugs, improving performance in 2.6.5 but still below 2.4.10 levels initially. Further optimization work increased throughput above 2.4.10 levels when testing with more workers and operations.
Upgrading an application’s database can be daunting.Doing this for tens ofthousands of apps at atime is downright scary.New bugs combined with unique edge cases can result in reduced performance,downtime, and plenty of frustration. Learn how Parse is working to avoid these issues as we upgrade to 2.6 with advanced benchmarking tools and aggressive troubleshooting
Learn how to track the user experience of your web applications across multiple geographical locations. Measure and understand the ways customers interact with your software. Discover the basic skills and information needed to get the most out of your synthetic monitoring.
Discover the server in your IT ecosystem automatically and understand how various app components interact with each other. Reach us out at appmanager-support@manageengine.com
This document summarizes key learnings from a presentation about SharePoint 2013 and Enterprise Search. It discusses how to run a successful search project through planning, development, testing and deployment. It also covers infrastructure needs and capacity testing findings. Additionally, it provides examples of how to customize the user experience through display templates and Front search. Methods for crawling thousands of file shares and enriching indexed content are presented. The document concludes with discussions on relevancy, managing property weighting, changing ranking models, and tuning search results.
Building high performance and scalable share point applicationsTalbott Crowell
SharePoint custom application development can sometimes be challenging. This presentation at SPS New Hampshire on October 18th, 2014 covers some techniques and strategies on improving performance and scalability of your applications.
Marco Pozzan
Power BI consultant & Trainer
Scenario di utilizzo del real-time di Power BI. In questa sessione verrà introdotta la teoria sul real-time dashboarding offerto da Power BI. Poi ci si focalizzerà sun un caso pratico di real-time dataset in modalità ibrida per la realizzazione di una dashboard di controllo con la possibilità di effettuare il write back e permettere all’utente di effettuare analisi what-if.
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
Think you have big data? What about high availability
requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world
monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to
Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads
Similar to Introduction to Spark SQL training workshop (20)
Odoo ERP software
Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth.
The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently.
This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...kalichargn70th171
A dynamic process unfolds in the intricate realm of software development, dedicated to crafting and sustaining products that effortlessly address user needs. Amidst vital stages like market analysis and requirement assessments, the heart of software development lies in the meticulous creation and upkeep of source code. Code alterations are inherent, challenging code quality, particularly under stringent deadlines.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
3. Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
4. Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
5. Why learn Spark SQL?
• Most popular component in Spark
• Spark Survey 2015
• Use cases
• ETL
• Analytics
• Feature Extraction for machine learning
% of users
0 18 35 53 70
Spark SQL
DataFrames
MLlib, GraphX
Streaming
6. Use case: ETL & analytics
• Example: restaurant finder app
• Log data: Timestamp, UserID, Location, RestaurantType
• [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ]
• Analytics
• What time of day do users use the app?
• What is the most popular restaurant type in San Jose, CA?
Logs ETL Analytics
Spark SQL Spark SQL
7. How Spark SQL fits into Spark (2.0)
Spark Core (RDD)
Catalyst
SQL DataFrame / Dataset
ML Pipelines
Structured
Streaming
GraphFrames
Spark SQL
http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
9. SQL or DataFrame?
• Use SQL if you are already familiar with SQL
• Use DataFrame
• To write queries in a general-purpose programming language
(Scala, Python, …).
• Use DataFrame to catch syntax errors earlier:
SQL DataFrame
Syntax Error
Example
“SELEECT id FROM table” df.seleect(“id”)
Caught at Runtime Compile Time
10. Loading and examining a table, Query with SQL
• See Notebook: http://tinyurl.com/spark-nb1
11. Setup for Hands-on Training
1. Sign on to WiFi with your assigned access code
1. See slip of paper in front of your seat
2. Sign in to https://community.cloud.databricks.com/
3. Go to "Clusters" and create a Spark 2.0 cluster
1. This may take a minute.
4. Go to “Workspace” -> Users -> Home -> Create ->
Notebook
1. Select Language = Scala
2. Create
12. Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
14. Lazy Execution
• DataFrame operations are lazy
• Work is delayed until the last possible moment
• Transformations: DF -> DF
• select, groupBy; no computation done
• Actions: DF -> console or disk output
• show, collect, count, write; computation is done
https://www.flickr.com/photos/mtch3l/24491625352
15. Lazy Execution Example
1. val df1 = df.select(…)
2. val df2 = df1.groupBy (…)
3. .sum()
4. if (cond)
5. df2.show()
• Benefits of laziness
• Query optimization across lines 1-3
• If step 5 is not executed, then no unnecessary work was done
Transformation: no
computation done
Transformation: no
computation done
Action: performs the
select, groupBy at this
time, then shows the
results
16. Caching
• When querying the same data set over and over, caching it
in memory may speed up queries.
• Back to notebook …
Disk Memory Results
Memory Results
Without
caching:
With
caching:
17. Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
18. Use case: Feature Extraction for ML
• Example: restaurant finder app
• Log data: Timestamp, UserID, Location, RestaurantType
• [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ]
• Machine Learning to train a model of user preferences
• Use Spark SQL to extract features for the model
• Example features: hour of day, distance to a restaurant, restaurant
type
Logs ETL Features ML Training
Spark SQL Spark SQL
See Notebook …
20. Dataset (new in 2.0)
• DataFrames are untyped
• df.select($”col1” + 3)
• Useful when exploring new data
• Datasets are typed
• Dataset[T]
• Associates an object of type T with each row
• Catches type mismatches at compile time
• DataFrame = Dataset[Row]
• A DataFrame is one specific type of Dataset[T]
case class FarmersMarket(FMID: Int, MarketName: String)
val ds : Dataset[FarmersMarket] …
Numerical type assumed, but
not checked at compile time
21. Review
• Part 1: Spark SQL Overview, SQL Queries √
• Part 2: DataFrame Queries √
• Part 3: Additional DataFrame Functions √
22. References
• Spark SQL: http://spark.apache.org/docs/latest/sql-
programming-guide.html
• Spark Scala API docs: http://spark.apache.org/docs/latest/
api/scala/index.html#org.apache.spark.package
• Overview of DataFrames: http://
xinhstechblog.blogspot.com/2016/05/overview-of-spark-
dataframe-api.html
• Questions, comments:
• Spark user list: user@spark.apache.org
• Xinh’s contact: https://www.linkedin.com/in/xinh-huynh-317608
• Women in Big Data: https://www.womeninbigdata.org/