Instead of manually setting up a Spark Infrastructure, worrying about scaling out and scaling down, managing data stored in a shared memory, and taking care of it's back up too, Databrick provides an alternative. Databricks allows the developers to worry about the code that will run on the Spark cluster, and not the cluster's infrastructure. It creates an envelope around Spark, abstracting all the Infrastructure requirements from the Developer. Not only the cluster, Databricks also manages the shared FileSystem that will be used for storing the data. It provides notebooks that can be used to run spark code interactively, in multiple languages. A user can also leverage Databrick's Spark-submit job feature.
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouSpark Summit
Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Christos Erotocritou – Lead GridGain solutions architect, will explain in detail how IgniteRDD – an implementation of native Spark RDD and DataFrame APIs – shares the state of the RDD across other Spark jobs, applications and workers. Christos will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Furthermore we will be discussing the newest feature additions and what the future holds for this integration.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Understand the key data reliability challenges
How Delta Lake brings reliability to data lakes at scale
Understand how Delta Lake fits within an Apache Spark™ environment
How to use Delta Lake to realize data reliability improvements
Prerequisites
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Pre-register for Databricks Community Edition
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouSpark Summit
Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Christos Erotocritou – Lead GridGain solutions architect, will explain in detail how IgniteRDD – an implementation of native Spark RDD and DataFrame APIs – shares the state of the RDD across other Spark jobs, applications and workers. Christos will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Furthermore we will be discussing the newest feature additions and what the future holds for this integration.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Understand the key data reliability challenges
How Delta Lake brings reliability to data lakes at scale
Understand how Delta Lake fits within an Apache Spark™ environment
How to use Delta Lake to realize data reliability improvements
Prerequisites
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Pre-register for Databricks Community Edition
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
Machine learning pipelines are a hot topic at the moment. Moving data through the pipeline in an efficient and predictable way is one of the most important aspects of running machine learning models in production.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
In this webinar, the presenter will take you through the most revolutionary data warehouse, Snowflake with a live demo and technical and functional discussions with a customer. Ryan Goltz from Chesapeake Energy and Tristan Handy, creator of DBT Cloud and owner of Fishtown Analytics will also be joining the webinar.
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
The reality of most large scale data deployments includes storage decoupled from computation, pipelines operating directly on files and metadata services with no locking mechanisms or transaction tracking. For this reason attempts at achieving transactional behavior, snapshot isolation, safe schema evolution or performant support for CRUD operations has always been marred with tradeoffs.
This talk will focus on technical aspects, practical capabilities and the potential future of three table formats that have emerged in recent years as solutions to the issues mentioned above – ACID ORC (in Hive 3.x), Iceberg and Delta Lake. To provide a richer context, a comparison between traditional databases and big data tools as well as an overview of the reasons for the current state of affairs will be included.
After the talk, the audience is expected to have a clear understanding of the current development trends in large scale table formats, on the conceptual and practical level. This should allow the attendees to make better informed assessments about which approaches to data warehousing, metadata management and data pipelining they should adapt in their organizations.
In this session we take an in-depth look into the Apache Atlas open metadata and governance function.
Open metadata and governance is a moon-shot type of project to create a set of open APIs, types, and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery, and access frameworks to automate the collection, management, and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed, and used in order to deliver maximum value to the enterprise.
Apache Atlas is the reference implementation of the Open Metadata and Governance standards and framework (https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and+Governance). This function will enable an Apache Atlas server to synchronize and query metadata from any open metadata-compliant metadata repository.
In this session we will cover how Open Metadata and Governance works. This includes: (1) the key components in Atlas, (2) the different integration patterns and APIs that vendors can use to integrate their technology into the open metadata ecosystem, and (3) how common metadata use cases such as searching for data sets, managing security (through Atlas/Ranger integration), and automated metadata discovery work in the active ecosystem.
Speaker
Mandy Chessell, Distinguished Engineer, IBM
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
Many organizations using machine learning are facing challenges storing and versioning their complex ML data as well as a large number of models generated from those data. To simplify this process, organizations tend to start building their customized ‘ML platforms.’
Modularized ETL Writing with Apache SparkDatabricks
Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse.
Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table.
These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business.
Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data.
And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table.
Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Introducing Delta Live Tables: Make Reliable ETL Easy on Delta LakeDatabricks
As the size and complexity of data grows, building reliable data pipelines is increasingly important, but also complex and challenging. Learn how Delta Live Tables simplifies the ETL lifecycle on Delta Lake, helping data engineering teams greatly simplify ETL development and ongoing management to improve data quality, scale operations, and reduce cost.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems.
This presentation explains where it fits into the data eco system and how it helps implement your system in Rust
FIWARE Wednesday Webinars - How to Design DataModelsFIWARE
How to Design DataModels - 8th May 2019
Corresponding webinar recording: https://youtu.be/T_1DpKf6C_c
Understanding and applying Standard Data Models.
Chapter: Core
Difficulty: 3
Audience: Technical Domain Specific
Presenter: José Manuel Cantera (Senior Standardization Expert, FIWARE Foundation)
Azure Databricks (For Data Analytics).pptxKnoldus Inc.
Azure Databricks is a cloud-based platform that simplifies big data analytics. It's essential because it streamlines data processing, analysis, and visualization, crucial for deriving insights from vast datasets. It works by managing Spark clusters automatically, enabling users to focus on analysis. Different data bricks, like Azure Blob Storage, can be integrated seamlessly. Implementing Azure Databricks with Blob Storage involves creating a workspace, configuring Blob Storage, generating access keys, mounting Blob Storage as a filesystem, and utilizing data processing capabilities within Databricks notebooks. This integration enhances data accessibility and analysis, facilitating informed decision-making.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
Machine learning pipelines are a hot topic at the moment. Moving data through the pipeline in an efficient and predictable way is one of the most important aspects of running machine learning models in production.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
In this webinar, the presenter will take you through the most revolutionary data warehouse, Snowflake with a live demo and technical and functional discussions with a customer. Ryan Goltz from Chesapeake Energy and Tristan Handy, creator of DBT Cloud and owner of Fishtown Analytics will also be joining the webinar.
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
The reality of most large scale data deployments includes storage decoupled from computation, pipelines operating directly on files and metadata services with no locking mechanisms or transaction tracking. For this reason attempts at achieving transactional behavior, snapshot isolation, safe schema evolution or performant support for CRUD operations has always been marred with tradeoffs.
This talk will focus on technical aspects, practical capabilities and the potential future of three table formats that have emerged in recent years as solutions to the issues mentioned above – ACID ORC (in Hive 3.x), Iceberg and Delta Lake. To provide a richer context, a comparison between traditional databases and big data tools as well as an overview of the reasons for the current state of affairs will be included.
After the talk, the audience is expected to have a clear understanding of the current development trends in large scale table formats, on the conceptual and practical level. This should allow the attendees to make better informed assessments about which approaches to data warehousing, metadata management and data pipelining they should adapt in their organizations.
In this session we take an in-depth look into the Apache Atlas open metadata and governance function.
Open metadata and governance is a moon-shot type of project to create a set of open APIs, types, and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery, and access frameworks to automate the collection, management, and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed, and used in order to deliver maximum value to the enterprise.
Apache Atlas is the reference implementation of the Open Metadata and Governance standards and framework (https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and+Governance). This function will enable an Apache Atlas server to synchronize and query metadata from any open metadata-compliant metadata repository.
In this session we will cover how Open Metadata and Governance works. This includes: (1) the key components in Atlas, (2) the different integration patterns and APIs that vendors can use to integrate their technology into the open metadata ecosystem, and (3) how common metadata use cases such as searching for data sets, managing security (through Atlas/Ranger integration), and automated metadata discovery work in the active ecosystem.
Speaker
Mandy Chessell, Distinguished Engineer, IBM
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
Many organizations using machine learning are facing challenges storing and versioning their complex ML data as well as a large number of models generated from those data. To simplify this process, organizations tend to start building their customized ‘ML platforms.’
Modularized ETL Writing with Apache SparkDatabricks
Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse.
Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table.
These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business.
Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data.
And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table.
Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Introducing Delta Live Tables: Make Reliable ETL Easy on Delta LakeDatabricks
As the size and complexity of data grows, building reliable data pipelines is increasingly important, but also complex and challenging. Learn how Delta Live Tables simplifies the ETL lifecycle on Delta Lake, helping data engineering teams greatly simplify ETL development and ongoing management to improve data quality, scale operations, and reduce cost.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems.
This presentation explains where it fits into the data eco system and how it helps implement your system in Rust
FIWARE Wednesday Webinars - How to Design DataModelsFIWARE
How to Design DataModels - 8th May 2019
Corresponding webinar recording: https://youtu.be/T_1DpKf6C_c
Understanding and applying Standard Data Models.
Chapter: Core
Difficulty: 3
Audience: Technical Domain Specific
Presenter: José Manuel Cantera (Senior Standardization Expert, FIWARE Foundation)
Azure Databricks (For Data Analytics).pptxKnoldus Inc.
Azure Databricks is a cloud-based platform that simplifies big data analytics. It's essential because it streamlines data processing, analysis, and visualization, crucial for deriving insights from vast datasets. It works by managing Spark clusters automatically, enabling users to focus on analysis. Different data bricks, like Azure Blob Storage, can be integrated seamlessly. Implementing Azure Databricks with Blob Storage involves creating a workspace, configuring Blob Storage, generating access keys, mounting Blob Storage as a filesystem, and utilizing data processing capabilities within Databricks notebooks. This integration enhances data accessibility and analysis, facilitating informed decision-making.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Data Engineering A Deep Dive into DatabricksKnoldus Inc.
During this session, you'll gain a comprehensive understanding of Databricks' capabilities for efficiently processing and managing data, with a focus on Apache Spark for data transformation. We'll cover data ingestion methods, storage, orchestration, and best practices to ensure your data engineering workflows are optimized for success.
Data Engineering with Databricks PresentationKnoldus Inc.
We will explore how to leverage the Databricks Lakehouse Platform to productionalize ETL pipelines and also learn how to use Delta Live Tables with Spark SQL and PySpark to define and schedule pipelines that incrementally process new data from a variety of data sources into the Lakehouse, orchestrate tasks with Databricks Workflows, and promote code with Databricks Repos.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Data Con LA 2020
Description
Data warehouses are not enough. Data lakes are the backbone of a modern data environment. Data Lakes are best built leveraging unique services of the cloud provider to reduce operations complexity. This session will explain why everyone's talking about data lakes, break down the best services in Azure to build a Data Lake, and walk through code for querying and loading with Azure Databricks and Event Hubs for Kafka. Attendees will leave the session with a firm grasp of why we build data lakes and how Azure Databricks fits in for ETL and querying.
Speaker
Dustin Vannoy, Dustin Vannoy Consulting, Principal Data Engineer
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Come può .NET contribuire alla Data Science? Cosa è .NET Interactive? Cosa c'entrano i notebook? E Apache Spark? E il pythonismo? E Azure? Vediamo in questa sessione di mettere in ordine le idee.
The exponential growth of data is not a problem but processing/managing the huge diversity of data is a concern. In this session, we are gonna discuss about one of the most popular big data distributed processing framework "Spark" where we'll be developing API using "Scala", a gained prominence amid big data developers.
Similar to Databricks and Logging in Notebooks (20)
Using InfluxDB for real-time monitoring in JmeterKnoldus Inc.
Explore the integration of InfluxDB with JMeter for real-time performance monitoring. This session will cover setting up InfluxDB to capture JMeter metrics, configuring JMeter to send data to InfluxDB, and visualizing the results using Grafana. Learn how to leverage this powerful combination to gain real-time insights into your application's performance, enabling proactive issue detection and faster resolution.
Intoduction to KubeVela Presentation (DevOps)Knoldus Inc.
KubeVela is an open-source platform for modern application delivery and operation on Kubernetes. It is designed to simplify the deployment and management of applications in a Kubernetes environment. KubeVela is a modern software delivery platform that makes deploying and operating applications across today's hybrid, multi-cloud environments easier, faster and more reliable. KubeVela is infrastructure agnostic, programmable, yet most importantly, application-centric. It allows you to build powerful software, and deliver them anywhere!
Stakeholder Management (Project Management) PresentationKnoldus Inc.
A stakeholder is someone who has an interest in or who is affected by your project and its outcome. This may include both internal and external entities such as the members of the project team, project sponsors, executives, customers, suppliers, partners and the government. Stakeholder management is the process of managing the expectations and the requirements of these stakeholders.
Introduction To Kaniko (DevOps) PresentationKnoldus Inc.
Kaniko is an open-source tool developed by Google that enables building container images from a Dockerfile inside a Kubernetes cluster without requiring a Docker daemon. Kaniko executes each command in the Dockerfile in the user space using an executor image, which runs inside a container, such as a Kubernetes pod. This allows building container images in environments where the user doesn’t have root access, like a Kubernetes cluster.
Efficient Test Environments with Infrastructure as Code (IaC)Knoldus Inc.
In the rapidly evolving landscape of software development, the need for efficient and scalable test environments has become more critical than ever. This session, "Streamlining Development: Unlocking Efficiency through Infrastructure as Code (IaC) in Test Environments," is designed to provide an in-depth exploration of how leveraging IaC can revolutionize your testing processes and enhance overall development productivity.
Exploring Terramate DevOps (Presentation)Knoldus Inc.
Terramate is a code generator and orchestrator for Terraform that enhances Terraform's capabilities by adding features such as code generation, stacks, orchestration, change detection, globals, and more . It's primarily designed to help manage Terraform code at scale more efficiently . Terramate is particularly useful for managing multiple Terraform stacks, providing support for change detection and code generation 2. It allows you to create relationships between stacks to improve your understanding and control over your infrastructure . One of the key features of Terramate is its ability to detect changes at both the stack and module level. This capability allows you to identify which stacks and resources have been altered and selectively determine where you should execute commands.
Clean Code in Test Automation Differentiating Between the Good and the BadKnoldus Inc.
This session focuses on the principles of writing clean, maintainable, and efficient code in the context of test automation. The session will highlight the characteristics that distinguish good test automation code from bad, ultimately leading to more reliable and scalable testing frameworks.
Integrating AI Capabilities in Test AutomationKnoldus Inc.
Explore the integration of artificial intelligence in test automation. Understand how AI can enhance test planning, execution, and analysis, leading to more efficient and reliable testing processes. Explore the cutting-edge integration of Artificial Intelligence (AI) capabilities in Test Automation, a transformative approach shaping the future of software testing. This session will delve into practical applications, benefits, and considerations associated with infusing AI into test automation workflows.
State Management with NGXS in Angular.pptxKnoldus Inc.
NGXS is a state management pattern and library for Angular. NGXS acts as a single source of truth for your application's state - providing simple rules for predictable state mutations. In this session we will go through the main for components of NGXS -Store, Actions, State, and Select.
Authentication in Svelte using cookies.pptxKnoldus Inc.
Svelte streamlines authentication with cookies, offering a secure and seamless user experience. Effortlessly manage sessions by storing tokens in cookies, ensuring persistent logins. With Svelte's simplicity, implement robust authentication mechanisms, enhancing user security and interaction.
OAuth2 Implementation Presentation (Java)Knoldus Inc.
The OAuth 2.0 authorization framework is a protocol that allows a user to grant a third-party web site or application access to the user's protected resources, without necessarily revealing their long-term credentials or even their identity. It is commonly used in scenarios such as user authentication in web and mobile applications and enables a more secure and user-friendly authorization process.
Supply chain security with Kubeclarity.pptxKnoldus Inc.
Kube clarity is a comprehensive solution designed to enhance supply chain security within Kubernetes environments. Kube clarity enables organizations to identify and mitigate potential security threats throughout the software development and deployment process.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
In this session, we will delve into the world of web scraping with JSoup, an open-source Java library. Here we are going to learn how to parse HTML effectively, extract meaningful data, and navigate the Document Object Model (DOM) for powerful web scraping capabilities.
Akka gRPC Essentials A Hands-On IntroductionKnoldus Inc.
Dive into the fundamental aspects of Akka gRPC and learn to leverage its power in building compact and efficient distributed systems. This session aims to equip attendees with the essential skills and knowledge to leverage Akka and gRPC effectively in building robust, scalable, and distributed applications.
Entity Core with Core Microservices.pptxKnoldus Inc.
How Developers can use Entity framework(ORM) which provides a structured and consistent way for microservices to interact with their respective database, prompting independence, scaliblity and maintainiblity in a distributed system, and also provide a high-level abstraction for data access.
Introduction to Redis and its features.pptxKnoldus Inc.
Join us for an interactive session where we'll cover the fundamentals of Redis, practical use cases, and best practices for incorporating Redis into your projects. Whether you're a developer, architect, or system administrator, this session will equip you with the knowledge to harness the full potential of Redis for your applications. Get ready to elevate your understanding of in-memory data storage and revolutionize the way you handle data in your projects with Redis
GraphQL with .NET Core Microservices.pdfKnoldus Inc.
In this Webinar, will talk on GraphQL with .NET, that provides a modern and flexible approach to building APIs. It empowers developers to create efficient and tailored APIs that meet the specific needs of their applications and clients.
NuGet Packages Presentation (DoT NeT).pptxKnoldus Inc.
These packages and topics cover various aspects of .NET development, offering solutions for common needs in software development, including logging, database interaction, API communication, testing, security, and more. Depending on the requirements of your project, incorporating these packages can significantly enhance the development process.
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
Data Quality in Test Automation: Navigating the Path to Reliable Testing" delves into the crucial role of data quality within the realm of test automation. It explores strategies and methodologies for ensuring reliable testing outcomes by addressing challenges related to the accuracy, completeness, and consistency of test data. The discussion encompasses techniques for managing, validating, and optimizing data sets to enhance the effectiveness and efficiency of automated testing processes, ultimately fostering confidence in the reliability of software systems.
K8sGPTThe AI way to diagnose KubernetesKnoldus Inc.
K8sGPT is a tool for scanning your Kubernetes clusters, diagnosing and triaging issues in simple english. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI. In the session we will integrate OpenAI with K8sGPT to diagnose our Kubernetes Cluster.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
3. Agenda
What is Databricks?
Reasons to use Azure Databricks
Azure Databricks Core Artifacts
Logging in Scala Notebooks
Workspace Clusters
Notebooks Libraries
Jobs Data
4. What is Databricks ?
Industry - leading, Zero-management cloud platform built around Spark.
Delivers
- fully managed Spark clusters
- an interactive workspace for exploration and visualization
- a production pipeline scheduler
- a platform for powering your favorite Spark-based applications
So instead of tackling headaches like setting up infrastructures, creating data backup, scaling
your nodes according to load, you can finally focus on finding answers that make an immediate
impact on your business.
It is a product offered by a third - party, but it is offered as a first - class service tightly integrated
with AWS and Azure
5. Reasons to use Azure Databricks
Familiar Languages and Environment
Higher Productivity and Collaboration
Easy integration with Microsoft Stack
Extensive List of Data Sources
Suitable for Small Jobs too
Extensive Documentation and Support Available
7. Azure Databricks Artifact - Workspace
An environment inside Databricks
service with access to all your
databricks resources
Organizes various objects, like
Notebooks, Jars into Folders.
Provides easy one-click access to
computational resources, like clusters
and data stored
8. Azure Databricks Artifact - Clusters
Core Component of Databricks
A set of computational resources and
configurations
Runs our Data Engineering, Data
Science and Data Analytics
workloads
Types of Clusters:
- Interactive Clusters
- Automated Clusters
9. Azure Databricks Artifact - Data
● Create Tables directly from imported data.
The table schema is stored in the internal
databricks metastore
● Use Apache Spark commands to read data
from supported data source
● Import data into DBFS and use the DBFS
CLI, DBFS API, DBFS utilities, Spark APIs,
and local file APIs to access the data.
10. Azure Databricks Artifact - Notebooks
Web-Based interface to a Document
containing
- Runnable Code
- Visualizations
- Narrations
Support for multiple languages in the
same notebook
Real-time collaboration on the same
notebook
Revision History of the notebook
11. Azure Databricks Artifact - Jobs
As an alternative to running notebooks
interactively, you can set for a notebook or Jar
to either run immediately or on a scheduled
basis.
Three types of task can be run as jobs:
- Notebooks
- JARs
- Spark Submit
Notebook Job and JAR jobs can be
configured by passing parameters too
Spark-Submit can also be configured
12. Azure Databricks Artifact - Libraries
A Library can be installed on a cluster to
make some third-party or custom code
available to the running notebook or JAR
Libraries can be installed in 3 modes:
- Workspace Libraries
- Cluster Libraries
- Notebook scoped Libraries
13. Spark Jobs work with large amount of data and the tasks involve time consuming
computations.
They run at remote locations too. So, it becomes difficult to track the execution step without
logs.
Logs help to track at what point is the execution at, they help the developer to debug at what
points is the job consuming maximum of it’s time.
Logs often contain vast amounts of metadata, including date stamps, logger name [that can be
set to be the name of the Logging Class], source information such as cluster name. This data
helps in the debugging process.
Using Real-time Logs and messages logged in them, certain Alerting techniques can be used
to create notifications if a log containing a particular message appears.
Importance of Logs
14. Logging in Databricks Scala Notebooks
Databricks’ Log Delivery System
Delivers Spark Driver, Executor and Event
Logs to a location specified during configuring
the cluster
Delivered every 5 minutes
Incase a cluster terminates, Databricks make
sure to deliver all logs up till the termination to
the delivery location.
15. Logging in Databricks Scala Notebooks
Default Log4j Properties in Databricks
Two log4j.properties files:
For Driver:
For Executors:
16. Logging in Databricks Scala Notebooks
Overwriting the Default Log4j Properties with init scripts
Drawback - Every time configuration is to be changed, cluster restart is required
17. Logging in Databricks Scala Notebooks
Using External Log4j configuration for
your job
- Create a Log4j properties file for your
custom loggers and appenders
- Upload the file to your required DBFS
location
- Use Log4j’s PropertyConfigurator object
to configure using your custom log4j
properties