SlideShare a Scribd company logo
Spark Structured APIs
Using Databricks
Presented By:
Raviyanshu Singh
Software Consultant
Knoldus Inc
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Our Agenda
01 What is Spark
02 What’s an RDD
03 Dataframes
04 Datasets
Databricks
05
05
06 Demo
What is Spark?
Unified Analytics Engine
Apache Spark is a unified engine designed for large-scale distributed data
processing, on premises in data centers or in the cloud.
Spark’s design philosophy is based
on these principles:
● Speed
● Ease of Use
● Modularity
● Extensibility
00
Spark APIs Trio
RDD, Dataframe & Datasets
Distributed collections of
JVM objects
Functional Operators
(Map, filter etc)
2011
Distributed collections of
Row objects.
Expression based
operations and UDFs
Fast/Efficient and
internal representations
2013
Internally rows,
externally
JVM objects.
“Best of both the
worlds”:
type safe + fast
2015
RDD Dataframe Datasets
The Timeline of Three
Whatʼs RDD?
[Resilient Distributed Datasets]
2013 2017 2018
● An RDD represents an immutable, partitioned collection of records that can be operated on in
parallel.
● RDDs gives you complete control because every record in RDD is just a Java or Python object.
RDD
Dependencies Partitions
Compute Function
Partition => Iterator[T]
Characteristics of an RDD
RDD Characteristics
2013 2017 2018
1. Dependencies
➢ The List of dependencies that instructs spark how an RDD is constructed.
➢ Spark can recreate an RDD from these dependencies and replicate operations on them.
(This characteristic gives RDDs resiliency)
2. Partitions
➢ This provide spark the ability to distribute the work to parallelize computation across executors.
➢ Spark also uses locality information to send work to executors close to the data.
(This characteristic gives RDDs distribution)
3. Compute Function
➢ An abstract method that computes the input split partition in the TaskContext to produce a
collection of values (of type T)
compute(split: Partition, context: TaskContext): Iterator[T]
Visualizing RDD
Simple &
Elegant
Whatʼs the Problem?
RDDs Expresses How-to Not What-to
Compute Function (or computation)
is opaque to Spark
Slow for non JVM languages like
Python
No optimization by Spark
No data compression techniques
Leading to inadvertent
inefficiencies
Dataframe
Solution is in structuring
What we mean by Structuring?
● Ordering and Structuring for allowing to arrange your data in
tabular format.
● Expressing computation using patterns like filtering, selecting,
counting etc.
The DataFrame API
Distributed in-memory tables with named columns and schemas, (where each_column ==
specific_datatype[String, Int, Timestamp etc.] )
To Human Eye DataFrame is like a table.
Visualizing Dataframes
With Custom Data
Spark Operations on Data
Manoeuvring Data
Transformation
Spark
Operation Head of IT
Actions
Finance Manager
Marketing Manager
● Transforming a Spark DF into a new
DF without altering the original data.
● Giving Immutability property.
● Actions are operations that returns the
raw value.
● It triggers the Lazy Evaluation of all the
recorded transformation
Transformations Actions
show()
take()
count()
collect()
orderBy()
groupBy()
filter()
select()
Common Dataframe Ops
Projections & Filter
➢ A way to return only the rows matching a certain relational condition by using filters.
➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method.
val topHits = df.select("Id", "First", "Url")
.where($"Hits" > 10000)
Renaming, Adding, and Dropping Columns
➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and
drop() will drop the column specified inside it.
val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name")
val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy"))
.drop("Published")
Common Dataframe Ops
Aggregation
➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and
then aggregate counts across them.
val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull)
.groupBy("Campaigns")
.count()
.orderBy(desc("count"))
The Datasets API
A Type-Safe one
According to the Dataset Documentation:
➢ A strongly typed collection of domain-specific objects that can be
transformed in parallel using functional or relational operations. Each
Dataset [in Scala] also has an untyped view called a DataFrame, which
is a Dataset of Row.
DataFrame
DataSets
Structured
APIs
Untyped APIs
Typed APIs
● Dataframe = Dataset[Row]
● Alias in Scala
● Dataset[T]
● In Scala & Java
Visualizing Datasets
Case Class (Type-Safe Hero)
Datasets Ops
Databricks?
A LakeHouse Company
● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and
maintaining enterprise-grade data solutions at scale.
● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud
infrastructure on your behalf.
Common Tools In Databricks
Core Data Tasks
REST API
Interactive
Notebooks
ML Model
Serving
Workflows
Scheduler
Source
Controlling
(GIt)
SQL Editor &
Dashboard
Compute
Management
Data
Ingestion
DEMO
Thank You !

More Related Content

Similar to Spark Structured APIs

Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
Paris Data Engineers !
 
2015 02-09 - NoSQL Vorlesung Mosbach
2015 02-09 - NoSQL Vorlesung Mosbach2015 02-09 - NoSQL Vorlesung Mosbach
2015 02-09 - NoSQL Vorlesung Mosbach
Johannes Hoppe
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
PGConf APAC
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
Rajeev Rastogi (KRR)
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Addressing Scenario
Addressing ScenarioAddressing Scenario
Addressing Scenario
Tara Hardin
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
Adam Doyle
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
SQL Windowing
SQL WindowingSQL Windowing
SQL Windowing
Sandun Perera
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 

Similar to Spark Structured APIs (20)

Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
 
2015 02-09 - NoSQL Vorlesung Mosbach
2015 02-09 - NoSQL Vorlesung Mosbach2015 02-09 - NoSQL Vorlesung Mosbach
2015 02-09 - NoSQL Vorlesung Mosbach
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Addressing Scenario
Addressing ScenarioAddressing Scenario
Addressing Scenario
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
SQL Windowing
SQL WindowingSQL Windowing
SQL Windowing
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 

More from Knoldus Inc.

Using InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in JmeterUsing InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in Jmeter
Knoldus Inc.
 
Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)
Knoldus Inc.
 
Stakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationStakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) Presentation
Knoldus Inc.
 
Introduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationIntroduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) Presentation
Knoldus Inc.
 
Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)
Knoldus Inc.
 
Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)
Knoldus Inc.
 
Clean Code in Test Automation Differentiating Between the Good and the Bad
Clean Code in Test Automation  Differentiating Between the Good and the BadClean Code in Test Automation  Differentiating Between the Good and the Bad
Clean Code in Test Automation Differentiating Between the Good and the Bad
Knoldus Inc.
 
Integrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationIntegrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test Automation
Knoldus Inc.
 
State Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxState Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptx
Knoldus Inc.
 
Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptx
Knoldus Inc.
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)
Knoldus Inc.
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
Knoldus Inc.
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Knoldus Inc.
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
Knoldus Inc.
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
Knoldus Inc.
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
Knoldus Inc.
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
Knoldus Inc.
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
Knoldus Inc.
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
Knoldus Inc.
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
Knoldus Inc.
 

More from Knoldus Inc. (20)

Using InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in JmeterUsing InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in Jmeter
 
Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)
 
Stakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationStakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) Presentation
 
Introduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationIntroduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) Presentation
 
Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)
 
Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)
 
Clean Code in Test Automation Differentiating Between the Good and the Bad
Clean Code in Test Automation  Differentiating Between the Good and the BadClean Code in Test Automation  Differentiating Between the Good and the Bad
Clean Code in Test Automation Differentiating Between the Good and the Bad
 
Integrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationIntegrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test Automation
 
State Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxState Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptx
 
Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptx
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 

Recently uploaded

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Spark Structured APIs

  • 1. Spark Structured APIs Using Databricks Presented By: Raviyanshu Singh Software Consultant Knoldus Inc
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Our Agenda 01 What is Spark 02 What’s an RDD 03 Dataframes 04 Datasets Databricks 05 05 06 Demo
  • 4. What is Spark? Unified Analytics Engine Apache Spark is a unified engine designed for large-scale distributed data processing, on premises in data centers or in the cloud. Spark’s design philosophy is based on these principles: ● Speed ● Ease of Use ● Modularity ● Extensibility
  • 5. 00 Spark APIs Trio RDD, Dataframe & Datasets Distributed collections of JVM objects Functional Operators (Map, filter etc) 2011 Distributed collections of Row objects. Expression based operations and UDFs Fast/Efficient and internal representations 2013 Internally rows, externally JVM objects. “Best of both the worlds”: type safe + fast 2015 RDD Dataframe Datasets The Timeline of Three
  • 6. Whatʼs RDD? [Resilient Distributed Datasets] 2013 2017 2018 ● An RDD represents an immutable, partitioned collection of records that can be operated on in parallel. ● RDDs gives you complete control because every record in RDD is just a Java or Python object. RDD Dependencies Partitions Compute Function Partition => Iterator[T] Characteristics of an RDD
  • 7. RDD Characteristics 2013 2017 2018 1. Dependencies ➢ The List of dependencies that instructs spark how an RDD is constructed. ➢ Spark can recreate an RDD from these dependencies and replicate operations on them. (This characteristic gives RDDs resiliency) 2. Partitions ➢ This provide spark the ability to distribute the work to parallelize computation across executors. ➢ Spark also uses locality information to send work to executors close to the data. (This characteristic gives RDDs distribution) 3. Compute Function ➢ An abstract method that computes the input split partition in the TaskContext to produce a collection of values (of type T) compute(split: Partition, context: TaskContext): Iterator[T]
  • 9. Whatʼs the Problem? RDDs Expresses How-to Not What-to Compute Function (or computation) is opaque to Spark Slow for non JVM languages like Python No optimization by Spark No data compression techniques Leading to inadvertent inefficiencies
  • 10. Dataframe Solution is in structuring What we mean by Structuring? ● Ordering and Structuring for allowing to arrange your data in tabular format. ● Expressing computation using patterns like filtering, selecting, counting etc. The DataFrame API Distributed in-memory tables with named columns and schemas, (where each_column == specific_datatype[String, Int, Timestamp etc.] ) To Human Eye DataFrame is like a table.
  • 12. Spark Operations on Data Manoeuvring Data Transformation Spark Operation Head of IT Actions Finance Manager Marketing Manager ● Transforming a Spark DF into a new DF without altering the original data. ● Giving Immutability property. ● Actions are operations that returns the raw value. ● It triggers the Lazy Evaluation of all the recorded transformation Transformations Actions show() take() count() collect() orderBy() groupBy() filter() select()
  • 13. Common Dataframe Ops Projections & Filter ➢ A way to return only the rows matching a certain relational condition by using filters. ➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method. val topHits = df.select("Id", "First", "Url") .where($"Hits" > 10000) Renaming, Adding, and Dropping Columns ➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and drop() will drop the column specified inside it. val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name") val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy")) .drop("Published")
  • 14. Common Dataframe Ops Aggregation ➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and then aggregate counts across them. val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull) .groupBy("Campaigns") .count() .orderBy(desc("count"))
  • 15. The Datasets API A Type-Safe one According to the Dataset Documentation: ➢ A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset [in Scala] also has an untyped view called a DataFrame, which is a Dataset of Row. DataFrame DataSets Structured APIs Untyped APIs Typed APIs ● Dataframe = Dataset[Row] ● Alias in Scala ● Dataset[T] ● In Scala & Java
  • 18. Databricks? A LakeHouse Company ● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. ● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf.
  • 19. Common Tools In Databricks Core Data Tasks REST API Interactive Notebooks ML Model Serving Workflows Scheduler Source Controlling (GIt) SQL Editor & Dashboard Compute Management Data Ingestion
  • 20. DEMO