SlideShare a Scribd company logo
1 of 20
Big Data
transformations
powered by Apache Spark
Mohika Rastogi
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
 Introduction to BIG DATA
 What is Big Data
 Challenges and Benefits Involving Big Data
 Apache Spark in the Big Data Industry
 Data Transformations
 What is Data Transformation and why do we need
to transform data ?
 Why Spark ..?
 Spark Transformations
 RDD, Dataframe,Wide vs Narrow Transformation.
 Examples of transformations
 Aggregate Functions
 Array Functions
 Spark Joins
 DEMO
What is Big Data ?
 The definition of big data is data that contains
greater variety, arriving in increasing volumes
and with more velocity. This is also known as
the three Vs.
 Put simply, big data is larger, more complex data
sets, especially from new data sources. These data
sets are so voluminous that traditional data
processing software just can’t manage them. But
these massive volumes of data can be used to
address business problems you wouldn’t have
been able to tackle before.
 Scalability and storage bottleneck
 Noise accumulation
 Fault Tolerance
 Incidental endogeneity and measurement errors
 Data quality
 Storage
 Lack of data science professionals
 Validating data
 Big data technology is changing at a rapid pace. A
few years ago, Apache Hadoop was the popular
technology used to handle big data. Then Apache
Spark was introduced in 2014. Today, a
combination of the two frameworks appears to be
the best approach. Keeping up with big data
technology is an ongoing challenge.
Challenges With Big Data
 Big data makes it possible for you to gain more
complete answers because you have more
information. More complete answers mean more
confidence in the data—which means a completely
different approach to tackling problems.
 Cost Savings: Optimizing processes based on big
data insights can result in cost savings. This
includes improvements in supply chain
management, resource utilization, and more
efficient business operations.
 Risk Management: Big data analytics helps
organizations identify and mitigate risks by
analyzing patterns and anomalies in data. This is
particularly valuable in financial services,
insurance, and other industries where risk
management is crucial.
Big Data Benefits
HOW DOES APACHE
SPARK IMPROVES
BUSINESS IN THE BIG
DATA INDUSTRY ??
Better Analytics
Powerful Data Processing
 Apache Spark is an ideal tool for companies that work
on the Internet of Things. As it has low-latency in-
memory data processing capability, it can efficiently
handle a wide range of analytics problems. It
contains well-designed libraries used for graph
analytics algorithms and machine learning.
 Apache Spark libraries are used by big data scientists
to improve their analyses, querying, and
data transformation. It helps them to create complex
workflows in a smooth and seamless way. Apache
Spark is used for completing various tasks such
as analysis, interactive queries across large data sets,
and more.
Flexibility
Real-time processing.
 Apache Spark enables the organization to analyze the
data coming from IoT sensors. It enables
easy processing of continuous streaming of low-
latency data. In this way, organizations can utilize real-
time dashboards and data exploration to monitor
and optimize their business.
 Apache Spark is highly compatible with a variety of
programming languages and allows you to write
applications in Python, Scala, Java, and more.
What is Data Transformation ?
 Data transformation is defined as the technical
process of converting data from one format,
standard, or structure to another – without
changing the content of the datasets – typically to
prepare it for consumption by an app or a user or
to improve the data quality.
Why do we need to transform data ?
 It is crucial for any organization that seeks to
leverage its data to provide timely business
insights. Organizations need a reliable method for
utilizing data to put it to good use for their
operations as the number of data has increased.
 Data transformation is a component of using this
data since, when performed effectively, it ensures
that the information is accessible, consistent, safe,
and eventually acknowledged by the targeted
business users.
Why Spark Only
 In-memory computation
o Spark allows applications on Hadoop clusters to be executed up to 100 times faster in memory,
10 times faster on disk
 Lazy Evaluation
 Support SQL queries
SPARK
TRANSFORMATIONS
 RDD: RDD is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core.
 Dataframe: DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a
relational database or a data frame in R/Python, but with richer optimizations under the hood.
Spark Transformations
Main Features:
 In-Memory Processing
 Immutability
 Fault Tolerance
 Lazy Evaluation
 Partitioning
 Parallelize
Wide vs Narrow
Transformations
Spark Transformations create a new
Resilient Distributed Dataset (RDD) from
an existing one.
eg :- map, flatMap,groupByKey,filter, union
Spark Transformations Examples
 GroupBy :- groupBy() is a transformation operation in spark that is used to group the data in a Spark DataFrame or RDD
based on one or more specified columns.
It returns a GroupedData object which can then be used to perform aggregation operations such as count(),sum(),avg(), etc. on
the grouped data.
 Map : Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame,
and Dataset and finally returns a new RDD/Dataset respectively.
 Union : This method of the DataFrame is used to combine two DataFrame’s of the same structure/schema. If schemas are not
the same it returns an error.
 Intersect : Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to
INTERSECT in SQL.
 Where : Spark where() function is used to filter the rows from DataFrame or Dataset based on the given condition or SQL
expression.
Aggregate Functions
 Spark provides built-in standard
Aggregate functions defines in
DataFrame API, these come in
handy when we need to make
aggregate operations on
DataFrame columns. Aggregate
functions operate on a group of
rows and calculate a single
return value for every group. All
these aggregate functions
accept input as, Column type or
column name in a string and
several other arguments based
on the function and return
Column type.
 For eg :-
val avg_df = df.select(avg("salary"))
 df.select(first("salary").as("Top
Salary")).show(false)
Array Functions
 Spark SQL provides built-in
standard array functions defines in
DataFrame API, these come in
handy when we need to make
operations on array (ArrayType)
column.
 All these accept input as, array
column and several other
arguments based on the function.
 For eg :
inputdf.withColumn("result",
array_contains(col("array_col2"),
3))
 inputdf.withColumn("result",
array_max(col("array_col2"), 3))
Spark Joins
Inner Join
Spark Inner join is the default join and
it’s mostly used, It is used to join two
DataFrames/Datasets on key
columns, and where keys don’t match
the rows get dropped from both
datasets.
Left Join
Spark Left a.k.a Left Outer join returns all
rows from the left DataFrame/Dataset
regardless of match found on the right
dataset when join expression doesn’t
match, it assigns null for that record and
drops records from right where match not
found.
Full Outer Join
Outer or full, fullouter join returns all rows
from both Spark DataFrame/Datasets,
where join expression doesn’t match it
returns null on respective record columns.
Right Outer Join
Spark Right a.k.a Right Outer join is
opposite of left join, here it returns all rows
from the right DataFrame/Dataset
regardless of math found on the left
dataset, when join expression doesn’t
match, it assigns null for that record and
drops records from left where match not
found.
Spark DataFrame supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI,
CROSS, SELF JOIN. Spark SQL Joins are wider transformations that result in data shuffling over the network hence they have
huge performance issues when not designed with care.
Joins Continued ...
 Left Semi Join :
Spark Left Semi join is similar to inner join difference being leftsemi join returns all columns from the left
DataFrame/Dataset and ignores all columns from the right dataset. In other words, this join returns columns from the
only left dataset for the records match in the right dataset on join expression, records not matched on join expression
are ignored from both left and right datasets.
** The same result can be achieved using select on the result of the inner join however, using this join would be
efficient.
 Left Anti Join :
Left Anti join does the exact opposite of the Spark leftsemi join, leftanti join returns only columns from the left
DataFrame/Dataset for non-matched records.
 Self Join :
Spark Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above-
explained join types to join DataFrame to itself. below example use inner self join.
 Cross Join :
Returns the Cartesian product of both DataFrames, resulting in all possible combinations of rows. -> crossJoin(right:
Dataset[_]) -> if you did’nt specify any condition/joinExpr then it will cross join by default or you can use the crossJoin
method as well.
Joins Continued ...
 Join With :
The joinWith method in Spark also performs a join operation between
two DataFrames based on a specified join condition. However, it returns
a Dataset of tuples representing the joined rows from both DataFrames.
The resulting Dataset contains tuples, where each tuple represents a
joined row consisting of the rows from the left DataFrame and the
right DataFrame that satisfy the join condition.
-> The key distinction is that joinWith provides a more structured
output in the form of a Dataset of tuples, while join returns a
DataFrame with merged columns from both DataFrames.
-> In most cases, join is used when you want to combine rows
from two DataFrames based on a join condition, and you are
interested in the merged columns in the output DataFrame. On
the other hand, joinWith is useful when you want to keep the
original structure of the data and work with tuples representing
the joined rows.
For eg :- empDF.joinWith(deptDF, empDF("emp_dept_id") === deptDF("dept_id"), "inner").show(false)
EmpDF DeptDF
resultDF
Big Data Transformations Powered By Spark
Big Data Transformations Powered By Spark

More Related Content

Similar to Big Data Transformations Powered By Spark

Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
SAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperSAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperVipul Neema
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Anirudh Gangwar
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
SAP Lambda Architecture Point of View
SAP Lambda Architecture Point of ViewSAP Lambda Architecture Point of View
SAP Lambda Architecture Point of ViewSnehanshu Shah
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesStratio
 

Similar to Big Data Transformations Powered By Spark (20)

Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
SAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperSAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White Paper
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
SAP Lambda Architecture Point of View
SAP Lambda Architecture Point of ViewSAP Lambda Architecture Point of View
SAP Lambda Architecture Point of View
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Os Lonergan
Os LonerganOs Lonergan
Os Lonergan
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Alteryx Presentation
Alteryx PresentationAlteryx Presentation
Alteryx Presentation
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 

More from Knoldus Inc.

GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfKnoldus Inc.
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxKnoldus Inc.
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesKnoldus Inc.
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxKnoldus Inc.
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxKnoldus Inc.
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxKnoldus Inc.
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxKnoldus Inc.
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxKnoldus Inc.
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationKnoldus Inc.
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationKnoldus Inc.
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIsKnoldus Inc.
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II PresentationKnoldus Inc.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAKnoldus Inc.
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Knoldus Inc.
 
Azure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxAzure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxKnoldus Inc.
 
The Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinThe Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinKnoldus Inc.
 
Data Engineering with Databricks Presentation
Data Engineering with Databricks PresentationData Engineering with Databricks Presentation
Data Engineering with Databricks PresentationKnoldus Inc.
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Knoldus Inc.
 

More from Knoldus Inc. (20)

GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)
 
Azure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxAzure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptx
 
The Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinThe Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and Kotlin
 
Data Engineering with Databricks Presentation
Data Engineering with Databricks PresentationData Engineering with Databricks Presentation
Data Engineering with Databricks Presentation
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Big Data Transformations Powered By Spark

  • 1. Big Data transformations powered by Apache Spark Mohika Rastogi
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3.  Introduction to BIG DATA  What is Big Data  Challenges and Benefits Involving Big Data  Apache Spark in the Big Data Industry  Data Transformations  What is Data Transformation and why do we need to transform data ?  Why Spark ..?  Spark Transformations  RDD, Dataframe,Wide vs Narrow Transformation.  Examples of transformations  Aggregate Functions  Array Functions  Spark Joins  DEMO
  • 4.
  • 5. What is Big Data ?  The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs.  Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.
  • 6.  Scalability and storage bottleneck  Noise accumulation  Fault Tolerance  Incidental endogeneity and measurement errors  Data quality  Storage  Lack of data science professionals  Validating data  Big data technology is changing at a rapid pace. A few years ago, Apache Hadoop was the popular technology used to handle big data. Then Apache Spark was introduced in 2014. Today, a combination of the two frameworks appears to be the best approach. Keeping up with big data technology is an ongoing challenge. Challenges With Big Data  Big data makes it possible for you to gain more complete answers because you have more information. More complete answers mean more confidence in the data—which means a completely different approach to tackling problems.  Cost Savings: Optimizing processes based on big data insights can result in cost savings. This includes improvements in supply chain management, resource utilization, and more efficient business operations.  Risk Management: Big data analytics helps organizations identify and mitigate risks by analyzing patterns and anomalies in data. This is particularly valuable in financial services, insurance, and other industries where risk management is crucial. Big Data Benefits
  • 7. HOW DOES APACHE SPARK IMPROVES BUSINESS IN THE BIG DATA INDUSTRY ??
  • 8. Better Analytics Powerful Data Processing  Apache Spark is an ideal tool for companies that work on the Internet of Things. As it has low-latency in- memory data processing capability, it can efficiently handle a wide range of analytics problems. It contains well-designed libraries used for graph analytics algorithms and machine learning.  Apache Spark libraries are used by big data scientists to improve their analyses, querying, and data transformation. It helps them to create complex workflows in a smooth and seamless way. Apache Spark is used for completing various tasks such as analysis, interactive queries across large data sets, and more. Flexibility Real-time processing.  Apache Spark enables the organization to analyze the data coming from IoT sensors. It enables easy processing of continuous streaming of low- latency data. In this way, organizations can utilize real- time dashboards and data exploration to monitor and optimize their business.  Apache Spark is highly compatible with a variety of programming languages and allows you to write applications in Python, Scala, Java, and more.
  • 9. What is Data Transformation ?  Data transformation is defined as the technical process of converting data from one format, standard, or structure to another – without changing the content of the datasets – typically to prepare it for consumption by an app or a user or to improve the data quality. Why do we need to transform data ?  It is crucial for any organization that seeks to leverage its data to provide timely business insights. Organizations need a reliable method for utilizing data to put it to good use for their operations as the number of data has increased.  Data transformation is a component of using this data since, when performed effectively, it ensures that the information is accessible, consistent, safe, and eventually acknowledged by the targeted business users.
  • 10. Why Spark Only  In-memory computation o Spark allows applications on Hadoop clusters to be executed up to 100 times faster in memory, 10 times faster on disk  Lazy Evaluation  Support SQL queries
  • 12.  RDD: RDD is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core.  Dataframe: DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. Spark Transformations Main Features:  In-Memory Processing  Immutability  Fault Tolerance  Lazy Evaluation  Partitioning  Parallelize Wide vs Narrow Transformations Spark Transformations create a new Resilient Distributed Dataset (RDD) from an existing one. eg :- map, flatMap,groupByKey,filter, union
  • 13. Spark Transformations Examples  GroupBy :- groupBy() is a transformation operation in spark that is used to group the data in a Spark DataFrame or RDD based on one or more specified columns. It returns a GroupedData object which can then be used to perform aggregation operations such as count(),sum(),avg(), etc. on the grouped data.  Map : Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame, and Dataset and finally returns a new RDD/Dataset respectively.  Union : This method of the DataFrame is used to combine two DataFrame’s of the same structure/schema. If schemas are not the same it returns an error.  Intersect : Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.  Where : Spark where() function is used to filter the rows from DataFrame or Dataset based on the given condition or SQL expression.
  • 14. Aggregate Functions  Spark provides built-in standard Aggregate functions defines in DataFrame API, these come in handy when we need to make aggregate operations on DataFrame columns. Aggregate functions operate on a group of rows and calculate a single return value for every group. All these aggregate functions accept input as, Column type or column name in a string and several other arguments based on the function and return Column type.  For eg :- val avg_df = df.select(avg("salary"))  df.select(first("salary").as("Top Salary")).show(false)
  • 15. Array Functions  Spark SQL provides built-in standard array functions defines in DataFrame API, these come in handy when we need to make operations on array (ArrayType) column.  All these accept input as, array column and several other arguments based on the function.  For eg : inputdf.withColumn("result", array_contains(col("array_col2"), 3))  inputdf.withColumn("result", array_max(col("array_col2"), 3))
  • 16. Spark Joins Inner Join Spark Inner join is the default join and it’s mostly used, It is used to join two DataFrames/Datasets on key columns, and where keys don’t match the rows get dropped from both datasets. Left Join Spark Left a.k.a Left Outer join returns all rows from the left DataFrame/Dataset regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found. Full Outer Join Outer or full, fullouter join returns all rows from both Spark DataFrame/Datasets, where join expression doesn’t match it returns null on respective record columns. Right Outer Join Spark Right a.k.a Right Outer join is opposite of left join, here it returns all rows from the right DataFrame/Dataset regardless of math found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found. Spark DataFrame supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN. Spark SQL Joins are wider transformations that result in data shuffling over the network hence they have huge performance issues when not designed with care.
  • 17. Joins Continued ...  Left Semi Join : Spark Left Semi join is similar to inner join difference being leftsemi join returns all columns from the left DataFrame/Dataset and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets. ** The same result can be achieved using select on the result of the inner join however, using this join would be efficient.  Left Anti Join : Left Anti join does the exact opposite of the Spark leftsemi join, leftanti join returns only columns from the left DataFrame/Dataset for non-matched records.  Self Join : Spark Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above- explained join types to join DataFrame to itself. below example use inner self join.  Cross Join : Returns the Cartesian product of both DataFrames, resulting in all possible combinations of rows. -> crossJoin(right: Dataset[_]) -> if you did’nt specify any condition/joinExpr then it will cross join by default or you can use the crossJoin method as well.
  • 18. Joins Continued ...  Join With : The joinWith method in Spark also performs a join operation between two DataFrames based on a specified join condition. However, it returns a Dataset of tuples representing the joined rows from both DataFrames. The resulting Dataset contains tuples, where each tuple represents a joined row consisting of the rows from the left DataFrame and the right DataFrame that satisfy the join condition. -> The key distinction is that joinWith provides a more structured output in the form of a Dataset of tuples, while join returns a DataFrame with merged columns from both DataFrames. -> In most cases, join is used when you want to combine rows from two DataFrames based on a join condition, and you are interested in the merged columns in the output DataFrame. On the other hand, joinWith is useful when you want to keep the original structure of the data and work with tuples representing the joined rows. For eg :- empDF.joinWith(deptDF, empDF("emp_dept_id") === deptDF("dept_id"), "inner").show(false) EmpDF DeptDF resultDF