Using spark data frame for sql

•

27 likes•2,626 views

DaeMyung Kang

Simple SQL to Spark DataFrame

Technology

Basic
Using Spark DataFrame
For SQL
charsyam@naver.com

Create DataFrame From File
val path = “abc.txt”
val df = spark.read.text(path)

$Create DataFrame From Kafka val rdd = KafkaUtils.createRDD[String, String](...) val logsDF = rdd.map { _.value }.toDF$

Spark DataFrame Column
1) col("column name")
2) $"column name"
1) And 2) are the same.

Simple Iris TSV Logs
http://www.math.uah.edu/stat/data/Fisher.txt
Type PW PL SW SL
0 2 14 33 50
1 24 56 31 67
1 23 51 31 69
0 2 10 36 46
1 20 52 30 65
1 19 51 27 58

Load TSV with StructType
import org.apache.spark.sql.types._
var irisSchema = StructType(Array(
StructField("Type", IntegerType, true),
StructField("PetalWidth", IntegerType, true),
StructField("PetalLength", IntegerType, true),
StructField("SepalWidth", IntegerType, true),
StructField("SepalLength", IntegerType, true)
))

Load TSV with Encoder #1
import org.apache.spark.sql.Encoders
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema

Load TSV
var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("header", "true"). // Does the file have a header line?
option("delimiter", "t"). // Set delimiter to tab or comma.
schema(irisSchema). // Schema that was built above.
load("Fisher.txt")
irisDf.show(5)

Load TSV - Show Results
scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
| 0| 2| 14| 33| 50|
| 1| 24| 56| 31| 67|
| 1| 23| 51| 31| 69|
| 0| 2| 10| 36| 46|
| 1| 20| 52| 30| 65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows

Using sqlContext sql
Super easy way
val view = df.createOrReplaceTempView("tmp_iris")
val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")

Simple Select
SQL:
Select type, petalwidth + sepalwidth as sum_width from …
val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth"))
val resultDF = sumDF.selectExpr("Type", "sum_width")
val resultDF = sumDF.selectExpr("*") ← select *

Select with where
SQL:
Select type, petalwidth from … where petalwidth > 10
val whereDF = df.filter($"petalwidth" > 10)
val whereDF = df.where($"petalwidth" > 10)
//filter and where are the same
val resultDF = whereDF.selectExpr("Type", "petalwidth")

Select with order by
SQL:
Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc
1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc)
2) val sortDF = df.sort($"petalwidth", desc("sepalwidth"))
3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth"))
1), 2) And 3) are the same.
val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")

Select with Group by
SQL:
Select type, max(petalwidth) A, min(sepalwidth) B from … group by type
val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"),
min($"sepalwidth").as("B"))
val resultDF = groupDF.selectExpr("type", "A", "B")

Tip - Support MapType<String, String> like Hive
SQL in Hive:
Create table test (type map<string, string>);
Hive support str_to_map, but spark not support for dataframe(spark support
str_to_map for hiveQL).
Using udf to solve this.
val string_line = "A=1,B=2,C=3"
Val df = logsDF.withColumn("type", str_to_map(string_line))

$UDF - str_to_map val str_to_map = udf { text : String => val pairs = text.split("delimiter1|delimiter2").grouped(2) pairs.map { case Array(k, v) => k -> v}.toMap }$

What's hot

Format xls sheets Demo Mode

Jared Bourne

The Ring programming language version 1.6 book - Part 32 of 189

Mahmoud Samir Fayed

The Ring programming language version 1.2 book - Part 19 of 84

Mahmoud Samir Fayed

SICP_2.5 일반화된 연산시스템

HyeonSeok Choi

The Ring programming language version 1.10 book - Part 47 of 212

Mahmoud Samir Fayed

The Ring programming language version 1.4.1 book - Part 13 of 31

Mahmoud Samir Fayed

JSON Support in MariaDB: News, non-news and the bigger picture

Sergey Petrunya

Rule Your Geometry with the Terraformer Toolkit

Aaron Parecki

Get docs from sp doc library

Sudip Sengupta

GreenDao Introduction

Booch Lin

The Ring programming language version 1.7 book - Part 41 of 196

Mahmoud Samir Fayed

Memory management

Kuban Dzhakipov

The Ring programming language version 1.7 book - Part 48 of 196

Mahmoud Samir Fayed

Node js mongodriver

christkv

The Ring programming language version 1.5.3 book - Part 30 of 184

Mahmoud Samir Fayed

The Ring programming language version 1.9 book - Part 46 of 210

Mahmoud Samir Fayed

This talk will teach you how to use Slick in practice, based on our experience at EatingWell Media Group. Slick is a totally different (and better!) relational database mapping tool that brings Scala’s powerful features to your database interactions, namely: static-checking, compile-time safety, and compositionality. Here at EatingWell, we have learned quite a bit about Slick over the past two years as we transitioned from a PHP website to Scala. I will share with you tips and tricks we have learned, as well as everything you need to get started using Slick in your Scala application. I will begin with Slick fundamentals: how to get started making your connection, the types of databases it can access, how to actually create table objects and make queries to and from them. We will using these fundamentals to demonstrate the powerful features inherited from the Scala language itself: static-checking, compile-time safety, and compositionality. And throughout I will share plenty of tips that will help you in everything from getting started to connection pooling options and configuration for use at scale.

Slick: Bringing Scala’s Powerful Features to Your Database Access

Rebecca Grenier

The Ring programming language version 1.5 book - Part 8 of 31

Mahmoud Samir Fayed

The Ring programming language version 1.5.3 book - Part 37 of 184

Mahmoud Samir Fayed

Odoo Technical Concepts Summary

Mohamed Magdy

What's hot (20)

Format xls sheets Demo Mode

The Ring programming language version 1.6 book - Part 32 of 189

The Ring programming language version 1.2 book - Part 19 of 84

SICP_2.5 일반화된 연산시스템

The Ring programming language version 1.10 book - Part 47 of 212

The Ring programming language version 1.4.1 book - Part 13 of 31

JSON Support in MariaDB: News, non-news and the bigger picture

Rule Your Geometry with the Terraformer Toolkit

Get docs from sp doc library

GreenDao Introduction

The Ring programming language version 1.7 book - Part 41 of 196

Memory management

The Ring programming language version 1.7 book - Part 48 of 196

Node js mongodriver

The Ring programming language version 1.5.3 book - Part 30 of 184

The Ring programming language version 1.9 book - Part 46 of 210

Slick: Bringing Scala’s Powerful Features to Your Database Access

The Ring programming language version 1.5 book - Part 8 of 31

The Ring programming language version 1.5.3 book - Part 37 of 184

Odoo Technical Concepts Summary

Similar to Using spark data frame for sql

Solr As A SparkSQL DataSource

Spark Summit

ScalikeJDBC Tutorial for Beginners

Kazuhiro Sera

Spark is an execution framework designed to operate on distributed systems like Cassandra. It's a handy tool for many things, including ETL (extract, transform, and load) jobs. In this session, let me share with you some tips and tricks that I have learned through experience. I'm no oracle, but I can guarantee these tips will get you well down the path of pulling your relational data into Cassandra. About the Speaker Jim Hatcher Principal Architect, IHS Markit Jim Hatcher is a software architect with a passion for data. He has spent most of his 20 year career working with relational databases, but he has been working with Big Data technologies such as Cassandra, Solr, and Spark for the last several years. He has supported systems with very large databases at companies like First Data, CyberSource, and Western Union. He is currently working at IHS, supporting an Electronic Parts Database which tracks half a billion electronic parts using Cassandra.

Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...

DataStax

Using Spark to Load Oracle Data into Cassandra

Jim Hatcher

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs

Matt Stubbs

Introduction to Spark with Scala

Himanshu Gupta

Meetup spark structured streaming

José Carlos García Serrano

Apache Spark is a general engine for processing data on a large scale. Employing this tool in a distributed environment to process large data sets is undeniably beneficial. But what about fast feedback loop while developing such application with Apache Spark? Testing it on a cluster is essential, but it does not seem to be what most developers accustomed to TDD workflow would like to do. In the talk, ŁLLukasz will share with you some tips on how to write the unit and integration tests, and how Docker can be applied to test Spark application on a local machine. Examples will be presented within the ScalaTest framework, and it should be easy to grasp by people who know Scala and other JVM languages.

Testing batch and streaming Spark applications

Łukasz Gawron

Apache Spark jest narzędziem do przetwarzania danych na dużą skalę. Zastosowanie tego narzędzia w rozproszonym środowisku, w celu przetwarzania dużych zbiorów danych daje ogromne korzyści. Ale co z szybką pętlą zwrotną podczas opracowywania aplikacji z użyciem Apache Spark? Testowanie aplikacji w klastrze jest niezbędne, lecz nie wydaje się być tym, do czego większość programistów przywykło podczas praktykowania TDD. Podczas wystąpienia, Łukasz podzielił się z kilkoma wskazówkami, jak można napisać testy jednostkowe oraz integracyjne i jak Docker może być używany do testowania Sparka na lokalnej maszynie.

[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications

Future Processing

3 database-jdbc(1)

hameedkhan2017

Spark Summit EU talk by Ted Malaska

Spark Summit

User Defined Aggregation in Apache Spark: A Love Story

Databricks

User Defined Aggregation in Apache Spark: A Love Story

Databricks

Big Data Analytics with Scala at SCALA.IO 2013

Samir Bessalah

At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.

Spark Streaming Programming Techniques You Should Know with Gerard Maas

Spark Summit

The Ring programming language version 1.5.4 book - Part 37 of 185

Mahmoud Samir Fayed

SparkSQLの構文解析

ゆり井上

The Ring programming language version 1.5.3 book - Part 54 of 184

Mahmoud Samir Fayed

The Ring programming language version 1.5.3 book - Part 44 of 184

Mahmoud Samir Fayed

Scala in Places API

Łukasz Bałamut

Similar to Using spark data frame for sql (20)

Solr As A SparkSQL DataSource

ScalikeJDBC Tutorial for Beginners

Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...

Using Spark to Load Oracle Data into Cassandra

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs

Introduction to Spark with Scala

Meetup spark structured streaming

Testing batch and streaming Spark applications

[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications

3 database-jdbc(1)

Spark Summit EU talk by Ted Malaska

User Defined Aggregation in Apache Spark: A Love Story

Big Data Analytics with Scala at SCALA.IO 2013

Spark Streaming Programming Techniques You Should Know with Gerard Maas

The Ring programming language version 1.5.4 book - Part 37 of 185

SparkSQLの構文解析

The Ring programming language version 1.5.3 book - Part 54 of 184

The Ring programming language version 1.5.3 book - Part 44 of 184

Scala in Places API

Recently uploaded

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

Samir Dash

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

DBX First Quarter 2024 Investor Presentation

Dropbox

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

Exploring Multimodal Embeddings with Milvus

Zilliz

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

The microservices honeymoon is over. When starting a new project or revamping a legacy monolith, teams started looking for alternatives to microservices. The Modular Monolith, or 'Modulith', is an architecture that reaps the benefits of (vertical) functional decoupling without the high costs associated with separate deployments. This talk will delve into the advantages and challenges of this progressive architecture, beginning with exploring the concept of a 'module', its internal structure, public API, and inter-module communication patterns. Supported by spring-modulith, the talk provides practical guidance on addressing the main challenges of a Modultith Architecture: finding and guarding module boundaries, data decoupling, and integration module-testing. You should not miss this talk if you are a software architect or tech lead seeking practical, scalable solutions. About the author With two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Victor Rentea

When you’re building (micro)services, you have lots of framework options. Spring Boot is no doubt a popular choice. But there’s more! Take Quarkus, a framework that’s considered the rising star for Kubernetes-native Java. It always depends on what's best for your situation, but how to choose the best solution if you're comparing 2 frameworks? Both Spring Boot and Quarkus have their positives and negatives. Let us compare the two by live coding a couple of common use cases in Spring Boot and Quarkus. After this talk, you’ll be ready to get started with Quarkus yourself, and know when to select Quarkus or Spring Boot.

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

Jago de Vreede

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

In this keynote, Asanka Abeysinghe, CTO,WSO2 will explore the shift towards platformless technology ecosystems and their importance in driving digital adaptability and innovation. We will discuss strategies for leveraging decentralized architectures and integrating diverse technologies, with a focus on building resilient, flexible, and future-ready IT infrastructures. We will also highlight WSO2's roadmap, emphasizing our commitment to supporting this transformative journey with our evolving product suite.

Platformless Horizons for Digital Adaptability

WSO2

Retrieval augmented generation (RAG) is the most popular style of large language model application to emerge from 2023. The most basic style of RAG works by vectorizing your data and injecting it into a vector database like Milvus for retrieval to augment the text output generated by an LLM. This is just the beginning. One of the ways that we can extend RAG, and extend AI, is through multilingual use cases. Typical RAG is done in English using embedding models that are trained in English. In this talk, we’ll explore how RAG could work in languages other than English. We’ll explore French, Chinese, and Polish.

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Zilliz

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

ICT role in 21st century education and its challenges

rafiqahmad00786416

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Six Myths about Ontologies: The Basics of Formal Ontology

johnbeverley2021

Recently uploaded (20)

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

Why Teams call analytics are critical to your entire business

DBX First Quarter 2024 Investor Presentation

CNIC Information System with Pakdata Cf In Pakistan

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

[BuildWithAI] Introduction to Gemini.pdf

Exploring Multimodal Embeddings with Milvus

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Platformless Horizons for Digital Adaptability

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Strategies for Landing an Oracle DBA Job as a Fresher

ICT role in 21st century education and its challenges

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

AWS Community Day CPH - Three problems of Terraform

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Six Myths about Ontologies: The Basics of Formal Ontology

Using spark data frame for sql

1. Basic Using Spark DataFrame For SQL charsyam@naver.com

2. Create DataFrame From File val path = “abc.txt” val df = spark.read.text(path)

3. Create DataFrame From Kafka val rdd = KafkaUtils.createRDD[String, String](...) val logsDF = rdd.map { _.value }.toDF

4. Spark DataFrame Column 1) col("column name") 2) $"column name" 1) And 2) are the same.

5. Simple Iris TSV Logs http://www.math.uah.edu/stat/data/Fisher.txt Type PW PL SW SL 0 2 14 33 50 1 24 56 31 67 1 23 51 31 69 0 2 10 36 46 1 20 52 30 65 1 19 51 27 58

6. Load TSV with StructType import org.apache.spark.sql.types._ var irisSchema = StructType(Array( StructField("Type", IntegerType, true), StructField("PetalWidth", IntegerType, true), StructField("PetalLength", IntegerType, true), StructField("SepalWidth", IntegerType, true), StructField("SepalLength", IntegerType, true) ))

7. Load TSV with Encoder #1 import org.apache.spark.sql.Encoders case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int, SepalWidth: Int, SepalLength: Int) var irisSchema = Encoders.product[IrisSchema].schema

8. Load TSV var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV. option("header", "true"). // Does the file have a header line? option("delimiter", "t"). // Set delimiter to tab or comma. schema(irisSchema). // Schema that was built above. load("Fisher.txt") irisDf.show(5)

9. Load TSV - Show Results scala> irisDf.show(5) +----+----------+-----------+----------+-----------+ |Type|PetalWidth|PetalLength|SepalWidth|SepalLength| +----+----------+-----------+----------+-----------+ | 0| 2| 14| 33| 50| | 1| 24| 56| 31| 67| | 1| 23| 51| 31| 69| | 0| 2| 10| 36| 46| | 1| 20| 52| 30| 65| +----+----------+-----------+----------+-----------+ only showing top 5 rows

10. Using sqlContext sql Super easy way val view = df.createOrReplaceTempView("tmp_iris") val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")

11. Simple Select SQL: Select type, petalwidth + sepalwidth as sum_width from … val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth")) val resultDF = sumDF.selectExpr("Type", "sum_width") val resultDF = sumDF.selectExpr("*") ← select *

12. Select with where SQL: Select type, petalwidth from … where petalwidth > 10 val whereDF = df.filter($"petalwidth" > 10) val whereDF = df.where($"petalwidth" > 10) //filter and where are the same val resultDF = whereDF.selectExpr("Type", "petalwidth")

13. Select with order by SQL: Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc 1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc) 2) val sortDF = df.sort($"petalwidth", desc("sepalwidth")) 3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth")) 1), 2) And 3) are the same. val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")

14. Select with Group by SQL: Select type, max(petalwidth) A, min(sepalwidth) B from … group by type val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"), min($"sepalwidth").as("B")) val resultDF = groupDF.selectExpr("type", "A", "B")

15. Tip - Support MapType<String, String> like Hive SQL in Hive: Create table test (type map<string, string>); Hive support str_to_map, but spark not support for dataframe(spark support str_to_map for hiveQL). Using udf to solve this. val string_line = "A=1,B=2,C=3" Val df = logsDF.withColumn("type", str_to_map(string_line))

16. UDF - str_to_map val str_to_map = udf { text : String => val pairs = text.split("delimiter1|delimiter2").grouped(2) pairs.map { case Array(k, v) => k -> v}.toMap }

17. Thank you.

Using spark data frame for sql

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using spark data frame for sql

Similar to Using spark data frame for sql (20)

More from DaeMyung Kang

More from DaeMyung Kang (20)

Recently uploaded

Recently uploaded (20)

Using spark data frame for sql