Using spark data frame for sql

•

27 likes•2,626 views

1) This document provides examples of how to use Spark DataFrames and SQL to load and analyze Iris flower data. It shows how to load data from files and Kafka, define schemas, select, filter, sort, group, and join dataframes. 2) Methods like spark.read, dataframe.select(), dataframe.filter(), and dataframe.groupBy() are used to load and query the data. StructType and case classes define the schema. SQL statements can also be used via the sqlContext. 3) User defined functions (UDFs) are demonstrated to handle custom data types like maps. The examples provide an overview of basic Spark DataFrame and SQL functionality.

Technology

Basic
Using Spark DataFrame
For SQL
charsyam@naver.com

Create DataFrame From File
val path = “abc.txt”
val df = spark.read.text(path)

$Create DataFrame From Kafka val rdd = KafkaUtils.createRDD[String, String](...) val logsDF = rdd.map { _.value }.toDF$

Spark DataFrame Column
1) col("column name")
2) $"column name"
1) And 2) are the same.

Simple Iris TSV Logs
http://www.math.uah.edu/stat/data/Fisher.txt
Type PW PL SW SL
0 2 14 33 50
1 24 56 31 67
1 23 51 31 69
0 2 10 36 46
1 20 52 30 65
1 19 51 27 58

Load TSV with StructType
import org.apache.spark.sql.types._
var irisSchema = StructType(Array(
StructField("Type", IntegerType, true),
StructField("PetalWidth", IntegerType, true),
StructField("PetalLength", IntegerType, true),
StructField("SepalWidth", IntegerType, true),
StructField("SepalLength", IntegerType, true)
))

Load TSV with Encoder #1
import org.apache.spark.sql.Encoders
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema

Load TSV
var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("header", "true"). // Does the file have a header line?
option("delimiter", "t"). // Set delimiter to tab or comma.
schema(irisSchema). // Schema that was built above.
load("Fisher.txt")
irisDf.show(5)

Load TSV - Show Results
scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
| 0| 2| 14| 33| 50|
| 1| 24| 56| 31| 67|
| 1| 23| 51| 31| 69|
| 0| 2| 10| 36| 46|
| 1| 20| 52| 30| 65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows

Using sqlContext sql
Super easy way
val view = df.createOrReplaceTempView("tmp_iris")
val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")

Simple Select
SQL:
Select type, petalwidth + sepalwidth as sum_width from …
val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth"))
val resultDF = sumDF.selectExpr("Type", "sum_width")
val resultDF = sumDF.selectExpr("*") ← select *

Select with where
SQL:
Select type, petalwidth from … where petalwidth > 10
val whereDF = df.filter($"petalwidth" > 10)
val whereDF = df.where($"petalwidth" > 10)
//filter and where are the same
val resultDF = whereDF.selectExpr("Type", "petalwidth")

Select with order by
SQL:
Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc
1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc)
2) val sortDF = df.sort($"petalwidth", desc("sepalwidth"))
3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth"))
1), 2) And 3) are the same.
val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")

Select with Group by
SQL:
Select type, max(petalwidth) A, min(sepalwidth) B from … group by type
val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"),
min($"sepalwidth").as("B"))
val resultDF = groupDF.selectExpr("type", "A", "B")

Tip - Support MapType<String, String> like Hive
SQL in Hive:
Create table test (type map<string, string>);
Hive support str_to_map, but spark not support for dataframe(spark support
str_to_map for hiveQL).
Using udf to solve this.
val string_line = "A=1,B=2,C=3"
Val df = logsDF.withColumn("type", str_to_map(string_line))

$UDF - str_to_map val str_to_map = udf { text : String => val pairs = text.split("delimiter1|delimiter2").grouped(2) pairs.map { case Array(k, v) => k -> v}.toMap }$

There are a lot of solutions for querying JSON data available, most of which are proprietary and require a steep learning curve. Couchbase's N1QL (Non-First Normal Form Query Language) is a very powerful query language built on top of the SQL we all know and love (well, mostly love). It's really amazing how easy N1QL is for current SQL users. In this session, we'll delve into the differences between SQL and N1QL, learning how it layers new features on top of ANSI SQL to support nested data and JSON types. We'll also go in depth into indexing JSON data using Couchbase, covering how to design and troubleshoot your indexes to drive spectacular performance at scale.

The Ring programming language version 1.2 book - Part 26 of 84

Mahmoud Samir Fayed

Apache Spark - Aram Mkrtchyan

Hovhannes Kuloghlyan

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk, offering over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. Aram is a Senior Software Engineer at PicsArt's Analytics team. Aram has about 6 years of experience in software development. His interests are Scala, Big Data Engineering and playing guitar

Hidden Gems in Swift

Netguru

Database testing in postgresql query

mohammed najim

Avro, la puissance du binaire, la souplesse du JSON

Alexandre Victoor

This talk will teach you how to use Slick in practice, based on our experience at EatingWell Media Group. Slick is a totally different (and better!) relational database mapping tool that brings Scala’s powerful features to your database interactions, namely: static-checking, compile-time safety, and compositionality. Here at EatingWell, we have learned quite a bit about Slick over the past two years as we transitioned from a PHP website to Scala. I will share with you tips and tricks we have learned, as well as everything you need to get started using Slick in your Scala application. I will begin with Slick fundamentals: how to get started making your connection, the types of databases it can access, how to actually create table objects and make queries to and from them. We will using these fundamentals to demonstrate the powerful features inherited from the Scala language itself: static-checking, compile-time safety, and compositionality. And throughout I will share plenty of tips that will help you in everything from getting started to connection pooling options and configuration for use at scale.

The Ring programming language version 1.5 book - Part 8 of 31

Mahmoud Samir Fayed

The Ring programming language version 1.5.3 book - Part 37 of 184

Mahmoud Samir Fayed

Odoo Technical Concepts Summary

Mohamed Magdy

Solr As A SparkSQL DataSource

Spark Summit

ScalikeJDBC Tutorial for Beginners

Kazuhiro Sera

What's hot

Format xls sheets Demo ModeJared Bourne

The Ring programming language version 1.6 book - Part 32 of 189

Mahmoud Samir Fayed

The Ring programming language version 1.2 book - Part 19 of 84

Mahmoud Samir Fayed

SICP_2.5 일반화된 연산시스템

HyeonSeok Choi

The Ring programming language version 1.10 book - Part 47 of 212

Mahmoud Samir Fayed

The Ring programming language version 1.4.1 book - Part 13 of 31

Mahmoud Samir Fayed

JSON Support in MariaDB: News, non-news and the bigger picture

Sergey Petrunya

Rule Your Geometry with the Terraformer ToolkitAaron Parecki

Get docs from sp doc library

Sudip Sengupta

GreenDao Introduction

Booch Lin

The Ring programming language version 1.7 book - Part 41 of 196

Mahmoud Samir Fayed

Memory managementKuban Dzhakipov

The Ring programming language version 1.7 book - Part 48 of 196

Mahmoud Samir Fayed

Node js mongodriver

christkv

The Ring programming language version 1.5.3 book - Part 30 of 184

Mahmoud Samir Fayed

The Ring programming language version 1.9 book - Part 46 of 210

Mahmoud Samir Fayed

Slick: Bringing Scala’s Powerful Features to Your Database Access

Rebecca Grenier

The Ring programming language version 1.5 book - Part 8 of 31

Mahmoud Samir Fayed

The Ring programming language version 1.5.3 book - Part 37 of 184

Mahmoud Samir Fayed

Odoo Technical Concepts Summary

Mohamed Magdy

What's hot (20)

Format xls sheets Demo Mode

The Ring programming language version 1.6 book - Part 32 of 189

The Ring programming language version 1.2 book - Part 19 of 84

SICP_2.5 일반화된 연산시스템

The Ring programming language version 1.10 book - Part 47 of 212

The Ring programming language version 1.4.1 book - Part 13 of 31

JSON Support in MariaDB: News, non-news and the bigger picture

Rule Your Geometry with the Terraformer Toolkit

Get docs from sp doc library

GreenDao Introduction

The Ring programming language version 1.7 book - Part 41 of 196

Memory management

The Ring programming language version 1.7 book - Part 48 of 196

Node js mongodriver

The Ring programming language version 1.5.3 book - Part 30 of 184

The Ring programming language version 1.9 book - Part 46 of 210

Slick: Bringing Scala’s Powerful Features to Your Database Access

The Ring programming language version 1.5 book - Part 8 of 31

The Ring programming language version 1.5.3 book - Part 37 of 184

Odoo Technical Concepts Summary

Similar to Using spark data frame for sql

Solr As A SparkSQL DataSource

Spark Summit

ScalikeJDBC Tutorial for Beginners

Kazuhiro Sera

Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...

DataStax

Spark is an execution framework designed to operate on distributed systems like Cassandra. It's a handy tool for many things, including ETL (extract, transform, and load) jobs. In this session, let me share with you some tips and tricks that I have learned through experience. I'm no oracle, but I can guarantee these tips will get you well down the path of pulling your relational data into Cassandra. About the Speaker Jim Hatcher Principal Architect, IHS Markit Jim Hatcher is a software architect with a passion for data. He has spent most of his 20 year career working with relational databases, but he has been working with Big Data technologies such as Cassandra, Solr, and Spark for the last several years. He has supported systems with very large databases at companies like First Data, CyberSource, and Western Union. He is currently working at IHS, supporting an Electronic Parts Database which tracks half a billion electronic parts using Cassandra.

Using Spark to Load Oracle Data into Cassandra

Jim Hatcher

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs

Matt Stubbs

Introduction to Spark with Scala

Himanshu Gupta

Meetup spark structured streaming

José Carlos García Serrano

[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications

Future Processing

Apache Spark jest narzędziem do przetwarzania danych na dużą skalę. Zastosowanie tego narzędzia w rozproszonym środowisku, w celu przetwarzania dużych zbiorów danych daje ogromne korzyści. Ale co z szybką pętlą zwrotną podczas opracowywania aplikacji z użyciem Apache Spark? Testowanie aplikacji w klastrze jest niezbędne, lecz nie wydaje się być tym, do czego większość programistów przywykło podczas praktykowania TDD. Podczas wystąpienia, Łukasz podzielił się z kilkoma wskazówkami, jak można napisać testy jednostkowe oraz integracyjne i jak Docker może być używany do testowania Sparka na lokalnej maszynie.

Testing batch and streaming Spark applications

Łukasz Gawron

Apache Spark is a general engine for processing data on a large scale. Employing this tool in a distributed environment to process large data sets is undeniably beneficial. But what about fast feedback loop while developing such application with Apache Spark? Testing it on a cluster is essential, but it does not seem to be what most developers accustomed to TDD workflow would like to do. In the talk, ŁLLukasz will share with you some tips on how to write the unit and integration tests, and how Docker can be applied to test Spark application on a local machine. Examples will be presented within the ScalaTest framework, and it should be easy to grasp by people who know Scala and other JVM languages.

3 database-jdbc(1)

hameedkhan2017

Spark Summit EU talk by Ted Malaska

Spark Summit

User Defined Aggregation in Apache Spark: A Love Story

Databricks

User Defined Aggregation in Apache Spark: A Love Story

Databricks

Big Data Analytics with Scala at SCALA.IO 2013

Samir Bessalah

Spark Streaming Programming Techniques You Should Know with Gerard Maas

Spark Summit

At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.

The Ring programming language version 1.5.4 book - Part 37 of 185

Mahmoud Samir Fayed

SparkSQLの構文解析

ゆり井上

The Ring programming language version 1.5.3 book - Part 54 of 184

Mahmoud Samir Fayed

The Ring programming language version 1.5.3 book - Part 44 of 184

Mahmoud Samir Fayed

Scala in Places API

Łukasz Bałamut

Similar to Using spark data frame for sql (20)

Solr As A SparkSQL DataSource

ScalikeJDBC Tutorial for Beginners

Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...

Using Spark to Load Oracle Data into Cassandra

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs

Introduction to Spark with Scala

Meetup spark structured streaming

[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications

Testing batch and streaming Spark applications

3 database-jdbc(1)

Spark Summit EU talk by Ted Malaska

User Defined Aggregation in Apache Spark: A Love Story

Big Data Analytics with Scala at SCALA.IO 2013

Spark Streaming Programming Techniques You Should Know with Gerard Maas

The Ring programming language version 1.5.4 book - Part 37 of 185

SparkSQLの構文解析

The Ring programming language version 1.5.3 book - Part 54 of 184

The Ring programming language version 1.5.3 book - Part 44 of 184

Scala in Places API

Recently uploaded

The Future of Platform Engineering

Jemma Hussein Allen

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

nkrafacyberclub

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Enhancing Performance with Globus and the Science DMZ

Globus

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

Peter Spielvogel

Building better applications for business users with SAP Fiori. • What is SAP Fiori and why it matters to you • How a better user experience drives measurable business benefits • How to get started with SAP Fiori today • How SAP Fiori elements accelerates application development • How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities • How SAP Fiori paves the way for using AI in SAP apps

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Alex Pruden

This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second). Paper: https://eprint.iacr.org/2023/1886

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Paige Cruz

Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack. While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack. I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

RESUME BUILDER APPLICATION Project for students

KAMESHS29

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Recently uploaded (20)

The Future of Platform Engineering

FIDO Alliance Osaka Seminar: Overview.pdf

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

Leading Change strategies and insights for effective change management pdf 1.pdf

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Enhancing Performance with Globus and the Science DMZ

Introduction to CHERI technology - Cybersecurity

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Climate Impact of Software Testing at Nordic Testing Days

Accelerate your Kubernetes clusters with Varnish Caching

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

UiPath Test Automation using UiPath Test Suite series, part 4

RESUME BUILDER APPLICATION Project for students

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Using spark data frame for sql

1. Basic Using Spark DataFrame For SQL charsyam@naver.com

2. Create DataFrame From File val path = “abc.txt” val df = spark.read.text(path)

3. Create DataFrame From Kafka val rdd = KafkaUtils.createRDD[String, String](...) val logsDF = rdd.map { _.value }.toDF

4. Spark DataFrame Column 1) col("column name") 2) $"column name" 1) And 2) are the same.

5. Simple Iris TSV Logs http://www.math.uah.edu/stat/data/Fisher.txt Type PW PL SW SL 0 2 14 33 50 1 24 56 31 67 1 23 51 31 69 0 2 10 36 46 1 20 52 30 65 1 19 51 27 58

6. Load TSV with StructType import org.apache.spark.sql.types._ var irisSchema = StructType(Array( StructField("Type", IntegerType, true), StructField("PetalWidth", IntegerType, true), StructField("PetalLength", IntegerType, true), StructField("SepalWidth", IntegerType, true), StructField("SepalLength", IntegerType, true) ))

7. Load TSV with Encoder #1 import org.apache.spark.sql.Encoders case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int, SepalWidth: Int, SepalLength: Int) var irisSchema = Encoders.product[IrisSchema].schema

8. Load TSV var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV. option("header", "true"). // Does the file have a header line? option("delimiter", "t"). // Set delimiter to tab or comma. schema(irisSchema). // Schema that was built above. load("Fisher.txt") irisDf.show(5)

9. Load TSV - Show Results scala> irisDf.show(5) +----+----------+-----------+----------+-----------+ |Type|PetalWidth|PetalLength|SepalWidth|SepalLength| +----+----------+-----------+----------+-----------+ | 0| 2| 14| 33| 50| | 1| 24| 56| 31| 67| | 1| 23| 51| 31| 69| | 0| 2| 10| 36| 46| | 1| 20| 52| 30| 65| +----+----------+-----------+----------+-----------+ only showing top 5 rows

10. Using sqlContext sql Super easy way val view = df.createOrReplaceTempView("tmp_iris") val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")

11. Simple Select SQL: Select type, petalwidth + sepalwidth as sum_width from … val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth")) val resultDF = sumDF.selectExpr("Type", "sum_width") val resultDF = sumDF.selectExpr("*") ← select *

12. Select with where SQL: Select type, petalwidth from … where petalwidth > 10 val whereDF = df.filter($"petalwidth" > 10) val whereDF = df.where($"petalwidth" > 10) //filter and where are the same val resultDF = whereDF.selectExpr("Type", "petalwidth")

13. Select with order by SQL: Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc 1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc) 2) val sortDF = df.sort($"petalwidth", desc("sepalwidth")) 3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth")) 1), 2) And 3) are the same. val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")

14. Select with Group by SQL: Select type, max(petalwidth) A, min(sepalwidth) B from … group by type val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"), min($"sepalwidth").as("B")) val resultDF = groupDF.selectExpr("type", "A", "B")

15. Tip - Support MapType<String, String> like Hive SQL in Hive: Create table test (type map<string, string>); Hive support str_to_map, but spark not support for dataframe(spark support str_to_map for hiveQL). Using udf to solve this. val string_line = "A=1,B=2,C=3" Val df = logsDF.withColumn("type", str_to_map(string_line))

16. UDF - str_to_map val str_to_map = udf { text : String => val pairs = text.split("delimiter1|delimiter2").grouped(2) pairs.map { case Array(k, v) => k -> v}.toMap }

17. Thank you.

Using spark data frame for sql

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using spark data frame for sql

Similar to Using spark data frame for sql (20)

More from DaeMyung Kang

More from DaeMyung Kang (20)

Recently uploaded

Recently uploaded (20)

Using spark data frame for sql