SlideShare a Scribd company logo
1 of 1
Download to read offline
PythonForDataScience Cheat Sheet
PySpark - SQL Basics
Learn Python for data science Interactively at www.DataCamp.com
DataCamp
Learn Python for Data Science Interactively
Initializing SparkSession
Spark SQL is Apache Spark's module for
working with structured data.
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession 
.builder 
.appName("Python Spark SQL basic example") 
.config("spark.some.config.option", "some-value") 
.getOrCreate()
Creating DataFrames
PySpark & Spark SQL
>>> spark.stop()
Stopping SparkSession
>>> df.select("firstName", "city")
.write 
.save("nameAndCity.parquet")
>>> df.select("firstName", "age") 
.write 
.save("namesAndAges.json",format="json")
From RDDs
From Spark Data Sources
Queries
>>> from pyspark.sql import functions as F
Select
>>> df.select("firstName").show() Show all entries in firstName column
>>> df.select("firstName","lastName") 
.show()
>>> df.select("firstName", Show all entries in firstName, age
	 "age", and type
explode("phoneNumber") 
.alias("contactInfo")) 
.select("contactInfo.type",
"firstName",
"age") 
.show()
>>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() Show all entries where age >24
When
>>> df.select("firstName", Show firstName and 0 or 1 depending
F.when(df.age > 30, 1)  on age >30
.otherwise(0)) 
.show()
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options
.collect()
Like
>>> df.select("firstName", Show firstName, and lastName is
df.lastName.like("Smith"))  TRUE if lastName is like Smith
.show()
Startswith - Endswith
>>> df.select("firstName", Show firstName, and TRUE if
df.lastName  lastName starts with Sm
.startswith("Sm")) 
.show()
>>> df.select(df.lastName.endswith("th")) Show last names ending in th
.show()
Substring
>>> df.select(df.firstName.substr(1, 3)  Return substrings of firstName
.alias("name")) 
.collect()
Between
>>> df.select(df.age.between(22, 24))  Show age: values are TRUE if between
.show() 22 and 24
Running SQL Queries Programmatically
>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")
.show()
Add, Update & Remove Columns
>>> df = df.withColumn('city',df.address.city) 
.withColumn('postalCode',df.address.postalCode) 
.withColumn('state',df.address.state) 
.withColumn('streetAddress',df.address.streetAddress) 
.withColumn('telePhoneNumber',
explode(df.phoneNumber.number)) 
.withColumn('telePhoneType',
explode(df.phoneNumber.type))
>>> df = df.drop("address", "phoneNumber")
>>> df = df.drop(df.address).drop(df.phoneNumber)
>>> df = df.dropDuplicates()
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')
Duplicate Values
Adding Columns
Updating Columns
Removing Columns
JSON
>>> df = spark.read.json("customer.json")
>>> df.show()
+--------------------+---+---------+--------+--------------------+
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+
|[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...|
|[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...|
+--------------------+---+---------+--------+--------------------+
>>> df2 = spark.read.load("people.json", format="json")
Parquet files
>>> df3 = spark.read.load("users.parquet")
TXT files
>>> df4 = spark.read.text("people.txt")
A SparkSession can be used create DataFrame, register DataFrame as tables,
execute SQL over tables, cache tables, and read parquet files.
>>> from pyspark.sql.types import *
Infer Schema
>>> sc = spark.sparkContext
>>> lines = sc.textFile("people.txt")
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
>>> peopledf = spark.createDataFrame(people)
Specify Schema
>>> people = parts.map(lambda p: Row(name=p[0],
age=int(p[1].strip())))
>>> schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for
field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()
+--------+---+
| name|age|
+--------+---+
| Mine| 28|
| Filip| 29|
|Jonathan| 30|
+--------+---+
Inspect Data
Sort
>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])
.collect()
Missing & Replacing Values
>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")
Registering DataFrames as Views
Query Views
GroupBy
>>> df.na.fill(50).show() Replace null values
>>> df.na.drop().show() Return new df omitting rows with null values
>>> df.na  Return new df replacing one value with
.replace(10, 20)  another
.show()
>>> df.groupBy("age") Group by age, count the members
.count()  in the groups
.show()			
>>> df.describe().show() Compute summary statistics
>>> df.columns Return the columns of df
>>> df.count() Count the number of rows in df
>>> df.distinct().count() Count the number of distinct rows in df
>>> df.printSchema() Print the schema of df
>>> df.explain() Print the (logical and physical) plans
>>> df.dtypes Return df column names and data types
>>> df.show() Display the content of df
>>> df.head() Return first n rows
>>> df.first() Return first row
>>> df.take(2) Return the first n rows
>>> df.schema Return the schema of df
Filter
>>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
				 records of which the values are >24
Output
Data Structures
Write & Save to Files
>>> rdd1 = df.rdd Convert df into an RDD
>>> df.toJSON().first() Convert df into a RDD of string
>>> df.toPandas() Return the contents of df as Pandas
			 DataFrame
Repartitioning
>>> df.repartition(10) df with 10 partitions
.rdd 
.getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition

More Related Content

Similar to pyspark_df.pdf

ScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersKazuhiro Sera
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Dr. Volkan OBAN
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Working with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueWorking with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueDatabricks
 
school-management-by-shivkamal-singh.pdf
school-management-by-shivkamal-singh.pdfschool-management-by-shivkamal-singh.pdf
school-management-by-shivkamal-singh.pdfashishkum805
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant TrainingAidIQ
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196Mahmoud Samir Fayed
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions Dr. Volkan OBAN
 
The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184Mahmoud Samir Fayed
 
dbms project with output.docx
dbms project with output.docxdbms project with output.docx
dbms project with output.docxssuseraf4601
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31Mahmoud Samir Fayed
 
Python Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdfPython Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdfRahul Jain
 
The Ring programming language version 1.6 book - Part 46 of 189
The Ring programming language version 1.6 book - Part 46 of 189The Ring programming language version 1.6 book - Part 46 of 189
The Ring programming language version 1.6 book - Part 46 of 189Mahmoud Samir Fayed
 
Presentation on Pandas in _ detail .pptx
Presentation on Pandas in _ detail .pptxPresentation on Pandas in _ detail .pptx
Presentation on Pandas in _ detail .pptx16115yogendraSingh
 

Similar to pyspark_df.pdf (20)

ScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for Beginners
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
 
Drupal7 dbtng
Drupal7  dbtngDrupal7  dbtng
Drupal7 dbtng
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Working with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueWorking with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the Rescue
 
school-management-by-shivkamal-singh.pdf
school-management-by-shivkamal-singh.pdfschool-management-by-shivkamal-singh.pdf
school-management-by-shivkamal-singh.pdf
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant Training
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
 
The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184
 
The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184
 
dbms project with output.docx
dbms project with output.docxdbms project with output.docx
dbms project with output.docx
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
 
Db1 lecture4
Db1 lecture4Db1 lecture4
Db1 lecture4
 
Python Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdfPython Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdf
 
The Ring programming language version 1.6 book - Part 46 of 189
The Ring programming language version 1.6 book - Part 46 of 189The Ring programming language version 1.6 book - Part 46 of 189
The Ring programming language version 1.6 book - Part 46 of 189
 
3 database-jdbc(1)
3 database-jdbc(1)3 database-jdbc(1)
3 database-jdbc(1)
 
Presentation on Pandas in _ detail .pptx
Presentation on Pandas in _ detail .pptxPresentation on Pandas in _ detail .pptx
Presentation on Pandas in _ detail .pptx
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

pyspark_df.pdf

  • 1. PythonForDataScience Cheat Sheet PySpark - SQL Basics Learn Python for data science Interactively at www.DataCamp.com DataCamp Learn Python for Data Science Interactively Initializing SparkSession Spark SQL is Apache Spark's module for working with structured data. >>> from pyspark.sql import SparkSession >>> spark = SparkSession .builder .appName("Python Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() Creating DataFrames PySpark & Spark SQL >>> spark.stop() Stopping SparkSession >>> df.select("firstName", "city") .write .save("nameAndCity.parquet") >>> df.select("firstName", "age") .write .save("namesAndAges.json",format="json") From RDDs From Spark Data Sources Queries >>> from pyspark.sql import functions as F Select >>> df.select("firstName").show() Show all entries in firstName column >>> df.select("firstName","lastName") .show() >>> df.select("firstName", Show all entries in firstName, age "age", and type explode("phoneNumber") .alias("contactInfo")) .select("contactInfo.type", "firstName", "age") .show() >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age, .show() add 1 to the entries of age >>> df.select(df['age'] > 24).show() Show all entries where age >24 When >>> df.select("firstName", Show firstName and 0 or 1 depending F.when(df.age > 30, 1) on age >30 .otherwise(0)) .show() >>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .collect() Like >>> df.select("firstName", Show firstName, and lastName is df.lastName.like("Smith")) TRUE if lastName is like Smith .show() Startswith - Endswith >>> df.select("firstName", Show firstName, and TRUE if df.lastName lastName starts with Sm .startswith("Sm")) .show() >>> df.select(df.lastName.endswith("th")) Show last names ending in th .show() Substring >>> df.select(df.firstName.substr(1, 3) Return substrings of firstName .alias("name")) .collect() Between >>> df.select(df.age.between(22, 24)) Show age: values are TRUE if between .show() 22 and 24 Running SQL Queries Programmatically >>> df5 = spark.sql("SELECT * FROM customer").show() >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people") .show() Add, Update & Remove Columns >>> df = df.withColumn('city',df.address.city) .withColumn('postalCode',df.address.postalCode) .withColumn('state',df.address.state) .withColumn('streetAddress',df.address.streetAddress) .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) .withColumn('telePhoneType', explode(df.phoneNumber.type)) >>> df = df.drop("address", "phoneNumber") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df = df.dropDuplicates() >>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Duplicate Values Adding Columns Updating Columns Removing Columns JSON >>> df = spark.read.json("customer.json") >>> df.show() +--------------------+---+---------+--------+--------------------+ | address|age|firstName |lastName| phoneNumber| +--------------------+---+---------+--------+--------------------+ |[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...| |[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...| +--------------------+---+---------+--------+--------------------+ >>> df2 = spark.read.load("people.json", format="json") Parquet files >>> df3 = spark.read.load("users.parquet") TXT files >>> df4 = spark.read.text("people.txt") A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. >>> from pyspark.sql.types import * Infer Schema >>> sc = spark.sparkContext >>> lines = sc.textFile("people.txt") >>> parts = lines.map(lambda l: l.split(",")) >>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) >>> peopledf = spark.createDataFrame(people) Specify Schema >>> people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip()))) >>> schemaString = "name age" >>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] >>> schema = StructType(fields) >>> spark.createDataFrame(people, schema).show() +--------+---+ | name|age| +--------+---+ | Mine| 28| | Filip| 29| |Jonathan| 30| +--------+---+ Inspect Data Sort >>> peopledf.sort(peopledf.age.desc()).collect() >>> df.sort("age", ascending=False).collect() >>> df.orderBy(["age","city"],ascending=[0,1]) .collect() Missing & Replacing Values >>> peopledf.createGlobalTempView("people") >>> df.createTempView("customer") >>> df.createOrReplaceTempView("customer") Registering DataFrames as Views Query Views GroupBy >>> df.na.fill(50).show() Replace null values >>> df.na.drop().show() Return new df omitting rows with null values >>> df.na Return new df replacing one value with .replace(10, 20) another .show() >>> df.groupBy("age") Group by age, count the members .count() in the groups .show() >>> df.describe().show() Compute summary statistics >>> df.columns Return the columns of df >>> df.count() Count the number of rows in df >>> df.distinct().count() Count the number of distinct rows in df >>> df.printSchema() Print the schema of df >>> df.explain() Print the (logical and physical) plans >>> df.dtypes Return df column names and data types >>> df.show() Display the content of df >>> df.head() Return first n rows >>> df.first() Return first row >>> df.take(2) Return the first n rows >>> df.schema Return the schema of df Filter >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those records of which the values are >24 Output Data Structures Write & Save to Files >>> rdd1 = df.rdd Convert df into an RDD >>> df.toJSON().first() Convert df into a RDD of string >>> df.toPandas() Return the contents of df as Pandas DataFrame Repartitioning >>> df.repartition(10) df with 10 partitions .rdd .getNumPartitions() >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition