Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to PySpark: Python Data Analysis at scale in the Cloud

231 views

Published on

Why would you care? Because PySpark is a cloud-agnostic analytics tool for Big Data processing, "hidden" in:
* AWS Glue - Managed ETL Service
* Amazon EMR - Big Data Platform
* Google Cloud Dataproc - Cloud-native Spark and Hadoop
* Azure HDInsight - Microsoft implementation of Apache Spark in the cloud

In this #ServerlessTO talk, Jonathan Rioux - Head of Data Science at EPAM Canada & author of PySpark in Action book (https://www.manning.com/books/pyspark-in-action), will get you acquainted with PySpark - Python API for Spark.

Event details: https://www.meetup.com/Serverless-Toronto/events/269124392/
Event recording: https://youtu.be/QGxytMbrjGY

Like always, BIG thanks to our knowledge sponsor Manning Publications – who generously offered to raffle not 1 but 3 of Jonathan's books!

RSVP for more exciting (online) events at https://www.meetup.com/Serverless-Toronto/events/

Published in: Software
  • Be the first to comment

  • Be the first to like this

Intro to PySpark: Python Data Analysis at scale in the Cloud

  1. 1. Welcome to ServerlessToronto.org “Home of Less IT Mess” 1 Introduce Yourself ☺ - Why are you here? - Looking for work? - Offering work? Our feature presentation “Intro to PySpark” starts at 6:20pm…
  2. 2. Serverless is not just about the Tech: 2 Serverless is New Agile & Mindset Serverless Dev (gluing other people’s APIs and managed services) We're obsessed to creating business value (meaningful MVPs, products), by helping Startups & empowering Business users! We build bridges between Serverless Community (“Dev leg”), and Front-end & Voice- First folks (“UX leg”), and empower UX developers Achieve agility NOT by “sprinting” faster (like in Scrum), but by working smarter (by using bigger building blocks and less Ops)
  3. 3. Upcoming #ServerlessTO Online Meetups 3 1. Accelerating with a Cloud Contact Center – Patrick Kolencherry Sr. Product Marketing Manager, and Karla Nussbaumer, Head of Technical Marketing at Twilio **JULY 9 @ 6pm ** 2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
  4. 4. Feature Talk Jonathan Rioux, Head Data Scientist at EPAM Systems & author of Manning book PySpark in Action 4
  5. 5. Getting acquainted with PySpark 1/49
  6. 6. If you have not filled the Meetup survey, now is the time to do it! (Also copied in the chat) https://forms.gle/6cyWGVY4L4GJvsXh7 2/49
  7. 7. Hi! I'm Jonathan 3/49
  8. 8. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast 3/49
  9. 9. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada 3/49
  10. 10. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada Author of PySpark in Action → 3/49
  11. 11. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada Author of PySpark in Action → <3 Spark, <3 <3 Python 3/49
  12. 12. 4/49
  13. 13. 5/49
  14. 14. Goals of this presentation 6/49
  15. 15. Goals of this presentation Share my love of (Py)Spark 6/49
  16. 16. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines 6/49
  17. 17. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop 6/49
  18. 18. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop Get you excited about using PySpark 6/49
  19. 19. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop Get you excited about using PySpark 36,000 ft overview: Managed Spark in the Cloud 6/49
  20. 20. What I expect from you 7/49
  21. 21. What I expect from you You know a little bit of Python 7/49
  22. 22. What I expect from you You know a little bit of Python You know what SQL is 7/49
  23. 23. What I expect from you You know a little bit of Python You know what SQL is You won't hesitate to ask questions :-) 7/49
  24. 24. What is Spark Spark is a unified analytics engine for large-scale data processing 8/49
  25. 25. What is Spark (bis) Spark can be thought of a data factory that you (mostly) program like a cohesive computer. 9/49
  26. 26. Spark under the hood 10/49
  27. 27. Spark as an analytics factory 11/49
  28. 28. Why is Pyspark cool? 12/49
  29. 29. Data manipulation uses the same vocabulary as SQL ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .groupby("age") .count("*") ) 13/49
  30. 30. Data manipulation uses the same vocabulary as SQL .select("id", "first_name", "last_name", "age") ( my_table .where(col("age") > 21) .groupby("age") .count("*") ) select 13/49
  31. 31. Data manipulation uses the same vocabulary as SQL .where(col("age") > 21) ( my_table .select("id", "first_name", "last_name", "age") .groupby("age") .count("*") ) where 13/49
  32. 32. Data manipulation uses the same vocabulary as SQL .groupby("age") ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .count("*") ) group by 13/49
  33. 33. Data manipulation uses the same vocabulary as SQL .count("*") ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .groupby("age") ) count 13/49
  34. 34. I mean, you can legitimately use SQL spark.sql(""" select count(*) from ( select id, first_name, last_name, age from my_table where age > 21 ) group by age""") 14/49
  35. 35. Data manipulation and machine learning with a uent API results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) 15/49
  36. 36. Data manipulation and machine learning with a uent API spark.read.text("./data/Ch02/1342-0.txt") results = ( .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Read a text file 15/49
  37. 37. Data manipulation and machine learning with a uent API .select(F.split(F.col("value"), " ").alias("line")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Select the column value, where each element is splitted (space as a separator). Alias to line. 15/49
  38. 38. Data manipulation and machine learning with a uent API .select(F.explode(F.col("line")).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Explode each element of line into its own record. Alias to word. 15/49
  39. 39. Data manipulation and machine learning with a uent API .select(F.lower(F.col("word")).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Lower-case each word 15/49
  40. 40. Data manipulation and machine learning with a uent API .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Extract only the first group of lower-case letters from each word. 15/49
  41. 41. Data manipulation and machine learning with a uent API .where(F.col("word") != "") results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .groupby(F.col("word")) .count() ) Keep only the records where the word is not the empty string. 15/49
  42. 42. Data manipulation and machine learning with a uent API .groupby(F.col("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .count() ) Group by word 15/49
  43. 43. Data manipulation and machine learning with a uent API .count() results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) ) Count the number of records in each group 15/49
  44. 44.           Scala is not the only player in town 16/49
  45. 45. Let's code! 17/49
  46. 46. 18/49
  47. 47. Summoning PySpark from pyspark.sql import SparkSession   spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() 19/49
  48. 48. Summoning PySpark from pyspark.sql import SparkSession   spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() A SparkSession is your entry point to distributed data manipulation 19/49
  49. 49. Summoning PySpark spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() from pyspark.sql import SparkSession   We create our SparkSession with an optional library to access BigQuery as a data source. 19/49
  50. 50. Reading data from functools import reduce from pyspark.sql import DataFrame     def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() )     gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] 20/49
  51. 51. Reading data def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() ) from functools import reduce from pyspark.sql import DataFrame         gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] We create a helper function to read our code from BigQuery. 20/49
  52. 52. Reading data gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] ) )     def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() )     A DataFrame is a regular Python object. 20/49
  53. 53. Using the power of the schema gsod.printSchema() # root # |-- stn: string (nullable = true) # |-- wban: string (nullable = true) # |-- year: string (nullable = true) # |-- mo: string (nullable = true) # |-- da: string (nullable = true) # |-- temp: double (nullable = true) # |-- count_temp: long (nullable = true) # |-- dewp: double (nullable = true) # |-- count_dewp: long (nullable = true) # |-- slp: double (nullable = true) # |-- count_slp: long (nullable = true) # |-- stp: double (nullable = true) # |-- count_stp: long (nullable = true) # |-- visib: double (nullable = true) # [...] 21/49
  54. 54. Using the power of the schema # root # |-- stn: string (nullable = true) # |-- wban: string (nullable = true) # |-- year: string (nullable = true) # |-- mo: string (nullable = true) # |-- da: string (nullable = true) # |-- temp: double (nullable = true) # |-- count_temp: long (nullable = true) # |-- dewp: double (nullable = true) # |-- count_dewp: long (nullable = true) # |-- slp: double (nullable = true) # |-- count_slp: long (nullable = true) # |-- stp: double (nullable = true) # |-- count_stp: long (nullable = true) # |-- visib: double (nullable = true) # [...] gsod.printSchema() The schema will give us the column names and their types. 21/49
  55. 55. And showing data gsod = gsod.select("stn", "year", "mo", "da", "temp")   gsod.show(5)   # Approximately 5 seconds waiting # +------+----+---+---+----+ # | stn|year| mo| da|temp| # +------+----+---+---+----+ # |359250|2010| 02| 25|25.2| # |359250|2010| 05| 25|65.0| # |386130|2010| 02| 19|35.4| # |386130|2010| 03| 15|52.2| # |386130|2010| 01| 21|37.9| # +------+----+---+---+----+ # only showing top 5 rows 22/49
  56. 56. What happens behind the scenes? 23/49
  57. 57. Any data frame transformation will be stored until we need the data. Then, when we trigger an action, (Py)Spark will go and optimize the query plan, select the best physical plan and apply the transformation on the data. 24/49
  58. 58. Transformations Actions 25/49
  59. 59. Transformations select Actions 25/49
  60. 60. Transformations select filter Actions 25/49
  61. 61. Transformations select filter group by Actions 25/49
  62. 62. Transformations select filter group by partition Actions 25/49
  63. 63. Transformations select filter group by partition Actions write 25/49
  64. 64. Transformations select filter group by partition Actions write show 25/49
  65. 65. Transformations select filter group by partition Actions write show count 25/49
  66. 66. Transformations select filter group by partition Actions write show count toPandas 25/49
  67. 67. Something a little more complex import pyspark.sql.functions as F   stations = ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.stations") .option("credentialsFile", "bq-key.json") .load() )   # We want to get the "hottest Countries" that have at least 60 measures answer = ( gsod.join(stations, gsod["stn"] == stations["usaf"]) .where(F.col("country").isNotNull()) .groupBy("country") .agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count")) ).where(F.col("count") > 12 * 5) read, join, where, groupby, avg/count, where, orderby, show 26/49
  68. 68. read, join, where, groupby, avg/count, where, orderby, show 27/49
  69. 69. read, join, where, groupby, avg/count, where, orderby, show 28/49
  70. 70. Python or SQL? gsod.createTempView("gsod") stations.createTempView("stations")   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) 29/49
  71. 71. Python or SQL? gsod.createTempView("gsod") stations.createTempView("stations")   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) We register the data frames as Spark SQL tables. 29/49
  72. 72. Python or SQL?   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) gsod.createTempView("gsod") stations.createTempView("stations") We then can query using SQL without leaving Python! 29/49
  73. 73. Python and SQL! ( spark.sql( """ select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf group by country""" ) .where("country is not null") .where("count > (12 * 5)") .orderby("avg_temp", ascending=False) .show(5) ) 30/49
  74. 74. Python ⇄Spark 31/49
  75. 75. 32/49
  76. 76. 33/49
  77. 77. Scalar UDF import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 34/49
  78. 78. Scalar UDF import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 PySpark types are objects in the pyspark.sql.types modules. 34/49
  79. 79. Scalar UDF @F.pandas_udf(T.DoubleType()) import pandas as pd import pyspark.sql.types as T   def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 We promote a regular Python function to a User Defined Function via a decorator. 34/49
  80. 80. Scalar UDF def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9 import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType())   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 A simple function on pandas Series 34/49
  81. 81. Scalar UDF f_to_c.func(pd.Series(range(32, 213))) import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 Still unit-testable :-) 34/49
  82. 82. Scalar UDF gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp"))) gsod.select("temp", "temp_c").distinct().show(5)   # +-----+-------------------+ # | temp| temp_c| # +-----+-------------------+ # | 37.2| 2.8888888888888906| # | 85.9| 29.944444444444443| # | 53.5| 11.944444444444445| # | 71.6| 21.999999999999996| # |-27.6|-33.111111111111114| # +-----+-------------------+ # only showing top 5 rows 35/49
  83. 83. Scalar UDF gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp"))) gsod.select("temp", "temp_c").distinct().show(5)   # +-----+-------------------+ # | temp| temp_c| # +-----+-------------------+ # | 37.2| 2.8888888888888906| # | 85.9| 29.944444444444443| # | 53.5| 11.944444444444445| # | 71.6| 21.999999999999996| # |-27.6|-33.111111111111114| # +-----+-------------------+ # only showing top 5 rows A UDF can be used like any PySpark function. 35/49
  84. 84. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) 36/49
  85. 85. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) A regular, fun, harmless function on (pandas) DataFrames 36/49
  86. 86. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) 36/49
  87. 87. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| 37/49
  88. 88. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| We provide PySpark the schema we expect our function to return 37/49
  89. 89. Grouped Map UDF gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema ) scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )     gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| We just have to partition (using group), and then applyInPandas! 37/49
  90. 90. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| 37/49
  91. 91. 38/49
  92. 92. You are not limited library-wise from sklearn.linear_model import LinearRegression     @F.pandas_udf(T.DoubleType()) def rate_of_change_temperature( day: pd.Series, temp: pd.Series ) -> float: """Returns the slope of the daily temperature for a given period of time.""" return ( LinearRegression() .fit(X=day.astype("int").values.reshape(-1, 1), y=temp) .coef_[0] ) 39/49
  93. 93. result = gsod.groupby("stn", "year", "mo").agg( rate_of_change_temperature(gsod["da"], gsod["temp_norm"]).alias( "rt_chg_temp" ) )   result.show(5, False) # +------+----+---+---------------------+ # |stn |year|mo |rt_chg_temp | # +------+----+---+---------------------+ # |010250|2018|12 |-0.01014397905759162 | # |011120|2018|11 |-0.01704736746691528 | # |011150|2018|10 |-0.013510329829648423| # |011510|2018|03 |0.020159116598556657 | # |011800|2018|06 |0.012645501680677372 | # +------+----+---+---------------------+ # only showing top 5 rows 40/49
  94. 94. 41/49
  95. 95. Not fan of the syntax? 42/49
  96. 96. 43/49
  97. 97. From the README.md import databricks.koalas as ks import pandas as pd   pdf = pd.DataFrame( { 'x':range(3), 'y':['a','b','b'], 'z':['a','b','b'], } )   # Create a Koalas DataFrame from pandas DataFrame df = ks.from_pandas(pdf)   # Rename the columns df.columns = ['x', 'y', 'z1'] 44/49
  98. 98. (Py)Spark in the cloud 45/49
  99. 99. 46/49
  100. 100. "Serverless" Spark? 47/49
  101. 101. "Serverless" Spark? Cost-effective for sporadic runs 47/49
  102. 102. "Serverless" Spark? Cost-effective for sporadic runs Scales easily 47/49
  103. 103. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance 47/49
  104. 104. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive 47/49
  105. 105. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive Sometimes confusing pricing model 47/49
  106. 106. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive Sometimes confusing pricing model Uneven documentation 47/49
  107. 107. Thank you! 48/49
  108. 108. Join www.ServerlessToronto.org Home of “Less IT Mess”

×