SlideShare a Scribd company logo
1 of 33
Pyt(h)on vs słoń: aktualny stan przetwarzania dużych
danych w Python
Jakub Nowacki
Yosh.AI, SigDelta, Sages
whoami
CTO @ Yosh.AI (yosh.ai)
Lead Data Scientist @ SigDelta (sigdelta.com)
Trainer @ Sages (sages.com.pl)
I can code, I do maths
@jsnowacki
Jak to było kiedyś?
Apache Spark!
Spark RDD
sc.textFile("hdfs://...") 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile("hdfs://...")
Spark RDD – co gdzie?
Źródło: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
Spark SQL - DataFrame
Źródło: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
from pyspark.sql.functions import *
spark.read.text('hdfs://...’) 
.select(explode(split('value', 'W+')).alias('word')) 
.groupBy('word') 
.count() 
.orderBy(desc('count')) 
.write.parquet('hdfs://...')
UDF?!
Rozwiązanie - Java/Scala UDF to Python: http://sigdelta.com/blog/scala-spark-udfs-in-python/
from pyspark.sql.types import IntegerType
@udf(returnType=IntegerType())
def add_one(x):
if x is not None:
return x + 1
Vectorized UDFs
import pandas as pd
from pyspark.sql.types import LongType
def multiply_func(a, b):
return a * b
multiply = pandas_udf(multiply_func,
returnType=LongType())
pdf = pd.DataFrame([1, 2, 3], columns=["x"]
print(multiply_func(pdf.x, pdf.x))
# 0 1
# 1 4
# 2 9
# dtype: int64
df = spark.createDataFrame(pdf)
df.select(multiply(col("x"), col("x"))).show()
# +-------------------+
# |multiply_func(x, x)|
# +-------------------+
# | 1| # | 4| # | 9|
# +-------------------+
Źródło:
https://databricks.com/blog/2017/10/30/introduci
ng-vectorized-udfs-for-pyspark.html
Spark Structured Streaming
Źródło: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
PySpark w PyPI
pip install pyspark
conda install pyspark
...
Opis: http://sigdelta.com/blog/how-to-install-pyspark-locally/
Dask!
Źródło: https://dask.pydata.org/
Dask Array
Źródło: https://dask.pydata.org/
import dask.array as da
import numpy as np
x = da.ones(10, chunks=(5,))
y = np.ones(10)
z = x + y
print(z)
# dask.array<add, shape=(10,),
# … dtype=float64, chunksize=(5,)>
Dask DataFrame
Źródło: https://dask.pydata.org/
import dask.dataframe as dd
posts = dd.read_parquet('data/posts_tags.parq’)
.set_index('id’)
posts_count = posts.creation_date.dt.date
.value_counts()
posts_count_df = posts_count.compute()
posts_count_df.head()
# 2017-08-23 9531
# 2017-07-27 9450
# 2017-08-24 9366
# 2017-08-03 9345
# 2017-03-22 9342
# Name: creation_date, dtype: int64
Przykład: http://sigdelta.com/blog/stackpverflow-tags-with-dask/
Dask Bag
Przykład: http://sigdelta.com/blog/dask-introduction/
import dask.bag as db
tags_xml = db.read_text('data/Tags.xml', encoding='utf-8’)
tags_xml.take(5)
# ('ufeff<?xml version="1.0" encoding="utf-8"?>n’,
# '<tags>n’,
# ' <row Id="1" TagName=".net" Count="257092" … />n’,
# ' <row Id="2" TagName="html" Count="683981" … />n’,
# ' <row Id="3" TagName="javascript" Count="1457944" … />n’)
tags_rows = tags_xml.filter(lambda line: line.find('<row') >= 0)
tags_rows.take(5)
# (' <row Id="1" TagName=".net" Count="257092" … />n’,
# ' <row Id="2" TagName="html" Count="683981" … />n’,
# ' <row Id="3" TagName="javascript" Count="1457944" … />n’,
# ' <row Id="4" TagName="css" Count="490198" … />n’,
# ' <row Id="5" TagName="php" Count="1114030" … />n’)
tags = tags_rows.map(extract_tags_columns).to_dataframe()
Na jednej maszynie lub wielu
Źródło: https://dask.pydata.org/
Dask?!
Źródło: https://github.com/dask/dask/issues/3038 (naprawione)
...
t.reset_index().head()
# ------------------------------------------------------------------
# ValueError Traceback (most recent call last)
# <ipython-input-100-e6186d78fb03> in <module>()
# ----> 1 t.reset_index().head()
#
# ...
#
# ValueError: Length mismatch: Expected axis has 3 elements, new
# values have 2 elements
Ray
Źródło: https://rise.cs.berkeley.edu/blog/pandas-on-ray/
# import pandas as pd
import ray.dataframe as pd
stocks_df = pd.read_csv("all_stocks_5yr.csv")
print(type(stocks_df))
# <class 'ray.dataframe.dataframe.DataFrame'>
positive_stocks_df = stocks_df.query("close > open")
print(positive_stocks_df['date'].head(n=5))
# 0 2013-02-13
# 1 2013-02-15
# 2 2013-02-26
# 3 2013-02-27
# 4 2013-03-01
@ray.remote def f():
time.sleep(1)
return 1
ray.init()
results = ray.get([
f.remote()
for i in range(4)
])
Źródło: http://ray.readthedocs.io/en/latest/index.html
Apache Arrow
Źródło: https://arrow.apache.org/
Platformy chmurowe
Google Cloud Platform
Google BigQuery
Google BigQuery vs Pandas
# pip install pandas-gbq
projectid = "xxxxxxxx"
df = pd.read_gbq('SELECT * FROM test_dataset.test_table’,
index_col='index_column_name’,
col_order=['col1', 'col2', 'col3’],
projectid)
df.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')
TensorFlow
Źródło: https://www.tensorflow.org/get_started/premade_estimators
TensorFlow Data
dataset2 = tf.data.Dataset.from_tensor_slices(
(tf.random_uniform([4]),
tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types) # ==> "(tf.float32, tf.int32)"
print(dataset2.output_shapes) # ==> "((), (100,))"
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types) # ==> (tf.float32, (tf.float32, tf.int32))
print(dataset3.output_shapes) # ==> "(10, ((), (100,)))"
dataset1 = dataset1.map(lambda x: ...)
dataset2 = dataset2.flat_map(lambda x, y: ...)
dataset3 = dataset3.filter(lambda x, (y, z): ...)
Źródło: https://www.tensorflow.org/programmers_guide/datasets
TensorFlow GPU & Distributed
Źródło: https://towardsdatascience.com/using-docker-to-
set-up-a-deep-learning-environment-on-aws-6af37a78c551
Źródło: http://www.pittnuts.com/2016/08/glossary-in-
distributed-tensorflow/
TensorFlow Serving
Źródło: https://www.tensorflow.org/serving/ Źródło: https://cloud.google.com/products/machine-learning/
Co przyniesie przyszłość? ¯_(ツ)_/¯
Źródło: https://www.slideshare.net/AmazonWebServices/introducing-amazon-kinesis-realtime-processing-of-streaming-
big-data-bdt103-aws-reinvent-2013
Programowanie funkcyjne
Input Function Output
Input Function Output
Input Function Output
Input Function Output
SQL
Twoja opinia na temat mojej prelekcji jest dla mnie bardzo ważna.
1. Wejdź w mój wykład znajdujący się w agendzie w aplikacji
Eventory.
2. Oceń moją prelekcję i dodaj swój komentarz.
Dzięki temu będę wiedział/a, co Ci się podobało a co
powinienem/am ulepszyć!
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych w Python (Jakub Nowacki)

More Related Content

What's hot

scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
Anwendungsfaelle für Elasticsearch
Anwendungsfaelle für ElasticsearchAnwendungsfaelle für Elasticsearch
Anwendungsfaelle für ElasticsearchFlorian Hopf
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark ApplicationsTzach Zohar
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Sematext Group, Inc.
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Presto Overfview
Presto OverfviewPresto Overfview
Presto OverfviewMiguel Ping
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingJason Terpko
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic searchmarkstory
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Bruno Rocha
 
Powershell for Log Analysis and Data Crunching
 Powershell for Log Analysis and Data Crunching Powershell for Log Analysis and Data Crunching
Powershell for Log Analysis and Data CrunchingMichelle D'israeli
 

What's hot (20)

scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Anwendungsfaelle für Elasticsearch
Anwendungsfaelle für ElasticsearchAnwendungsfaelle für Elasticsearch
Anwendungsfaelle für Elasticsearch
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
 
Web Scrapping with Python
Web Scrapping with PythonWeb Scrapping with Python
Web Scrapping with Python
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Presto Overfview
Presto OverfviewPresto Overfview
Presto Overfview
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
 
Elk stack @inbot
Elk stack @inbotElk stack @inbot
Elk stack @inbot
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic search
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014
 
Powershell for Log Analysis and Data Crunching
 Powershell for Log Analysis and Data Crunching Powershell for Log Analysis and Data Crunching
Powershell for Log Analysis and Data Crunching
 

Similar to 4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych w Python (Jakub Nowacki)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteSriram Natarajan
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 

Similar to 4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych w Python (Jakub Nowacki) (20)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web site
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Spark 101
Spark 101Spark 101
Spark 101
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych w Python (Jakub Nowacki)

  • 1. Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych w Python Jakub Nowacki Yosh.AI, SigDelta, Sages
  • 2. whoami CTO @ Yosh.AI (yosh.ai) Lead Data Scientist @ SigDelta (sigdelta.com) Trainer @ Sages (sages.com.pl) I can code, I do maths @jsnowacki
  • 3.
  • 4. Jak to było kiedyś?
  • 6. Spark RDD sc.textFile("hdfs://...") .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile("hdfs://...")
  • 7. Spark RDD – co gdzie? Źródło: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
  • 8. Spark SQL - DataFrame Źródło: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html from pyspark.sql.functions import * spark.read.text('hdfs://...’) .select(explode(split('value', 'W+')).alias('word')) .groupBy('word') .count() .orderBy(desc('count')) .write.parquet('hdfs://...')
  • 9. UDF?! Rozwiązanie - Java/Scala UDF to Python: http://sigdelta.com/blog/scala-spark-udfs-in-python/ from pyspark.sql.types import IntegerType @udf(returnType=IntegerType()) def add_one(x): if x is not None: return x + 1
  • 10. Vectorized UDFs import pandas as pd from pyspark.sql.types import LongType def multiply_func(a, b): return a * b multiply = pandas_udf(multiply_func, returnType=LongType()) pdf = pd.DataFrame([1, 2, 3], columns=["x"] print(multiply_func(pdf.x, pdf.x)) # 0 1 # 1 4 # 2 9 # dtype: int64 df = spark.createDataFrame(pdf) df.select(multiply(col("x"), col("x"))).show() # +-------------------+ # |multiply_func(x, x)| # +-------------------+ # | 1| # | 4| # | 9| # +-------------------+ Źródło: https://databricks.com/blog/2017/10/30/introduci ng-vectorized-udfs-for-pyspark.html
  • 11. Spark Structured Streaming Źródło: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 12. PySpark w PyPI pip install pyspark conda install pyspark ... Opis: http://sigdelta.com/blog/how-to-install-pyspark-locally/
  • 14. Dask Array Źródło: https://dask.pydata.org/ import dask.array as da import numpy as np x = da.ones(10, chunks=(5,)) y = np.ones(10) z = x + y print(z) # dask.array<add, shape=(10,), # … dtype=float64, chunksize=(5,)>
  • 15. Dask DataFrame Źródło: https://dask.pydata.org/ import dask.dataframe as dd posts = dd.read_parquet('data/posts_tags.parq’) .set_index('id’) posts_count = posts.creation_date.dt.date .value_counts() posts_count_df = posts_count.compute() posts_count_df.head() # 2017-08-23 9531 # 2017-07-27 9450 # 2017-08-24 9366 # 2017-08-03 9345 # 2017-03-22 9342 # Name: creation_date, dtype: int64 Przykład: http://sigdelta.com/blog/stackpverflow-tags-with-dask/
  • 16. Dask Bag Przykład: http://sigdelta.com/blog/dask-introduction/ import dask.bag as db tags_xml = db.read_text('data/Tags.xml', encoding='utf-8’) tags_xml.take(5) # ('ufeff<?xml version="1.0" encoding="utf-8"?>n’, # '<tags>n’, # ' <row Id="1" TagName=".net" Count="257092" … />n’, # ' <row Id="2" TagName="html" Count="683981" … />n’, # ' <row Id="3" TagName="javascript" Count="1457944" … />n’) tags_rows = tags_xml.filter(lambda line: line.find('<row') >= 0) tags_rows.take(5) # (' <row Id="1" TagName=".net" Count="257092" … />n’, # ' <row Id="2" TagName="html" Count="683981" … />n’, # ' <row Id="3" TagName="javascript" Count="1457944" … />n’, # ' <row Id="4" TagName="css" Count="490198" … />n’, # ' <row Id="5" TagName="php" Count="1114030" … />n’) tags = tags_rows.map(extract_tags_columns).to_dataframe()
  • 17. Na jednej maszynie lub wielu Źródło: https://dask.pydata.org/
  • 18. Dask?! Źródło: https://github.com/dask/dask/issues/3038 (naprawione) ... t.reset_index().head() # ------------------------------------------------------------------ # ValueError Traceback (most recent call last) # <ipython-input-100-e6186d78fb03> in <module>() # ----> 1 t.reset_index().head() # # ... # # ValueError: Length mismatch: Expected axis has 3 elements, new # values have 2 elements
  • 19. Ray Źródło: https://rise.cs.berkeley.edu/blog/pandas-on-ray/ # import pandas as pd import ray.dataframe as pd stocks_df = pd.read_csv("all_stocks_5yr.csv") print(type(stocks_df)) # <class 'ray.dataframe.dataframe.DataFrame'> positive_stocks_df = stocks_df.query("close > open") print(positive_stocks_df['date'].head(n=5)) # 0 2013-02-13 # 1 2013-02-15 # 2 2013-02-26 # 3 2013-02-27 # 4 2013-03-01 @ray.remote def f(): time.sleep(1) return 1 ray.init() results = ray.get([ f.remote() for i in range(4) ]) Źródło: http://ray.readthedocs.io/en/latest/index.html
  • 24. Google BigQuery vs Pandas # pip install pandas-gbq projectid = "xxxxxxxx" df = pd.read_gbq('SELECT * FROM test_dataset.test_table’, index_col='index_column_name’, col_order=['col1', 'col2', 'col3’], projectid) df.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')
  • 26. TensorFlow Data dataset2 = tf.data.Dataset.from_tensor_slices( (tf.random_uniform([4]), tf.random_uniform([4, 100], maxval=100, dtype=tf.int32))) print(dataset2.output_types) # ==> "(tf.float32, tf.int32)" print(dataset2.output_shapes) # ==> "((), (100,))" dataset3 = tf.data.Dataset.zip((dataset1, dataset2)) print(dataset3.output_types) # ==> (tf.float32, (tf.float32, tf.int32)) print(dataset3.output_shapes) # ==> "(10, ((), (100,)))" dataset1 = dataset1.map(lambda x: ...) dataset2 = dataset2.flat_map(lambda x, y: ...) dataset3 = dataset3.filter(lambda x, (y, z): ...) Źródło: https://www.tensorflow.org/programmers_guide/datasets
  • 27. TensorFlow GPU & Distributed Źródło: https://towardsdatascience.com/using-docker-to- set-up-a-deep-learning-environment-on-aws-6af37a78c551 Źródło: http://www.pittnuts.com/2016/08/glossary-in- distributed-tensorflow/
  • 28. TensorFlow Serving Źródło: https://www.tensorflow.org/serving/ Źródło: https://cloud.google.com/products/machine-learning/
  • 29. Co przyniesie przyszłość? ¯_(ツ)_/¯ Źródło: https://www.slideshare.net/AmazonWebServices/introducing-amazon-kinesis-realtime-processing-of-streaming- big-data-bdt103-aws-reinvent-2013
  • 30. Programowanie funkcyjne Input Function Output Input Function Output Input Function Output Input Function Output
  • 31. SQL
  • 32. Twoja opinia na temat mojej prelekcji jest dla mnie bardzo ważna. 1. Wejdź w mój wykład znajdujący się w agendzie w aplikacji Eventory. 2. Oceń moją prelekcję i dodaj swój komentarz. Dzięki temu będę wiedział/a, co Ci się podobało a co powinienem/am ulepszyć!