Погружаемся в Catalyst

•Download as PPTX, PDF•

0 likes•322 views

Dataset и Dataframe стали предпочтительными интерфейсами работы со Spark. Во многом благодаря активной разработке оптимизатора запросов Catalyst. В докладе мы рассмотрим мотивацию создания Spark.SQL и поймем, почему он так критически важен для работы PySpark. А так же подробно разберем как устроен Catalyst изнутри и как можно расширить его функциональность.

Engineering

Павел Клеменков
p.klemenkov@rambler-co.ru
Погружаемся в Catalyst

192.168.0.38 WARNING Something bad could
happen
192.168.0.88 INFO Just an info message
passing by
192.168.0.5 WARNING Something bad could
happen
192.168.0.36 ERROR When production
fails in despair,
whom you're gonna
call?
192.168.0.27 INFO Just an info message
passing by

192.168.0.38 USA
192.168.0.88 RUSSIA
192.168.0.5 CHINA
192.168.0.36 USA
192.168.0.27 RUSSIA

SELECT country, code FROM table1
JOIN table2
WHERE table1.ip = table2.ip
AND table1.code != "INFO"

rdd1 = sc.textFile()
rdd2 = sc.textFile()
table1 = rdd1.map(lambda x: x.split("t"))
table2 = rdd2.map(lambda x: x.split("t"))
table1.join(table2)
.map(lambda (x, y): (y[1], y[0]))
.filter(lambda (x, y): y != "INFO")
.collect()
cхема
вычисления

SELECT country,
count(country) as messages
FROM table1 JOIN table2
WHERE table1.ip = table2.ip
AND table1.code != "INFO"
GROUP BY country

spark.sql(query).explain(True)
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
...
== Physical Plan ==
...

abstract class TreeNode[BaseType <: TreeNode[BaseType]]
extends Product

case class Literal (value: Any, dataType: DataType)
extends LeafExpression with CodegenFallback
abstract class Attribute
extends LeafExpression with NamedExpression
with NullIntolerant
case class Add(left: Expression, right: Expression)
extends BinaryArithmetic with NullIntolerant

x + (1 + 2) Attrib(x) Add
Literal(1)Literal(2)
Add

abstract class TreeNode[BaseType <: TreeNode[BaseType]]
extends Product {
def transform(rule: PartialFunction[BaseType, BaseType]):
BaseType = {}
}

tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
}
Attrib(x) Add
Literal(1)Literal(2)
Add
Attrib(x) Literal(3)
Add

SELECT country,
count(country) as messages
FROM table1 JOIN table2
WHERE table1.ip = table2.ip
AND table1.code != "INFO"
GROUP BY country
Scan
Join
Filter
Project
Aggregate
Scan

object PushDownPredicate extends Rule[LogicalPlan]
with PredicateHelper {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {}
}

Scan
Join
Filter
Project
Aggregate
Scan Scan
Join
Filter
Project
Aggregate
Scan

Scan
Join
Filter
Project
Aggregate
Scan FileScan
BroadcastHashJoin
Filter
Project
HashAggregate
FileScan

== Parsed Logical Plan ==
'Aggregate ['country], ['country, 'count('country) AS
messages#14]
+- 'Filter (('table1.ip = 'table2.ip) && NOT
('table1.code = INFO))
+- 'Join Inner
:- 'UnresolvedRelation `table1`
+- 'UnresolvedRelation `table2`

== Analyzed Logical Plan ==
country: string, messages: bigint
Aggregate [country#8], [country#8, count(country#8) AS
messages#14L]
+- Filter ((ip#0 = ip#7) && NOT (code#1 = INFO))
+- Join Inner
:- SubqueryAlias table1
: +- Relation[ip#0,code#1,msg#2] parquet
+- SubqueryAlias table2
+- Relation[ip#7,country#8] parquet

== Optimized Logical Plan ==
Aggregate [country#8], [country#8, count(country#8) AS
messages#14L]
+- Project [country#8]
+- Join Inner, (ip#0 = ip#7)
:- Project [ip#0]
: +- Filter ((isnotnull(code#1) && NOT (code#1 =
INFO)) && isnotnull(ip#0))
: +- Relation[ip#0,code#1,msg#2] parquet
+- Filter isnotnull(ip#7)
+- Relation[ip#7, country#8] parquet

== Physical Plan ==
*HashAggregate(
+- Exchange hashpartitioning(country#8, 200)
+- *HashAggregate
+- *Project [country#8]
+- *BroadcastHashJoin [ip#0], [ip#7], Inner,
BuildRight
:- *Project [ip#0]
: +- *Filter
: +- *FileScan parquet [ip#0,code#1],
PushedFilters: [IsNotNull(code),
Not(EqualTo(code,INFO)), IsNotNull(ip)]
+- BroadcastExchange
+- *Project [ip#7, country#8]
+- *Filter isnotnull(ip#7)
+- *FileScan parquet [ip#7,country#8],
PushedFilters: [IsNotNull(ip)]

Recently uploaded

College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal

UNIT - IV - Air Compressors and its Performancesivaprakash250

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

KubeKraft presentation @CloudNativeHooghlysanyuktamishra911

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

Introduction to Multiple Access Protocol.pptxupamatechverse

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3

Introduction and different types of Ethernet.pptxupamatechverse

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

Extrusion Processes and Their Limitations120cr0395

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

Java Programming :Event Handling(Types of Events)simmis5

Recently uploaded (20)

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...

UNIT - IV - Air Compressors and its Performance

Roadmap to Membership of RICS - Pathways and Routes

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

KubeKraft presentation @CloudNativeHooghly

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...

Introduction to Multiple Access Protocol.pptx

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS

Introduction and different types of Ethernet.pptx

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt

Processing & Properties of Floor and Wall Tiles.pptx

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

Extrusion Processes and Their Limitations

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)

Java Programming :Event Handling(Types of Events)

Featured

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Featured (20)

2024 State of Marketing Report – by Hubspot

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Погружаемся в Catalyst

1. Павел Клеменков p.klemenkov@rambler-co.ru Погружаемся в Catalyst

3. rdd.map(lambda x: x)

4. 192.168.0.38 WARNING Something bad could happen 192.168.0.88 INFO Just an info message passing by 192.168.0.5 WARNING Something bad could happen 192.168.0.36 ERROR When production fails in despair, whom you're gonna call? 192.168.0.27 INFO Just an info message passing by

5. 192.168.0.38 USA 192.168.0.88 RUSSIA 192.168.0.5 CHINA 192.168.0.36 USA 192.168.0.27 RUSSIA

6. SELECT country, code FROM table1 JOIN table2 WHERE table1.ip = table2.ip AND table1.code != "INFO"

7. rdd1 = sc.textFile() rdd2 = sc.textFile() table1 = rdd1.map(lambda x: x.split("t")) table2 = rdd2.map(lambda x: x.split("t")) table1.join(table2) .map(lambda (x, y): (y[1], y[0])) .filter(lambda (x, y): y != "INFO") .collect() cхема вычисления

8. SELECT country, count(country) as messages FROM table1 JOIN table2 WHERE table1.ip = table2.ip AND table1.code != "INFO" GROUP BY country

9. spark.sql(query).explain(True) == Parsed Logical Plan == ... == Analyzed Logical Plan == ... == Optimized Logical Plan == ... == Physical Plan == ...

10. Catalyst pipeline

11. abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product

12. case class Literal (value: Any, dataType: DataType) extends LeafExpression with CodegenFallback abstract class Attribute extends LeafExpression with NamedExpression with NullIntolerant case class Add(left: Expression, right: Expression) extends BinaryArithmetic with NullIntolerant

13. x + (1 + 2) Attrib(x) Add Literal(1)Literal(2) Add

14. abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { def transform(rule: PartialFunction[BaseType, BaseType]): BaseType = {} }

15. tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) } Attrib(x) Add Literal(1)Literal(2) Add Attrib(x) Literal(3) Add

16. SELECT country, count(country) as messages FROM table1 JOIN table2 WHERE table1.ip = table2.ip AND table1.code != "INFO" GROUP BY country Scan Join Filter Project Aggregate Scan

17. object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper { def apply(plan: LogicalPlan): LogicalPlan = plan transform {} }

18. Scan Join Filter Project Aggregate Scan Scan Join Filter Project Aggregate Scan

19. Scan Join Filter Project Aggregate Scan FileScan BroadcastHashJoin Filter Project HashAggregate FileScan

20. Пример

21. SELECT country, count(country) as messages FROM table1 JOIN table2 WHERE table1.ip = table2.ip AND table1.code != "INFO" GROUP BY country Scan Join Filter Project Aggregate Scan

22. Catalyst pipeline

23. == Parsed Logical Plan == 'Aggregate ['country], ['country, 'count('country) AS messages#14] +- 'Filter (('table1.ip = 'table2.ip) && NOT ('table1.code = INFO)) +- 'Join Inner :- 'UnresolvedRelation `table1` +- 'UnresolvedRelation `table2`

24. == Parsed Logical Plan == 'Aggregate ['country], ['country, 'count('country) AS messages#14] +- 'Filter (('table1.ip = 'table2.ip) && NOT ('table1.code = INFO)) +- 'Join Inner :- 'UnresolvedRelation `table1` +- 'UnresolvedRelation `table2`

25. Catalyst pipeline

26. == Analyzed Logical Plan == country: string, messages: bigint Aggregate [country#8], [country#8, count(country#8) AS messages#14L] +- Filter ((ip#0 = ip#7) && NOT (code#1 = INFO)) +- Join Inner :- SubqueryAlias table1 : +- Relation[ip#0,code#1,msg#2] parquet +- SubqueryAlias table2 +- Relation[ip#7,country#8] parquet

27. Catalyst pipeline

28. Filter pushdown

29. == Optimized Logical Plan == Aggregate [country#8], [country#8, count(country#8) AS messages#14L] +- Project [country#8] +- Join Inner, (ip#0 = ip#7) :- Project [ip#0] : +- Filter ((isnotnull(code#1) && NOT (code#1 = INFO)) && isnotnull(ip#0)) : +- Relation[ip#0,code#1,msg#2] parquet +- Filter isnotnull(ip#7) +- Relation[ip#7, country#8] parquet

30. Catalyst pipeline

31. == Physical Plan == *HashAggregate( +- Exchange hashpartitioning(country#8, 200) +- *HashAggregate +- *Project [country#8] +- *BroadcastHashJoin [ip#0], [ip#7], Inner, BuildRight :- *Project [ip#0] : +- *Filter : +- *FileScan parquet [ip#0,code#1], PushedFilters: [IsNotNull(code), Not(EqualTo(code,INFO)), IsNotNull(ip)] +- BroadcastExchange +- *Project [ip#7, country#8] +- *Filter isnotnull(ip#7) +- *FileScan parquet [ip#7,country#8], PushedFilters: [IsNotNull(ip)]

32. Column pruning

33. == Physical Plan == *HashAggregate( +- Exchange hashpartitioning(country#8, 200) +- *HashAggregate +- *Project [country#8] +- *BroadcastHashJoin [ip#0], [ip#7], Inner, BuildRight :- *Project [ip#0] : +- *Filter : +- *FileScan parquet [ip#0,code#1], PushedFilters: [IsNotNull(code), Not(EqualTo(code,INFO)), IsNotNull(ip)] +- BroadcastExchange +- *Project [ip#7, country#8] +- *Filter isnotnull(ip#7) +- *FileScan parquet [ip#7,country#8], PushedFilters: [IsNotNull(ip)]

34. СПАСИБО!

Погружаемся в Catalyst

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Погружаемся в Catalyst