SlideShare a Scribd company logo
Welcome
Chicago Data Engineering Meetup
- Our First Event – November 2018
- Objectives
- Every 2 months
- Format
- sharing experiences (open for volunteers)
- new tools / demos
- Open for suggestions
01 Who I am
02 QuantumBlack
03 Today’s topic: Spark UDF Performance
04 Background
05 Benchmarking – Live demo
06 Conclusion and Our Approach
07 Q&A
Agenda
Who I am
01
4All content copyright © 2017 QuantumBlack, a McKinsey company
Client case studies
Experience across several industry sectors,
including telecoms, retail, financial services and
pharmaceuticals.
Financial sector – Advanced Analytics
projects for Fraud detection in Internet Banking
and Credit Risk Modelling.
Telecommunications – Petabyte scale
environment, delivering several use cases,
including: real-time failure detection using CDR
data, customer profiling and marketing
campaigns.
Manufacturing– data wrangling in failure
detection project for computer parts
manufacturing in Europe.
Pharmaceuticals – Site selection optimisation
for a top pharma players.
Telematics (Car insurance) – machine learning
model that estimates the probability of crashing
for each driver based data obtained from on
board units box installed on cars containing
geo-location positions, speed and acceleration
of ~2 million drivers over a 2-year period.
Complex feature creation using terabyte scale
and external data sources such as weather,
street and traffic data.
Education
Guilherme has a BSc in Data Processing from
Mackenzie University and specialisations in
Machine Learning and Business Intelligence.
Role
Big Data technology expert based in Chicago.
Work with clients to translate business
hypotheses into data requirements and
technology solutions.
Expertise
Provides technical data engineering oversight
on projects and advises other data engineers
on architecture definition and performance
optimization for large-scale data wrangling.
Professional experience
Prior to joining QuantumBlack, Guilherme
specialised for over 18 years in Data
Warehouse and Business Intelligence projects
on large-scale environments. More recently, 6
years experience in Big Data projects and
architecture, lots of them at petabyte scale, as
well as real-time projects.
Previously led big data projects at Hortonworks,
SAP and large financial institutions.
BIOGRAPHY
Guilherme Braccialli
Principal Data Engineer, QuantumBlack,
Chicago
QuantumBlack
02
6All content copyright © 2017 QuantumBlack, a McKinsey company
QB exploit data, analytics and
design to help our clients be the
best they can be
We were born and proven in
Formula One, where the smallest
margins are the difference
between winning and losing and
data has emerged as a
fundamental element of
competitive advantage
QuantumBlack
6All content copyright © 2017 QuantumBlack, a McKinsey company
In elite sport the
smallest edge makes
the difference,
and the best teams
exploit this to outlearn
their rivals
8All content copyright © 2017 QuantumBlack, a McKinsey company
Since then, we have applied our proven
methodology across multiple sectors
Advanced
Industries
Aerospace
Automotive
Semi-Conductors
Urban Infrastructure
Financial
Services
Asset Management
Payment Networks
Private Banking
Retail Banking
Health &
Wellbeing
Hospitals
Medical Devices
Pharmaceutical
Natural
Resources
Oil & Gas
Mining
Renewable Energy
Utilities
Sports
Basketball
Baseball
Formula One
Soccer
Spark UDF Performance
03
- Share our learnings
- Running spark at scale
- Practical Examples
- Live demo (code)
Background
04
11All content copyright © 2017 QuantumBlack, a McKinsey company
• Open Source
‒ We are a consulting company, we enable our clients to use Advanced Analytics
‒ We don’t sell a out-of-box solution / licensing
‒ Clients can run it anywhere, we use open source tools
• Scalable
‒ We deal with big data volumes
‒ Multiple TBs of data
‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone)
• Flexibility and Integration
‒ Supports multiple languages: Python, SQL, Scala, Java, R
‒ Batch, Streaming, Graph, Machine Learning
‒ Easy to integrate with Data Scientist code, single data pipeline
Why we use spark
BACKGROUND
12All content copyright © 2017 QuantumBlack, a McKinsey company
• In the Cloud
‒ AWS (EMR)
‒ Azure (HDInsight)
‒ Google Cloud (DataProc)
‒ Databricks (AWS or Azure)
• On-premises
‒ Some clients have their internal hadoop cluster on premisses
Where we run
BACKGROUND
13All content copyright © 2017 QuantumBlack, a McKinsey company
Why PySpark / Performance implications
BACKGROUND
• PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist
• Same performance for data frame operations (pyspark is a wrapper that runs native scala code)
• Performance hit when we use UDF (execution relies on: scala - python - scala)
• Pandas UDFs (Vectorized UDFs) + Arrow
‒ Nov/2017 – Spark 2.3
https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
‒ but… where are Scala numbers?
Benchmarking – Live Demo
05
15All content copyright © 2017 QuantumBlack, a McKinsey company
Databricks Notebook – (try on Community version)
LIVE DEMO
https://bit.ly/2E4ehIm
Conclusion and Our Approach
06
17All content copyright © 2017 QuantumBlack, a McKinsey company
Best of both worlds: PySpark with Scala performance
CONCLUSION AND OUR APPROACH
• Conclusion
‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS
‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs
• Our Approach
‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data
‒ Built an internal library with re-usable Scala UDFs
‒ Created Python wrappers to call Scala UDFs
‒ Demo
Q&A
07
Thank you!
- Would you like to share your
experiences on next events?
and…
- We are hiring!!!

More Related Content

What's hot

What’s Chipping Away at Automotive Production
What’s Chipping Away at Automotive ProductionWhat’s Chipping Away at Automotive Production
What’s Chipping Away at Automotive Production
Boston Consulting Group
 
Fintech New York: Partnerships, Platforms and Open Innovation
Fintech New York: Partnerships, Platforms and Open InnovationFintech New York: Partnerships, Platforms and Open Innovation
Fintech New York: Partnerships, Platforms and Open Innovation
accenture
 
IBOR transition: Opportunities and challenges for the asset management industry
IBOR transition: Opportunities and challenges for the asset management industryIBOR transition: Opportunities and challenges for the asset management industry
IBOR transition: Opportunities and challenges for the asset management industry
EY
 
The Bionic Future - Future Work Summit
The Bionic Future - Future Work SummitThe Bionic Future - Future Work Summit
The Bionic Future - Future Work Summit
Miguel Carrasco
 
Unleashing Competitiveness on the Cloud Continuum | Accenture
Unleashing Competitiveness on the Cloud Continuum | AccentureUnleashing Competitiveness on the Cloud Continuum | Accenture
Unleashing Competitiveness on the Cloud Continuum | Accenture
accenture
 
Accenture Global Institutional Investor Study of ESG in Mining
Accenture Global Institutional Investor Study of ESG in MiningAccenture Global Institutional Investor Study of ESG in Mining
Accenture Global Institutional Investor Study of ESG in Mining
accenture
 
2016 Strategic Hospital Priorities Study
2016 Strategic Hospital Priorities Study2016 Strategic Hospital Priorities Study
2016 Strategic Hospital Priorities Study
L.E.K. Consulting
 
Joining Forces: Interagency Collaboration and "Smart Power"
Joining Forces: Interagency Collaboration and "Smart Power"Joining Forces: Interagency Collaboration and "Smart Power"
Joining Forces: Interagency Collaboration and "Smart Power"
Booz Allen Hamilton
 
Power transactions and trends Q2 2019
Power transactions and trends Q2 2019Power transactions and trends Q2 2019
Power transactions and trends Q2 2019
EY
 
The Future of Asset Management: Building Business Models and Strategies for 2025
The Future of Asset Management: Building Business Models and Strategies for 2025The Future of Asset Management: Building Business Models and Strategies for 2025
The Future of Asset Management: Building Business Models and Strategies for 2025
accenture
 
MGI: From poverty to empowerment: India’s imperative for jobs, growth, and ef...
MGI: From poverty to empowerment: India’s imperative for jobs, growth, and ef...MGI: From poverty to empowerment: India’s imperative for jobs, growth, and ef...
MGI: From poverty to empowerment: India’s imperative for jobs, growth, and ef...
McKinsey & Company
 
Retail Banking in the New Reality – Summary Survey Findings
Retail Banking in the New Reality – Summary Survey FindingsRetail Banking in the New Reality – Summary Survey Findings
Retail Banking in the New Reality – Summary Survey Findings
Boston Consulting Group
 
2018 Brand Owner Packaging Survey
2018 Brand Owner Packaging Survey2018 Brand Owner Packaging Survey
2018 Brand Owner Packaging Survey
L.E.K. Consulting
 
How future-ready is your IT –Next Gen Tech Function.pdf
How future-ready is your IT –Next Gen Tech Function.pdfHow future-ready is your IT –Next Gen Tech Function.pdf
How future-ready is your IT –Next Gen Tech Function.pdf
Yiannis Paraschos
 
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
Smart City
 
2022 Women in the Workplace Briefing
2022 Women in the Workplace Briefing2022 Women in the Workplace Briefing
2022 Women in the Workplace Briefing
McKinsey & Company
 
Overview of M&A, 2016
Overview of M&A, 2016Overview of M&A, 2016
Overview of M&A, 2016
McKinsey & Company
 
Five keys to marketing's "new golden age"
Five keys to marketing's "new golden age"Five keys to marketing's "new golden age"
Five keys to marketing's "new golden age"
McKinsey & Company
 
2018 Local Dynamos: Emerging-Market Companies Up Their Game
2018 Local Dynamos: Emerging-Market Companies Up Their Game2018 Local Dynamos: Emerging-Market Companies Up Their Game
2018 Local Dynamos: Emerging-Market Companies Up Their Game
Boston Consulting Group
 
TMT Outlook 2017: A new wave of advances offer opportunities and challenges
TMT Outlook 2017:  A new wave of advances offer opportunities and challengesTMT Outlook 2017:  A new wave of advances offer opportunities and challenges
TMT Outlook 2017: A new wave of advances offer opportunities and challenges
Deloitte United States
 

What's hot (20)

What’s Chipping Away at Automotive Production
What’s Chipping Away at Automotive ProductionWhat’s Chipping Away at Automotive Production
What’s Chipping Away at Automotive Production
 
Fintech New York: Partnerships, Platforms and Open Innovation
Fintech New York: Partnerships, Platforms and Open InnovationFintech New York: Partnerships, Platforms and Open Innovation
Fintech New York: Partnerships, Platforms and Open Innovation
 
IBOR transition: Opportunities and challenges for the asset management industry
IBOR transition: Opportunities and challenges for the asset management industryIBOR transition: Opportunities and challenges for the asset management industry
IBOR transition: Opportunities and challenges for the asset management industry
 
The Bionic Future - Future Work Summit
The Bionic Future - Future Work SummitThe Bionic Future - Future Work Summit
The Bionic Future - Future Work Summit
 
Unleashing Competitiveness on the Cloud Continuum | Accenture
Unleashing Competitiveness on the Cloud Continuum | AccentureUnleashing Competitiveness on the Cloud Continuum | Accenture
Unleashing Competitiveness on the Cloud Continuum | Accenture
 
Accenture Global Institutional Investor Study of ESG in Mining
Accenture Global Institutional Investor Study of ESG in MiningAccenture Global Institutional Investor Study of ESG in Mining
Accenture Global Institutional Investor Study of ESG in Mining
 
2016 Strategic Hospital Priorities Study
2016 Strategic Hospital Priorities Study2016 Strategic Hospital Priorities Study
2016 Strategic Hospital Priorities Study
 
Joining Forces: Interagency Collaboration and "Smart Power"
Joining Forces: Interagency Collaboration and "Smart Power"Joining Forces: Interagency Collaboration and "Smart Power"
Joining Forces: Interagency Collaboration and "Smart Power"
 
Power transactions and trends Q2 2019
Power transactions and trends Q2 2019Power transactions and trends Q2 2019
Power transactions and trends Q2 2019
 
The Future of Asset Management: Building Business Models and Strategies for 2025
The Future of Asset Management: Building Business Models and Strategies for 2025The Future of Asset Management: Building Business Models and Strategies for 2025
The Future of Asset Management: Building Business Models and Strategies for 2025
 
MGI: From poverty to empowerment: India’s imperative for jobs, growth, and ef...
MGI: From poverty to empowerment: India’s imperative for jobs, growth, and ef...MGI: From poverty to empowerment: India’s imperative for jobs, growth, and ef...
MGI: From poverty to empowerment: India’s imperative for jobs, growth, and ef...
 
Retail Banking in the New Reality – Summary Survey Findings
Retail Banking in the New Reality – Summary Survey FindingsRetail Banking in the New Reality – Summary Survey Findings
Retail Banking in the New Reality – Summary Survey Findings
 
2018 Brand Owner Packaging Survey
2018 Brand Owner Packaging Survey2018 Brand Owner Packaging Survey
2018 Brand Owner Packaging Survey
 
How future-ready is your IT –Next Gen Tech Function.pdf
How future-ready is your IT –Next Gen Tech Function.pdfHow future-ready is your IT –Next Gen Tech Function.pdf
How future-ready is your IT –Next Gen Tech Function.pdf
 
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
 
2022 Women in the Workplace Briefing
2022 Women in the Workplace Briefing2022 Women in the Workplace Briefing
2022 Women in the Workplace Briefing
 
Overview of M&A, 2016
Overview of M&A, 2016Overview of M&A, 2016
Overview of M&A, 2016
 
Five keys to marketing's "new golden age"
Five keys to marketing's "new golden age"Five keys to marketing's "new golden age"
Five keys to marketing's "new golden age"
 
2018 Local Dynamos: Emerging-Market Companies Up Their Game
2018 Local Dynamos: Emerging-Market Companies Up Their Game2018 Local Dynamos: Emerging-Market Companies Up Their Game
2018 Local Dynamos: Emerging-Market Companies Up Their Game
 
TMT Outlook 2017: A new wave of advances offer opportunities and challenges
TMT Outlook 2017:  A new wave of advances offer opportunities and challengesTMT Outlook 2017:  A new wave of advances offer opportunities and challenges
TMT Outlook 2017: A new wave of advances offer opportunities and challenges
 

Similar to Meetup Spark UDF performance

Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
tsigitnist02
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
Saama
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
DataWorks Summit
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Databricks
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
DataBench
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
DataWorks Summit
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in Logistics
Skillspeed
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
William Poos
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
SoftServe
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
Abdelkrim Hadjidj
 
BCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutionsBCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutions
BCC_Group
 
Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®
Sunil Kempegowda
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
Holden Ackerman
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Databricks
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
DataWorks Summit
 
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturingCwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
Capgemini
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
DataWorks Summit/Hadoop Summit
 
Bhadale group of companies projects portfolio
Bhadale group of companies  projects portfolioBhadale group of companies  projects portfolio
Bhadale group of companies projects portfolio
Vijayananda Mohire
 
Why Infrastructure matters?!
Why Infrastructure matters?!Why Infrastructure matters?!
Why Infrastructure matters?!
Gabi Bauer
 
Orange Data Centre and Cloud
Orange Data Centre and CloudOrange Data Centre and Cloud
Orange Data Centre and Cloud
Orange Business Services
 

Similar to Meetup Spark UDF performance (20)

Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in Logistics
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
BCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutionsBCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutions
 
Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturingCwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Bhadale group of companies projects portfolio
Bhadale group of companies  projects portfolioBhadale group of companies  projects portfolio
Bhadale group of companies projects portfolio
 
Why Infrastructure matters?!
Why Infrastructure matters?!Why Infrastructure matters?!
Why Infrastructure matters?!
 
Orange Data Centre and Cloud
Orange Data Centre and CloudOrange Data Centre and Cloud
Orange Data Centre and Cloud
 

Recently uploaded

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
FODUU
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 

Recently uploaded (20)

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 

Meetup Spark UDF performance

  • 1. Welcome Chicago Data Engineering Meetup - Our First Event – November 2018 - Objectives - Every 2 months - Format - sharing experiences (open for volunteers) - new tools / demos - Open for suggestions
  • 2. 01 Who I am 02 QuantumBlack 03 Today’s topic: Spark UDF Performance 04 Background 05 Benchmarking – Live demo 06 Conclusion and Our Approach 07 Q&A Agenda
  • 4. 4All content copyright © 2017 QuantumBlack, a McKinsey company Client case studies Experience across several industry sectors, including telecoms, retail, financial services and pharmaceuticals. Financial sector – Advanced Analytics projects for Fraud detection in Internet Banking and Credit Risk Modelling. Telecommunications – Petabyte scale environment, delivering several use cases, including: real-time failure detection using CDR data, customer profiling and marketing campaigns. Manufacturing– data wrangling in failure detection project for computer parts manufacturing in Europe. Pharmaceuticals – Site selection optimisation for a top pharma players. Telematics (Car insurance) – machine learning model that estimates the probability of crashing for each driver based data obtained from on board units box installed on cars containing geo-location positions, speed and acceleration of ~2 million drivers over a 2-year period. Complex feature creation using terabyte scale and external data sources such as weather, street and traffic data. Education Guilherme has a BSc in Data Processing from Mackenzie University and specialisations in Machine Learning and Business Intelligence. Role Big Data technology expert based in Chicago. Work with clients to translate business hypotheses into data requirements and technology solutions. Expertise Provides technical data engineering oversight on projects and advises other data engineers on architecture definition and performance optimization for large-scale data wrangling. Professional experience Prior to joining QuantumBlack, Guilherme specialised for over 18 years in Data Warehouse and Business Intelligence projects on large-scale environments. More recently, 6 years experience in Big Data projects and architecture, lots of them at petabyte scale, as well as real-time projects. Previously led big data projects at Hortonworks, SAP and large financial institutions. BIOGRAPHY Guilherme Braccialli Principal Data Engineer, QuantumBlack, Chicago
  • 6. 6All content copyright © 2017 QuantumBlack, a McKinsey company QB exploit data, analytics and design to help our clients be the best they can be We were born and proven in Formula One, where the smallest margins are the difference between winning and losing and data has emerged as a fundamental element of competitive advantage QuantumBlack 6All content copyright © 2017 QuantumBlack, a McKinsey company
  • 7. In elite sport the smallest edge makes the difference, and the best teams exploit this to outlearn their rivals
  • 8. 8All content copyright © 2017 QuantumBlack, a McKinsey company Since then, we have applied our proven methodology across multiple sectors Advanced Industries Aerospace Automotive Semi-Conductors Urban Infrastructure Financial Services Asset Management Payment Networks Private Banking Retail Banking Health & Wellbeing Hospitals Medical Devices Pharmaceutical Natural Resources Oil & Gas Mining Renewable Energy Utilities Sports Basketball Baseball Formula One Soccer
  • 9. Spark UDF Performance 03 - Share our learnings - Running spark at scale - Practical Examples - Live demo (code)
  • 11. 11All content copyright © 2017 QuantumBlack, a McKinsey company • Open Source ‒ We are a consulting company, we enable our clients to use Advanced Analytics ‒ We don’t sell a out-of-box solution / licensing ‒ Clients can run it anywhere, we use open source tools • Scalable ‒ We deal with big data volumes ‒ Multiple TBs of data ‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone) • Flexibility and Integration ‒ Supports multiple languages: Python, SQL, Scala, Java, R ‒ Batch, Streaming, Graph, Machine Learning ‒ Easy to integrate with Data Scientist code, single data pipeline Why we use spark BACKGROUND
  • 12. 12All content copyright © 2017 QuantumBlack, a McKinsey company • In the Cloud ‒ AWS (EMR) ‒ Azure (HDInsight) ‒ Google Cloud (DataProc) ‒ Databricks (AWS or Azure) • On-premises ‒ Some clients have their internal hadoop cluster on premisses Where we run BACKGROUND
  • 13. 13All content copyright © 2017 QuantumBlack, a McKinsey company Why PySpark / Performance implications BACKGROUND • PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist • Same performance for data frame operations (pyspark is a wrapper that runs native scala code) • Performance hit when we use UDF (execution relies on: scala - python - scala) • Pandas UDFs (Vectorized UDFs) + Arrow ‒ Nov/2017 – Spark 2.3 https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/ https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html ‒ but… where are Scala numbers?
  • 15. 15All content copyright © 2017 QuantumBlack, a McKinsey company Databricks Notebook – (try on Community version) LIVE DEMO https://bit.ly/2E4ehIm
  • 16. Conclusion and Our Approach 06
  • 17. 17All content copyright © 2017 QuantumBlack, a McKinsey company Best of both worlds: PySpark with Scala performance CONCLUSION AND OUR APPROACH • Conclusion ‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS ‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs • Our Approach ‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data ‒ Built an internal library with re-usable Scala UDFs ‒ Created Python wrappers to call Scala UDFs ‒ Demo
  • 19. Thank you! - Would you like to share your experiences on next events? and… - We are hiring!!!