SlideShare a Scribd company logo

Big data made easy with a Spark

Big data made easy with a Spark is the presentation I did for Opensource 101 in Columbia, SC, on April 18th 2019. It is a hands-on tutorial on Apache Spark with Java, walking through 3 different labs.

1 of 79
Big Data made
easy with a
Open Source 101
Columbia, SC
April 18th 2019
Jean Georges Perrin
Software since 1983
Big Data since 1984

x11
@jgperrin • http://jgp.net [blog]
News…
๏ Director of Engineering for WeExperience
๏ Hiring a team of talented engineers to work with us
๏ Front end
๏ Mobile
๏ Back end & data
๏ AI
๏ Shoot at @jgperrin
Big data made easy with a Spark
Caution
Hands-on tutorial
Tons of content
Unknown crowd
Unknown setting
Get all the S T U F F
๏ Go to http://jgp.net/ato2018
๏ Install the software
๏ Access the source code
Ad

Recommended

A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...Laura M. Castro
 
In graph we trust: Microservices, GraphQL and security challenges
In graph we trust: Microservices, GraphQL and security challengesIn graph we trust: Microservices, GraphQL and security challenges
In graph we trust: Microservices, GraphQL and security challengesMohammed A. Imran
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
4Developers 2015: Lessons for Erlang VM - Michał Ślaski
4Developers 2015: Lessons for Erlang VM - Michał Ślaski4Developers 2015: Lessons for Erlang VM - Michał Ślaski
4Developers 2015: Lessons for Erlang VM - Michał ŚlaskiPROIDEA
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web ApplicationMichael Choi
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWork-Bench
 

More Related Content

What's hot

Which Freaking Database Should I Use?
Which Freaking Database Should I Use?Which Freaking Database Should I Use?
Which Freaking Database Should I Use?Great Wide Open
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Python Raster Function - Esri Developer Conference - 2015
Python Raster Function - Esri Developer Conference - 2015Python Raster Function - Esri Developer Conference - 2015
Python Raster Function - Esri Developer Conference - 2015akferoz07
 
DESIGN West 2013 Presentation: Accelerating Android Development and Delivery
DESIGN West 2013 Presentation: Accelerating Android Development and DeliveryDESIGN West 2013 Presentation: Accelerating Android Development and Delivery
DESIGN West 2013 Presentation: Accelerating Android Development and DeliveryDavid Rosen
 
Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Andrei KUCHARAVY
 
De-mystifying contributing to PostgreSQL
De-mystifying contributing to PostgreSQLDe-mystifying contributing to PostgreSQL
De-mystifying contributing to PostgreSQLLætitia Avrot
 
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...
DevSecCon Singapore 2018 -  Remove developers’ shameful secrets or simply rem...DevSecCon Singapore 2018 -  Remove developers’ shameful secrets or simply rem...
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...DevSecCon
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
H2O World - Benchmarking Open Source ML Platforms - Szilard PafkaH2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
H2O World - Benchmarking Open Source ML Platforms - Szilard PafkaSri Ambati
 
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...NETWAYS
 
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Ted Drake
 
Introduction to Web Development with Ruby on Rails
Introduction to Web Development with Ruby on RailsIntroduction to Web Development with Ruby on Rails
Introduction to Web Development with Ruby on Railspmatsinopoulos
 
From NASA to Startups to Big Commerce
From NASA to Startups to Big CommerceFrom NASA to Startups to Big Commerce
From NASA to Startups to Big CommerceDaniel Greenfeld
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...Charlie Hull
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3Holden Karau
 

What's hot (20)

Which Freaking Database Should I Use?
Which Freaking Database Should I Use?Which Freaking Database Should I Use?
Which Freaking Database Should I Use?
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Python Raster Function - Esri Developer Conference - 2015
Python Raster Function - Esri Developer Conference - 2015Python Raster Function - Esri Developer Conference - 2015
Python Raster Function - Esri Developer Conference - 2015
 
DESIGN West 2013 Presentation: Accelerating Android Development and Delivery
DESIGN West 2013 Presentation: Accelerating Android Development and DeliveryDESIGN West 2013 Presentation: Accelerating Android Development and Delivery
DESIGN West 2013 Presentation: Accelerating Android Development and Delivery
 
Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1
 
De-mystifying contributing to PostgreSQL
De-mystifying contributing to PostgreSQLDe-mystifying contributing to PostgreSQL
De-mystifying contributing to PostgreSQL
 
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...
DevSecCon Singapore 2018 -  Remove developers’ shameful secrets or simply rem...DevSecCon Singapore 2018 -  Remove developers’ shameful secrets or simply rem...
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
H2O World - Benchmarking Open Source ML Platforms - Szilard PafkaH2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
 
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...
 
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
 
Introduction to Web Development with Ruby on Rails
Introduction to Web Development with Ruby on RailsIntroduction to Web Development with Ruby on Rails
Introduction to Web Development with Ruby on Rails
 
From NASA to Startups to Big Commerce
From NASA to Startups to Big CommerceFrom NASA to Startups to Big Commerce
From NASA to Startups to Big Commerce
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
 
Deployments in one click!
Deployments in one click!Deployments in one click!
Deployments in one click!
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3
 
Apache Toree
Apache ToreeApache Toree
Apache Toree
 

Similar to Big data made easy with a Spark

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Jean-Georges Perrin
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneDatabricks
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...it-people
 
Overview of Modern Graph Analysis Tools
Overview of Modern Graph Analysis ToolsOverview of Modern Graph Analysis Tools
Overview of Modern Graph Analysis ToolsKeiichiro Ono
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programsgreenwop
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Project Flogo: An Event-Driven Stack for the Enterprise
Project Flogo: An Event-Driven Stack for the EnterpriseProject Flogo: An Event-Driven Stack for the Enterprise
Project Flogo: An Event-Driven Stack for the EnterpriseLeon Stigter
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with PythonBrian Lyttle
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and batteryVitali Pekelis
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaLINE Corporation
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabAmanda Casari
 

Similar to Big data made easy with a Spark (20)

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
 
Overview of Modern Graph Analysis Tools
Overview of Modern Graph Analysis ToolsOverview of Modern Graph Analysis Tools
Overview of Modern Graph Analysis Tools
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Project Flogo: An Event-Driven Stack for the Enterprise
Project Flogo: An Event-Driven Stack for the EnterpriseProject Flogo: An Event-Driven Stack for the Enterprise
Project Flogo: An Event-Driven Stack for the Enterprise
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with Python
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and battery
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE Fukuoka
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLab
 

More from Jean-Georges Perrin

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the worldJean-Georges Perrin
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsJean-Georges Perrin
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunityJean-Georges Perrin
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMJean-Georges Perrin
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)Jean-Georges Perrin
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applicationsJean-Georges Perrin
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and servicesJean-Georges Perrin
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & servicesJean-Georges Perrin
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysJean-Georges Perrin
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoJean-Georges Perrin
 

More from Jean-Georges Perrin (20)

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentions
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT Days
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San Francicsco
 

Recently uploaded

AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for usersStephenEfange3
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referencepriyansabari355
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxMdRafiqulIslam403212
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023stephizcoolio
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Cyber Security Experts
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfAustraliaChapterIIBA
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva
 
data analytics and tools from in2inglobal.pdf
data analytics  and tools from in2inglobal.pdfdata analytics  and tools from in2inglobal.pdf
data analytics and tools from in2inglobal.pdfdigimartfamily
 
Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaAdrian Sanabria
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)CUO VEERANAN VEERANAN
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxHizkiaJastis
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Thibaud Le Douarin
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referencepriyansabari355
 
Artificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxArtificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxVighnesh Shashtri
 
What is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxWhat is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxJose Briones
 

Recently uploaded (18)

AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for users
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as reference
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptx
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals
 
data analytics and tools from in2inglobal.pdf
data analytics  and tools from in2inglobal.pdfdata analytics  and tools from in2inglobal.pdf
data analytics and tools from in2inglobal.pdf
 
Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix Enigma
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptx
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a reference
 
Artificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxArtificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptx
 
What is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxWhat is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptx
 

Big data made easy with a Spark

  • 1. Big Data made easy with a Open Source 101 Columbia, SC April 18th 2019
  • 2. Jean Georges Perrin Software since 1983 Big Data since 1984
 x11 @jgperrin • http://jgp.net [blog]
  • 3. News… ๏ Director of Engineering for WeExperience ๏ Hiring a team of talented engineers to work with us ๏ Front end ๏ Mobile ๏ Back end & data ๏ AI ๏ Shoot at @jgperrin
  • 5. Caution Hands-on tutorial Tons of content Unknown crowd Unknown setting
  • 6. Get all the S T U F F ๏ Go to http://jgp.net/ato2018 ๏ Install the software ๏ Access the source code
  • 7. Who are thou? ๏ Experience with Spark? ๏ Experience with Hadoop? ๏ Experience with Scala? ๏ Java? ๏ PHP guru? ๏ Front-end developer?
  • 8. But most importantly… ๏ … who is not a developer?
  • 9. ๏ What is Big Data? ๏ What is. ? ๏ What can I do with. ? ๏ What is a app, anyway? ๏ Install a bunch of software ๏ A first example ๏ Understand what just happened ๏ Another example, slightly more complex, because you are now ready ๏ But now, sincerely what just happened? ๏ Let’s do AI! ๏ Going further Agenda
  • 10. 3 V4 5 Biiiiiiiig Data ๏ volume ๏ variety ๏ velocity ๏ variability ๏ value Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
  • 11. Data is considered big when they need more than one computer to be processed Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
  • 13. Apps Analytics Distrib. An analytics operating system? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS
  • 14. An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 15. An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 16. Some use cases ๏ NCEatery.com ๏ Restaurant analytics ๏ 1.57×10^21 datapoints analyzed ๏ Lumeris ๏ General compute ๏ Distributed data transfer/pipeline ๏ CERN ๏ Analysis of the science experiments in the LHC - Large Hadron Collider ๏ IBM ๏ Watson Data Studio ๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ ๏ And much more…
  • 17. What a typical app looks like? Connect to the cluster Load Data Do something with the data Share the results
  • 20. Get all the S T U F F ๏ Go to http://jgp.net/ato2018 ๏ Install the software ๏ Access the source code
  • 21. Download some tools ๏ Java JDK 1.8 ๏ http://bit.ly/javadk8 ๏ Eclipse Oxygen or later ๏ http://bit.ly/eclipseo2 ๏ Other nice to have ๏ Maven ๏ SourceTree or git (command line) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html http://www.eclipse.org/downloads/eclipse-packages/
  • 22. Aren’t you glad we are using Java?
  • 23. Lab #1 - ingestion
  • 24. Lab #1 - ingestion ๏ Goal
 In a Big Data project, ingestion is the first operation. You get the data “in.” ๏ Source code
 https://github.com/jgperrin/ net.jgp.books.spark.ch01
  • 25. Getting deeper ๏ Go to net.jgp.books.spark.ch01 ๏ Open CsvToDataframeApp.java ๏ Right click, Run As, Java Application
  • 26. +---+--------+--------------------+-----------+--------------------+ | id|authorId| title|releaseDate| link| +---+--------+--------------------+-----------+--------------------+ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| +---+--------+--------------------+-----------+--------------------+ only showing top 5 rows
  • 27. package net.jgp.books.sparkWithJava.ch01; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CsvToDataframeApp { public static void main(String[] args) { CsvToDataframeApp app = new CsvToDataframeApp(); app.start(); } private void start() { // Creates a session on a local master SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local") .getOrCreate(); // Reads a CSV file with header, called books.csv, stores it in a dataframe Dataset<Row> df = spark.read().format("csv") .option("header", "true") .load("data/books.csv"); // Shows at most 5 rows from the dataframe df.show(5); } } /jgperrin/net.jgp.books.sparkWithJava.ch01
  • 28. So what happened? Let’s try to understand a little more
  • 30. Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - HW Node 2 - HW Node 3 - HW Node 4 - HW Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Node 5 - OS Node 5 - HW Your application … … Unified API Node 6 - OS Node 6 - HW Node 7 - OS Node 7 - HW Node 8 - OS Node 8 - HW
  • 31. Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Your application Dataframe Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 5 - OS … Node 6 - OS Node 7 - OS Node 8 - OS Unified API
  • 32. Title Text Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Dataframe
  • 33. Lab #2 - a bit of analytics But really just a bit
  • 34. Lab #2 - a little bit of analytics ๏ Goal
 From two datasets, one containing books, the other authors, list the authors with most books, by number of books ๏ Source code
 https://github.com/jgperrin/net.jgp.labs.spark
  • 35. If it was in a relational database books.csv authors.csv id: integer name: string link: string wikipedia: string id: integer authorId: integer title: string releaseDate: string link: string
  • 36. Basic analytics ๏ Go to net.jgp.labs.spark.l200_join.l030_count_books ๏ Open AuthorsAndBooksCountBooksApp.java ๏ Right click, Run As, Java Application
  • 37. +---+-------------------+--------------------+-----+ | id| name| link|count| +---+-------------------+--------------------+-----+ | 1| J. K. Rowling|http://amzn.to/2l...| 4| | 12|William Shakespeare|http://amzn.to/2j...| 3| | 4| Denis Diderot|http://amzn.to/2i...| 2| | 6| Craig Walls|http://amzn.to/2A...| 2| | 2|Jean Georges Perrin|http://amzn.to/2w...| 2| | 3| Mark Twain|http://amzn.to/2v...| 2| | 11| Alan Mycroft|http://amzn.to/2A...| 1| | 10| Mario Fusco|http://amzn.to/2A...| 1| … +---+-------------------+--------------------+-----+ root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- link: string (nullable = true) |-- count: long (nullable = false)
  • 38. package net.jgp.labs.spark.l200_join.l030_count_books; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class AuthorsAndBooksCountBooksApp { public static void main(String[] args) { AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("Authors and Books") .master("local").getOrCreate(); String filename = "data/authors.csv"; Dataset<Row> authorsDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); /jgperrin/net.jgp.labs.spark
  • 39. filename = "data/books.csv"; Dataset<Row> booksDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); Dataset<Row> libraryDf = authorsDf .join( booksDf, authorsDf.col("id").equalTo(booksDf.col("authorId")), "left") .withColumn("bookId", booksDf.col("id")) .drop(booksDf.col("id")) .groupBy( authorsDf.col("id"), authorsDf.col("name"), authorsDf.col("link")) .count(); libraryDf = libraryDf .orderBy(libraryDf.col("count").desc()); libraryDf.show(); libraryDf.printSchema(); } } /jgperrin/net.jgp.labs.spark
  • 40. The art of delegating
  • 41. Slave (Worker) Driver Master Cluster Manager Slave (Worker) Your app Executor Task Task Executor Task Task
  • 42. Lab #3 - an even smaller bit of AI But really just a bit
  • 44. Popular beliefs ๏ Robot with human-like behavior ๏ HAL from 2001 ๏ Isaac Asimov ๏ Potential ethic problems General AI Narrow AI ๏ Lots of mathematics ๏ Heavy calculations ๏ Algorithms ๏ Self-driving cars Current state-of-the-art
  • 45. Title Text I am an expert in general AI ARTIFICIAL INTELLIGENCE is Machine Learning
  • 46. ๏ Common algorithms ๏Linear and logistic regressions ๏Classification and regression trees ๏K-nearest neighbors (KNN) ๏Deep learning ๏Subset of ML ๏Artificial neural networks (ANNs) ๏Super CPU intensive, use of GPU Machine learning
  • 47. There are two kinds of data scientists: 1) Those who can extrapolate from incomplete data.
  • 48. Title TextDATA Engineer DATA Scientist Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders.
  • 49. Title Text Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer DATA Engineer DATA Scientist SQL
  • 50. All over again As goes the old adage: Garbage In, Garbage Out xkcd
  • 51. Lab #3 - correcting and extrapolating data
  • 52. Lab #3 - projecting data ๏ Goal
 As a restaurant manager, I want to predict how much revenue will bring a party of 40 ๏ Source code
 https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
  • 53. If everything was as simple… Dinner revenue per number of guests
  • 54. …as a visual representation Anomaly #1 Anomaly #2
  • 55. I love it when a plan comes together
  • 56. Load & Format +-----+-----+ |guest|price| +-----+-----+ | 1| 23.1| | 2| 30.0| … +-----+-----+ only showing top 20 rows ---- 1st DQ rule +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ | 1| 23.1| 23.1| | 2| 30.0| 30.0| … | 25| 3.0| -1.0| | 26| 10.0| -1.0| … +-----+-----+------------+ … +-----+-----+-----+--------+ |guest|price|label|features| +-----+-----+-----+--------+ | 1| 23.1| 23.1| [1.0]| | 2| 30.0| 30.0| [2.0]| … +-----+-----+-----+--------+ only showing top 20 rows … RMSE: 2.802192495300457 r2: 0.9965340953376102 Intersection: 20.979190460591575 Regression parameter: 1.0 Tol: 1.0E-6 Prediction for 40.0 guests is 218.00351106373822
  • 57. Using existing data quality rules package net.jgp.labs.sparkdq4ml.dq.udf; 
 import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; 
 public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml If price is ok, returns price, if price is ko, returns -1
  • 58. Telling Spark to use my DQ rules SparkSession spark = SparkSession.builder() .appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); spark.udf().register( "priceCorrelationRule", new PriceCorrelationDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  • 59. Loading my dataset String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  • 60. +-----+-----+ |guest|price| +-----+-----+ |   1|23.24| |    2|30.89| |    2|33.74| |    3|34.89| |    3|29.91| |    3| 38.0| |    4| 40.0| |    5|120.0| |    6| 50.0| |    6|112.0| |    8| 60.0| |    8|127.0| |    8|120.0| |    9|130.0| +-----+-----+ Raw data, contains the anomalies
  • 61. Apply the rules String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 62. +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ |    1| 23.1|        23.1| |    2| 30.0|        30.0| |    2| 33.0|        33.0| |    3| 34.0|        34.0| |   24|142.0|       142.0| |   24|138.0|       138.0| |   25|  3.0|        -1.0| |   26| 10.0|        -1.0| |   25| 15.0|        -1.0| |   26|  4.0|        -1.0| |   28| 10.0|        -1.0| |   28|158.0|       158.0| |   30|170.0|       170.0| |   31|180.0|       180.0| +-----+-----+------------+ Anomalies are clearly identified by -1, so they can be easily filtered
  • 63. Filtering out anomalies String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 64. +-----+-----+ |guest|price| +-----+-----+ |    1| 23.1| |    2| 30.0| |    2| 33.0| |    3| 34.0| |    3| 30.0| |    4| 40.0| |   19|110.0| |   20|120.0| |   22|131.0| |   24|142.0| |   24|138.0| |   28|158.0| |   30|170.0| |   31|180.0| +-----+-----+ Useable data
  • 65. Format the data for ML ๏ Convert/Adapt dataset to Features and Label ๏ Required for Linear Regression in MLlib ๏Needs a column called label of type double ๏Needs a column called features of type VectorUDT
  • 66. Format the data for ML spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); 
 // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  • 67. +-----+-----+-----+--------+------------------+ |guest|price|label|features|        prediction| +-----+-----+-----+--------+------------------+ |    1| 23.1| 23.1|   [1.0]|24.563807596513133| |    2| 30.0| 30.0|   [2.0]|29.595283312577884| |    2| 33.0| 33.0|   [2.0]|29.595283312577884| |    3| 34.0| 34.0|   [3.0]| 34.62675902864264| |    3| 30.0| 30.0|   [3.0]| 34.62675902864264| |    3| 38.0| 38.0|   [3.0]| 34.62675902864264| |    4| 40.0| 40.0|   [4.0]| 39.65823474470739| |   14| 89.0| 89.0|  [14.0]| 89.97299190535493| |   16|102.0|102.0|  [16.0]|100.03594333748444| |   20|120.0|120.0|  [20.0]|120.16184620174346| |   22|131.0|131.0|  [22.0]|130.22479763387295| |   24|142.0|142.0|  [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852 Prediction for 40 guests
  • 68. (the complex ML code) LinearRegression lr = new LinearRegression() .setMaxIter(40) .setRegParam(1) .setElasticNetParam(1); LinearRegressionModel model = lr.fit(df); Double feature = 40.0; Vector features = Vectors.dense(40.0); double p = model.predict(features); /jgperrin/net.jgp.labs.sparkdq4ml Define algorithms and its (hyper)parameters Created a model from our data Apply the model to a new dataset: predict
  • 69. It’s all about the base model Same model Trainer ModelDataset #1 ModelDataset #2 Predicted Data Step 1: Learning phase Step 2..n: Predictive phase
  • 71. A (Big) Data Scenario Data Raw Data Ingestion DataQuality Pure Data Transformation Rich Data Load/Publish Data
  • 72. Key takeaways ๏ Big Data is easier than one could think ๏ Java is the way to go (or Python) ๏ New vocabulary for using Spark ๏ You have a friend to help (ok, me) ๏ Spark is fun ๏ Spark is easily extensible
  • 73. Going further ๏ Contact me @jgperrin ๏ Join the Spark User mailing list ๏ Get help from Stack Overflow ๏ fb.com/TriangleSpark ๏ Start a Spark meetup in Columbia, SC?
  • 74. Going further Spark in action (Second edition, MEAP) by Jean Georges Perrin published by Manning http://jgp.net/sia sprkans-681D sprkans-7538 ctwopen10119 40% off One two free books
  • 77. Spark in Action Second edition, MEAP by Jean Georges Perrin published by Manning http://jgp.net/sia
  • 78. Credits Photos by Pexels IBM PC XT by Ruben de Rijcke - http://dendmedia.com/ vintage/ - Own work, CC BY 3.0, https:// commons.wikimedia.org/w/index.php?curid=3610862 Illustrations © Jean Georges Perrin
  • 79. No more slides You’re on your own!