SlideShare a Scribd company logo
1 of 18
Download to read offline
Scalable Data Science
with Spark and R
Zeydy Ortiz, Ph. D. & Rob Montalvo
DataCrunch Lab, LLC
Twitter: @DCrunchLab
PyData Carolinas 2016
DataCrunch
Lab
Tutorial Objectives
 Understand basic concepts of Spark
 Learn how to explore data in SparkR
 Learn how to perform interactive analysis
in SparkR
 Learn to run machine learning algorithms in
SparkR
DataCrunch
Lab
What is Spark?
 A distributed computing framework
 Provides programming abstraction and
parallel runtime to hide complexities of
fault tolerance and slow machines
 Partitions out big data across multiple
machines and stores the data on those
machines' memory
DataCrunch
Lab
Apache Spark Components
DataCrunch
Lab
Why Spark?
Spark is fast...
Massively parallel
Minimizes I/O bottlenecks by storing data in
memory
Spark is partitioning-aware to avoid network-
intensive shuffles
DataCrunch
Lab
Spark supports multiple languages
DataCrunch
Lab
Spark Concepts - Driver and Executors
A Spark program consists of two programs:
 Driver program
runs on one machine ("main")
 Executor program
runs either on cluster nodes or in local threads
on the same machine
DataCrunch
Lab
Spark Concepts – Resilient Distributed
Dataset (RDD)
“The main abstraction Spark provides is
a resilient distributed dataset (RDD), which
is a collection of elements partitioned across
the nodes of the cluster that can be operated
on in parallel.”
DataCrunch
Lab
Spark Concepts - SparkDataFrame
"A SparkDataFrame is a distributed collection
of data organized into named columns. It is
conceptually equivalent to a table in a
relational database or a data frame in R, but
with richer optimizations under the hood."
DataCrunch
Lab
Spark Concepts – SparkDataFrame
Properties
 Immutable; once created cannot be changed
 Distributed across all Executors
 Can be created from many sources (HDFS, text
files, JSON, Parquet, Hive,...
 Can be cached in memory for later reuse
(optimization by you)
 Must have a schema (columns, each with name
and type)
DataCrunch
Lab
Spark Concepts - Operations
Transformations
 Are lazily evaluated (part of an
execution plan); are not
immediately executed
 Are executed only when an
action is invoked
 Create a new SparkDataFrame
from an existing one
Actions
 The mechanism to get results
out of Spark
 Trigger the execution of "the
execution plan"
DataCrunch
Lab
Setting up for the Tutorial
Pre-requisites:
 Databricks Community Edition account
https://databricks.com/ce
1. Log into your Databricks account
https://community.cloud.databricks.com/
2. Import tutorial notebook
http://bit.ly/DCL-SparkR
DataCrunch
Lab
Import Notebook
1. On a separate window (or tab), point your browser to
http://bit.ly/DCL-SparkR
2. Copy to the clipboard the URL to which the previous
bit.ly evaluates.
3. On the Databricks UI, click on Workspace. The
Workspace view comes up.
4. Click on the right-most dropdown arrow (next to your
User ID).
5. Select Import. The Import Item window comes up.
6. Select the URL radio button, and paste onto the space
the URL obtained on step 2 above.
7. Click Import.
DataCrunch
Lab
SparkR
DataCrunch
Lab
Exploring Old Faithful Geyser Data
Old Faithful, named by members of the
1870 Washburn Expedition, was once called
“Eternity’s Timepiece” because of the
regularity of its eruptions. Despite the
myth, this geyser has never erupted at
exact hourly intervals, nor is it the largest
or most regular geyser in Yellowstone.
Questions to explore:
• Historically, how long has been the wait between eruptions?
• How long does an eruption usually last?
• What is the most common wait time between eruptions?
• How long eruptions last for the most common wait time?
DataCrunch
Lab
Identifying Irises
Use machine learning algorithms to classify irises based on
their measured features
DataCrunch
Lab
Summary
 Explained basic concepts of Spark
 Learned how to explore data in SparkR
 Learned how to perform interactive
analysis in SparkR
 Learned to run machine learning algorithms
in SparkR
Thank You!
Zeydy Ortiz, Ph. D. & Rob Montalvo
zortiz @ datacrunchlab.com
rmontalvo @ datacrunchlab.com
Twitter: @DCrunchLab

More Related Content

What's hot

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesDatabricks
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingDavid Lauzon
 
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...Databricks
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksAnyscale
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...Spark Summit
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayLuka Zakrajšek
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureMarco Parenzan
 
Deep Dive into Spark
Deep Dive into SparkDeep Dive into Spark
Deep Dive into SparkEric Xiao
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Distributed ML/DL with Ignite ML Module Using Apache Spark as Database
Distributed ML/DL with Ignite ML Module Using Apache Spark as DatabaseDistributed ML/DL with Ignite ML Module Using Apache Spark as Database
Distributed ML/DL with Ignite ML Module Using Apache Spark as DatabaseDatabricks
 
SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationNicolas Fränkel
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
 

What's hot (18)

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 Debriefing
 
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at Databricks
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and Play
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
 
Deep Dive into Spark
Deep Dive into SparkDeep Dive into Spark
Deep Dive into Spark
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Distributed ML/DL with Ignite ML Module Using Apache Spark as Database
Distributed ML/DL with Ignite ML Module Using Apache Spark as DatabaseDistributed ML/DL with Ignite ML Module Using Apache Spark as Database
Distributed ML/DL with Ignite ML Module Using Apache Spark as Database
 
SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy application
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
 

Similar to Scalable Data Science with Spark and R

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Synopsis on online shopping by sudeep singh
Synopsis on online shopping by  sudeep singhSynopsis on online shopping by  sudeep singh
Synopsis on online shopping by sudeep singhSudeep Singh
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Deploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesDeploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesPetteriTeikariPhD
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 

Similar to Scalable Data Science with Spark and R (20)

NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Synopsis on online shopping by sudeep singh
Synopsis on online shopping by  sudeep singhSynopsis on online shopping by  sudeep singh
Synopsis on online shopping by sudeep singh
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Deploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesDeploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and Kubernetes
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 

More from Zeydy Ortiz, Ph. D.

Best practices in building machine learning models in Azure ML
Best practices in building machine learning models in Azure MLBest practices in building machine learning models in Azure ML
Best practices in building machine learning models in Azure MLZeydy Ortiz, Ph. D.
 
Coverting data into business value
Coverting data into business valueCoverting data into business value
Coverting data into business valueZeydy Ortiz, Ph. D.
 
Analytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data ScienceAnalytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data ScienceZeydy Ortiz, Ph. D.
 

More from Zeydy Ortiz, Ph. D. (6)

Bias in AI
Bias in AIBias in AI
Bias in AI
 
Best practices in building machine learning models in Azure ML
Best practices in building machine learning models in Azure MLBest practices in building machine learning models in Azure ML
Best practices in building machine learning models in Azure ML
 
Coverting data into business value
Coverting data into business valueCoverting data into business value
Coverting data into business value
 
Analytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data ScienceAnalytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data Science
 
Data Science for Social Good
Data Science for Social GoodData Science for Social Good
Data Science for Social Good
 
Zeydy Ortiz _66
Zeydy Ortiz _66Zeydy Ortiz _66
Zeydy Ortiz _66
 

Recently uploaded

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

Scalable Data Science with Spark and R

  • 1. Scalable Data Science with Spark and R Zeydy Ortiz, Ph. D. & Rob Montalvo DataCrunch Lab, LLC Twitter: @DCrunchLab PyData Carolinas 2016
  • 2. DataCrunch Lab Tutorial Objectives  Understand basic concepts of Spark  Learn how to explore data in SparkR  Learn how to perform interactive analysis in SparkR  Learn to run machine learning algorithms in SparkR
  • 3. DataCrunch Lab What is Spark?  A distributed computing framework  Provides programming abstraction and parallel runtime to hide complexities of fault tolerance and slow machines  Partitions out big data across multiple machines and stores the data on those machines' memory
  • 5. DataCrunch Lab Why Spark? Spark is fast... Massively parallel Minimizes I/O bottlenecks by storing data in memory Spark is partitioning-aware to avoid network- intensive shuffles
  • 7. DataCrunch Lab Spark Concepts - Driver and Executors A Spark program consists of two programs:  Driver program runs on one machine ("main")  Executor program runs either on cluster nodes or in local threads on the same machine
  • 8. DataCrunch Lab Spark Concepts – Resilient Distributed Dataset (RDD) “The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.”
  • 9. DataCrunch Lab Spark Concepts - SparkDataFrame "A SparkDataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood."
  • 10. DataCrunch Lab Spark Concepts – SparkDataFrame Properties  Immutable; once created cannot be changed  Distributed across all Executors  Can be created from many sources (HDFS, text files, JSON, Parquet, Hive,...  Can be cached in memory for later reuse (optimization by you)  Must have a schema (columns, each with name and type)
  • 11. DataCrunch Lab Spark Concepts - Operations Transformations  Are lazily evaluated (part of an execution plan); are not immediately executed  Are executed only when an action is invoked  Create a new SparkDataFrame from an existing one Actions  The mechanism to get results out of Spark  Trigger the execution of "the execution plan"
  • 12. DataCrunch Lab Setting up for the Tutorial Pre-requisites:  Databricks Community Edition account https://databricks.com/ce 1. Log into your Databricks account https://community.cloud.databricks.com/ 2. Import tutorial notebook http://bit.ly/DCL-SparkR
  • 13. DataCrunch Lab Import Notebook 1. On a separate window (or tab), point your browser to http://bit.ly/DCL-SparkR 2. Copy to the clipboard the URL to which the previous bit.ly evaluates. 3. On the Databricks UI, click on Workspace. The Workspace view comes up. 4. Click on the right-most dropdown arrow (next to your User ID). 5. Select Import. The Import Item window comes up. 6. Select the URL radio button, and paste onto the space the URL obtained on step 2 above. 7. Click Import.
  • 15. DataCrunch Lab Exploring Old Faithful Geyser Data Old Faithful, named by members of the 1870 Washburn Expedition, was once called “Eternity’s Timepiece” because of the regularity of its eruptions. Despite the myth, this geyser has never erupted at exact hourly intervals, nor is it the largest or most regular geyser in Yellowstone. Questions to explore: • Historically, how long has been the wait between eruptions? • How long does an eruption usually last? • What is the most common wait time between eruptions? • How long eruptions last for the most common wait time?
  • 16. DataCrunch Lab Identifying Irises Use machine learning algorithms to classify irises based on their measured features
  • 17. DataCrunch Lab Summary  Explained basic concepts of Spark  Learned how to explore data in SparkR  Learned how to perform interactive analysis in SparkR  Learned to run machine learning algorithms in SparkR
  • 18. Thank You! Zeydy Ortiz, Ph. D. & Rob Montalvo zortiz @ datacrunchlab.com rmontalvo @ datacrunchlab.com Twitter: @DCrunchLab