SlideShare a Scribd company logo
1 of 52
2019-05-25
Oops!
I wrote my data
science in COBOL
Which language?
Technical EconomicContext
Program
Data Science
for us
The use of Advanced
Analytics to help with
complex business
decisions or problems
Advanced
Analytics
for us
Handle Large, Complex
Datasets
Statistical Algorithms
Visualizations, Heuristics
Artificial Intelligence
Machine Learning
Prediction Machine
What languages should we
choose from?
Our Question
so far
? ?
? ?
? ?
? ?
? ?
? ?
? ?
GitHub – Top Languages Javascript
Java
Python
PHP
C++
C#
TypeScript
Shell
C
Ruby
Source:Octoverse2018
Stack Overflow –
Top Languages
Source:StackoverflowInsights2019
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What languages should we choose
from?
Technical EconomicContext
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
•
Source:ATaxonomyofDataScience
Source:2018KaggleML&DSSurvey
O S E M N
Flat Tables
Related Tables
CSV, TSV, fixed-width
JSON, XML
HTML, CSS
O
S
E
M
N
Our Question
so far
What language is good for:
- Querying related tables?
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQLSQL
O
S
E
M
N
Flat Tables
Related Tables
CSV, TSV, fixed-width
JSON, XML
HTML, CSS
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML
and scraping websites?
Our Question
so far
COBOL
SQLSQL
What we have What we need
O
S
E
M
N
O
S
E
M
N
Inconsistencies
Alberta = AB
Customer Province
Wonka Industries Alberta
Stark Industries AB
Wayne Enterprises BC
O
S
E
M
N
Categorical Data
Customer Death By
Wonka Industries Chocolate
Stark Industries Plasma Burns
Wayne Enterprises Multiple Contusions
Customer Death By Chocolate Death by Plasma
Burns
Death by Multiple
Contusions
Wonka Industries 1 0 0
Stark Industries 0 1 0
Wayne Enterprises 0 0 1
O
S
E
M
N
Flatten (Denormalize)
Customer Province
Wonka Industries Alberta
Stark Industries AB
Wayne Enterprises BC
Customer Item Price Date
Wonka Industries Toffee 5.00 2018-12-31
Stark Industries Iron 15.00 2018-03-30
Wayne Enterprises
Vitamin
D
25.00 2018-07-31
Wonka Industries Toffee 5.00 2019-01-04
Stark Industries Iron 15.00 2018-04-15
Wayne Enterprises
Vitamin
D
25.00 2018-08-01
Customer Death By
Wonka Industries Chocolate
Stark Industries Plasma Burns
Wayne Enterprises Multiple Contusions
O
S
E
M
N
Flatten (Denormalize)
Customer Item Price Date Province Death By
Wonka Industries Toffee 5.00 2018-12-31 Alberta Chocolate
Stark Industries Iron 15.00 2018-03-30 AB Plasma Burns
Wayne Enterprises
Vitamin
D
25.00 2018-07-31 BC Multiple Contusions
Wonka Industries Toffee 5.00 2019-01-04 Alberta Chocolate
Stark Industries Iron 15.00 2018-04-15 AB Plasma Burns
Wayne Enterprises
Vitamin
D
25.00 2018-08-01 BC Multiple Contusions
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables /
DataFrames?
Our Question
so far
C
COBOL
SQLSQL
Define or
observe
process
Codify process
Repeatable
Outcome
Verify
Result
O
S
E
M
N
Programming an Application
Define or
observe
problem
Experiment
Observe
Result
Exp.
Exp.
Exp.
Strong library of math algorithms and visualizations
O
S
E
M
N
Programming in Data Science
Interactive (REPL) languages
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for math analysis and
in-flight visualizations?
Our Question
so far
Source:Wikipedia–InteractiveLanguages
C
C# COBOL
GO
Kotlin
Rust
C++
Java
SQL
O
S
E
M
N
O
S
E
M
N
y = ax2 + bx + c
y = 10x2 + 5x + 12
O
S
E
M
N
O
S
E
M
N
Source:PeekabooVision
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Libraries for machine learning?
- Distributed modeling on Spark?
Python R
Scala
C
C# COBOL
GO Java
Kotlin
Rust
C++
SQLSQL
Ruby
Technical EconomicContext
Metcalfe’s
Law
The value of a network grows as
the square of the number of its
users
Network Effects
Metcalfe’s
Law
Network Effects
Users /
Nodes
Value
Network
Network Effects
Source:StackoverflowInsights2019
Network Effects
Source:StackoverflowInsights2019
Network Effects
Source:StackoverflowTagsandGithubStars
Network Effects
Source:StackoverflowTagsandGithubStars
7 years old
15 years old
25 years old
29 years old
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Reducing the amount of time spent
debugging and writing code that
already exists?
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
Python R
Scala
Quantity
Price
Supply
Demand
Discount
Premium
Supply and Demand
Supply and Demand
Source:Supply&DemandbyVilmosMüller
Supply Demand Fulfillment
Python 86% 34% 2.5 x
R 38% 8% 4.8 x
Scala <10% 12% 0.8 x
Source:StackoverflowInsights2019
Supply and Demand
Cost Premium
Python $63k baseline
R $64k 0.01%
Scala $78k 24%
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Boosting productivity and efficiency?
- Reducing the supply premium?
Python R
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
ScalaScala
Time
Knowledge
Learning Curve
Fast Learning Curve
Typical Learning Curve
Practitioner
Novice
Expert
Time Savings = Cost Savings
Learning Curve
Source:CodingDojo
Learning Curve
Source:wpengine
Our Question
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Boosting productivity and efficiency?
- Reducing the supply premium?
- Reducing training costs?
Python R
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
Scala
R
Python
Solution
Source:2018KaggleML&DSSurvey
O S E M N
Python | SQL | Algorithms
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
O
S
E
M
N
O
S
E
M
N
Python | SQL | Algorithms
Storytelling

More Related Content

Similar to Oops! I Wrote my Data Science in COBOL

ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoNikolay Stoitsev
 
Programming Languages: Trends for 2021
Programming Languages: Trends for 2021Programming Languages: Trends for 2021
Programming Languages: Trends for 2021Svetlin Nakov
 
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015Bipin Singh
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Data analytics at a petabyte scale final
Data analytics at a petabyte scale   finalData analytics at a petabyte scale   final
Data analytics at a petabyte scale finalOri Reshef
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack source{d}
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Data Con LA
 
Christian Mladenov @ Intuitics
Christian Mladenov @ IntuiticsChristian Mladenov @ Intuitics
Christian Mladenov @ IntuiticsPAPIs.io
 
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsPaul Gallagher
 
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...NETUserGroupBern
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn
 
How to integrate python into a scala stack
How to integrate python into a scala stackHow to integrate python into a scala stack
How to integrate python into a scala stackFliptop
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
A case for teaching SQL to scientists
A case for teaching SQL to scientistsA case for teaching SQL to scientists
A case for teaching SQL to scientistsdhalperi
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Sid Anand
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Mohit Kalra 25th August
Mohit Kalra 25th AugustMohit Kalra 25th August
Mohit Kalra 25th Augustmdk8989
 

Similar to Oops! I Wrote my Data Science in COBOL (20)

ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to Go
 
Programming Languages: Trends for 2021
Programming Languages: Trends for 2021Programming Languages: Trends for 2021
Programming Languages: Trends for 2021
 
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
 
Resume analyst
Resume analystResume analyst
Resume analyst
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Data analytics at a petabyte scale final
Data analytics at a petabyte scale   finalData analytics at a petabyte scale   final
Data analytics at a petabyte scale final
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
 
Christian Mladenov @ Intuitics
Christian Mladenov @ IntuiticsChristian Mladenov @ Intuitics
Christian Mladenov @ Intuitics
 
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
 
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
 
How to integrate python into a scala stack
How to integrate python into a scala stackHow to integrate python into a scala stack
How to integrate python into a scala stack
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
A case for teaching SQL to scientists
A case for teaching SQL to scientistsA case for teaching SQL to scientists
A case for teaching SQL to scientists
 
Introduction To R
Introduction To RIntroduction To R
Introduction To R
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Mohit Kalra 25th August
Mohit Kalra 25th AugustMohit Kalra 25th August
Mohit Kalra 25th August
 

Recently uploaded

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 

Recently uploaded (20)

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Oops! I Wrote my Data Science in COBOL

Editor's Notes

  1. An individual can
  2. Led to the question, what is the right language for data science?
  3. We’ll start by exploring some technical aspects, then some economic ones … but first, some context
  4. Building a program, and building a team around the program. This is not about you as an individual. Will need to consider not just the current business problem, but the many business problems you will face.
  5. Needs refinement
  6. Underlying these are statistical analyses, algorithms, models, visualizations … anything that results in a prediction machine
  7. Based on Github and SO, here’s our list. Now I’ll ask your indulgence here, I’ve added Julia and Scala because they are highly relevant in the context of data science. And we can’t forget COBOL, which we are already starting to see was a mistake in my fictitious data science project. Let’s continue to build our question
  8. Remove anything dedicated to web-programming. Remove anything dedicated to shell scripting.
  9. Let’s get rid of: What we already cleared out Exclusive for web or mobile app Take everything above VBA because I hate VBA and I never want to talk about VBA again and I’ve already said VBA too many times in this sentence.
  10. Based on Github and SO, here’s our list. Now I’ll ask your indulgence here, I’ve added Julia and Scala because they are highly relevant in the context of data science. And we can’t forget COBOL, which we are already starting to see was a mistake in my fictitious data science project. Let’s continue to build our question
  11. …with a focus on some technical elements. We’ll use the OSEMN model from earlier to walk through the technical gauntlet.
  12. OSEMN model “Awesome” 2010 by Hilary Mason and Chris Wiggins Simplified, but it does a good job of capturing the essence of datasci http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ https://towardsdatascience.com/5-steps-of-a-data-science-project-lifecycle-26c50372b492 https://medium.com/@randylaosat/life-of-data-data-science-is-osemn-f453e1febc10
  13. OBTAIN Although there are many datasets obtained from APIs and from scraping websites, the vast majority still comes from databases that house the data in a structured form. These might be application databases, ODS, data warehouses, semantic layers ... Regardless, they’re treated as structured databases Databases contain almost all of our contextual (reference) data, and almost all of our industry-secret data Websites contain a wealth of data when trying to extrapolate information that the world, in general, has to offer
  14. SQL was designed to query tables! In fact, most languages have abstraction libraries that allow you to write SQL or almost-SQL … and most of those are translated into SQL when executed against databases. It is the de facto standard for extracting data from databases, and this point must not be understated.
  15. All of them can handle Excel / CSV / JSON / XML, even COBOL! Not a helpful question to ask. Let’s ignore it and carry on.
  16. Reduce Clean Transform Categorize / Label Observe / Take notes … what might be a good feature? What is unnecessary noise? What might be an outcome?
  17. 1-hot encoding / binarize If we simply provide a numerical category, then the average of “Chocolate” and “Multiple Contusions” = “Plasma Burns”
  18. http://elitedatascience.com/data-cleaning Reduce to what you need Remove outliers Handle missing data
  19. Highlight SQL for O and S… it’s so valuable that in the very early days of big data, SQL interpreters were quintessential to adoption. This is a show-stopper. If there aren’t native objects or generally accepted libraries that help a language manage data native as a table, then there’s nowhere to go. C is really close to bare metal (i.e. low-level language), making it non-ideal. It’s possible, not pragmatic.
  20. Very All about workflows
  21. Could write our own libraries, but this is an immensely costly effort… and our objective is make this a cost-effective team / program. Read – Evaluate – Print – Loop https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Interactive_mode_languages
  22. What is modeling? A trained model = a populated algorithm
  23. Knowing the algorithm and the purpose; Applying the right one to the problem at hand; Training a model Testing the model’s validity Tuning the parameters Training and testing need large datasets Some algorithms are complex and need mucho data These require a distributed environment Distributed data frames The person filling this role needs the skill of knowing which algorithm are available, and when to apply the appropriate ones
  24. Julia could, but it’s still really, really new. We are now down to the most relevant languages for data science. For anyone who’s familiar with the field, this is where the question gets difficult. Good time to call out productionalization of a trained model… can rewrite it in a low-level language for efficiency, or can scale the solution in the cloud. I’ve intentionally ignored that
  25. Let’s see if we can use some economic principles to help us expand our question. Network Effects Supply / Demand Learning Curve
  26. Network effect (value of X is amplified by Y connected nodes) Number of developers that know the language (SO survey + google searches) Number of google pages / SO answers / github libraries in the language Number of libraries Number of developers
  27. Network effect (value of X is amplified by Y connected nodes) Number of developers that know the language Number of SO answers Number of github libraries in the language
  28. Observations: Python’s network with frameworks Python’s response size (network of people using it) Relationship to data-science frameworks Pandas, PyTorch, Tensorflow All interlinked with Jupyter acting as a node
  29. Observations: Python’s network with frameworks Python’s response size (network of people using it) Relationship to data-science frameworks Pandas, PyTorch, Tensorflow All interlinked with Jupyter acting as a node
  30. Y: Github Repos (including data-sci / machine-learning libraries, categorized by target language) X: StackOverflow questions (including libraries, categorized by target language) Bubblesize: Language popularity Why no SQL?
  31. I don’t have to re-invent what I can re-use Someone else is bound to have hit the problem I’m facing
  32. Data based on Kaggle datasets including the 2018 Kaggle survey and a job-demand dataset. We can’t fulfill Scala … in economic theory, if supply is below demand, we have to pay a premium to get it. This is really important, let’s validate this.
  33. Here’s SO’s pay-by-technology breakdown. Let’s zoom in on the relevant entries Note: Doesn’t account for cross-training.
  34. Also, timely staffing when turnover occurs, and reduced poaching
  35. On the fast learning curve, we get to being a practitioner much faster. Even if reaching expert takes around the same time, the developer can be useful much sooner. The faster something can be learned: The lesser the up front cost; The lower the barrier to entry; The greater the adoption Leading to an amplified network effect Virtuous cycle!
  36. The faster something can be learned: The lesser the up front cost; The lower the barrier to entry; The greater the adoption Leading to an amplified network effect Virtuous cycle! https://www.codingdojo.com/blog/python-perfect-beginners
  37. The faster something can be learned: The lesser the up front cost; The lower the barrier to entry; The greater the adoption Leading to an amplified network effect Virtuous cycle! https://www.codingdojo.com/blog/python-perfect-beginners
  38. Don’t need a homogeneous team! Remember! Not mutually exclusive! Depending on the size of the team and the problem at hand … team makeup can vary significantly Technical conclusion: Roles <-> Languages and Knowledges Which language has the combination of features and most pliable across the data science process
  39. Our team, as a team, needs to know The syntax, patterns, principles and utilization of Python To understand which algorithms are appropriate to the problem
  40. Remember the OSEMN model? We didn’t talk about the last step – Interpreting! This is what makes it real for stakeholders. If you can’t explain what you did, why you did it, and what the results imply … then it was all for nought.
  41. TRUST Stakeholders and users of our model want to trust it. If they don’t understand it, they don’t trust it. What data did we obtain? What did we do to scrub it? Why did we choose these algorithms, and this training data? What biases could remain? Under what conditions does this start to break down?
  42. The answer is actually this Story telling is clear communication in a natural language (English). I hope you have enjoyed my storytelling today. Thank you.
  43. Interesting chart: Stackoverflow ^ | | |______________> Github
  44. https://www.tiobe.com/tiobe-index/
  45. https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/
  46. https://insights.stackoverflow.com/survey/2019#technology