Oops! I Wrote my Data Science in COBOL

2019-05-25
Oops!
I wrote my data
science in COBOL

Data Science
for us
The use of Advanced
Analytics to help with
complex business
decisions or problems

Advanced
Analytics
for us
Handle Large, Complex
Datasets
Statistical Algorithms
Visualizations, Heuristics
Artificial Intelligence
Machine Learning
Prediction Machine

What languages should we
choose from?
Our Question
so far
? ?
? ?
? ?
? ?
? ?
? ?
? ?

GitHub – Top Languages Javascript
Java
Python
PHP
C++
C#
TypeScript
Shell
C
Ruby
Source:Octoverse2018

Stack Overflow –
Top Languages
Source:StackoverflowInsights2019

C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What languages should we choose
from?

O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
•
Source:ATaxonomyofDataScience

Source:2018KaggleML&DSSurvey
O S E M N

Flat Tables
Related Tables
CSV, TSV, fixed-width
JSON, XML
HTML, CSS
O
S
E
M
N

Our Question
so far
What language is good for:
- Querying related tables?
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQLSQL

O
S
E
M
N
Flat Tables
Related Tables
CSV, TSV, fixed-width
JSON, XML
HTML, CSS

C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
- Handling Excel, CSV, JSON, XML
and scraping websites?
Our Question
so far
COBOL
SQLSQL

What we have What we need
O
S
E
M
N

O
S
E
M
N
Inconsistencies
Alberta = AB
Customer Province
Wonka Industries Alberta
Stark Industries AB
Wayne Enterprises BC

O
S
E
M
N
Categorical Data
Customer Death By
Wonka Industries Chocolate
Stark Industries Plasma Burns
Wayne Enterprises Multiple Contusions
Customer Death By Chocolate Death by Plasma
Burns
Death by Multiple
Contusions
Wonka Industries 1 0 0
Stark Industries 0 1 0
Wayne Enterprises 0 0 1

O
S
E
M
N
Flatten (Denormalize)
Customer Province
Wonka Industries Alberta
Stark Industries AB
Wayne Enterprises BC
Customer Item Price Date
Wonka Industries Toffee 5.00 2018-12-31
Stark Industries Iron 15.00 2018-03-30
Wayne Enterprises
Vitamin
D
25.00 2018-07-31
Wonka Industries Toffee 5.00 2019-01-04
Stark Industries Iron 15.00 2018-04-15
Wayne Enterprises
Vitamin
D
25.00 2018-08-01
Customer Death By
Wonka Industries Chocolate
Stark Industries Plasma Burns
Wayne Enterprises Multiple Contusions

O
S
E
M
N
Flatten (Denormalize)
Customer Item Price Date Province Death By
Wonka Industries Toffee 5.00 2018-12-31 Alberta Chocolate
Stark Industries Iron 15.00 2018-03-30 AB Plasma Burns
Wayne Enterprises
Vitamin
D
25.00 2018-07-31 BC Multiple Contusions
Wonka Industries Toffee 5.00 2019-01-04 Alberta Chocolate
Stark Industries Iron 15.00 2018-04-15 AB Plasma Burns
Wayne Enterprises
Vitamin
D
25.00 2018-08-01 BC Multiple Contusions

C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables /
DataFrames?
Our Question
so far
C
COBOL
SQLSQL

Define or
observe
process
Codify process
Repeatable
Outcome
Verify
Result
O
S
E
M
N
Programming an Application

Define or
observe
problem
Experiment
Observe
Result
Exp.
Exp.
Exp.
Strong library of math algorithms and visualizations
O
S
E
M
N
Programming in Data Science
Interactive (REPL) languages

C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for math analysis and
in-flight visualizations?
Our Question
so far
Source:Wikipedia–InteractiveLanguages
C
C# COBOL
GO
Kotlin
Rust
C++
Java
SQL

O
S
E
M
N
O
S
E
M
N
y = ax2 + bx + c
y = 10x2 + 5x + 12

O
S
E
M
N
O
S
E
M
N
Source:PeekabooVision

C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
- Libraries for statistical analysis and in-flight visualizations?
- Libraries for machine learning?
- Distributed modeling on Spark?
Python R
Scala
C
C# COBOL
GO Java
Kotlin
Rust
C++
SQLSQL
Ruby

Metcalfe’s
Law
The value of a network grows as
the square of the number of its
users
Network Effects

Metcalfe’s
Law
Network Effects
Users /
Nodes
Value
Network

Network Effects

Network Effects
Source:StackoverflowTagsandGithubStars

Network Effects
Source:StackoverflowTagsandGithubStars
7 years old
15 years old
25 years old
29 years old

C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
- Reducing the amount of time spent
debugging and writing code that
already exists?
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
Python R
Scala

Quantity
Price
Supply
Demand
Discount
Premium
Supply and Demand

Supply and Demand
Source:Supply&DemandbyVilmosMüller
Supply Demand Fulfillment
Python 86% 34% 2.5 x
R 38% 8% 4.8 x
Scala <10% 12% 0.8 x

Supply and Demand
Cost Premium
Python $63k baseline
R $64k 0.01%
Scala $78k 24%

C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
- Boosting productivity and efficiency?
- Reducing the supply premium?
Python R
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
ScalaScala

Time
Knowledge
Learning Curve
Fast Learning Curve
Typical Learning Curve
Practitioner
Novice
Expert
Time Savings = Cost Savings

Learning Curve
Source:CodingDojo

Learning Curve
Source:wpengine

Our Question
- Boosting productivity and efficiency?
- Reducing the supply premium?
- Reducing training costs?
Python R
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
Scala
R

O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret

Python | SQL | Algorithms
Storytelling

Oops! I Wrote my Data Science in COBOL

Recommended

Recommended

More Related Content

Similar to Oops! I Wrote my Data Science in COBOL

Similar to Oops! I Wrote my Data Science in COBOL (20)

Recently uploaded

Recently uploaded (20)

Oops! I Wrote my Data Science in COBOL

Editor's Notes