A technical and economic breakdown of what language to choose as the core for your data science team.
The accompanying slides to my presentation at the 2019 DMC conference in Calgary, AB.
17. C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML
and scraping websites?
Our Question
so far
COBOL
SQLSQL
20. O
S
E
M
N
Categorical Data
Customer Death By
Wonka Industries Chocolate
Stark Industries Plasma Burns
Wayne Enterprises Multiple Contusions
Customer Death By Chocolate Death by Plasma
Burns
Death by Multiple
Contusions
Wonka Industries 1 0 0
Stark Industries 0 1 0
Wayne Enterprises 0 0 1
21. O
S
E
M
N
Flatten (Denormalize)
Customer Province
Wonka Industries Alberta
Stark Industries AB
Wayne Enterprises BC
Customer Item Price Date
Wonka Industries Toffee 5.00 2018-12-31
Stark Industries Iron 15.00 2018-03-30
Wayne Enterprises
Vitamin
D
25.00 2018-07-31
Wonka Industries Toffee 5.00 2019-01-04
Stark Industries Iron 15.00 2018-04-15
Wayne Enterprises
Vitamin
D
25.00 2018-08-01
Customer Death By
Wonka Industries Chocolate
Stark Industries Plasma Burns
Wayne Enterprises Multiple Contusions
22. O
S
E
M
N
Flatten (Denormalize)
Customer Item Price Date Province Death By
Wonka Industries Toffee 5.00 2018-12-31 Alberta Chocolate
Stark Industries Iron 15.00 2018-03-30 AB Plasma Burns
Wayne Enterprises
Vitamin
D
25.00 2018-07-31 BC Multiple Contusions
Wonka Industries Toffee 5.00 2019-01-04 Alberta Chocolate
Stark Industries Iron 15.00 2018-04-15 AB Plasma Burns
Wayne Enterprises
Vitamin
D
25.00 2018-08-01 BC Multiple Contusions
23. C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables /
DataFrames?
Our Question
so far
C
COBOL
SQLSQL
26. C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for math analysis and
in-flight visualizations?
Our Question
so far
Source:Wikipedia–InteractiveLanguages
C
C# COBOL
GO
Kotlin
Rust
C++
Java
SQL
29. C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Libraries for machine learning?
- Distributed modeling on Spark?
Python R
Scala
C
C# COBOL
GO Java
Kotlin
Rust
C++
SQLSQL
Ruby
37. C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Reducing the amount of time spent
debugging and writing code that
already exists?
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
Python R
Scala
41. C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Boosting productivity and efficiency?
- Reducing the supply premium?
Python R
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
ScalaScala
45. Our Question
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Boosting productivity and efficiency?
- Reducing the supply premium?
- Reducing training costs?
Python R
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
Scala
R
Led to the question, what is the right language for data science?
We’ll start by exploring some technical aspects, then some economic ones … but first, some context
Building a program, and building a team around the program.
This is not about you as an individual.
Will need to consider not just the current business problem, but the many business problems you will face.
Needs refinement
Underlying these are statistical analyses, algorithms, models, visualizations … anything that results in a prediction machine
Based on Github and SO, here’s our list.
Now I’ll ask your indulgence here, I’ve added Julia and Scala because they are highly relevant in the context of data science.
And we can’t forget COBOL, which we are already starting to see was a mistake in my fictitious data science project.
Let’s continue to build our question
Remove anything dedicated to web-programming.
Remove anything dedicated to shell scripting.
Let’s get rid of:
What we already cleared out
Exclusive for web or mobile app
Take everything above VBA because I hate VBA and I never want to talk about VBA again and I’ve already said VBA too many times in this sentence.
Based on Github and SO, here’s our list.
Now I’ll ask your indulgence here, I’ve added Julia and Scala because they are highly relevant in the context of data science.
And we can’t forget COBOL, which we are already starting to see was a mistake in my fictitious data science project.
Let’s continue to build our question
…with a focus on some technical elements.
We’ll use the OSEMN model from earlier to walk through the technical gauntlet.
OSEMN model “Awesome”
2010 by Hilary Mason and Chris Wiggins
Simplified, but it does a good job of capturing the essence of datasci
http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
https://towardsdatascience.com/5-steps-of-a-data-science-project-lifecycle-26c50372b492
https://medium.com/@randylaosat/life-of-data-data-science-is-osemn-f453e1febc10
OBTAIN
Although there are many datasets obtained from APIs and from scraping websites, the vast majority still comes from databases that house the data in a structured form.
These might be application databases, ODS, data warehouses, semantic layers ... Regardless, they’re treated as structured databases
Databases contain almost all of our contextual (reference) data, and almost all of our industry-secret data
Websites contain a wealth of data when trying to extrapolate information that the world, in general, has to offer
SQL was designed to query tables!
In fact, most languages have abstraction libraries that allow you to write SQL or almost-SQL … and most of those are translated into SQL when executed against databases.
It is the de facto standard for extracting data from databases, and this point must not be understated.
All of them can handle Excel / CSV / JSON / XML, even COBOL!
Not a helpful question to ask. Let’s ignore it and carry on.
Reduce
Clean
Transform
Categorize / Label
Observe / Take notes … what might be a good feature? What is unnecessary noise? What might be an outcome?
1-hot encoding / binarize
If we simply provide a numerical category, then the average of “Chocolate” and “Multiple Contusions” = “Plasma Burns”
http://elitedatascience.com/data-cleaning
Reduce to what you need
Remove outliers
Handle missing data
Highlight SQL for O and S… it’s so valuable that in the very early days of big data, SQL interpreters were quintessential to adoption.
This is a show-stopper. If there aren’t native objects or generally accepted libraries that help a language manage data native as a table, then there’s nowhere to go.
C is really close to bare metal (i.e. low-level language), making it non-ideal. It’s possible, not pragmatic.
Very All about workflows
Could write our own libraries, but this is an immensely costly effort… and our objective is make this a cost-effective team / program.
Read – Evaluate – Print – Loop
https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop
https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Interactive_mode_languages
What is modeling?
A trained model = a populated algorithm
Knowing the algorithm and the purpose;
Applying the right one to the problem at hand;
Training a model
Testing the model’s validity
Tuning the parameters
Training and testing need large datasets
Some algorithms are complex and need mucho data
These require a distributed environment
Distributed data frames
The person filling this role needs the skill of knowing which algorithm are available, and when to apply the appropriate ones
Julia could, but it’s still really, really new.
We are now down to the most relevant languages for data science.
For anyone who’s familiar with the field, this is where the question gets difficult.
Good time to call out productionalization of a trained model… can rewrite it in a low-level language for efficiency, or can scale the solution in the cloud.
I’ve intentionally ignored that
Let’s see if we can use some economic principles to help us expand our question.
Network Effects
Supply / Demand
Learning Curve
Network effect (value of X is amplified by Y connected nodes)
Number of developers that know the language (SO survey + google searches)
Number of google pages / SO answers / github libraries in the language
Number of libraries
Number of developers
Network effect (value of X is amplified by Y connected nodes)
Number of developers that know the language
Number of SO answers
Number of github libraries in the language
Observations:
Python’s network with frameworks
Python’s response size (network of people using it)
Relationship to data-science frameworks
Pandas, PyTorch, Tensorflow
All interlinked with Jupyter acting as a node
Observations:
Python’s network with frameworks
Python’s response size (network of people using it)
Relationship to data-science frameworks
Pandas, PyTorch, Tensorflow
All interlinked with Jupyter acting as a node
Y: Github Repos (including data-sci / machine-learning libraries, categorized by target language)
X: StackOverflow questions (including libraries, categorized by target language)
Bubblesize: Language popularity
Why no SQL?
I don’t have to re-invent what I can re-use
Someone else is bound to have hit the problem I’m facing
Data based on Kaggle datasets including the 2018 Kaggle survey and a job-demand dataset.
We can’t fulfill Scala … in economic theory, if supply is below demand, we have to pay a premium to get it.
This is really important, let’s validate this.
Here’s SO’s pay-by-technology breakdown.
Let’s zoom in on the relevant entries
Note: Doesn’t account for cross-training.
Also, timely staffing when turnover occurs, and reduced poaching
On the fast learning curve, we get to being a practitioner much faster.
Even if reaching expert takes around the same time, the developer can be useful much sooner.
The faster something can be learned:
The lesser the up front cost;
The lower the barrier to entry;
The greater the adoption
Leading to an amplified network effect
Virtuous cycle!
The faster something can be learned:
The lesser the up front cost;
The lower the barrier to entry;
The greater the adoption
Leading to an amplified network effect
Virtuous cycle!
https://www.codingdojo.com/blog/python-perfect-beginners
The faster something can be learned:
The lesser the up front cost;
The lower the barrier to entry;
The greater the adoption
Leading to an amplified network effect
Virtuous cycle!
https://www.codingdojo.com/blog/python-perfect-beginners
Don’t need a homogeneous team!
Remember! Not mutually exclusive! Depending on the size of the team and the problem at hand … team makeup can vary significantly
Technical conclusion:
Roles <-> Languages and Knowledges
Which language has the combination of features and most pliable across the data science process
Our team, as a team, needs to know
The syntax, patterns, principles and utilization of Python
To understand which algorithms are appropriate to the problem
Remember the OSEMN model? We didn’t talk about the last step – Interpreting!
This is what makes it real for stakeholders. If you can’t explain what you did, why you did it, and what the results imply … then it was all for nought.
TRUST
Stakeholders and users of our model want to trust it.
If they don’t understand it, they don’t trust it.
What data did we obtain?
What did we do to scrub it?
Why did we choose these algorithms, and this training data?
What biases could remain?
Under what conditions does this start to break down?
The answer is actually this
Story telling is clear communication in a natural language (English).
I hope you have enjoyed my storytelling today. Thank you.