Data Science in
Ruby? Is it possible?
Is it Fast? Should we
use it?
• Rodrigo Urubatan
• rodrigo@urubatan.dev
• http://urubatan.dev
• http://twitter.com/urubatan
Anyone here work
with Data Science?
• Data Scientist?
• Data Engineer?
• Developers of application that uses Data?
• Statisticians?
What exactly
is Data
Science?
The process of extracting meaning from and interpret
data
The usage of statistics and machine learning to clean
and manipulate data
The usage of computer software to collect, clean,
manipulate and interpret data
A cool name for the combination of Data Mining and
Business Intelligence (other buzz words that were used
for a long time for exactly what we call Data Science
today, but with more expensive tool sets)
Can Ruby do Data
Science?
Can Ruby do
Data Science?
(Long Answer)
INTEGRATION WITH
OTHER TOOLS
DATA
MANIPULATION
DISTRIBUTED
COMPUTING
DATA STRUCTURES
DATA SETS STATISTICS VISUALIZATION INTERACTIVE
COMPUTING
Interactive
Computing
iruby — Ruby kernel
for Jupyter.
iruby-rails —
Integration library for
IRuby and Rails.
Standing on
the shoulders
of giants
(integration)
pycall — Bridge into
the Python world.
rserve-client — Ruby
connector for Rserve,
R's binary server.
Data
manipulation
kiba — lightweight Ruby ETL
(Extract-Transform-Load)
framework.
jongleur — Workflow
manager using DAG
definitions to execute ETL
tasks
Distributed
Computing
ruby-spark — Ruby
Interface to Apache
Spark 1.x.x.
jruby-spark — JRuby
based bindings
for Apache Spark.
Data
Structures
daru — Data Frame and Vector
structures with comprehensive
manipulating and visualization methods.
numo-narray — n-dimensional
Numerical Array for Ruby.
nmatrix — dense and sparse linear
algebra library for Ruby via SciRuby.
Data Sets
rdatasets — Data sets
available in R via Rdatasets.
red-datasets — Growing
collection of publicly
available data sets such as
CIFAR-10, Iris, MNIST etc
Statistics
rb-gsl — Ruby interface to the GNU
Scientific Library. [dep: GLS]
simple_stats — Enumerable patches
for descriptive statistics.
enumerable-statistics — fast
implementation of descriptive
statistics for the Enumerable module.
Visualization
• matplotlib — Ruby based wrapper
around matplotlib. [dep: matplotlib]
• mathematical — PNG and MathML
renderings for your equations.
• daru-view — daru-view is
interactive plotting gem for web
application (any Ruby web
application framework like
Rails/Sinatra/Nanoc/Hanami) &
IRuby notebook. It is a plugin gem
for daru.
• daru-plotly — Plotly based
visualization for Daru.
The 3 Major
Ruby Data
Science
Projects
SciRuby project
Nmatrix Centric gems
Nmatrix
Daru
GnuplotRB
Stas_sample
Ruby Numo project
Numo:: NArray centric Gems
Numo:: NArray
Numo:: FFTE
Numo:: FFTW
Numo::Gnuplot
RedDataTools project
Apache Arrow centric gems
RedArrow
RedChainer
RedArrowGSL
RedArrowNMatrix
RedArrowNumoNArray
Doing data science in
Ruby is Hard!
Ruby
X
Python
Ruby
Daru
NMatrix/NArray
Python
Pandas
Numpy
Simple number operation with numpy
“Same”number operation with NMatrix
Simple DataFrame operation with Pandas
“Same” DataFrame operation with Daru
Ruby and Ruby on Rails are
way better to write business
web applications!
We can even do
really good Machine
Learning with Ruby
(but that is subject
for another
presentation)
And my objective is to
help ruby developers to use
the best tools for each job so
they can solve hard
problems, with less bugs and
have more free time.
pycall to the
rescue
pycall lets you use Python libraries from
your ruby code very naturally, as if you
were calling a Ruby library
pycall consists of one ruby binding
library for libpython.so and an Object-
oriented protocol for communication
between Ruby and Python
Simple pycall
code
Ok, so what
are the best
work
patterns?
Python is way better than Ruby for
Data Science
Ruby is better for web business
applications
Best patterns for integration are
(IMHO)
• Pointing both applications to the same
database
• Exchanging data through JSON or some similar
serialization
• Calling Python directly through pycall
References
• Ruby Conf 2017 – Using Ruby in Data Science by Kenta Murata (@mrkn)
• Big Data analysis in Ruby
• Lets do some (Data) Science in Ruby by Dan Carpenter (@dan_alyst)
• Progress of Ruby/Numo: Numerical Computing for Ruby
• SciRuby
• Ruby::Numo
• Ruby Machine Learning resources
• Ruby Data Science Resources
• PyCall
Any questions? Talk to
me!
• @urubatan
• https://urubatan.dev
• rodrigo@urubatan.dev
Other Data
Structure
Libraries
• spreadsheet — manipulation library for MS
Excel spreadsheets
• mdarray — Array structure for Jruby
• cumo — CUDA-aware numerical Array library
with NArray similar interface.
Other statistics libraries
statsample — basic and advanced statistics for Ruby. [dep: GLS]
statsample-glm — extension of statsample by Generalized Linear Models.
statsample-bivariate-extension — extension of statsample by Bivariate Correlations.
statsample-timeseries — extension of statsample by Time Series estimators.
pca — Principal Component Analysis (PCA) in Ruby.
descriptive-statistics — descriptive extensions for the Enumerable module or standalone usage.
distribution — probabilistic distributions and descriptive measures for them.
statistics2 — Normal, Chi-square, t- and F- probability distributions for Ruby.
General
Format IO
• https://github.com/fiksu/rcsv
• ox — Optimized for speed XML parser and
object marshaller.
• oj — High-speed JSON parser.
• Markdown
• Nokogiri
• CSV
Database
Adapters
• pg
• Mongo
• MySQL

Data science in ruby is it possible? is it fast? should we use it?

  • 1.
    Data Science in Ruby?Is it possible? Is it Fast? Should we use it? • Rodrigo Urubatan • rodrigo@urubatan.dev • http://urubatan.dev • http://twitter.com/urubatan
  • 2.
    Anyone here work withData Science? • Data Scientist? • Data Engineer? • Developers of application that uses Data? • Statisticians?
  • 3.
    What exactly is Data Science? Theprocess of extracting meaning from and interpret data The usage of statistics and machine learning to clean and manipulate data The usage of computer software to collect, clean, manipulate and interpret data A cool name for the combination of Data Mining and Business Intelligence (other buzz words that were used for a long time for exactly what we call Data Science today, but with more expensive tool sets)
  • 4.
    Can Ruby doData Science?
  • 5.
    Can Ruby do DataScience? (Long Answer) INTEGRATION WITH OTHER TOOLS DATA MANIPULATION DISTRIBUTED COMPUTING DATA STRUCTURES DATA SETS STATISTICS VISUALIZATION INTERACTIVE COMPUTING
  • 6.
    Interactive Computing iruby — Rubykernel for Jupyter. iruby-rails — Integration library for IRuby and Rails.
  • 7.
    Standing on the shoulders ofgiants (integration) pycall — Bridge into the Python world. rserve-client — Ruby connector for Rserve, R's binary server.
  • 8.
    Data manipulation kiba — lightweightRuby ETL (Extract-Transform-Load) framework. jongleur — Workflow manager using DAG definitions to execute ETL tasks
  • 9.
    Distributed Computing ruby-spark — Ruby Interfaceto Apache Spark 1.x.x. jruby-spark — JRuby based bindings for Apache Spark.
  • 10.
    Data Structures daru — DataFrame and Vector structures with comprehensive manipulating and visualization methods. numo-narray — n-dimensional Numerical Array for Ruby. nmatrix — dense and sparse linear algebra library for Ruby via SciRuby.
  • 11.
    Data Sets rdatasets —Data sets available in R via Rdatasets. red-datasets — Growing collection of publicly available data sets such as CIFAR-10, Iris, MNIST etc
  • 12.
    Statistics rb-gsl — Rubyinterface to the GNU Scientific Library. [dep: GLS] simple_stats — Enumerable patches for descriptive statistics. enumerable-statistics — fast implementation of descriptive statistics for the Enumerable module.
  • 13.
    Visualization • matplotlib —Ruby based wrapper around matplotlib. [dep: matplotlib] • mathematical — PNG and MathML renderings for your equations. • daru-view — daru-view is interactive plotting gem for web application (any Ruby web application framework like Rails/Sinatra/Nanoc/Hanami) & IRuby notebook. It is a plugin gem for daru. • daru-plotly — Plotly based visualization for Daru.
  • 14.
    The 3 Major RubyData Science Projects SciRuby project Nmatrix Centric gems Nmatrix Daru GnuplotRB Stas_sample Ruby Numo project Numo:: NArray centric Gems Numo:: NArray Numo:: FFTE Numo:: FFTW Numo::Gnuplot RedDataTools project Apache Arrow centric gems RedArrow RedChainer RedArrowGSL RedArrowNMatrix RedArrowNumoNArray
  • 15.
    Doing data sciencein Ruby is Hard!
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Ruby and Rubyon Rails are way better to write business web applications!
  • 22.
    We can evendo really good Machine Learning with Ruby (but that is subject for another presentation)
  • 23.
    And my objectiveis to help ruby developers to use the best tools for each job so they can solve hard problems, with less bugs and have more free time.
  • 24.
    pycall to the rescue pycalllets you use Python libraries from your ruby code very naturally, as if you were calling a Ruby library pycall consists of one ruby binding library for libpython.so and an Object- oriented protocol for communication between Ruby and Python
  • 25.
  • 26.
    Ok, so what arethe best work patterns? Python is way better than Ruby for Data Science Ruby is better for web business applications Best patterns for integration are (IMHO) • Pointing both applications to the same database • Exchanging data through JSON or some similar serialization • Calling Python directly through pycall
  • 27.
    References • Ruby Conf2017 – Using Ruby in Data Science by Kenta Murata (@mrkn) • Big Data analysis in Ruby • Lets do some (Data) Science in Ruby by Dan Carpenter (@dan_alyst) • Progress of Ruby/Numo: Numerical Computing for Ruby • SciRuby • Ruby::Numo • Ruby Machine Learning resources • Ruby Data Science Resources • PyCall
  • 28.
    Any questions? Talkto me! • @urubatan • https://urubatan.dev • rodrigo@urubatan.dev
  • 29.
    Other Data Structure Libraries • spreadsheet— manipulation library for MS Excel spreadsheets • mdarray — Array structure for Jruby • cumo — CUDA-aware numerical Array library with NArray similar interface.
  • 30.
    Other statistics libraries statsample— basic and advanced statistics for Ruby. [dep: GLS] statsample-glm — extension of statsample by Generalized Linear Models. statsample-bivariate-extension — extension of statsample by Bivariate Correlations. statsample-timeseries — extension of statsample by Time Series estimators. pca — Principal Component Analysis (PCA) in Ruby. descriptive-statistics — descriptive extensions for the Enumerable module or standalone usage. distribution — probabilistic distributions and descriptive measures for them. statistics2 — Normal, Chi-square, t- and F- probability distributions for Ruby.
  • 31.
    General Format IO • https://github.com/fiksu/rcsv •ox — Optimized for speed XML parser and object marshaller. • oj — High-speed JSON parser. • Markdown • Nokogiri • CSV
  • 32.

Editor's Notes

  • #2 53s
  • #3 Try to interact with the audience 41s (1:34)
  • #4 Quick comment of what is data science 1:44s (3:15)
  • #5 Quick answer: Yes, but let's dive a little into that, since you can do everything, but the answer to if you should deppends on what you want to do 43s (3:58)
  • #6 1:53s (5:51) There are lots of data science libraries for Ruby, for statiscics, data manipulation, data visualization, for integration with python and R, distributed computing, data visualization, machine learning, it appears we have everything we need! But not everything is as great as it seems, lets check some of the options in depth.
  • #7 45s (6:36)
  • #8 38s (7:14)
  • #9 1:14 (8:28)
  • #10 44s (10:12)
  • #11 25s (10:37)
  • #12 28s (11:05)
  • #13 1:25 (12:30)
  • #14 1:45 (14:15)
  • #15 4:48 (19:03) SciRuby Drawbacks: - Nmatrix is slow for large ammounts of data (there is a bug open for that) - Daru has less functionality than Pandas for practical DS work - There is a lot less documentation Benefits: - You only need Ruby Nmatrix supports in-memory sparse matrices - You can use Data frames with Daru Data frames are the basic data structure to manipulate and visualize living data in data science a 2D table data structure like a SQL Table Ruby Numo Benefits: You need only Ruby Numo::Narray is faster than Nmatrix and pure ruby Drawbacks No sparce matrices suport No data frame support Even less documented In Summary for Data Science SciRuby is better because it has Daru, for scientific computing is better because Nmatrix is too slow But I didn’t forget about RedDataTools It supports Apache Arrow and the core developer Kohoei Suto is also a member of Apache Arrow PMC But it is too young to use in production, and right now it only supports Data I/O, manipulation is not supported
  • #16 10s (19:13)
  • #17 54s (20:07) The most used libraries for data cleaning and transformation in Python are Pandas and Numpy, and we have the corresponding Daru and NMatrix/Narray, but there are some problems, for starters, the documentation of the ruby versions is ages behind the Python libraries, mainly because there are a lot less users. Also Daru has less features than Pandas NMatrix gets slow for big ammounts of data Narray is lots faster but not compatible with Daru but things are improving
  • #18 50s (20:57)
  • #19 1:36s (21:33)
  • #20 31s (22:04)
  • #21 51s (22:55)
  • #22 20s (23:15)
  • #23 10s (23:25)
  • #24 15s (23:40)
  • #25 1:13 (24:53)
  • #26 1:08s (26:01) Pycal can work with most python libraries, but to make our lifes easier, it already has wrapers for numpy, pandas, matplotlib, seaborn, scikit-learn, tensorflow, and even wraping python libraries it is a lot faster than using the native Ruby libraries (thanks Kenta Murata for this great project)
  • #27 1:45s (27:46)
  • #28 1:11 (28:57)
  • #29 40s