4. 什麼是R語言
Open Source
“lingua franca”
Analytics, Computing,
Modeling
Global Community
Millions of users 7000+ Algorithms, Test
Data & Evaluations
Can be Scaled to
Big Data,
Big Analytics
Ecosystem
Scalability
5. Polls of data miners and analytics professionals on their software
choices since 2007
Source: http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
6. R is developed and contributed by open
source community
CRAN – the Comprehensive R Archive
Network
Package repository of R
7500+ packages, covering all aspects of
statistical analysis, machine learning, natural
language processing …
Still exponentially growth
Free!
Source: http://r4stats.com/2014/04/07/r-continues-its-rapid-growth/
13. Mean Error (ME) - Average forecasting error (an error is the difference between the
predicted value and the actual value) on the test dataset
Root Mean Squared Error (RMSE) - The square root of the average of squared errors of
predictions made on the test dataset.
Mean Absolute Error (MAE) - The average of absolute errors
Mean Percentage Error (MPE) - The average of percentage errors
Mean Absolute Percentage Error (MAPE) - The average of absolute percentage errors
Mean Absolute Scaled Error (MASE)
Symmetric Mean Absolute Percentage Error (sMAPE)
14.
15. Datasize
In-memory
In-memory In-Memory or Disk Based
Speed of
Analysis
Single threaded Multi-threaded
Multi-threaded, parallel
processing 1:N servers
Support
Community Community Community + Commercial
Analytic
Breadth &
Depth
7500+ innovative analytic
packages
7500+ innovative analytic
packages
7500+ innovative packages
+ commercial parallel high-
speed functions
License Open Source
Open Source
Commercial license.
Supported release with
indemnity
Microsoft
R Open
Microsoft
R Server
16.
17. Support standard Python library types such as
Pandas data frames and NumPy arrays.
Execute the Python code is based on Anaconda
2.1, It comes with close to 200 of the most
common Python packages (as NumPy, SciPy and
Scikits-Learn )
Output generate images from MatplotLib
22. Data is growing faster than processing
speeds
Only solution is to parallelize data
processing on large clusters
Example: HDInsight
23. Fast, expressive cluster computing system compatible with Apache
Hadoop
• Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
Improves efficiency through:
• In-memory computing primitives
• General computation graphs
Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab
in 2009, was open sourced in 2010 and donated to Apache in 2013
Up to 100× faster
Often 2-10× less code
What is Spark?
34. Microsoft 透過深度學習技術贏得 ImageNet 2015所
有比賽項目冠軍
28.2
25.8
16.4
11.7
7.3 6.7
3.5
ILSVRC 2010
NEC
America
ILSVRC 2011
Xerox
ILSVRC 2012
AlexNet
ILSVRC 2013
Clarifi
ILSVRC 2014
VGG
ILSVRC 2014
GoogleNet
ILSVRC 2015
MSRA
ResNet
ImageNet Classification top-5 error (%)
Microsoft had all 5 entries being the 1-st places this year: ImageNet
classification, ImageNet localization, ImageNet detection, COCO
detection, and COCO segmentation
35. CNTK At the Heart: Computational Networks
•A generalization of machine learning models that can be
described as a series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Seq2Sqe, Log-linear model
•Representation:
• A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, ···, cKn }
• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
•Can flexibly describe deep learning models.
• Adopted by many other popular tools as well
35
36. 36
•A generalization of machine learning models that can be described
as a series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model
•Representation:
• A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, ···, cKn }
• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
•Can flexibly describe deep learning models.
• Adopted by many other popular tools as well
37. “CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to
multi-GPU/multi-server.”
Theano only supports 1 GPU
Achieved with 1-bit gradient quantization
algorithm
0
10000
20000
30000
40000
50000
60000
70000
80000
CNTK Theano TensorFlow Torch 7 Caffe
speed comparison (samples/second), higher = better
[note: December 2015]
1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)
* TensorFlow add distributed compute support in April 2016
38. Micrsoft Reacher SLAWEK
SMYL win in CIF 2016 by
using LSTM Neural Network
Powered by CNTK
39. CIF Competition 2016 – Final Results
• Contestant 1 – Slawek Smyl (LSTM-based
NN on deseasonalized data)
• Contestant 2 – Slawek Smyl (weighted
average of my 3 methods)
• Contestant 3 – prof. Sven Crone (Multilayer
Perceptron with a thorough feature search)
• Contestant 4 - Mikhail Artyukhov (previous
competition winner, ensemble models)
• Contestant 5 - Joerg Wichard, Bayer
Healthcare AG (Adaptive Forecasting
Strategy with Hybrid Ensemble Models)
• Contestant 6 – Slawek Smyl (LSTM-based
NN)
42. (1) Kai Chen and Qiang Huo, “Scalable training of deep learning machines by incremental block training with intra-block
parallel optimization and blockwise model-update filtering”,
in Internal Conference on Acoustics, Speech and Signal Processing , March 2016, Shanghai, China.
43. CNTK is a powerful tool that supports CPU/GPU and
runs under Windows/Linux
CNTK is extensible with the low-coupling modular
design: adding new readers and new computation
nodes is easy with a new reader design
Network definition language, macros, and model
editing language (as well as Python and C++
bindings in the future) makes network design and
modification easy
Compared to other tools CNTK has a great balance
between efficiency, performance, and flexibility
45. Mahout Spark ML Azure ML R Server
Shared Service No No Yes No
Deployment Model PaaS PaaS PaaS IaaS
Extensibility High High Medium High
Deployment Complexity Medium High Low Medium
Cost High High Low High
Programming Languages Java/Scala Scala/Java/Python Python/R R
Algorithms Limited (growing) MLlib/scikit Many (scikit/CRAN) Many (CRAN)
Scalability High High Medium Medium
55. ConnectR
• High-speed & direct
connectors
Available for:
• High-performance XDF
• SAS, SPSS, delimited & fixed
format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBC
ScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R
algorithms across nodes
DistributedR
• Distributed computing framework
• Delivers cross-platform portability
Available on:
• Windows Servers
• Red Hat and SuSE Linux Servers
• Teradata Database
• Cloudera Hadoop
• Hortonworks Hadoop
• MapR Hadoop
R+CRAN
• Open source R interpreter
• R 3.2.2
• Freely-available huge range of R
algorithms
• Algorithms callable by RevoR
• 100% Compatible with existing R scripts,
functions and packages
RevoR
• Performance enhanced R
interpreter
• Based on open source R
• Adds high-performance
math library to speed up
linear algebra functions
R Open MicrosoftR Server
DeployRDevelopR
56. Gradient Boosted Decision Trees
Naïve Bayes
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set
variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long
form)
Marginal Summaries of Cross Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations & variables)
Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
Sum of Squares (cross product matrix for set
variables)
Multiple Linear Regression
Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Predictive Models K-Means
Decision Trees
Decision Forests
Cluster Analysis
Classification
Simulation
Variable Selection
Stepwise Regression
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Combination
rxDataStep
rxExec
PEMA-R API Custom Algorithms
57. Additional Resources
•CNTK:
• https://github.com/Microsoft/CNTK
• Contains all the source code and example setups
• You may understand better how CNTK works by reading the source code
• New features are added constantly
•How to contact:
• CNTK team: ask a question on CNTK GitHub!
• Alexey:
• Email: alexey.kamenev@microsoft.com
• : https://www.linkedin.com/in/alexeykamenev
59