Armando Vieira
Closer
Armando.lidinwise.com
1.

2.
3.
4.
5.
6.

Machine Learning: finding features, patterns &
representations
The connectionist approach: Neural Networks
Applications
The Deep Learning “revolution”: a step closer to the brain?
Applications
The Big Data deluge: better algorithms & more data
Was “Deep Blue” Intelligent?
 How about Watson?
 Or Google?
 Does machines have reached the intelligence
level of a rat?
….
 Let’s be pragmatic: I’ll call “intelligent” any
device capable of surprise me!

Deep
MLP
Hebb
concept

Architectures








1943 – Mculloch& Pitts + Hebb
1968- Rosenblatperceptron and the Minsk
argument - or why a good theory may kill an even better idea
1985- RumelhartPerceptron
2006- Hinton Deep Learning (Boltzmann)
Networks
All together: Watson, Google et al






Input builds up on receptors
(dendrites)
Cell has an input threshold
Upon breech of cell’s
threshold, activation is fired
down the axon.
The visual cortex


A step closer to success thanks to a training
algorithm: back propagation




Training is nothing more than fitting:
regression, classification, recommendations
Problem is we have to find a way to represent
the world (extract features)
FRUSTATION
Age

Money
Simpler hypothesis has lower error rate






ANN are very hard to optimize
Lots of local minimum (trap for stochastic
gradient descendent)
Permutation invariant (no unique solution)
How to stop training?








Neural Networks are incredible powerful
algorithms
But they are also wild beasts that should be
treated with great care
Its very easy to fall in the GIGO trap
Problems like overfit, suboptimization, bad
conditioned, wrong interpretation are
common
Interpretation of outputs
 Loss function
 Outputs ≠ probabilities
 Where to draw the line?
 VERY careful in interpreting the outputs of ML
algorithms: you not always get what you see
Input preparation
 Clean & balance the data
 Normalize it properly
 Remove unneeded features, create new ones
 Missing values






Rutherford Backscattering (RBS)
Credit Risk & Scoring
Churn prediction (CDR)
Prediction of hotel demand with Google trends
Adwords Optimization
Ion beam analysis
MeV/amu

RBS
Channelling

h
NRA
PIXE
ERDA
25Å Ge -layer under 400 nm Si
1500

(c)

1000
50

Angle of incidence

o

500
25

o

0

o

Yield (arb. units)

0

(b)
1000

120

o

500

Scattering angle
140

180

o

o

0

(a)

1.2 MeV
1000

Beam energy
1.6 MeV

500

2 MeV

0
0

100

200

Channel

300

400
architecture

(I, 100, O)
(I, 250, O)
(I, 100, 80, O)
(I, 100, 50, 20, O)
(I, 100, 80, 50, O)
(I, 100, 80, 80, O)
(I, 100, 50, 100, O)
(I, 100, 80, 80, 50, O)
(I, 100, 80, 50, 30, 20, O)

train set error

test set error

6.3
5.2
3.6
4.2
3.0
2.8
3.0
3.2
3.8

11.7
10.1
5.3
5.1
4.1
4.7
4.2
4.1
5.3
2

DepthANN (10 at/cm )

4000

b)

15

3000

2000

1000

0
0

1000

2000

3000
15

4000

2

100

a)

10

15

2

DoseANN (10 at/cm )

Depthdata (10 at/cm )

1

0,1

0,1

1

10
15

100
2

Dosedata (10 at/cm )
Score (EBIT, Current ratio)
1
0.5
0
-0.5
-1
-1.5
2
1
0
-1
eb

-2

-2

0

-1
cr

1

2
Before

After
0

-1
-2
-3
-4
-5




Neural networks are good when: many
training data available; continuous
variables; relevant features are known;
unicity of the mapping.

Neural networks are less useful when:
problem is linear; few data compared to
the size of search space; data high
dimensional; long range correlations.

They are black boxes
Characteristic

Traditional methods
(Von Neumann)

Artificial neural networks

logics

Deductive

Inductive

Processing principle

Logical

Gestalt

Processing style

Sequential

Distributed (parallel)

Functions
through

realised concepts, rules, calculations

Concepts, images, categories,
maps

Connections
concepts

between Programmed a priori

Dynamic, evolving

Programming

Through a limited set of Self-programmable (given an
rigid rules
appropriate architecture)

Learning

By rules

Self-learning

Through internal algorithmic Continuously adaptable
parameters

Tolerance to errors

Mostly none

By examples (analogies)

Inerent






ANN are massive correlation & feature
extraction machines isn’t what intelligence is all about?
Knowledge is embedded in a messy network
of weights
Capable to model an arbitrary complex
mapping






We need thousands of examples for training.
Why?
Prior

Algorithms are simple: complexity lies in the
data
Hinton et al, 2006









“quasi” non-supervised machines
Extract and combine subtle features in the data
Build high-level representations (abstractions)
Capable of knowledge transfer
Can handle (very) high-dimensional data
Are deep and broad: millions of synapses
Work both ways: up and down
Learning features that are not mutually exclusive












Top on image identification (is some cases it
beat humans)
Top on video classification
Top on real-time translation
Top on Gene identification
Reverse engineering: can replicate complex
human behaviour, like walking.
Data visualization and of text disambiguation
(river-bank/bank-bailout)
Kaggle
Data

Data
Features

Results

Features

Results
Better

Powerful

Algorithms

Computers

Data
Available

MAGIC




In 2 years we produce more data (and garbage)
than the accumulated over all history
Zettabytesof data, 1021 bytes produced every
year
Machine learning molecules (**)







Most ML algorithms work better (sometimes
much better) by simple throwing more data
to them
And now we have more data. Plenty of it!
Which is signal and which is noise? Let the
machines decide (they are good at it)
Where humans stands in this equation? We
are feeding the machines!











Don’t look for causation; welcome correlations
Messiness - prepare to get your hands dirty
Don’t expect definitive answers. Only
communists have them!
Stop searching God’s equation
Keep theories at bay and let the data speak
Exactitude may not be better than “estimations”
Forget about keep data clean and organized
Data is alive and wild. Don’t imprisoned it








Flue prediction
Netflix movie rating contest
New York city building security
Used car
Veg food->airport
Prediction rare events frauds and why its
important










A step closer to the brain? Yes and No
What is missing?
Predictive analytics (crime before it occurs)?
Algorithms that learn & adapt
Replace humans?
Augment reality
Big Data & algorithms are revolutionizing the
world. Fast!









Recommendations (Amazon, Netflix, Facebook)
Trading (70% Wall Street is made by them)
Identifying your partner, recruiting, votes
Images, video, voice, translation (real time)

Where are we heading?
NSA?
Black boxes?







Deeplearning.net
Hinton Google talks
“Too big to know”
Big Data: a new revolution that will transform
business
Machine Learning in R








Matlab (several code – google for it)
R (CRAN repository), Rminer
Python (Skilearn)
C++ (mainly on Github)
Torch
More on Deeplearning.net
Recommend an unseen item i to an user ubased on engagement of other users to items 1 to 8.
Items recommended in this case are i2 followed by i1.
Item based recommendation for a user ua
based on a neighbour of k = 3.
Items recommended in this case
are i3 followed by i4.

(item based CF superior to user based CF
but it requires lot of information like ratings
or user interaction with the product).

machine learning in the age of big data: new approaches and business applications

  • 1.
  • 3.
    1. 2. 3. 4. 5. 6. Machine Learning: findingfeatures, patterns & representations The connectionist approach: Neural Networks Applications The Deep Learning “revolution”: a step closer to the brain? Applications The Big Data deluge: better algorithms & more data
  • 4.
    Was “Deep Blue”Intelligent?  How about Watson?  Or Google?  Does machines have reached the intelligence level of a rat? ….  Let’s be pragmatic: I’ll call “intelligent” any device capable of surprise me! 
  • 5.
  • 6.
         1943 – Mculloch&Pitts + Hebb 1968- Rosenblatperceptron and the Minsk argument - or why a good theory may kill an even better idea 1985- RumelhartPerceptron 2006- Hinton Deep Learning (Boltzmann) Networks All together: Watson, Google et al
  • 9.
       Input builds upon receptors (dendrites) Cell has an input threshold Upon breech of cell’s threshold, activation is fired down the axon.
  • 10.
  • 12.
     A step closerto success thanks to a training algorithm: back propagation
  • 15.
      Training is nothingmore than fitting: regression, classification, recommendations Problem is we have to find a way to represent the world (extract features)
  • 21.
  • 24.
    Simpler hypothesis haslower error rate
  • 25.
        ANN are veryhard to optimize Lots of local minimum (trap for stochastic gradient descendent) Permutation invariant (no unique solution) How to stop training?
  • 27.
        Neural Networks areincredible powerful algorithms But they are also wild beasts that should be treated with great care Its very easy to fall in the GIGO trap Problems like overfit, suboptimization, bad conditioned, wrong interpretation are common
  • 28.
    Interpretation of outputs Loss function  Outputs ≠ probabilities  Where to draw the line?  VERY careful in interpreting the outputs of ML algorithms: you not always get what you see Input preparation  Clean & balance the data  Normalize it properly  Remove unneeded features, create new ones  Missing values
  • 30.
         Rutherford Backscattering (RBS) CreditRisk & Scoring Churn prediction (CDR) Prediction of hotel demand with Google trends Adwords Optimization
  • 31.
  • 32.
    25Å Ge -layerunder 400 nm Si 1500 (c) 1000 50 Angle of incidence o 500 25 o 0 o Yield (arb. units) 0 (b) 1000 120 o 500 Scattering angle 140 180 o o 0 (a) 1.2 MeV 1000 Beam energy 1.6 MeV 500 2 MeV 0 0 100 200 Channel 300 400
  • 33.
    architecture (I, 100, O) (I,250, O) (I, 100, 80, O) (I, 100, 50, 20, O) (I, 100, 80, 50, O) (I, 100, 80, 80, O) (I, 100, 50, 100, O) (I, 100, 80, 80, 50, O) (I, 100, 80, 50, 30, 20, O) train set error test set error 6.3 5.2 3.6 4.2 3.0 2.8 3.0 3.2 3.8 11.7 10.1 5.3 5.1 4.1 4.7 4.2 4.1 5.3
  • 34.
    2 DepthANN (10 at/cm) 4000 b) 15 3000 2000 1000 0 0 1000 2000 3000 15 4000 2 100 a) 10 15 2 DoseANN (10 at/cm ) Depthdata (10 at/cm ) 1 0,1 0,1 1 10 15 100 2 Dosedata (10 at/cm )
  • 43.
    Score (EBIT, Currentratio) 1 0.5 0 -0.5 -1 -1.5 2 1 0 -1 eb -2 -2 0 -1 cr 1 2
  • 46.
  • 47.
  • 48.
      Neural networks aregood when: many training data available; continuous variables; relevant features are known; unicity of the mapping. Neural networks are less useful when: problem is linear; few data compared to the size of search space; data high dimensional; long range correlations. They are black boxes
  • 49.
    Characteristic Traditional methods (Von Neumann) Artificialneural networks logics Deductive Inductive Processing principle Logical Gestalt Processing style Sequential Distributed (parallel) Functions through realised concepts, rules, calculations Concepts, images, categories, maps Connections concepts between Programmed a priori Dynamic, evolving Programming Through a limited set of Self-programmable (given an rigid rules appropriate architecture) Learning By rules Self-learning Through internal algorithmic Continuously adaptable parameters Tolerance to errors Mostly none By examples (analogies) Inerent
  • 50.
       ANN are massivecorrelation & feature extraction machines isn’t what intelligence is all about? Knowledge is embedded in a messy network of weights Capable to model an arbitrary complex mapping
  • 51.
       We need thousandsof examples for training. Why? Prior Algorithms are simple: complexity lies in the data
  • 62.
  • 65.
           “quasi” non-supervised machines Extractand combine subtle features in the data Build high-level representations (abstractions) Capable of knowledge transfer Can handle (very) high-dimensional data Are deep and broad: millions of synapses Work both ways: up and down
  • 66.
    Learning features thatare not mutually exclusive
  • 67.
           Top on imageidentification (is some cases it beat humans) Top on video classification Top on real-time translation Top on Gene identification Reverse engineering: can replicate complex human behaviour, like walking. Data visualization and of text disambiguation (river-bank/bank-bailout) Kaggle
  • 68.
  • 70.
  • 71.
      In 2 yearswe produce more data (and garbage) than the accumulated over all history Zettabytesof data, 1021 bytes produced every year
  • 72.
  • 73.
        Most ML algorithmswork better (sometimes much better) by simple throwing more data to them And now we have more data. Plenty of it! Which is signal and which is noise? Let the machines decide (they are good at it) Where humans stands in this equation? We are feeding the machines!
  • 74.
            Don’t look forcausation; welcome correlations Messiness - prepare to get your hands dirty Don’t expect definitive answers. Only communists have them! Stop searching God’s equation Keep theories at bay and let the data speak Exactitude may not be better than “estimations” Forget about keep data clean and organized Data is alive and wild. Don’t imprisoned it
  • 75.
          Flue prediction Netflix movierating contest New York city building security Used car Veg food->airport Prediction rare events frauds and why its important
  • 76.
           A step closerto the brain? Yes and No What is missing? Predictive analytics (crime before it occurs)? Algorithms that learn & adapt Replace humans? Augment reality Big Data & algorithms are revolutionizing the world. Fast!
  • 78.
           Recommendations (Amazon, Netflix,Facebook) Trading (70% Wall Street is made by them) Identifying your partner, recruiting, votes Images, video, voice, translation (real time) Where are we heading? NSA? Black boxes?
  • 79.
         Deeplearning.net Hinton Google talks “Toobig to know” Big Data: a new revolution that will transform business Machine Learning in R
  • 80.
          Matlab (several code– google for it) R (CRAN repository), Rminer Python (Skilearn) C++ (mainly on Github) Torch More on Deeplearning.net
  • 81.
    Recommend an unseenitem i to an user ubased on engagement of other users to items 1 to 8. Items recommended in this case are i2 followed by i1.
  • 82.
    Item based recommendationfor a user ua based on a neighbour of k = 3. Items recommended in this case are i3 followed by i4. (item based CF superior to user based CF but it requires lot of information like ratings or user interaction with the product).

Editor's Notes

  • #2 Story telling : gripe google
  • #8 Invariants, parity, connexity