1
Natalino Busa - @natbusa
Natalino Busa
Head of Data Science Teradata
2
Natalino Busa - @natbusa
3
Natalino Busa - @natbusa
4
Natalino Busa - @natbusa
5
Natalino Busa - @natbusa
6
Natalino Busa - @natbusa
What about (data) science?
- technologies and tools are driving innovation in data analytics -
7
Natalino Busa - @natbusa
Man - Machine
as cognitive systems
8
Natalino Busa - @natbusa
Learning: The Scientific Method
Ørsted's "First Introduction to General Physics" (1811)
https://en.m.wikipedia.org/wiki/History_of_scientific_method
observation hypothesis deduction synthesis
Hans Christian Ørsted
experiment
Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY
9
Natalino Busa - @natbusa
Innovation in Data Analytics
Cloud Community AI & ML
10
Natalino Busa - @natbusa
Cloud
11
Natalino Busa - @natbusa
“we live in an age of open source datacenters, so
we can stack all these things together and we
have open source from the ground to ceiling.”
Sam Ramji, CEO of Cloud Foundry
https://www.youtube.com/watch?v=7oCSFcUW-Qk
12
Natalino Busa - @natbusa
Analytics in the cloud
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines
iPAAS: Tools Integration, Flows & Processes
DAAAS: Data Analytics as a Service
13
Natalino Busa - @natbusa
DAAAS: AI and ML API’s
Cloud Computing for Deep Neural Networks
> Models, Compute (Train, Score), and Data
AI and ML models for:
● Speech (audio)
● Language (text)
● Vision (images/video)
● Data (classification, regression, clustering, anomaly detection)
14
Natalino Busa - @natbusa
Ephemeral Computing Clusters on a Cloud
data
create load compute store
timeline
destroy
15
Natalino Busa - @natbusa
dPaaS: Analytical clusters
Ephemeral
Short-Lived
Data Exploration
Isolated, Personal
Simple Access Management
Permanent
Long Lived
Production / Operations
Co-Ordinated
Complex Access Management
vs
16
Natalino Busa - @natbusa
GPU’s and Distributed Computing
GPU support is coming in Kubernetes, Mesos, Spark
https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus
http://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark
out
up
CPU
R,Python
Spark
TensorFrames
17
Natalino Busa - @natbusa
Community
18
Natalino Busa - @natbusa
Community
Develop - Use - Share
19
Natalino Busa - @natbusa
Sharing is caring … speed
github.com + Jupyter notebooks,
share ideas, code, and data
arxiv.org
share innovation and scientific results
20
Natalino Busa - @natbusa
Artificial Intelligence
Machine Learning
21
Natalino Busa - @natbusa
Google: open-sources NLP parser
scoring 95% in grammar accuracy
https://github.com/tensorflow/models/tree/master/syntaxnet
22
Natalino Busa - @natbusa
Deep Learning in Language Parsing
https://github.com/tensorflow/models/blob/master/syntaxnet/ff_nn_schematic.png
23
Natalino Busa - @natbusa
Semantic Search: TDA + NNs Word2Vec, Par2Vec, Doc2Vec
https://arxiv.org/pdf/1405.4053v2.pdf
https://arxiv.org/pdf/1301.3781v3.pdf
24
Natalino Busa - @natbusa
Lip reading
LipNet achieves 93.4% accuracy,
on GRID corpus.
https://arxiv.org/pdf/1611.01599v1.pdf
25
Natalino Busa - @natbusa
Ask me Anything
Dynamic Memory Networks
for Natural Language
Processing
https://arxiv.org/pdf/1603.01417v1.pdf
https://youtu.be/oGk1v1jQITw
Caiming Xiong,
Stephen Merity,
Richard Socher
26
Natalino Busa - @natbusa
Ask me Anything
http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial
Dynamic Memory Networks for Natural Language Processing
https://arxiv.org/pdf/1603.01417v1.pdf
http://www.socher.org/
Local
context
Wider
context
NLP, Attention Masks
Semantic Embeddings from Text, Images
27
Natalino Busa - @natbusa
Network Traffic Patterns Classification
28
Natalino Busa - @natbusa
Network Intrusion Detection
http://billsdata.net/?p=105
It contains 130 million flow records involving
12,027 distinct computers over 36 days (not
the full 58 days claimed for the entire data
release).
Each record consists of: time (to nearest
second), duration, source and destination
computer ids, source and destination ports,
protocol, number of packets and number of
bytes
Techniques: TDA, Dimensionality Reduction
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
29
Natalino Busa - @natbusa
Approaching (Almost) Any Machine Learning Problem
- Abhishek Thakur, Kaggle Grandmaster -
data labels
raw data: tables, files Useful dataData munging Feature
Engineering
Tabular Data ready for ML
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-l
earning-problem-abhishek-thakur/
30
Natalino Busa - @natbusa
AutoML challenge
- based on scikit-learn
- 15 classifiers,
- 14 feature preprocessing methods
- 4 data preprocessing methods
- 110 hyperparameters
- Supervised classification challenge:
100 different datasets
Natalino Busa - @natbusa
31
Natalino Busa - @natbusa
Artificial + Human Intelligence
32
Natalino Busa - @natbusa
Human cognitive biases :
Too much information
Not enough meaning
What should we
remember?
Need to act fast
https://en.wikipedia.org/wiki/List_of_cognitive_biases
33
Natalino Busa - @natbusa
Man vs Machine cognitive limits
Model generation
Explanation
Unsupervised
Planning
Too much information
Not enough meaning
Need to act quickly
Memory limits
34
Natalino Busa - @natbusa
Theorems often tell us complex truths about the simple things,
but only rarely tell us simple truths about the complex ones
Marvin Minsky
K-Linesː A Theory of Memory (1980)
35
Natalino Busa - @natbusa
Data Science: wear the AI/ML Lenses
We are entering a new era of intelligent machines
Boost our understanding of data
Focus on higher level analyses
36
Natalino Busa - @natbusa
Intelligent Data Systems:
Long live the “database”
Wikipedia:
A database is an organized collection of data.
DATA
New-SQL
ML
AI
SQL
Python - Scala - R
NLP
UX
Speech
COG
37
Natalino Busa - @natbusa
The Database.
is never going to be the same.
38
Natalino Busa - @natbusa
Thank you.
@natbusa
39
Natalino Busa - @natbusa
credits
40
Natalino Busa - @natbusa
bonus slides

Are we reaching a Data Science Singularity? How Cognitive Computing is emerging from Machine Learning Algorithms, Big Data Tools, and Cloud Services by Natalino Busa

  • 2.
    1 Natalino Busa -@natbusa Natalino Busa Head of Data Science Teradata
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    6 Natalino Busa -@natbusa What about (data) science? - technologies and tools are driving innovation in data analytics -
  • 8.
    7 Natalino Busa -@natbusa Man - Machine as cognitive systems
  • 9.
    8 Natalino Busa -@natbusa Learning: The Scientific Method Ørsted's "First Introduction to General Physics" (1811) https://en.m.wikipedia.org/wiki/History_of_scientific_method observation hypothesis deduction synthesis Hans Christian Ørsted experiment Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY
  • 10.
    9 Natalino Busa -@natbusa Innovation in Data Analytics Cloud Community AI & ML
  • 11.
    10 Natalino Busa -@natbusa Cloud
  • 12.
    11 Natalino Busa -@natbusa “we live in an age of open source datacenters, so we can stack all these things together and we have open source from the ground to ceiling.” Sam Ramji, CEO of Cloud Foundry https://www.youtube.com/watch?v=7oCSFcUW-Qk
  • 13.
    12 Natalino Busa -@natbusa Analytics in the cloud Bare Metal: Physical Machines IAAS: Virtual Resources CAAS: Containers, dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes DAAAS: Data Analytics as a Service
  • 14.
    13 Natalino Busa -@natbusa DAAAS: AI and ML API’s Cloud Computing for Deep Neural Networks > Models, Compute (Train, Score), and Data AI and ML models for: ● Speech (audio) ● Language (text) ● Vision (images/video) ● Data (classification, regression, clustering, anomaly detection)
  • 15.
    14 Natalino Busa -@natbusa Ephemeral Computing Clusters on a Cloud data create load compute store timeline destroy
  • 16.
    15 Natalino Busa -@natbusa dPaaS: Analytical clusters Ephemeral Short-Lived Data Exploration Isolated, Personal Simple Access Management Permanent Long Lived Production / Operations Co-Ordinated Complex Access Management vs
  • 17.
    16 Natalino Busa -@natbusa GPU’s and Distributed Computing GPU support is coming in Kubernetes, Mesos, Spark https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus http://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark out up CPU R,Python Spark TensorFrames
  • 18.
    17 Natalino Busa -@natbusa Community
  • 19.
    18 Natalino Busa -@natbusa Community Develop - Use - Share
  • 20.
    19 Natalino Busa -@natbusa Sharing is caring … speed github.com + Jupyter notebooks, share ideas, code, and data arxiv.org share innovation and scientific results
  • 21.
    20 Natalino Busa -@natbusa Artificial Intelligence Machine Learning
  • 22.
    21 Natalino Busa -@natbusa Google: open-sources NLP parser scoring 95% in grammar accuracy https://github.com/tensorflow/models/tree/master/syntaxnet
  • 23.
    22 Natalino Busa -@natbusa Deep Learning in Language Parsing https://github.com/tensorflow/models/blob/master/syntaxnet/ff_nn_schematic.png
  • 24.
    23 Natalino Busa -@natbusa Semantic Search: TDA + NNs Word2Vec, Par2Vec, Doc2Vec https://arxiv.org/pdf/1405.4053v2.pdf https://arxiv.org/pdf/1301.3781v3.pdf
  • 25.
    24 Natalino Busa -@natbusa Lip reading LipNet achieves 93.4% accuracy, on GRID corpus. https://arxiv.org/pdf/1611.01599v1.pdf
  • 26.
    25 Natalino Busa -@natbusa Ask me Anything Dynamic Memory Networks for Natural Language Processing https://arxiv.org/pdf/1603.01417v1.pdf https://youtu.be/oGk1v1jQITw Caiming Xiong, Stephen Merity, Richard Socher
  • 27.
    26 Natalino Busa -@natbusa Ask me Anything http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial Dynamic Memory Networks for Natural Language Processing https://arxiv.org/pdf/1603.01417v1.pdf http://www.socher.org/ Local context Wider context NLP, Attention Masks Semantic Embeddings from Text, Images
  • 28.
    27 Natalino Busa -@natbusa Network Traffic Patterns Classification
  • 29.
    28 Natalino Busa -@natbusa Network Intrusion Detection http://billsdata.net/?p=105 It contains 130 million flow records involving 12,027 distinct computers over 36 days (not the full 58 days claimed for the entire data release). Each record consists of: time (to nearest second), duration, source and destination computer ids, source and destination ports, protocol, number of packets and number of bytes Techniques: TDA, Dimensionality Reduction https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
  • 30.
    29 Natalino Busa -@natbusa Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur, Kaggle Grandmaster - data labels raw data: tables, files Useful dataData munging Feature Engineering Tabular Data ready for ML http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-l earning-problem-abhishek-thakur/
  • 31.
    30 Natalino Busa -@natbusa AutoML challenge - based on scikit-learn - 15 classifiers, - 14 feature preprocessing methods - 4 data preprocessing methods - 110 hyperparameters - Supervised classification challenge: 100 different datasets Natalino Busa - @natbusa
  • 32.
    31 Natalino Busa -@natbusa Artificial + Human Intelligence
  • 33.
    32 Natalino Busa -@natbusa Human cognitive biases : Too much information Not enough meaning What should we remember? Need to act fast https://en.wikipedia.org/wiki/List_of_cognitive_biases
  • 34.
    33 Natalino Busa -@natbusa Man vs Machine cognitive limits Model generation Explanation Unsupervised Planning Too much information Not enough meaning Need to act quickly Memory limits
  • 35.
    34 Natalino Busa -@natbusa Theorems often tell us complex truths about the simple things, but only rarely tell us simple truths about the complex ones Marvin Minsky K-Linesː A Theory of Memory (1980)
  • 36.
    35 Natalino Busa -@natbusa Data Science: wear the AI/ML Lenses We are entering a new era of intelligent machines Boost our understanding of data Focus on higher level analyses
  • 37.
    36 Natalino Busa -@natbusa Intelligent Data Systems: Long live the “database” Wikipedia: A database is an organized collection of data. DATA New-SQL ML AI SQL Python - Scala - R NLP UX Speech COG
  • 38.
    37 Natalino Busa -@natbusa The Database. is never going to be the same.
  • 39.
    38 Natalino Busa -@natbusa Thank you. @natbusa
  • 40.
    39 Natalino Busa -@natbusa credits
  • 41.
    40 Natalino Busa -@natbusa bonus slides