Open Data, Big Data
and Machine Learning
Steven Van Vaerenbergh
Universidad de Cantabria
May 31, 2016
#EMWeek16 - Santander
About me
Researcher in machine learning
gtas.unican.es/people/steven
Open Data, Big Data and Machine Learning 2
twitter.com/steven2358
Steven Van Vaerenbergh
1. Open Data
Open Data, Big Data and Machine Learning 3Steven Van Vaerenbergh
Denmark’s Open Address Data Set
• Making public data
“free of charge”
Open Data, Big Data and Machine Learning 4
Period Benefits Costs Return on
Investment
2004-2009 (including
setup)
>€60M ~€2M 22:1
2010 (steady state) ~€14M €0.2M 70:1
Source: http://odimpact.org/static/files/case-study-denmark.pdf
Steven Van Vaerenbergh
Open Data, Big Data and Machine Learning 5Steven Van Vaerenbergh
Steven Van Vaerenbergh Open Data, Big Data and Machine Learning 6
Open Data in Santander
• Santander Datos Abiertos http://datos.santander.es/
• FIWARE lab: https://www.fiware.org/lab/
• FIWARE Academy: http://edu.fiware.org
Open Data, Big Data and Machine Learning 7Steven Van Vaerenbergh
Open data
• “A data set is open if it is available under a free
license to everyone”.
• Providers: Governments, public services,
companies, individuals.
• Tendency: Many data providers stop making apps
and leave this to third parties.
Open data improves transparency
Not all data should be open though (privacy)
Open Data, Big Data and Machine Learning 8Steven Van Vaerenbergh
2. Big Data
Open Data, Big Data and Machine Learning 9Steven Van Vaerenbergh
Big Data
• Scientific definition: “Data sets that are so large
that traditional data processing techniques cannot
be applied to them”.
• Terabytes, Petabytes, Exabytes, etc.
• “Big Data” is also used to refer to novel analysis
techniques for such data.
• Typically not open data.
Open Data, Big Data and Machine Learning 10Steven Van Vaerenbergh
Big Data = Data Science with Lots of Data
Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Open Data, Big Data and Machine Learning 11Steven Van Vaerenbergh
Big Data
• Many frameworks are being developed:
• Apache Hadoop
• Apache Mahout
• NoSQL
• Caution: The science behind big data is in its infancy.
E.g. most methods are not able to produce error
bars, which is paramount in many applications.
Open Data, Big Data and Machine Learning 12Steven Van Vaerenbergh
Big Data
• Media and press often use “big data” to refer to
data science even if the amount of data is relatively
small.
 “Big data” is often simply a marketing term.
Open Data, Big Data and Machine Learning 13Steven Van Vaerenbergh
Big Data = Data Science with Lots of Data
Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Open Data, Big Data and Machine Learning 14
?
Steven Van Vaerenbergh
3. Machine Learning
Open Data, Big Data and Machine Learning 15Steven Van Vaerenbergh
Traditional Machine Intelligence
• Example: decision tree for determining access
Program consists of a set of rules (logic)
Open Data, Big Data and Machine Learning 16
Input: age, gender, occupation,… Permission to enter Juanito’s tree house?
Yes No No
Steven Van Vaerenbergh
Traditional Machine Intelligence
• Example: decision tree for digit recognition
Set of rules is very hard to design by hand
Open Data, Big Data and Machine Learning 17
Input: images (MNIST) Which digit is represented?
Steven Van Vaerenbergh
Traditional Machine Intelligence
• Example: decision tree for image recognition
Set of rules is impossible to design by hand
Open Data, Big Data and Machine Learning 18
Input: images (CIFAR10) What does the image represent?
Correct
answer
?
Steven Van Vaerenbergh
Machine Learning
• Solution: Let the program itself determine its
internal set of rules.
• Provide the program with inputs and correct
answers for these rules, and let it “learn”.
“Machine Learning is the study of
computer algorithms that
improve their performance
on a task automatically
through experience.”
- Tom Mitchell
Open Data, Big Data and Machine Learning 19Steven Van Vaerenbergh
Open Data, Big Data and Machine Learning 20
Traditional Machine Intelligence
Computer
Input
Program
Output
Machine Learning (ML)
ML algorithm
Input
Output
Program
Steven Van Vaerenbergh
Machine Learning Applications
• Spam filters detect unsolicited emails
Open Data, Big Data and Machine Learning 21
SPAM
Steven Van Vaerenbergh
Machine Learning Applications
• Biomedicine: pattern detection in images
Open Data, Big Data and Machine Learning 22Steven Van Vaerenbergh
Machine Learning Applications
• Computer Vision: Kinect body tracking
Open Data, Big Data and Machine Learning 23Steven Van Vaerenbergh
Machine Learning Applications
• Natural Language Processing (NLP)
Open Data, Big Data and Machine Learning 24Steven Van Vaerenbergh
Machine Learning Applications
1996: IBM’s Deep Blue
(Chess)
• Intelligence based on
manually-entered rules
2016: Google Deepmind’s
AlphaGo (Go)
• Program learns
autonomously
Open Data, Big Data and Machine Learning 25Steven Van Vaerenbergh
Machine Learning Applications
• Human activity recognition
Open Data, Big Data and Machine Learning 26
Running
Walking
Steven Van Vaerenbergh
Internal representation
How to represent the function from input to output?
• Neural networks
• Support vector machines
• Sets of rules / Logic programs
• Bayes/Markov nets
• Model ensembles
• Decision trees
• Etc.
Neural net demo: http://playground.tensorflow.org/
Open Data, Big Data and Machine Learning 27Steven Van Vaerenbergh
Tools and Frameworks
• Machine learning toolkits:
• Scikit Learn (Python) http://scikit-learn.org/
• Weka (Java) http://www.cs.waikato.ac.nz/ml/weka/
• Shogun http://www.shogun-toolbox.org/
• Cloud-based machine learning
• IBM Watson https://developer.ibm.com/watson/
• Amazon ML https://aws.amazon.com/machine-learning/
• Microsoft Azure ML https://azure.microsoft.com/en-
us/services/machine-learning/
• Google Cloud ML
https://cloud.google.com/products/machine-learning/
Open Data, Big Data and Machine Learning 28Steven Van Vaerenbergh
Takeaways
• Open data, big data and machine learning are
components of the current technological wave that
resembles an industrial revolution.
• Big data requires a rigorous scientific engineering
framework that is currently unfinished.
• Machine learning algorithms create intelligent
programs by automatically learning from example
data.
Open Data, Big Data and Machine Learning 29Steven Van Vaerenbergh
Join us on Meetup
Meetup group for people
in Santander & Cantabria
interested in everything
related to data science
www.meetup.com/Data-Science-Santander
Open Data, Big Data and Machine Learning 30Steven Van Vaerenbergh

Open Data, Big Data and Machine Learning

  • 1.
    Open Data, BigData and Machine Learning Steven Van Vaerenbergh Universidad de Cantabria May 31, 2016 #EMWeek16 - Santander
  • 2.
    About me Researcher inmachine learning gtas.unican.es/people/steven Open Data, Big Data and Machine Learning 2 twitter.com/steven2358 Steven Van Vaerenbergh
  • 3.
    1. Open Data OpenData, Big Data and Machine Learning 3Steven Van Vaerenbergh
  • 4.
    Denmark’s Open AddressData Set • Making public data “free of charge” Open Data, Big Data and Machine Learning 4 Period Benefits Costs Return on Investment 2004-2009 (including setup) >€60M ~€2M 22:1 2010 (steady state) ~€14M €0.2M 70:1 Source: http://odimpact.org/static/files/case-study-denmark.pdf Steven Van Vaerenbergh
  • 5.
    Open Data, BigData and Machine Learning 5Steven Van Vaerenbergh
  • 6.
    Steven Van VaerenberghOpen Data, Big Data and Machine Learning 6
  • 7.
    Open Data inSantander • Santander Datos Abiertos http://datos.santander.es/ • FIWARE lab: https://www.fiware.org/lab/ • FIWARE Academy: http://edu.fiware.org Open Data, Big Data and Machine Learning 7Steven Van Vaerenbergh
  • 8.
    Open data • “Adata set is open if it is available under a free license to everyone”. • Providers: Governments, public services, companies, individuals. • Tendency: Many data providers stop making apps and leave this to third parties. Open data improves transparency Not all data should be open though (privacy) Open Data, Big Data and Machine Learning 8Steven Van Vaerenbergh
  • 9.
    2. Big Data OpenData, Big Data and Machine Learning 9Steven Van Vaerenbergh
  • 10.
    Big Data • Scientificdefinition: “Data sets that are so large that traditional data processing techniques cannot be applied to them”. • Terabytes, Petabytes, Exabytes, etc. • “Big Data” is also used to refer to novel analysis techniques for such data. • Typically not open data. Open Data, Big Data and Machine Learning 10Steven Van Vaerenbergh
  • 11.
    Big Data =Data Science with Lots of Data Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram Open Data, Big Data and Machine Learning 11Steven Van Vaerenbergh
  • 12.
    Big Data • Manyframeworks are being developed: • Apache Hadoop • Apache Mahout • NoSQL • Caution: The science behind big data is in its infancy. E.g. most methods are not able to produce error bars, which is paramount in many applications. Open Data, Big Data and Machine Learning 12Steven Van Vaerenbergh
  • 13.
    Big Data • Mediaand press often use “big data” to refer to data science even if the amount of data is relatively small.  “Big data” is often simply a marketing term. Open Data, Big Data and Machine Learning 13Steven Van Vaerenbergh
  • 14.
    Big Data =Data Science with Lots of Data Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram Open Data, Big Data and Machine Learning 14 ? Steven Van Vaerenbergh
  • 15.
    3. Machine Learning OpenData, Big Data and Machine Learning 15Steven Van Vaerenbergh
  • 16.
    Traditional Machine Intelligence •Example: decision tree for determining access Program consists of a set of rules (logic) Open Data, Big Data and Machine Learning 16 Input: age, gender, occupation,… Permission to enter Juanito’s tree house? Yes No No Steven Van Vaerenbergh
  • 17.
    Traditional Machine Intelligence •Example: decision tree for digit recognition Set of rules is very hard to design by hand Open Data, Big Data and Machine Learning 17 Input: images (MNIST) Which digit is represented? Steven Van Vaerenbergh
  • 18.
    Traditional Machine Intelligence •Example: decision tree for image recognition Set of rules is impossible to design by hand Open Data, Big Data and Machine Learning 18 Input: images (CIFAR10) What does the image represent? Correct answer ? Steven Van Vaerenbergh
  • 19.
    Machine Learning • Solution:Let the program itself determine its internal set of rules. • Provide the program with inputs and correct answers for these rules, and let it “learn”. “Machine Learning is the study of computer algorithms that improve their performance on a task automatically through experience.” - Tom Mitchell Open Data, Big Data and Machine Learning 19Steven Van Vaerenbergh
  • 20.
    Open Data, BigData and Machine Learning 20 Traditional Machine Intelligence Computer Input Program Output Machine Learning (ML) ML algorithm Input Output Program Steven Van Vaerenbergh
  • 21.
    Machine Learning Applications •Spam filters detect unsolicited emails Open Data, Big Data and Machine Learning 21 SPAM Steven Van Vaerenbergh
  • 22.
    Machine Learning Applications •Biomedicine: pattern detection in images Open Data, Big Data and Machine Learning 22Steven Van Vaerenbergh
  • 23.
    Machine Learning Applications •Computer Vision: Kinect body tracking Open Data, Big Data and Machine Learning 23Steven Van Vaerenbergh
  • 24.
    Machine Learning Applications •Natural Language Processing (NLP) Open Data, Big Data and Machine Learning 24Steven Van Vaerenbergh
  • 25.
    Machine Learning Applications 1996:IBM’s Deep Blue (Chess) • Intelligence based on manually-entered rules 2016: Google Deepmind’s AlphaGo (Go) • Program learns autonomously Open Data, Big Data and Machine Learning 25Steven Van Vaerenbergh
  • 26.
    Machine Learning Applications •Human activity recognition Open Data, Big Data and Machine Learning 26 Running Walking Steven Van Vaerenbergh
  • 27.
    Internal representation How torepresent the function from input to output? • Neural networks • Support vector machines • Sets of rules / Logic programs • Bayes/Markov nets • Model ensembles • Decision trees • Etc. Neural net demo: http://playground.tensorflow.org/ Open Data, Big Data and Machine Learning 27Steven Van Vaerenbergh
  • 28.
    Tools and Frameworks •Machine learning toolkits: • Scikit Learn (Python) http://scikit-learn.org/ • Weka (Java) http://www.cs.waikato.ac.nz/ml/weka/ • Shogun http://www.shogun-toolbox.org/ • Cloud-based machine learning • IBM Watson https://developer.ibm.com/watson/ • Amazon ML https://aws.amazon.com/machine-learning/ • Microsoft Azure ML https://azure.microsoft.com/en- us/services/machine-learning/ • Google Cloud ML https://cloud.google.com/products/machine-learning/ Open Data, Big Data and Machine Learning 28Steven Van Vaerenbergh
  • 29.
    Takeaways • Open data,big data and machine learning are components of the current technological wave that resembles an industrial revolution. • Big data requires a rigorous scientific engineering framework that is currently unfinished. • Machine learning algorithms create intelligent programs by automatically learning from example data. Open Data, Big Data and Machine Learning 29Steven Van Vaerenbergh
  • 30.
    Join us onMeetup Meetup group for people in Santander & Cantabria interested in everything related to data science www.meetup.com/Data-Science-Santander Open Data, Big Data and Machine Learning 30Steven Van Vaerenbergh