Big Data and Machine Learning
An introduction to Key Ideas
Mauritian JEDI
Bruce Bassett
bruce@saao.ac.za
AIMS/SAAO/UCT
Jan 2015
History of the JEDI concept
We developed the format at several SA workshops (2005-
2008)
NRF-Royal Society 5 year Bilateral with Portsmouth, Sussex
and Oxford: train new researchers & do excellent
cosmology research
• JEDI 1 – Langebaan 2008
• JEDI 2 – STIAS/Avalon 2008
• We are now past JEDI X…
Aim of the JEDI series: explore to find the most efficient way
of teaching & learning research, building new
collaborations and doing excellent research
“Sciama” Principles
• Creativity has to be nurtured creatively
• Ideas are a non-linear function of interaction – want as much
discussion/interaction as possible
• Learning is most efficient when it is fun, informal and play.
• Academia is a small-world network…
• Hence personal contacts and networking are crucial for progress
• Being part of the “fratelli fisici” (Coleman) is important. People
need to know and trust you…
“Google” Principles
• Take good people and treat them really well.
• Trust that good things will come out…things that you can’t
predict before hand.
• Get out of your comfort zone!
“Creativity requires chaos”. Talk to people you would not
normally talk to. Do things that scare you!
• Attitude and atmosphere is crucial: be friendly, have fun,
relax, enjoy yourself, be proactive, interact, work hard.
How does the JEDI work?
• Research is best learned by doing it with people who
do it better or differently than you.
• Work with a “screw-it let’s do it” attitude
• Work on coming up with and evaluating new ideas
• Work on real research projects in teams.
• You choose the projects you are interested in and how
you spend your time.
• 1-3 years: are there any ongoing projects between people
who met at the JEDI?
• 10-20 years: Successful if two people can look back and say,
“actually I first worked or became good friends with X at JEDI
and we have since written papers together, they took my
students for post-docs, they wrote a letter of reference for
me, examined my student’s thesis, helped referee my grant,
get me promoted etc…”
Success on different timescales
Brain Teaser
• A man tosses a coin 30 times and it comes up
heads 30 times in a row.
• What is the probability that it comes up heads
on the 31st coin toss?
What is the scientific method?
• What is the first thing we do when we try to
understand something with physics/applied
mathematics?
• We build a toy model of it, a representation,
that we can study.
• We then study this simplified model and make
predictions.
Machine Learning
• In machine learning, we do the same. We
must choose a set of features that we think
are the most important to achieve our goals
• We then train the machine learning, and use it
to make predictions.
www.quora.com
Data Science in 3 nutshells
The Deeper Drivers
Data Science is really driven by the intersection of:
• Moore’s Law – cheaper, faster, smaller…
• Development of powerful, fast new algorithms that
take advantage of the computing power (e.g. Bayesian
methods)
• Turing completeness which allows near universal
application of the algorithms…
Moore’s Law applies to lots of things…
250,000 x more storage and
about 10 x Cheaper!
The Lean Startup Model
• What we are trying is very close to running a
startup in a competitive landscape
• In Lean Startup, the Minimum Viable Product
is central… test basic assumptions!
• The same is true in data science – start with
something very basic. You will learn a lot…
then build a better model.
A Very Simply & Brief Intro to
Machine Learning
Typically there are two classes of
problems people want solved…
• Classification – what group does this data fall
into? (e.g. male vs female, big spender vs
spendthrift etc…)
• Regression – predict the value of this variable.
(e.g. how much money will our store make
next year?)
Separate these two classes…
Campbell et al, 2012
There are two basic steps in machine
learning
1. Feature extraction – what information do you pull
from the data to learn from?
(e.g. “you dunt neid atl the leytirs to reqd tjis”)
2. Apply the learning algorithm – feed the features to
the algorithm you have chosen and get the answers.
You can play with either step to get better results (and
there are algorithms that do both in one step, e.g.
deep learning, convnets).
There are typically two types of ML
problems…
• Supervised – “here are some examples with the
model answers. Learn from these and apply to
new examples…” (labeled data). Just like school.
Learn from Training set  Apply to Test data set
• Unsupervised – ‘Here is some data. I don’t know
anything, figure everything out yourself.’
(unlabeled data). This is basically clustering 
Nadeem’s dataset.
Pitfalls and Warnings
https://www.topstocks.com.au/
1. Correlation is not causation…
If you look through enough correlations (and algorithms),
some of them will appear significant, just by chance…
But they have no real value.
2. Representative training data
• If the data you train on is not similar to the
test data, you will usually get very bad results!
Representative Training
The Ugly Ducking lacked representative training data…
3. Overfitting
If your friend says “I know how to get to the
supermarket, follow me” and then goes to the
toilet before getting in the car, you probably
don’t need to follow them into the
bathroom…
Robust Classification…
Overfitting
Data Science: First Steps
Step 1. Determine sample size, an indicator of data depth.
Step 2. Know the number of numeric and character variables, an indicator
of data breadth.
Step 3. Calculate the percentage of missing data for each numeric variable.
Step 4. Histogram, plot or otherwise map each variable
Step 5. Start a search for unexpected values of each variable: Improbable
values; and, undefined values due to dividing by 1/0.
Step 6. Know the nature of numeric variables. I.e., declare the formats of
the numerics as decimal, integer or date.
If your data has some nasty peculiarities you don’t know about, it can
really upset a clever algorithm.
• Machine learning competition site
(kaggle.com)
• They give a training dataset and a test set for
which we need to predict the answers.
• We can submit up to 5 test submissions per
day until the competition closes.
• Final scores is based on an unknown subset of
the test data.
The Titanic Problem
• Start with: https://www.kaggle.com/c/titanic-
gettingStarted
• Do the tutorials!
• Read the forums (https://www.kaggle.com/c/titanic-
gettingStarted/forums)
• Download the ipython notebook:
https://www.kaggle.com/c/titanic-
gettingStarted/forums/t/5105/ipython-notebook-
tutorial-for-titanic-machine-learning-from-disaster
• This is a classification problem (0 = died, 1 = survived)
• Good luck!

Mauritius Big Data and Machine Learning JEDI workshop

  • 1.
    Big Data andMachine Learning An introduction to Key Ideas Mauritian JEDI Bruce Bassett bruce@saao.ac.za AIMS/SAAO/UCT Jan 2015
  • 2.
    History of theJEDI concept We developed the format at several SA workshops (2005- 2008) NRF-Royal Society 5 year Bilateral with Portsmouth, Sussex and Oxford: train new researchers & do excellent cosmology research • JEDI 1 – Langebaan 2008 • JEDI 2 – STIAS/Avalon 2008 • We are now past JEDI X… Aim of the JEDI series: explore to find the most efficient way of teaching & learning research, building new collaborations and doing excellent research
  • 3.
    “Sciama” Principles • Creativityhas to be nurtured creatively • Ideas are a non-linear function of interaction – want as much discussion/interaction as possible • Learning is most efficient when it is fun, informal and play. • Academia is a small-world network… • Hence personal contacts and networking are crucial for progress • Being part of the “fratelli fisici” (Coleman) is important. People need to know and trust you…
  • 4.
    “Google” Principles • Takegood people and treat them really well. • Trust that good things will come out…things that you can’t predict before hand. • Get out of your comfort zone! “Creativity requires chaos”. Talk to people you would not normally talk to. Do things that scare you! • Attitude and atmosphere is crucial: be friendly, have fun, relax, enjoy yourself, be proactive, interact, work hard.
  • 5.
    How does theJEDI work? • Research is best learned by doing it with people who do it better or differently than you. • Work with a “screw-it let’s do it” attitude • Work on coming up with and evaluating new ideas • Work on real research projects in teams. • You choose the projects you are interested in and how you spend your time.
  • 6.
    • 1-3 years:are there any ongoing projects between people who met at the JEDI? • 10-20 years: Successful if two people can look back and say, “actually I first worked or became good friends with X at JEDI and we have since written papers together, they took my students for post-docs, they wrote a letter of reference for me, examined my student’s thesis, helped referee my grant, get me promoted etc…” Success on different timescales
  • 7.
    Brain Teaser • Aman tosses a coin 30 times and it comes up heads 30 times in a row. • What is the probability that it comes up heads on the 31st coin toss?
  • 8.
    What is thescientific method?
  • 9.
    • What isthe first thing we do when we try to understand something with physics/applied mathematics? • We build a toy model of it, a representation, that we can study. • We then study this simplified model and make predictions.
  • 10.
    Machine Learning • Inmachine learning, we do the same. We must choose a set of features that we think are the most important to achieve our goals • We then train the machine learning, and use it to make predictions.
  • 11.
  • 12.
    The Deeper Drivers DataScience is really driven by the intersection of: • Moore’s Law – cheaper, faster, smaller… • Development of powerful, fast new algorithms that take advantage of the computing power (e.g. Bayesian methods) • Turing completeness which allows near universal application of the algorithms…
  • 13.
    Moore’s Law appliesto lots of things…
  • 14.
    250,000 x morestorage and about 10 x Cheaper!
  • 15.
    The Lean StartupModel • What we are trying is very close to running a startup in a competitive landscape • In Lean Startup, the Minimum Viable Product is central… test basic assumptions! • The same is true in data science – start with something very basic. You will learn a lot… then build a better model.
  • 16.
    A Very Simply& Brief Intro to Machine Learning
  • 17.
    Typically there aretwo classes of problems people want solved… • Classification – what group does this data fall into? (e.g. male vs female, big spender vs spendthrift etc…) • Regression – predict the value of this variable. (e.g. how much money will our store make next year?)
  • 18.
    Separate these twoclasses… Campbell et al, 2012
  • 19.
    There are twobasic steps in machine learning 1. Feature extraction – what information do you pull from the data to learn from? (e.g. “you dunt neid atl the leytirs to reqd tjis”) 2. Apply the learning algorithm – feed the features to the algorithm you have chosen and get the answers. You can play with either step to get better results (and there are algorithms that do both in one step, e.g. deep learning, convnets).
  • 20.
    There are typicallytwo types of ML problems… • Supervised – “here are some examples with the model answers. Learn from these and apply to new examples…” (labeled data). Just like school. Learn from Training set  Apply to Test data set • Unsupervised – ‘Here is some data. I don’t know anything, figure everything out yourself.’ (unlabeled data). This is basically clustering  Nadeem’s dataset.
  • 21.
  • 22.
    https://www.topstocks.com.au/ 1. Correlation isnot causation… If you look through enough correlations (and algorithms), some of them will appear significant, just by chance… But they have no real value.
  • 23.
    2. Representative trainingdata • If the data you train on is not similar to the test data, you will usually get very bad results!
  • 24.
    Representative Training The UglyDucking lacked representative training data…
  • 25.
    3. Overfitting If yourfriend says “I know how to get to the supermarket, follow me” and then goes to the toilet before getting in the car, you probably don’t need to follow them into the bathroom…
  • 26.
  • 27.
  • 28.
    Data Science: FirstSteps Step 1. Determine sample size, an indicator of data depth. Step 2. Know the number of numeric and character variables, an indicator of data breadth. Step 3. Calculate the percentage of missing data for each numeric variable. Step 4. Histogram, plot or otherwise map each variable Step 5. Start a search for unexpected values of each variable: Improbable values; and, undefined values due to dividing by 1/0. Step 6. Know the nature of numeric variables. I.e., declare the formats of the numerics as decimal, integer or date. If your data has some nasty peculiarities you don’t know about, it can really upset a clever algorithm.
  • 29.
    • Machine learningcompetition site (kaggle.com) • They give a training dataset and a test set for which we need to predict the answers. • We can submit up to 5 test submissions per day until the competition closes. • Final scores is based on an unknown subset of the test data.
  • 30.
    The Titanic Problem •Start with: https://www.kaggle.com/c/titanic- gettingStarted • Do the tutorials! • Read the forums (https://www.kaggle.com/c/titanic- gettingStarted/forums) • Download the ipython notebook: https://www.kaggle.com/c/titanic- gettingStarted/forums/t/5105/ipython-notebook- tutorial-for-titanic-machine-learning-from-disaster • This is a classification problem (0 = died, 1 = survived) • Good luck!