Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Yo. Big Data
understanding data science in the era of big data.
Natalino Busa
@natalinobusa
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Bus...
Understanding
Big Data
What is life?
Why are we?
What is reality ?
● (almost) everything is a number
● A few guys came with some good ideas:
Aristoteles, Galileo, Popper,
Fisher, Pearson, B...
Aristoteles
Analytical reasoning
induction
deduction
Causality
Ontology
Galileo
Scientific method
experiment
reproducibility
math formula’s as models
Popper
Falsification
Exact sciences
Models have to adhere reality
Statistical inference:
Can we falsify beliefs?
Pearson
Statistical method
Null hypothesis
hypothesis testing
Principal Component Analysis
Correlation Coefficient
Fisher
Statistical method
Likelihood function
Significance
Distribution
Sufficient statistics
Bayes
Math of belief
belief inference
network of beliefs
hypothesis -> beliefs
What about it?
The shocking truth:
1) we use these concepts every day
2) we have a pre-scientific intuition of these ideas
Why do we bother?
New problems are related to understanding human behavior:
understand needs, desires, dreams, ambitions, ...
So, why data?
Data is our way of understanding life and reality.
How to deal with it?
Well, it’s quite simple, in a nutshell:
This is what (data) science is about:
data -> hypothesis -> v...
… but what we (mostly) really do is:
Use very little data
-> apply it to pre-formulated beliefs
-> come up with some “gut ...
Just buy the damn’d thing.
What’s the problem with it?
● Context
○ we could use some more data
○ insufficient feature engineering
● Add more hypothes...
Big data to the rescue?
Big Data is the domain which:
transforms
numbers to insights
services to experiences
Big data to the rescue?
by aggregating data sources
across users
across applications
across domains
Big data to the rescue?
in order to
providing personalized and relevant results
to the consumer of the given service
anywh...
Some small headaches
users != consumers
N=all : doesn’t mean you don’t need to clean it
Not all data is born equal
you don...
Keep exploring.
Your problem might not be captured by your data features.
Some small headaches
Tough to inspect big data.
Tough to reason about big data.
representativity/bias, support, and segmen...
Diminishing
returns
Most of models pretty good
after a few weeks
winner added just about 5% more
after 1 year, 300 ensembl...
How to compare?
You know the answer (supervised methods)
confusion matrix
ROC (Receiver Operating Characteristic)
Mean Squ...
Which is right?
Beware the modeling risks
Overfitting train data
Not enough “support” in the population
Not enough features available/disc...
Object functions
“ you can please some of the people some of the time”
Object functions
Many want a slice of the cake when it’s about object functions
● what the user wants
● what the community...
Data scientists
Data artists,
Data analysts
Data scientists
Data engineers
confirmatory analysis:
domain knowledge, statis...
When is data science cool?
What do we look in the haystack?
outliers
outliers are indicators and/or noise
groups
(Similarity metrics, PCA, SVD)
Big d...
How to enjoy and compare data science?
enjoy the artistry
appreciate the genius
cross-validation
avoid falling into the tr...
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Bus...
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Upcoming SlideShare
Loading in …5
×

Yo. big data. understanding data science in the era of big data.

1,337 views

Published on

We talk a lot these days about data science, and how it will pave our paths with beautiful insights and unexpected new relations and connections in our given datasets, and even across datasets.

But how to maintain the "Science" part in "Data Science"? After some time working in this field I appreciate more and more the critical thinking which has characterized the progress in science.

Hypothesis, facts, prove and/or disprove the thesis. This is how science has progressed in the past centuries. This method has been formalized by Popper and categorize as non-science all disciplines where the statements cannot be falsified. In other words, if a statement cannot be disproved, we cannot talk of science, since there is no mechanism to left to verify the solution or to prove it wrong.

When that happens the argument can still be accepted, but not scientifically accepted. Ways of accepting or refuting a non falsifiable statement are for instance based on aesthetic, authority or pragmatic or philosophical considerations. All valid but not scientific. This applies for instance to statements in the disciplines of politics, teology, ethics, etc.

Science has definitely progressed since then. For instance, Bayesian networks and statistical inductions are currently part of the arsenal of the (data) scientist weapons. But, no matter how the baseline is set, critical thinking and a rigorous method are definitely helpful in understanding the results produced by science in particular when this is based on large amount of data and computational in nature, rather than formula/model driven.

Data Science has currently many different connotations. On one side it praises the "artistry", the genius of laying out connections between disciplines and concepts. This is a truly great aspect of scientists and creativity is definitely very welcome in all data science profiles.

With the fun of creating new insights and new data golden eggs, a data scientist has to put up with those annoying criteria of reproducibility, falsifiability and peer reviewing. Sometimes these elements are postponed or left behind in name of the artistry. Granted, it's just hard to find metrics and baselines in order to compare models and data science solutions. But the scientific method has proven to be solid over the centuries and has proven to allow factual scientific discussion between scientists and a to allow selection between models based on objective agreed criteria.

Published in: Science, Technology, Education
  • Be the first to comment

Yo. big data. understanding data science in the era of big data.

  1. 1. Yo. Big Data understanding data science in the era of big data. Natalino Busa @natalinobusa
  2. 2. Parallelism Mathematics Programming Languages Machine Learning Statistics Big Data Algorithms Cloud Computing Natalino Busa @natalinobusa www.natalinobusa.com
  3. 3. Understanding Big Data
  4. 4. What is life?
  5. 5. Why are we?
  6. 6. What is reality ?
  7. 7. ● (almost) everything is a number ● A few guys came with some good ideas: Aristoteles, Galileo, Popper, Fisher, Pearson, Bayes What has changed in 2500 years?
  8. 8. Aristoteles Analytical reasoning induction deduction Causality Ontology
  9. 9. Galileo Scientific method experiment reproducibility math formula’s as models
  10. 10. Popper Falsification Exact sciences Models have to adhere reality Statistical inference: Can we falsify beliefs?
  11. 11. Pearson Statistical method Null hypothesis hypothesis testing Principal Component Analysis Correlation Coefficient
  12. 12. Fisher Statistical method Likelihood function Significance Distribution Sufficient statistics
  13. 13. Bayes Math of belief belief inference network of beliefs hypothesis -> beliefs
  14. 14. What about it? The shocking truth: 1) we use these concepts every day 2) we have a pre-scientific intuition of these ideas
  15. 15. Why do we bother? New problems are related to understanding human behavior: understand needs, desires, dreams, ambitions, cravings, and hopes. Models have a great side effect: they help us predicting the future. three weapons: Processing power: Models becomes faster: can unroll for everybody’s profiles Sources: extract more data features, use different data. Context: exploring information in order to understand the person.
  16. 16. So, why data? Data is our way of understanding life and reality.
  17. 17. How to deal with it? Well, it’s quite simple, in a nutshell: This is what (data) science is about: data -> hypothesis -> validation
  18. 18. … but what we (mostly) really do is: Use very little data -> apply it to pre-formulated beliefs -> come up with some “gut feeling” Validate it: It didn’t work? “Well, I am still right. ”
  19. 19. Just buy the damn’d thing.
  20. 20. What’s the problem with it? ● Context ○ we could use some more data ○ insufficient feature engineering ● Add more hypotheses ○ we could explore more scenarios, “pivoting” ○ look at the problem from other angles ○ need data “artistry”
  21. 21. Big data to the rescue? Big Data is the domain which: transforms numbers to insights services to experiences
  22. 22. Big data to the rescue? by aggregating data sources across users across applications across domains
  23. 23. Big data to the rescue? in order to providing personalized and relevant results to the consumer of the given service anywhere, anytime.
  24. 24. Some small headaches users != consumers N=all : doesn’t mean you don’t need to clean it Not all data is born equal you don’t know what you don’t know
  25. 25. Keep exploring. Your problem might not be captured by your data features.
  26. 26. Some small headaches Tough to inspect big data. Tough to reason about big data. representativity/bias, support, and segmentation signal to noise ratio: look at GFT (Google Flu Trends) for instance
  27. 27. Diminishing returns Most of models pretty good after a few weeks winner added just about 5% more after 1 year, 300 ensemble model moral: move on, get a new angle
  28. 28. How to compare? You know the answer (supervised methods) confusion matrix ROC (Receiver Operating Characteristic) Mean Square Error (MSE) You don’t know the answer (unsupervised methods) objective function access ground truth A/B testing
  29. 29. Which is right?
  30. 30. Beware the modeling risks Overfitting train data Not enough “support” in the population Not enough features available/discovered Not well defined objective function
  31. 31. Object functions “ you can please some of the people some of the time”
  32. 32. Object functions Many want a slice of the cake when it’s about object functions ● what the user wants ● what the community wants ● what marketing wants ● what business wants ● what finance/monetization wants
  33. 33. Data scientists Data artists, Data analysts Data scientists Data engineers confirmatory analysis: domain knowledge, statisticians and data analysis exploratory analysis : data artists/scientists operational analysis: data engineers , data technologists
  34. 34. When is data science cool?
  35. 35. What do we look in the haystack? outliers outliers are indicators and/or noise groups (Similarity metrics, PCA, SVD) Big data as pragmatic approach to: cheap storage distributed computing
  36. 36. How to enjoy and compare data science? enjoy the artistry appreciate the genius cross-validation avoid falling into the trap of over-fitted models define baseline avoid qualitative methods define a metric, put the models to the bench, compare results
  37. 37. Parallelism Mathematics Programming Languages Machine Learning Statistics Big Data Algorithms Cloud Computing Natalino Busa @natalinobusa www.natalinobusa.com Thanks ! Any questions?

×