“DATA IN THE WILD” –
BEGINNER STEPS INTO DATA
MARTA FAJLHAUER, GSTATS, BSC
DATA ANALYST AT BRIGHTBLUE CONSULTING,
PROFESSIONAL FELLOW OR ROYAL STATISTICAL SOCIETY
POSTGRADUATE STUDENT AT QUEEN MARY UNIVERSITY OF LONDON
What I learned from analysing 250 profiles of my LinkedIn
connections working in Data Science?
What I learned during my work in Data Engineering
What I learn when I work in Data Analytics.
Bayesian reasoning for social media
 curiosity, understanding, asking questions, looking for
answers on business and personal questions.
I want to work in Data
Science (£75,000 - £100,000)
Procurement / IT Service Desk /
Threat Intel Librarian / Audit / PMO /
Corporate / Business System /
Business / Technical / Analyst
Data / Analytics Consultant
Analytics and Business Intelligence
Analytical storyteller
AI and Advanced Analytics
Econometrician
Statistician
Mathematician
Software / Cloud / Mathematical / Data /
Linux Operation / System / Service /
Marketing / Backend / Blockchain / Splunk
/ Oracle / Machine Learning / AI Engineer
Data and Software / System / Enterprise /
Data Solution / Cloud Architect
Lead Software crafter
Software / Full Stack / Software developer
Cloud / AI / Computer Vision / Machine
Learning Consultant
Applied Machine Learning Scientists
Deep learning specialist
Enterprise data strategy
Machine Learning / AI / Robotics /
Researcher
Big Data Developer
Oracle DBA
DevOps
-> Machine Learning
-> R
-> Python
-> Deep Learning
-> NLP
-> AI
-> Advanced Statistics
241 profiless
86 data Scientists (27 PhD and 13 BSc)
64 Data Analysts (1 PhD and 35 BSc)
64 Engineers
 Computer Science or
Mathematics
background.
 Others in every single
category
 Mathematics for Data
Analytics and Computer
Science for Data
Engineering
Data
Scientists
less than 20% computer science
60% degree in computer science
But….
Lead Software Crafter: BSc Health
science
DevOps: BSc Applied linguistics
Marketing Engineer: English
literature
Senior Analytics Consultant: BSc
Music
Software Engineer: Public relations
Data Engineer: Anthropology
Data manager: BSc Arts
Cloud Consultant: Advanced
Aeronautical
Engineering
Data Engineer: Public Health
You need to choose what you want to expertise at:
They are called doctors but does it mean that one can perform work of another?
Does it mean that one is more important than another? No. It means that one
decided to concentrate on a specific thing after exploration stage.
EBOV virus for charity helping people in Africa. Crime Data mining using USA census
data
DATA ENGINEERING – FIRST JOB:
“DATA SOMETHING”
IT Ops and Security
Machine data
Real time visibility
Forwarding data in real time.
Collect and visualise
Forward data in real time to indexes
Scales from single server to distributed deployment
Accepts any text data as input, parses the data into
events, stores events in indexes, searches and reports
 Writing configuration files <TCP / UDP, SSL, HEC>
 Set up receiving ports on indexers, add inputs to forwarders
 Compress feed to save money for data pre-processing from Hadoop Clusters
 Lesson 0: where is the coffee machine
 Lesson 1: Not many girls in the Data Engineering work: The only girl, the only
non-technical.
 Lesson 2: Stack Overflow and Google is my best friend.
 Lesson 3: How to set up Splunk image on Docker container
 Lesson 4: setting up distributed, global deployment – very important to set up
proper time and time zone to correlate across multiple sources, set up alerts in
case of anomalies
 Lesson 5: Encryption data and different levels of access are very important in
finance – REGEX, Bush, Linux
 Dashboard and automatic pivots using Splunk Programming Language.
No time to carefully
check all details the
analytics of this kind
of data is completely
different than for
static data.
In static data .csv you can
check if you have missing data
or not, you can visualise all
details and understand the
data but in real time rolling
data it’s completely different.
You have already set up
dashboards to concentrate on
the most important bits. In
Splunk ,you can set up an alert
When you deal with
this kind of data you
don’t concentrate on
Statistics behind it
only choose an
algorithm from a
selection that you
think will the best
meet conditions. With
static data you think
about R^2,
coefficients and so
much more.
read code
written by
someone else
modify the
elements for
your own
purpose
Write your
own code
 There are languages like R when sometimes much more efficient is to use
package already in the system.
 When you set up a loop on millions of data first check if your loops give the
expected output and run smoothly on a smaller data. Once you check that
remember to add loop counter so you can track progress and set up automatic
saving of the output.
DATA ANALYTICS
Algorithms, R&D, statistical thinking
 Lesson 1: relying completely on statistical knowledge without thinking if
correlation does imply causation. (not only regression)
 Whatever you can plot it to visualise the data
 R, Python, Excel, SAS whatever works for the given purpose – you choose.
 Different models for different kind of data
 In smaller datasets, static data you may have much bigger fun from an
analytics point of view rather than with rolling in real time data coming from
different sources.
Bayesian Reasoning for Social Data
Sherlock Holmes and Watson
 It’s July, and mostly sunny <- prior. Predict: mostly sunny
 Someone carry an umbrella <- likelihood Predict: rainy
 What if this is country where you carry umbrella during hot days? What if you
carry umbrella only when it’s raining?
 Update belief <- posterior
If an absent-minded professor takes his umbrella into a classroom, there's a probability of 1/4 that he'll
absent-mindedly leave it there. One day, he sets off with his umbrella, teaches in three classrooms, and
comes back to his office... without his umbrella. What's the probability he left the umbrella?
16/
64
12/
64
16/16+12+9 ~ 43%
P(left in the first classroom, given that he left it
somewhere) =
P(left it in the classroom and he left it somewhere) /
P(he left it somewhere) = (1/4)/((1−27/64))
𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎 =
𝑃(𝐷𝑎𝑡𝑎|𝑇𝑟𝑢𝑡ℎ)𝑃(𝑇𝑟𝑢𝑡ℎ)
𝑃(𝐷𝑎𝑡𝑎)
𝑃 𝑇𝑟𝑢𝑡ℎ = 𝑡ℎ𝑒 𝑝𝑟𝑖𝑜𝑟 = 𝑤ℎ𝑎𝑡 𝑤𝑒 𝑏𝑒𝑙𝑖𝑒𝑣𝑒 𝑖𝑛
𝑃 𝐷𝑎𝑡𝑎 𝑇𝑟𝑢𝑡ℎ = 𝑡ℎ𝑒 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 = 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 𝑐𝑜𝑛𝑓𝑖𝑟𝑚 𝑜𝑢𝑟 𝑏𝑒𝑙𝑖𝑒𝑓
𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎 = 𝑡ℎ𝑒 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑡ℎ𝑒 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑏𝑒𝑙𝑖𝑒𝑓
𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎 ∝ 𝑃 𝑇𝑟𝑢𝑡ℎ 𝑃(𝐷𝑎𝑡𝑎|𝑇𝑟𝑢𝑡ℎ)
Prior
belief
The data
collected
Updated
belief
Updated
belief
Thedata
collected
Prior
belief
𝒑𝒐𝒔𝒕𝒆𝒓𝒊𝒐𝒓
∝ 𝒑𝒓𝒊𝒐𝒓 ∗ 𝒍𝒊𝒌𝒆𝒍𝒊𝒉𝒐𝒐𝒅
ROI, customer retention, losing
umbrella: all is based on some
previous belief
Why we may prefer to use Bayesian rather
than Classical approaches to the data?
problem with small n large p
limited influence on what features will be
selected in classical approaches
power of making decision what
coefficients are going into the model or
how strongly they will go into the model.
Why we are so different yet
so similar - No two people are exactly alike
and no two people are exactly different
preferences
 Bayesian statistics allows you to be subjective, to better connect the real world with the data.
 P-values and confidence intervals vs posterior distribution. <all outcomes and their probabilities>
 Answers that we look for do not match the answers from classical models.
 Important question: what is the probability of an event when the p-value is less than 0.005?
 A better than B with p-value 0..001. A is more expensive.
 You have the predicted probability of quality guarantee in hand., expected prices on the market
 Bayesian methods support complex decision – making under uncertainty.
Bayesian
methods provide
tradeoffs
between speed
and generality
Don’t know priors
Are you sure?
Multiple module analysis
with different level of
priors.
• Business rules influencing decision
• Movement of needs depending on price
• We need to think about competitors,
situation on the market, prices of other
products within the store
We try to measure the return of investment by media type.
We have cross-sectional unit: regions, markets, trade areas, channels, brands, competitor brands.
Another dimension is the time series can be weekly, monthly. at least 5 years of monthly data and 2 years of weekly data.
The dependent variable we would have to be units, not currency due to price elasticity.
Marketing Mix Modelling
• the theory that will never die
• Bayesian Methods for Hackers - http://camdavidsonpilon.github.io/Probabilistic-
Programming-and-Bayesian-Methods-for-Hackers/
• Think Bayes – Bayesian Statistics in Python https://greenteapress.com/wp/think-bayes/
• Statistical Computing for Scientists and engineers - https://www.zabaras.com/statistical-
computing-2017
• Chris Bishop Introduction to Bayesian Inference:
http://videolectures.net/mlss09uk_bishop_ibi/?q=mlss+2009
• Statistical Rethinking: Ebook:
http://xcelab.net/rmpubs/rethinking/Statistical_Rethinking_sample.pdf Videos:
https://www.youtube.com/watch?v=oy7Ks3YfbDg&list=PLDcUM9US4XdM9_N6XUUFrhghGJ4K2
5bFc
MARTA FAJLHAUER
Email: fajlhauermarta@gmail.com
LinkedIn: https://www.linkedin.com/in/martafajlhauer/

Bayesian reasoning

  • 1.
    “DATA IN THEWILD” – BEGINNER STEPS INTO DATA MARTA FAJLHAUER, GSTATS, BSC DATA ANALYST AT BRIGHTBLUE CONSULTING, PROFESSIONAL FELLOW OR ROYAL STATISTICAL SOCIETY POSTGRADUATE STUDENT AT QUEEN MARY UNIVERSITY OF LONDON
  • 2.
    What I learnedfrom analysing 250 profiles of my LinkedIn connections working in Data Science? What I learned during my work in Data Engineering What I learn when I work in Data Analytics. Bayesian reasoning for social media  curiosity, understanding, asking questions, looking for answers on business and personal questions.
  • 3.
    I want towork in Data Science (£75,000 - £100,000) Procurement / IT Service Desk / Threat Intel Librarian / Audit / PMO / Corporate / Business System / Business / Technical / Analyst Data / Analytics Consultant Analytics and Business Intelligence Analytical storyteller AI and Advanced Analytics Econometrician Statistician Mathematician Software / Cloud / Mathematical / Data / Linux Operation / System / Service / Marketing / Backend / Blockchain / Splunk / Oracle / Machine Learning / AI Engineer Data and Software / System / Enterprise / Data Solution / Cloud Architect Lead Software crafter Software / Full Stack / Software developer Cloud / AI / Computer Vision / Machine Learning Consultant Applied Machine Learning Scientists Deep learning specialist Enterprise data strategy Machine Learning / AI / Robotics / Researcher Big Data Developer Oracle DBA DevOps -> Machine Learning -> R -> Python -> Deep Learning -> NLP -> AI -> Advanced Statistics
  • 4.
    241 profiless 86 dataScientists (27 PhD and 13 BSc) 64 Data Analysts (1 PhD and 35 BSc) 64 Engineers
  • 5.
     Computer Scienceor Mathematics background.  Others in every single category  Mathematics for Data Analytics and Computer Science for Data Engineering Data Scientists
  • 6.
    less than 20%computer science 60% degree in computer science But…. Lead Software Crafter: BSc Health science DevOps: BSc Applied linguistics Marketing Engineer: English literature Senior Analytics Consultant: BSc Music Software Engineer: Public relations Data Engineer: Anthropology Data manager: BSc Arts Cloud Consultant: Advanced Aeronautical Engineering Data Engineer: Public Health
  • 7.
    You need tochoose what you want to expertise at: They are called doctors but does it mean that one can perform work of another? Does it mean that one is more important than another? No. It means that one decided to concentrate on a specific thing after exploration stage. EBOV virus for charity helping people in Africa. Crime Data mining using USA census data
  • 8.
    DATA ENGINEERING –FIRST JOB: “DATA SOMETHING”
  • 9.
    IT Ops andSecurity Machine data Real time visibility Forwarding data in real time. Collect and visualise Forward data in real time to indexes Scales from single server to distributed deployment Accepts any text data as input, parses the data into events, stores events in indexes, searches and reports
  • 10.
     Writing configurationfiles <TCP / UDP, SSL, HEC>  Set up receiving ports on indexers, add inputs to forwarders  Compress feed to save money for data pre-processing from Hadoop Clusters  Lesson 0: where is the coffee machine  Lesson 1: Not many girls in the Data Engineering work: The only girl, the only non-technical.  Lesson 2: Stack Overflow and Google is my best friend.  Lesson 3: How to set up Splunk image on Docker container  Lesson 4: setting up distributed, global deployment – very important to set up proper time and time zone to correlate across multiple sources, set up alerts in case of anomalies  Lesson 5: Encryption data and different levels of access are very important in finance – REGEX, Bush, Linux  Dashboard and automatic pivots using Splunk Programming Language.
  • 11.
    No time tocarefully check all details the analytics of this kind of data is completely different than for static data. In static data .csv you can check if you have missing data or not, you can visualise all details and understand the data but in real time rolling data it’s completely different. You have already set up dashboards to concentrate on the most important bits. In Splunk ,you can set up an alert When you deal with this kind of data you don’t concentrate on Statistics behind it only choose an algorithm from a selection that you think will the best meet conditions. With static data you think about R^2, coefficients and so much more.
  • 12.
    read code written by someoneelse modify the elements for your own purpose Write your own code  There are languages like R when sometimes much more efficient is to use package already in the system.  When you set up a loop on millions of data first check if your loops give the expected output and run smoothly on a smaller data. Once you check that remember to add loop counter so you can track progress and set up automatic saving of the output.
  • 13.
    DATA ANALYTICS Algorithms, R&D,statistical thinking
  • 14.
     Lesson 1:relying completely on statistical knowledge without thinking if correlation does imply causation. (not only regression)  Whatever you can plot it to visualise the data  R, Python, Excel, SAS whatever works for the given purpose – you choose.  Different models for different kind of data  In smaller datasets, static data you may have much bigger fun from an analytics point of view rather than with rolling in real time data coming from different sources.
  • 15.
    Bayesian Reasoning forSocial Data Sherlock Holmes and Watson
  • 16.
     It’s July,and mostly sunny <- prior. Predict: mostly sunny  Someone carry an umbrella <- likelihood Predict: rainy  What if this is country where you carry umbrella during hot days? What if you carry umbrella only when it’s raining?  Update belief <- posterior
  • 17.
    If an absent-mindedprofessor takes his umbrella into a classroom, there's a probability of 1/4 that he'll absent-mindedly leave it there. One day, he sets off with his umbrella, teaches in three classrooms, and comes back to his office... without his umbrella. What's the probability he left the umbrella? 16/ 64 12/ 64 16/16+12+9 ~ 43% P(left in the first classroom, given that he left it somewhere) = P(left it in the classroom and he left it somewhere) / P(he left it somewhere) = (1/4)/((1−27/64))
  • 18.
    𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎= 𝑃(𝐷𝑎𝑡𝑎|𝑇𝑟𝑢𝑡ℎ)𝑃(𝑇𝑟𝑢𝑡ℎ) 𝑃(𝐷𝑎𝑡𝑎) 𝑃 𝑇𝑟𝑢𝑡ℎ = 𝑡ℎ𝑒 𝑝𝑟𝑖𝑜𝑟 = 𝑤ℎ𝑎𝑡 𝑤𝑒 𝑏𝑒𝑙𝑖𝑒𝑣𝑒 𝑖𝑛 𝑃 𝐷𝑎𝑡𝑎 𝑇𝑟𝑢𝑡ℎ = 𝑡ℎ𝑒 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 = 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 𝑐𝑜𝑛𝑓𝑖𝑟𝑚 𝑜𝑢𝑟 𝑏𝑒𝑙𝑖𝑒𝑓 𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎 = 𝑡ℎ𝑒 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑡ℎ𝑒 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑏𝑒𝑙𝑖𝑒𝑓 𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎 ∝ 𝑃 𝑇𝑟𝑢𝑡ℎ 𝑃(𝐷𝑎𝑡𝑎|𝑇𝑟𝑢𝑡ℎ) Prior belief The data collected Updated belief Updated belief Thedata collected Prior belief
  • 19.
    𝒑𝒐𝒔𝒕𝒆𝒓𝒊𝒐𝒓 ∝ 𝒑𝒓𝒊𝒐𝒓 ∗𝒍𝒊𝒌𝒆𝒍𝒊𝒉𝒐𝒐𝒅 ROI, customer retention, losing umbrella: all is based on some previous belief
  • 20.
    Why we mayprefer to use Bayesian rather than Classical approaches to the data? problem with small n large p limited influence on what features will be selected in classical approaches power of making decision what coefficients are going into the model or how strongly they will go into the model.
  • 21.
    Why we areso different yet so similar - No two people are exactly alike and no two people are exactly different preferences
  • 22.
     Bayesian statisticsallows you to be subjective, to better connect the real world with the data.  P-values and confidence intervals vs posterior distribution. <all outcomes and their probabilities>  Answers that we look for do not match the answers from classical models.  Important question: what is the probability of an event when the p-value is less than 0.005?  A better than B with p-value 0..001. A is more expensive.  You have the predicted probability of quality guarantee in hand., expected prices on the market  Bayesian methods support complex decision – making under uncertainty.
  • 23.
  • 24.
    Don’t know priors Areyou sure? Multiple module analysis with different level of priors.
  • 25.
    • Business rulesinfluencing decision • Movement of needs depending on price • We need to think about competitors, situation on the market, prices of other products within the store
  • 26.
    We try tomeasure the return of investment by media type. We have cross-sectional unit: regions, markets, trade areas, channels, brands, competitor brands. Another dimension is the time series can be weekly, monthly. at least 5 years of monthly data and 2 years of weekly data. The dependent variable we would have to be units, not currency due to price elasticity. Marketing Mix Modelling
  • 27.
    • the theorythat will never die • Bayesian Methods for Hackers - http://camdavidsonpilon.github.io/Probabilistic- Programming-and-Bayesian-Methods-for-Hackers/ • Think Bayes – Bayesian Statistics in Python https://greenteapress.com/wp/think-bayes/ • Statistical Computing for Scientists and engineers - https://www.zabaras.com/statistical- computing-2017 • Chris Bishop Introduction to Bayesian Inference: http://videolectures.net/mlss09uk_bishop_ibi/?q=mlss+2009 • Statistical Rethinking: Ebook: http://xcelab.net/rmpubs/rethinking/Statistical_Rethinking_sample.pdf Videos: https://www.youtube.com/watch?v=oy7Ks3YfbDg&list=PLDcUM9US4XdM9_N6XUUFrhghGJ4K2 5bFc
  • 28.
    MARTA FAJLHAUER Email: fajlhauermarta@gmail.com LinkedIn:https://www.linkedin.com/in/martafajlhauer/

Editor's Notes

  • #3 Structure of the talk.
  • #4 Statement: “I want to work in Data Science” based on salary Explosion of information. First conference and where my friends works.