Bayesian reasoning

“DATA IN THE WILD” –
BEGINNER STEPS INTO DATA
MARTA FAJLHAUER, GSTATS, BSC
DATA ANALYST AT BRIGHTBLUE CONSULTING,
PROFESSIONAL FELLOW OR ROYAL STATISTICAL SOCIETY
POSTGRADUATE STUDENT AT QUEEN MARY UNIVERSITY OF LONDON

What I learned from analysing 250 profiles of my LinkedIn
connections working in Data Science?
What I learned during my work in Data Engineering
What I learn when I work in Data Analytics.
Bayesian reasoning for social media
 curiosity, understanding, asking questions, looking for
answers on business and personal questions.

I want to work in Data
Science (£75,000 - £100,000)
Procurement / IT Service Desk /
Threat Intel Librarian / Audit / PMO /
Corporate / Business System /
Business / Technical / Analyst
Data / Analytics Consultant
Analytics and Business Intelligence
Analytical storyteller
AI and Advanced Analytics
Econometrician
Statistician
Mathematician
Software / Cloud / Mathematical / Data /
Linux Operation / System / Service /
Marketing / Backend / Blockchain / Splunk
/ Oracle / Machine Learning / AI Engineer
Data and Software / System / Enterprise /
Data Solution / Cloud Architect
Lead Software crafter
Software / Full Stack / Software developer
Cloud / AI / Computer Vision / Machine
Learning Consultant
Applied Machine Learning Scientists
Deep learning specialist
Enterprise data strategy
Machine Learning / AI / Robotics /
Researcher
Big Data Developer
Oracle DBA
DevOps
-> Machine Learning
-> R
-> Python
-> Deep Learning
-> NLP
-> AI
-> Advanced Statistics

241 profiless
86 data Scientists (27 PhD and 13 BSc)
64 Data Analysts (1 PhD and 35 BSc)
64 Engineers

 Computer Science or
Mathematics
background.
 Others in every single
category
 Mathematics for Data
Analytics and Computer
Science for Data
Engineering
Data
Scientists

less than 20% computer science
60% degree in computer science
But….
Lead Software Crafter: BSc Health
science
DevOps: BSc Applied linguistics
Marketing Engineer: English
literature
Senior Analytics Consultant: BSc
Music
Software Engineer: Public relations
Data Engineer: Anthropology
Data manager: BSc Arts
Cloud Consultant: Advanced
Aeronautical
Engineering
Data Engineer: Public Health

You need to choose what you want to expertise at:
They are called doctors but does it mean that one can perform work of another?
Does it mean that one is more important than another? No. It means that one
decided to concentrate on a specific thing after exploration stage.
EBOV virus for charity helping people in Africa. Crime Data mining using USA census
data

DATA ENGINEERING – FIRST JOB:
“DATA SOMETHING”

IT Ops and Security
Machine data
Real time visibility
Forwarding data in real time.
Collect and visualise
Forward data in real time to indexes
Scales from single server to distributed deployment
Accepts any text data as input, parses the data into
events, stores events in indexes, searches and reports

 Writing configuration files <TCP / UDP, SSL, HEC>
 Set up receiving ports on indexers, add inputs to forwarders
 Compress feed to save money for data pre-processing from Hadoop Clusters
 Lesson 0: where is the coffee machine
 Lesson 1: Not many girls in the Data Engineering work: The only girl, the only
non-technical.
 Lesson 2: Stack Overflow and Google is my best friend.
 Lesson 3: How to set up Splunk image on Docker container
 Lesson 4: setting up distributed, global deployment – very important to set up
proper time and time zone to correlate across multiple sources, set up alerts in
case of anomalies
 Lesson 5: Encryption data and different levels of access are very important in
finance – REGEX, Bush, Linux
 Dashboard and automatic pivots using Splunk Programming Language.

No time to carefully
check all details the
analytics of this kind
of data is completely
different than for
static data.
In static data .csv you can
check if you have missing data
or not, you can visualise all
details and understand the
data but in real time rolling
data it’s completely different.
You have already set up
dashboards to concentrate on
the most important bits. In
Splunk ,you can set up an alert
When you deal with
this kind of data you
don’t concentrate on
Statistics behind it
only choose an
algorithm from a
selection that you
think will the best
meet conditions. With
static data you think
about R^2,
coefficients and so
much more.

read code
written by
someone else
modify the
elements for
your own
purpose
Write your
own code
 There are languages like R when sometimes much more efficient is to use
package already in the system.
 When you set up a loop on millions of data first check if your loops give the
expected output and run smoothly on a smaller data. Once you check that
remember to add loop counter so you can track progress and set up automatic
saving of the output.

DATA ANALYTICS
Algorithms, R&D, statistical thinking

 Lesson 1: relying completely on statistical knowledge without thinking if
correlation does imply causation. (not only regression)
 Whatever you can plot it to visualise the data
 R, Python, Excel, SAS whatever works for the given purpose – you choose.
 Different models for different kind of data
 In smaller datasets, static data you may have much bigger fun from an
analytics point of view rather than with rolling in real time data coming from
different sources.

Bayesian Reasoning for Social Data
Sherlock Holmes and Watson

 It’s July, and mostly sunny <- prior. Predict: mostly sunny
 Someone carry an umbrella <- likelihood Predict: rainy
 What if this is country where you carry umbrella during hot days? What if you
carry umbrella only when it’s raining?
 Update belief <- posterior

If an absent-minded professor takes his umbrella into a classroom, there's a probability of 1/4 that he'll
absent-mindedly leave it there. One day, he sets off with his umbrella, teaches in three classrooms, and
comes back to his office... without his umbrella. What's the probability he left the umbrella?
16/
64
12/
64
16/16+12+9 ~ 43%
P(left in the first classroom, given that he left it
somewhere) =
P(left it in the classroom and he left it somewhere) /
P(he left it somewhere) = (1/4)/((1−27/64))

𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎 =
𝑃(𝐷𝑎𝑡𝑎|𝑇𝑟𝑢𝑡ℎ)𝑃(𝑇𝑟𝑢𝑡ℎ)
𝑃(𝐷𝑎𝑡𝑎)
𝑃 𝑇𝑟𝑢𝑡ℎ = 𝑡ℎ𝑒 𝑝𝑟𝑖𝑜𝑟 = 𝑤ℎ𝑎𝑡 𝑤𝑒 𝑏𝑒𝑙𝑖𝑒𝑣𝑒 𝑖𝑛
𝑃 𝐷𝑎𝑡𝑎 𝑇𝑟𝑢𝑡ℎ = 𝑡ℎ𝑒 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 = 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 𝑐𝑜𝑛𝑓𝑖𝑟𝑚 𝑜𝑢𝑟 𝑏𝑒𝑙𝑖𝑒𝑓
𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎 = 𝑡ℎ𝑒 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑡ℎ𝑒 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑏𝑒𝑙𝑖𝑒𝑓
𝑃 𝑇𝑟𝑢𝑡ℎ 𝐷𝑎𝑡𝑎 ∝ 𝑃 𝑇𝑟𝑢𝑡ℎ 𝑃(𝐷𝑎𝑡𝑎|𝑇𝑟𝑢𝑡ℎ)
Prior
belief
The data
collected
Updated
belief
Updated
belief
Thedata
collected
Prior
belief

𝒑𝒐𝒔𝒕𝒆𝒓𝒊𝒐𝒓
∝ 𝒑𝒓𝒊𝒐𝒓 ∗ 𝒍𝒊𝒌𝒆𝒍𝒊𝒉𝒐𝒐𝒅
ROI, customer retention, losing
umbrella: all is based on some
previous belief

Why we may prefer to use Bayesian rather
than Classical approaches to the data?
problem with small n large p
limited influence on what features will be
selected in classical approaches
power of making decision what
coefficients are going into the model or
how strongly they will go into the model.

Why we are so different yet
so similar - No two people are exactly alike
and no two people are exactly different
preferences

 Bayesian statistics allows you to be subjective, to better connect the real world with the data.
 P-values and confidence intervals vs posterior distribution. <all outcomes and their probabilities>
 Answers that we look for do not match the answers from classical models.
 Important question: what is the probability of an event when the p-value is less than 0.005?
 A better than B with p-value 0..001. A is more expensive.
 You have the predicted probability of quality guarantee in hand., expected prices on the market
 Bayesian methods support complex decision – making under uncertainty.

Bayesian
methods provide
tradeoffs
between speed
and generality

Don’t know priors
Are you sure?
Multiple module analysis
with different level of
priors.

• Business rules influencing decision
• Movement of needs depending on price
• We need to think about competitors,
situation on the market, prices of other
products within the store

We try to measure the return of investment by media type.
We have cross-sectional unit: regions, markets, trade areas, channels, brands, competitor brands.
Another dimension is the time series can be weekly, monthly. at least 5 years of monthly data and 2 years of weekly data.
The dependent variable we would have to be units, not currency due to price elasticity.
Marketing Mix Modelling

• the theory that will never die
• Bayesian Methods for Hackers - http://camdavidsonpilon.github.io/Probabilistic-
Programming-and-Bayesian-Methods-for-Hackers/
• Think Bayes – Bayesian Statistics in Python https://greenteapress.com/wp/think-bayes/
• Statistical Computing for Scientists and engineers - https://www.zabaras.com/statistical-
computing-2017
• Chris Bishop Introduction to Bayesian Inference:
http://videolectures.net/mlss09uk_bishop_ibi/?q=mlss+2009
• Statistical Rethinking: Ebook:
http://xcelab.net/rmpubs/rethinking/Statistical_Rethinking_sample.pdf Videos:
https://www.youtube.com/watch?v=oy7Ks3YfbDg&list=PLDcUM9US4XdM9_N6XUUFrhghGJ4K2
5bFc

MARTA FAJLHAUER
Email: fajlhauermarta@gmail.com
LinkedIn: https://www.linkedin.com/in/martafajlhauer/

Bayesian reasoning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bayesian reasoning

Similar to Bayesian reasoning (20)

Recently uploaded

Recently uploaded (20)

Bayesian reasoning

Editor's Notes