Introducing Agile teams to Statistical Analysis. It's the tool that will help them self-manage and I introduce simple methods to measure efficacy. We also compare and contrast the traditional use of mathematics for command and control versus statistics and learning for contemporary agile development and EA.
LKNA 2014 Risk and Impediment Analysis and Analytics - Troy MagennisTroy Magennis
Software risk impact is more predictable than you might think. This session discusses similarities of uncertainty in various industries and relates this back to how we can measure and analyze impediments and risk for agile software teams.
I love the smell of data in the morning (getting started with data science) ...Troy Magennis
Data Science 101 for software development. I know it misses the purist view of Data Science, but this is intended to get you started! First presented at Agile 2017 in Florida.
Forecasting using data workshop slides for the Deliver conference in Winnipeg October 2016. This session introduces practical exercises for probabilistic forecasting. http://www.prdcdeliver.com
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...Troy Magennis
To meet expectations and optimize flow, managing risk is an important part of Kanban. Anticipating and adapting to things that "go wrong" and the uncertainty they cause is topic of this session. We look at techniques for quantifying what risks should be considered important to deal with.
Although discouraging, forecasting size, effort, staff and cost is sometimes necessary. Of course we have to do as little of this as possible, but when we do, we have to do it well with the data we have available. Forecasting is made difficult by un-reliable information as inputs to our process – the amount of work is uncertain, the historical data we are basing our forecasts on is biased and tainted, the situation seems hopeless. But it isn't. Good decisions can be made on imperfect data, and this session discusses how. This session shows immediately usable and simple techniques to capture, analyze, cleanse and assess data, and then use that data for reliable forecasting.
Second and hopefully draft of LKCE 2014 talk.
Object Automation Software Solutions Pvt Ltd in collaboration with SRM Ramapuram delivered Workshop for Skill Development on Artificial Intelligence.
Uncertain Knowledge and reasoning by Mr.Abhishek Sharma, Research Scholar from Object Automation.
Making Sense of Statistics in HCI: part 2 - doing itAlan Dix
Doing it – if not p then what
http://alandix.com/statistics/course/
In this part we will look at the major kinds of statistical analysis methods:
* Hypothesis testing (the dreaded p!) – robust but confusing
* Confidence intervals – powerful but underused
* Bayesian stats – mathematically clean but fragile
None of these is a magic bullet; all need care and a level of statistical understanding to apply.
We will discuss how these are related including the relationship between ‘likelihood’ in hypothesis testing and conditional probability as used in Bayesian analysis. There are common issues including the need to clearly report numbers and tests/distributions used. avoiding cherry picking, dealing with outliers, non-independent effects and correlated features. However, there are also specific issues for each method.
Classic statistical methods used in hypothesis testing and confidence intervals depend on ideas of ‘worse’ for measures, which are sometimes obvious, sometimes need thought (one vs. two tailed test), and sometimes outright confusing. In addition, care is needed in hypothesis testing to avoid classic fails such as treating non-significant as no-effect and inflated effect sizes.
In Bayesian statistics different problems arise: the need to be able to decide in a robust and defensible manner what are the expected likelihood of different hypothesis before an experiment; and the dangers of common causes leading to inflated probability estimates due to a single initial fluke event or optimistic prior.
Crucially, while all methods have problems that need to be avoided, we will see how not using statistics at all can be far worse.
LKNA 2014 Risk and Impediment Analysis and Analytics - Troy MagennisTroy Magennis
Software risk impact is more predictable than you might think. This session discusses similarities of uncertainty in various industries and relates this back to how we can measure and analyze impediments and risk for agile software teams.
I love the smell of data in the morning (getting started with data science) ...Troy Magennis
Data Science 101 for software development. I know it misses the purist view of Data Science, but this is intended to get you started! First presented at Agile 2017 in Florida.
Forecasting using data workshop slides for the Deliver conference in Winnipeg October 2016. This session introduces practical exercises for probabilistic forecasting. http://www.prdcdeliver.com
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...Troy Magennis
To meet expectations and optimize flow, managing risk is an important part of Kanban. Anticipating and adapting to things that "go wrong" and the uncertainty they cause is topic of this session. We look at techniques for quantifying what risks should be considered important to deal with.
Although discouraging, forecasting size, effort, staff and cost is sometimes necessary. Of course we have to do as little of this as possible, but when we do, we have to do it well with the data we have available. Forecasting is made difficult by un-reliable information as inputs to our process – the amount of work is uncertain, the historical data we are basing our forecasts on is biased and tainted, the situation seems hopeless. But it isn't. Good decisions can be made on imperfect data, and this session discusses how. This session shows immediately usable and simple techniques to capture, analyze, cleanse and assess data, and then use that data for reliable forecasting.
Second and hopefully draft of LKCE 2014 talk.
Object Automation Software Solutions Pvt Ltd in collaboration with SRM Ramapuram delivered Workshop for Skill Development on Artificial Intelligence.
Uncertain Knowledge and reasoning by Mr.Abhishek Sharma, Research Scholar from Object Automation.
Making Sense of Statistics in HCI: part 2 - doing itAlan Dix
Doing it – if not p then what
http://alandix.com/statistics/course/
In this part we will look at the major kinds of statistical analysis methods:
* Hypothesis testing (the dreaded p!) – robust but confusing
* Confidence intervals – powerful but underused
* Bayesian stats – mathematically clean but fragile
None of these is a magic bullet; all need care and a level of statistical understanding to apply.
We will discuss how these are related including the relationship between ‘likelihood’ in hypothesis testing and conditional probability as used in Bayesian analysis. There are common issues including the need to clearly report numbers and tests/distributions used. avoiding cherry picking, dealing with outliers, non-independent effects and correlated features. However, there are also specific issues for each method.
Classic statistical methods used in hypothesis testing and confidence intervals depend on ideas of ‘worse’ for measures, which are sometimes obvious, sometimes need thought (one vs. two tailed test), and sometimes outright confusing. In addition, care is needed in hypothesis testing to avoid classic fails such as treating non-significant as no-effect and inflated effect sizes.
In Bayesian statistics different problems arise: the need to be able to decide in a robust and defensible manner what are the expected likelihood of different hypothesis before an experiment; and the dangers of common causes leading to inflated probability estimates due to a single initial fluke event or optimistic prior.
Crucially, while all methods have problems that need to be avoided, we will see how not using statistics at all can be far worse.
Making Sense of Statistics in HCI: part 3 - gaining powerAlan Dix
Gaining power – the dreaded ‘too few participants’
http://alandix.com/statistics/course/
Statistical power is about whether an experiment or study is likely to reveal an effect if it is present. Without a sufficiently ‘powerful’ study, you risk being in the middle ground of ‘not proven’, not being able to make a strong statement either for or against whatever effect, system, or theory you are testing.
In HCI studies the greatest problem is often finding sufficient participants to do meaningful statistics. For professional practice we hear that ‘five users are enough’, but less often that this figure was based on particular historical contingencies and in the context of single iterations, not summative evaluations, which still need the equivalent of ‘power’ to be reliable.
However, power arises from a combination of the size of the effect you are trying to detect, the size of the study (number of trails/participants) and the size of the ‘noise’ (the random or uncontrolled factors).
Increasing number of participants is not the only way to increase power and we will discuss various ways in which careful design, selection of subjects and tasks can increase the power of your study albeit sometimes requiring care in interpreting results. For example, using a very narrow user group can reduce individual differences in knowledge and skill (reduce noise) and make it easier to see the effect of a novel interaction technique, but also reduces generalisation beyond that group. In another example, we will also see how careful choice of a task can even be used to deal with infrequent expert slips.
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...tboubez
My presentation from Velocity Europe 2013 in London: Beyond Pretty Charts…. Analytics for the cloud infrastructure.
IT Ops collect tons of data on the status of their data center or cloud environment. Much of that data ends up as graphs on big screens so ops folks can keep an eye on the behavior of their systems. But unless a threshold is crossed, behavioral issues will often fall through the cracks. Thresholds are reactive, and humans are, well, human. Applying analytics and machine learning to detect anomalies in dynamic infrastructure environments can catch these behavioral changes before they become critical.
Current tools used to monitor web environments rely on fundamental assumptions that are no longer true such as assuming that the underlying system being monitored is relatively static or that the behavioral limits of these systems can be defined by static rules and thresholds. Thus interest in applying analytics and machine learning to predict and detect anomalies in these dynamic environments is gaining steam. However, understanding which algorithms should be used to identify and predict anomalies accurately within all that data we generate is not so easy.
This talk will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as:
Understanding your data’s characteristics
The two main approaches for analyzing operations data: parametric and non-parametric methods
Simple data transformations that can give you powerful results
By the end of this talk, attendees will understand the pros and cons of the key statistical analysis techniques and walk away with examples as well as practical rules of thumb and usage patterns.
Making Sense of Statistics in HCI: From P to Bayes and Beyond – introductionAlan Dix
Many find statistics confusing, and perhaps more so given recent publicity of problems with traditional p-values and alternative statistical techniques including confidence intervals and Bayesian statistics. This course aims to help attendees navigate this morass: to understand the debates and more importantly make appropriate choices when designing and analysing experiments, empirical studies and other forms of quantitative data.
Making Sense of Statistics in HCI: part 4 - so whatAlan Dix
So what? – making sense of results
http://alandix.com/statistics/course/
You have done your experiment or study and have your data – what next, how do you make sense of the results? In fact one of the best ways to design a study is to imagine this situation before you start!
This part will address a number of questions to think about during analysis (or design) including: Whether your work is to test an existing hypothesis (validation) or to find out what you should be looking for (exploration)? Whether it is a one-off study, or part of a process (e.g. ‘5 users’ for iterative development)? How to make sure your results and data can be used by others (e.g. repeatability, meta analysis)? Looking at the data, and asking if it makes sense given your assumptions (e.g. Fitts’ Law experiments that assume index of difficulty is all that matters). Thinking about the conditions – what have you really shown – some general result or simply that one system or group of users is better than another?
Lecture 2: Data, pre-processing and post-processing
Chapters 2,3 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Chapter 1 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman
Learning Objective: Increase professional effectiveness, data management, and analytical skills
With evolving technology, many people are overloaded and overwhelmed with information and data. Businesses now have access to large amounts of feedback from internal and external sources. How do we make sense of the all of the information? Is the data reliable? How can we manage and utilize the data in order to impact business goals, visions, mission? This seminar with help you turn your information overload into powerful and reliable data that you can use to meet organizational goals.
At the end of this seminar, participants will be able to:
a. Assess and categorize data and information.
b. Identify tools and techniques to organize and interpret data.
c. Explore productivity tools and techniques.
d. Examine common data management challenges and solutions.
Understanding Deep Learning Requires Rethinking GeneralizationAhmet Kuzubaşlı
My presentation of one of the ICLR2017 best paper by Google Brain. (arxiv.org/abs/1611.03530). I believe that generalization deserves more attention as we go deep into over-parameterization zone.
این ارائه در کارگاه توانبخشی توجه توسط دکتر مهدی علیزاده تدریس شده است. برای دریافت بقیه فایلهای ارائه شده در این کارگاه به وب سایت فروردین مراجعه کنید.
https://farvardin-group.com
Making Sense of Statistics in HCI: part 3 - gaining powerAlan Dix
Gaining power – the dreaded ‘too few participants’
http://alandix.com/statistics/course/
Statistical power is about whether an experiment or study is likely to reveal an effect if it is present. Without a sufficiently ‘powerful’ study, you risk being in the middle ground of ‘not proven’, not being able to make a strong statement either for or against whatever effect, system, or theory you are testing.
In HCI studies the greatest problem is often finding sufficient participants to do meaningful statistics. For professional practice we hear that ‘five users are enough’, but less often that this figure was based on particular historical contingencies and in the context of single iterations, not summative evaluations, which still need the equivalent of ‘power’ to be reliable.
However, power arises from a combination of the size of the effect you are trying to detect, the size of the study (number of trails/participants) and the size of the ‘noise’ (the random or uncontrolled factors).
Increasing number of participants is not the only way to increase power and we will discuss various ways in which careful design, selection of subjects and tasks can increase the power of your study albeit sometimes requiring care in interpreting results. For example, using a very narrow user group can reduce individual differences in knowledge and skill (reduce noise) and make it easier to see the effect of a novel interaction technique, but also reduces generalisation beyond that group. In another example, we will also see how careful choice of a task can even be used to deal with infrequent expert slips.
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...tboubez
My presentation from Velocity Europe 2013 in London: Beyond Pretty Charts…. Analytics for the cloud infrastructure.
IT Ops collect tons of data on the status of their data center or cloud environment. Much of that data ends up as graphs on big screens so ops folks can keep an eye on the behavior of their systems. But unless a threshold is crossed, behavioral issues will often fall through the cracks. Thresholds are reactive, and humans are, well, human. Applying analytics and machine learning to detect anomalies in dynamic infrastructure environments can catch these behavioral changes before they become critical.
Current tools used to monitor web environments rely on fundamental assumptions that are no longer true such as assuming that the underlying system being monitored is relatively static or that the behavioral limits of these systems can be defined by static rules and thresholds. Thus interest in applying analytics and machine learning to predict and detect anomalies in these dynamic environments is gaining steam. However, understanding which algorithms should be used to identify and predict anomalies accurately within all that data we generate is not so easy.
This talk will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as:
Understanding your data’s characteristics
The two main approaches for analyzing operations data: parametric and non-parametric methods
Simple data transformations that can give you powerful results
By the end of this talk, attendees will understand the pros and cons of the key statistical analysis techniques and walk away with examples as well as practical rules of thumb and usage patterns.
Making Sense of Statistics in HCI: From P to Bayes and Beyond – introductionAlan Dix
Many find statistics confusing, and perhaps more so given recent publicity of problems with traditional p-values and alternative statistical techniques including confidence intervals and Bayesian statistics. This course aims to help attendees navigate this morass: to understand the debates and more importantly make appropriate choices when designing and analysing experiments, empirical studies and other forms of quantitative data.
Making Sense of Statistics in HCI: part 4 - so whatAlan Dix
So what? – making sense of results
http://alandix.com/statistics/course/
You have done your experiment or study and have your data – what next, how do you make sense of the results? In fact one of the best ways to design a study is to imagine this situation before you start!
This part will address a number of questions to think about during analysis (or design) including: Whether your work is to test an existing hypothesis (validation) or to find out what you should be looking for (exploration)? Whether it is a one-off study, or part of a process (e.g. ‘5 users’ for iterative development)? How to make sure your results and data can be used by others (e.g. repeatability, meta analysis)? Looking at the data, and asking if it makes sense given your assumptions (e.g. Fitts’ Law experiments that assume index of difficulty is all that matters). Thinking about the conditions – what have you really shown – some general result or simply that one system or group of users is better than another?
Lecture 2: Data, pre-processing and post-processing
Chapters 2,3 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Chapter 1 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman
Learning Objective: Increase professional effectiveness, data management, and analytical skills
With evolving technology, many people are overloaded and overwhelmed with information and data. Businesses now have access to large amounts of feedback from internal and external sources. How do we make sense of the all of the information? Is the data reliable? How can we manage and utilize the data in order to impact business goals, visions, mission? This seminar with help you turn your information overload into powerful and reliable data that you can use to meet organizational goals.
At the end of this seminar, participants will be able to:
a. Assess and categorize data and information.
b. Identify tools and techniques to organize and interpret data.
c. Explore productivity tools and techniques.
d. Examine common data management challenges and solutions.
Understanding Deep Learning Requires Rethinking GeneralizationAhmet Kuzubaşlı
My presentation of one of the ICLR2017 best paper by Google Brain. (arxiv.org/abs/1611.03530). I believe that generalization deserves more attention as we go deep into over-parameterization zone.
این ارائه در کارگاه توانبخشی توجه توسط دکتر مهدی علیزاده تدریس شده است. برای دریافت بقیه فایلهای ارائه شده در این کارگاه به وب سایت فروردین مراجعه کنید.
https://farvardin-group.com
We've been taught that "data science" is the esoteric domain of PhDs,
but like anything else, it's easy once you understand it. This talk
explains the basics of data science, covering concepts in supervised
learning (including a detailed explanation of decision trees and
random forests) as well as examples of unsupervised learning
algorithms. Far from being a dry and academic topic, data science and machine learning are useful and practical analytical tools. (This talk is intended for a general audience.)
Topics will include:
1) An introduction to supervised learning using the popular decision
tree algorithm
2) The concepts of training and scoring, and the meaning of "real time"
machine learning
3) Model validation using holdout sets
4) Model complexity and overfitting; understanding bias and variance;
using ensembles to reduce variance
5) An overview of unsupervised learning models including clustering,
topic modeling and anomaly detection
and more!
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Health-Economics has been our staple evaluation method for technology and drugs. Yet, during procurement, they often lack multi-axial, nonlinear determinants of health. Despite the influence they have on the prognosis of individual patients through respiratory treatments, heatstroke or water borne disease.
Health-Climate-Economics extends health-economics by adding sustainability criteria that factors in climate impact into the care of patients and helps quantify how much of an impact health services have on their own demand.
BarCamp Manchester 2016: Neuro, fuzzyio, logicalAxelisys Limited
A short lightening-talk at BarCamp Manchester 2016 covering 3 different types of Artificial Intelligence concepts. Neural Networks, Fuzzy Logic and Logic Programming.
The talk and bonus material slides from Ethar Alali's Agile Yorkshire talk in September 2015. Covering Business Agility and Lean thinking and why talking about [no]estimates is the wrong question.
It used to be that 90% of new businesses fail in their first 2 years. Lean Startup’s explosion on to the scene provide a step-change in mainstream entrepreneurship. However, for modern startups to avoid the same fate, A/B-tests need to yield credible results, especially in more uncertain environments.
How do you repeatedly design good, credible experiments to tame the unpredictable beast? This deck bridges theory and practise to provide useful tips, tools and techniques to apply to business, social media profiles or anything else you dare!
Do we in IT really know what a system is? If we answer with Servers, Applications, Hardware, we need to review our understanding, since this is far from the reality.
IT's A/B-testing and lean-startup techniques can learn a lot from experimental design and statistics. For those of you not that confident or familiar with such techniques, here is a little intro to help you on your way :)
Cloud computing has been around a while now. However, come business and technology managers still find the concepts baffling or struggle to decide how to move to the cloud.
This informal presentation introduces the basics of cloud computing and gives you the top-3 tips to decide if cloud is right for you and how to migrate effectively.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. Analysis for Dummy’s, Dummies
• Most Agile Teams
– Can’t Identify Influential Delivery Factors Plus…
– …Over-reliance on Cycle-Time & Throughput
– Equals Shooting in the Dark!
• Little’s Law Applies Only When ‘Predictable’
• ‘Fixed’ Mathematics Doesn’t Adequately Facilitate Self-Organisation
– Morphogenesis & Chaos
• Too hard for most
• We don’t know enough (yet)
• Enterprise Mathematical Models too Hard or Based on Unrealistic
Assumptions
– e.g. Efficient Market Hypothesis,
• required rational investor
• What can Agilists Do?
4. Traditional Mathematical Analysis
Modelled the environment in its entirety
Every variable identified and mapped
Every factor had to be understood in detail
…and managed
Fit command-and-control really well!
Provided an Exact answer
Useful comfort blanket
Exclusivity - Very few people understood it
Needed Masters & PhDs in numerate subjects
MBA’s not always enough
Mathematics, Physics, Operation Research, Engineering…
5. Area of a Circle (Traditional Way)
• Given origin (h,k) & radius
r
• Typically learned for GCSE
• Have to know:
– equation
– r is a factor & how to get it
– What ‘squared’ means
– Pi is a constant
– Know maths
• What if you didn’t?
Source: Google Images
6. Statistical Analysis
Doesn’t require exact model
Doesn’t produce an exact answer
Do you need one?
Can you rely on one?
Isn’t variable/factor centric
Though they may come out
Looks for correlations
Which tell you where else to look for more
CAREFUL! Correlations aren’t causations!
If you find a link, it doesn’t necessarily mean it’s so
Can be refined, akin to ‘learning’
Increasing number of samples in known range
…akin to reducing Kanban batch size or story size
Can also use Bayesian Inference
Fits Lean-Agility really well
A 10-year old can often do it!
7. Area of a Circle (Statistical Way)
• Grid around the Circle
• Count Squares at least
half inside circle
• Need more accuracy?
Easy! Use finer grid!
• Typically learned at 10
years old!
Question
Take a look at the examples on the
right, which grid is closer to Actual
Area?
8 x 8 x 1cm Grid
Diameter = 8 x 1cm squares = 8cm
Radius = Half diameter i.e. 8/2 = 4cm
Area is the number of squares at
least half inside circle.
52 squares: 52x(1x1) = 52cm2
20 x 20 x 0.4cm Grid
Diameter = 20 x 0.4cm squares = 8cm
Radius = Half diameter i.e. 8/2 = 4cm
Area is the number of squares at least
half inside circle.
312 squares: 312x(0.4x0.4) = 49.92cm2
Actual Area
When r = 4
Area = Pi x (4 x 4)
= 50.27cm2
Image Source: Google Images
8. Compare to Kanban
• Backlog the Tickets
• Batch together related
epic tickets
• If you need more
accuracy, make the
batches smaller!
– …and/or sprints shorter
9. Technical Note!
• Statistical form is standard in Monte Carlo
Algorithms
– Always Fast to run…
– …But ‘probably’ correct
• In any case, accurate to a particular range
• If that range is good enough use it!
11. Definition of Good Enough?
Definitions
What I tell Managers: “Any measure with an accuracy
matching your ability to change, is good enough.”
What I tell Techies: “Sampling twice as frequent as the
change, is good enough.”
- Ethar Alali
• Any more accurate/frequent is waste
• Any less and you can’t make decisions
– So risk mitigation strategy may be necessary
12. Example: CD Quality Sound
In ye olden days
we had these
• 44.1kHz sample rate
• Stereo Sound
• 16-bit Digital Sampling
• CD stores 650MB
Compact Disc
Image Source: Google Images
13. Example: Compact Disc Encoding
Focusing on Useful Data Storage
Ignore Reed-Solomon error correction & detection
Signed 16-bit number can segment audio into ~ 1/65,536 parts
44.1kHz means it takes 1x 16 bit number in this range every 1/44,100ths of a second
Stereo sound means two sets of microphones and hence 2 sample channels
Total storage needs for a 3 minute song:
• 44,100 samples x 2 bytes per sample x 2 channels x 3minute x 60 seconds = 31.752MB raw per song.
• Album = 20 songs = 635 MB of digital data, which fills a 650MB CD
Great for music :-)
Attribution: Image Courtesy of Grahammitchell.com
14. What About: Telephone Voice on CD?
Voice on Telephones is mono not stereo
Needs only one channel!
Telephone quality changes pitch in 3K at worst!
Voice doesn’t have the refined nature of music! Hence can be recorded in 8-bit (256 parts)
3kHz means it takes 1 x 8 bit number in this range ever 1/3,000th of a second
Total storage needs for a 3 minute conversation:
• 3,000 samples x 1 bytes per sample x 1 channels x 3 minute x 60 seconds = 540KB raw.
• Album = 20 songs = 10.8 MB of digital data
Stored on 650MB CD, you have almost 640MB of WASTE!
15. Attribution: Image Courtesy of Grahammitchell.com
What If: We sampled less?
• Not an Accurate Picture!
Note:
Dashed red edge case, which samples exactly at transition points. In
real scenarios this never happens with sound since change isn’t
periodic.
RED = 2/3 as fast sampling
AMBER = Twice as frequent sampling
GREEN = 4 times as frequent
16. Which is Closer to Actual?
RED = 2/3 as fast sampling
AMBER = Twice as frequent sampling
GREEN = 4 times as frequent
17. Traditional Samples in Business
• Annual Accounts
– Plc’s have mid-term or quarterly accounts
– If they want to be more agile, make it monthly
• Regulatory Reporting
• Charity Commission Reports
• Franchises Brand Inspections
– Once every 2-3 years, inspected annually
• FCA
• …
Identify: Easy! Usually associated with ‘Audit’ of some kind.
• Self-governing/managing teams Sample themselves!
19. Causation
• One thing occurs as a deterministic consequence of something else
– Fingers in high-voltage socket causes death
• Link a number of causes to establish behaviour
• Needs Two Factors
– Functional process, including all variables
– Initial condition (aka Pre-condition)
• ‘Given’ in Gherkin syntax
• Great for Forecasting…
– As long as causal-chain always happen
• Near useless in chaotic environments
– Depending on when you look at it
• Initial condition may not be known
• Sensitive dependence + Feedback injects uncertainty!
• Code runs deterministically, teams normally work chaotically…
• …until they reach predictability, then Little’s Law can apply
20. Example: Causation
• y = 2 + x <- function/process
• x = 3 <- Initial [pre]condition
• y = 5 <- Final outcome/post-condition
• Post-condition = acceptance test criteria
– ‘Then’ in Gherkin Syntax
• Really easy for code! Mostly predictable
– Fits Gherkin, OCL, VDM, Z etc. perfectly
21. Correlation
• Aims to find [statistical] links between samples
– When causal links not known or samples appear ‘random’
– Also shows strength of relationship
• First step in Factor Analysis
– Locate influential factors for dependent variables
• Cycle-time
• Throughput
• Value delivered
• Can be plotted on graph
• Needs Manipulation to Fit Gherkin :(
• All aim to locate where to sniff next!
22. Correlations Can Be Seen
• Correlations can be
modelled with Linear
Regression
• Seen when an increase in
one variable
increases/decreases another
Source: Scatterplot Image from knottwiki teaching
Source: Image from Utah.edu Mesowest weather
24. Example: Burnage Library
• Manchester City Council claim: Library closure based on 11
variables for deprivation
– Tasked with saving £80 million a year
• Correlation matrix showed strong correlations between
Population of Library catchment area &:
– Total Library Visitors – Larger catchments correlate with more
library visitors
– Active users – Larger catchments correlate with more active
users
– Participation in Events
– …
• But all factors correlated with each other!
25. Dependent v Independent Correlation
Very High Correlations of dependent combined score & other allegedly
independent factors with catchment population
27. Correlation: Deprivation
Q:Was deprivation a factor? A: Deprivation wasn’t a significant consideration,
despite the claims of Council
28. Example: Burnage Library Conclusion
• Basics showed that claims weren't supported
– Could have done better with Null Hypothesis
• Interdependence of allegedly independent
variables meant weighting of catchment area
5x more important than deprivation
– Not likely based on deprivation index, as was
claimed
– Potentially hinting at a political decision
• Controversial ;)
29. NEXT TIME: Agile Teams
• In Part 2, we examine how this applies to teams.
• In summary:
– Gather Cycle-time, Throughput & Value delivered across a few
sprints
– Match & Correlate Respective
• Bugs
• Blockers
• Days of week
• Team size
• Story
• Anything else you already have data for
• Don’t
– Make too many inferences early on
30. Thanks for Viewing
Further Reading
Business Planning Example
http://www.solver.com/monte-carlo-simulation-example
Monte Carlo Simulation Tutorial in Excel
“Statistics in Psychosocial Research, Lecture 8 Factor Analysis I” John Hopkins University
http://ocw.jhsph.edu/courses/statisticspsychosocialresearch/pdfs/lecture8.pdf)
“Correlation & Dependence” Wikipedia
http://en.wikipedia.org/wiki/Correlation_and_dependence
Ethar Alali @EtharUK @Dynacognetics
Managing Director & Chief Architect
Polymath-MathMo. Programming since 9 years old. TOGAF 9 Certified, change
agent.
Blog: GoadingtheITGeek.blogspot.co.uk
About Us
Specialist ICT Strategists & Advisors.
Member of HiveMind Network for some of
the biggest household and corporate multi-nationals.
Accredited Growth Voucher Advisors
certified to deliver IT & Web Growth
Consultancy as part of the government’s
Growth Voucher Scheme.
Accreditations & Associations