This document discusses survival analysis and its application to analyzing the departure dynamics of Wikipedia editors. It begins by defining survival analysis and its goal of modeling time-to-event data using techniques that account for censoring. A case study is presented on analyzing data from 110,000 Wikipedia editors to determine who is likely to stop editing, how long they will continue editing, and why they stop. Statistical techniques like the Kaplan-Meier estimator, Cox proportional hazards models, and adjusted survival curves are used to analyze editing durations and identify covariates that impact the hazard rate of editors stopping contributions.
Systemy rekomendacji, Algorytmy rankingu Top-N rekomendacji bazujące na nieja...Bartlomiej Twardowski
Wprowadzenie do systemów rekomendacji - prezentacja z seminarium Instytutu Informatyki Politechniki Warszawskiej.
W zalewie informacji odnalezienie tych które nas rzeczywiście interesują staje się bardzo trudne. Wspomagają nas w tym systemy IR, np. w postaci wyszukiwarek internetowych. O krok dalej idą systemy rekomendacji, próbując odgadnąć preferencje użytkownika i zaoferować najlepiej spersonalizowane treści automatycznie.
Podejście do problemu rekomendacji użytkownikowi najbardziej dopasowanych informacji zmieniało się w czasie. Aktualnie do wyboru mamy szereg gotowych do zastosowania metod: od prostego opisu podobieństwa użytkowników, kończąc na złożonych modelach data mining. Trudność zaczyna stanowić poprawne zrozumienie problemu/domeny, odpowiednie dobranie metody rekomendacji oraz sposób jej pomiaru.
Na prezentacji zostanie przedstawione krótkie wprowadzenie do tematyki systemów rekomendacji. Omówione zostaną metod rekomendacji oraz sposoby ich ewaluacja. Zaprezentowane zostanie podejście do rekomendacji jako "ranking top-N". Całość uzupełniona zostanie doświadczeniami i ciekawymi problemami z implementacji platformy rekomendacyjnej dla największego serwisu e-commerce w Polsce.
Description: WeightWatcher (WW): is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It can be used to:analyze pre/trained PyTorch, Keras, DNN models (Conv2D and Dense layers) monitor models, and the model layers, to see if they are over-trained or over-parameterized, predict test accuracies across different models, with or without training data, and detect potential problems when compressing or fine-tuning pre-trained models. see https://weightwatcher.ai
Systemy rekomendacji, Algorytmy rankingu Top-N rekomendacji bazujące na nieja...Bartlomiej Twardowski
Wprowadzenie do systemów rekomendacji - prezentacja z seminarium Instytutu Informatyki Politechniki Warszawskiej.
W zalewie informacji odnalezienie tych które nas rzeczywiście interesują staje się bardzo trudne. Wspomagają nas w tym systemy IR, np. w postaci wyszukiwarek internetowych. O krok dalej idą systemy rekomendacji, próbując odgadnąć preferencje użytkownika i zaoferować najlepiej spersonalizowane treści automatycznie.
Podejście do problemu rekomendacji użytkownikowi najbardziej dopasowanych informacji zmieniało się w czasie. Aktualnie do wyboru mamy szereg gotowych do zastosowania metod: od prostego opisu podobieństwa użytkowników, kończąc na złożonych modelach data mining. Trudność zaczyna stanowić poprawne zrozumienie problemu/domeny, odpowiednie dobranie metody rekomendacji oraz sposób jej pomiaru.
Na prezentacji zostanie przedstawione krótkie wprowadzenie do tematyki systemów rekomendacji. Omówione zostaną metod rekomendacji oraz sposoby ich ewaluacja. Zaprezentowane zostanie podejście do rekomendacji jako "ranking top-N". Całość uzupełniona zostanie doświadczeniami i ciekawymi problemami z implementacji platformy rekomendacyjnej dla największego serwisu e-commerce w Polsce.
Description: WeightWatcher (WW): is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It can be used to:analyze pre/trained PyTorch, Keras, DNN models (Conv2D and Dense layers) monitor models, and the model layers, to see if they are over-trained or over-parameterized, predict test accuracies across different models, with or without training data, and detect potential problems when compressing or fine-tuning pre-trained models. see https://weightwatcher.ai
Introduction à la traduction automatique, et au processus de traduction, les types de traduction automatique ainsi que les mesures du qualité...
Quelques informations sur le modèle de traduction neuronale développé pour la traduction de l'anglais vers l'allemand que vous pouvez trouver dans ce lien :
Recently I gave a talk at UC Berkeley regarding the transition from academia to industry in the context of Machine Learning and Data Science related roles. I based most of my slides on my own transition from being an Astrophysicist to a Machine Learning Expert. I hope this will be useful to many. Feedback is welcome!
1-Problématique
2-Définition du Big Data
3-Big Data et 3V
4-Data wahrehouse VS Big Data
5-Domaines d’utilisations
6-Les techniques de traitement
7-Big Data et Aspect Mobile
8-Conclusion
Developer Economics - State of the Developer Nation 2015Q1SlashData
The App Economy in 2015: e-commerce dominates.
The platform wars have ended in a stalemate
Swift rises to 20% of mobile devs, 4 months since launch
App economy revenues are polarising
53% of mobile developers are working on an IoT project
Tool awareness is increasing.
An increasing fraction of developers target enterprise and they ‘re more successful
Pro devs target
iOS & browser. Android is for all WP is the fun place to start.
How does Netflix recommend movies? In this presentation we go over a very common technique for recommendations called matrix factorization to predict what rating a user will give a movie. It sounds like a complicated mathematical concept, but all it consists of is finding a set of intermediate features such as action, comedy, etc., and using them to help us determine the ratings.
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Web Services
Amazon Aurora is a MySQL-compatible database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. The service is now in preview. Come to our session for an overview of the service and learn how Aurora delivers up to five times the performance of MySQL yet is priced at a fraction of what you'd pay for a commercial database with similar performance and availability.
Amazon Elastic MapReduce (Amazon EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores like Amazon S3, how to take advantage of both transient and active clusters, and how to work with other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivAmazon Web Services
Today’s modern infrastructure allows product teams to take full advantage of “infrastructure-as-code” and deliver value to their customers faster through a seamless & smart delivery pipeline.This delivery pipeline is built using AWS and 3rd party tools such as CloudFormation, Lambda, Terraform, Jenkins, Beanstalk, CodeDeploy, Ansible, and Docker. In the presentation we will walk you through the best practices of combining all the above into a “smart-delivery-pipeline” for your team. By Oron Adam, Emind CTO
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a preview of some of the work being done to add code generation to Spark ML.
Leveraging AWS for your business provides a catalyst for security programs as customers inherit a faster pace of security innovation simply by using AWS. This session highlights design and architecture patterns customers can employ to measurably improve the security of their organization. In this session, customers explore design patterns for data security using encryption, strong access controls, and least privilege; for implementing detective security controls, such as logging and monitoring, at scale; and for implementing a defense-in-depth network security architecture.
El análisis del camino (Path analysis) o análisis de pautas es un análisis de regresión múltiple más un diagrama de flujo de las interdependencia. Es una aplicación de la
inferencia estadística y la teoría de grafos. Primero se determina el orden de las dependencias o prioridades entre variables por una Encuesta, por un método intuitivo u
otro método. o Hecha la selección se analiza este material con Tablas de contingencia y Matriz de correlación y el análisis medirá los caminos críticos con valores esperados o reales. Es un test que puede fallar si no se establece racionalmente el orden de las dependencias en la
red del modelo causal, se emplean variables no relevantes y no se cumplen los supuestos básicos.
Introduction à la traduction automatique, et au processus de traduction, les types de traduction automatique ainsi que les mesures du qualité...
Quelques informations sur le modèle de traduction neuronale développé pour la traduction de l'anglais vers l'allemand que vous pouvez trouver dans ce lien :
Recently I gave a talk at UC Berkeley regarding the transition from academia to industry in the context of Machine Learning and Data Science related roles. I based most of my slides on my own transition from being an Astrophysicist to a Machine Learning Expert. I hope this will be useful to many. Feedback is welcome!
1-Problématique
2-Définition du Big Data
3-Big Data et 3V
4-Data wahrehouse VS Big Data
5-Domaines d’utilisations
6-Les techniques de traitement
7-Big Data et Aspect Mobile
8-Conclusion
Developer Economics - State of the Developer Nation 2015Q1SlashData
The App Economy in 2015: e-commerce dominates.
The platform wars have ended in a stalemate
Swift rises to 20% of mobile devs, 4 months since launch
App economy revenues are polarising
53% of mobile developers are working on an IoT project
Tool awareness is increasing.
An increasing fraction of developers target enterprise and they ‘re more successful
Pro devs target
iOS & browser. Android is for all WP is the fun place to start.
How does Netflix recommend movies? In this presentation we go over a very common technique for recommendations called matrix factorization to predict what rating a user will give a movie. It sounds like a complicated mathematical concept, but all it consists of is finding a set of intermediate features such as action, comedy, etc., and using them to help us determine the ratings.
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Web Services
Amazon Aurora is a MySQL-compatible database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. The service is now in preview. Come to our session for an overview of the service and learn how Aurora delivers up to five times the performance of MySQL yet is priced at a fraction of what you'd pay for a commercial database with similar performance and availability.
Amazon Elastic MapReduce (Amazon EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores like Amazon S3, how to take advantage of both transient and active clusters, and how to work with other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivAmazon Web Services
Today’s modern infrastructure allows product teams to take full advantage of “infrastructure-as-code” and deliver value to their customers faster through a seamless & smart delivery pipeline.This delivery pipeline is built using AWS and 3rd party tools such as CloudFormation, Lambda, Terraform, Jenkins, Beanstalk, CodeDeploy, Ansible, and Docker. In the presentation we will walk you through the best practices of combining all the above into a “smart-delivery-pipeline” for your team. By Oron Adam, Emind CTO
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a preview of some of the work being done to add code generation to Spark ML.
Leveraging AWS for your business provides a catalyst for security programs as customers inherit a faster pace of security innovation simply by using AWS. This session highlights design and architecture patterns customers can employ to measurably improve the security of their organization. In this session, customers explore design patterns for data security using encryption, strong access controls, and least privilege; for implementing detective security controls, such as logging and monitoring, at scale; and for implementing a defense-in-depth network security architecture.
El análisis del camino (Path analysis) o análisis de pautas es un análisis de regresión múltiple más un diagrama de flujo de las interdependencia. Es una aplicación de la
inferencia estadística y la teoría de grafos. Primero se determina el orden de las dependencias o prioridades entre variables por una Encuesta, por un método intuitivo u
otro método. o Hecha la selección se analiza este material con Tablas de contingencia y Matriz de correlación y el análisis medirá los caminos críticos con valores esperados o reales. Es un test que puede fallar si no se establece racionalmente el orden de las dependencias en la
red del modelo causal, se emplean variables no relevantes y no se cumplen los supuestos básicos.
This presentation shows financial managers how to predict how long accounts will likely stay open. It is based on a sophisticated statistical probability model.
Sydney based cloud consultancy Cloudten's Richard Tomkinson shows how masterless Puppet can be used in concert with AWS's services including Lambda to automate server builds and manage code deployments
(CMP407) Lambda as Cron: Scheduling Invocations in AWS LambdaAmazon Web Services
Do you need to run an AWS Lambda function on a schedule, without an event to trigger the invocation? This session shows how to use an Amazon CloudWatch metric and CloudWatch alarms, Amazon SNS, and Lambda so that Lambda triggers itself every minute—no external services required! From here, other Lambda jobs can be scheduled in crontab-like format, giving minute-level resolution to your Lambda scheduled tasks. During the session, we build this functionality up from scratch with a Lambda function, CloudWatch metric and alarms, sample triggers, and tasks.
Objectives
To understand Weibull distribution
To be able to use Weibull plot for failure time analysis and
diagnosis
To be able to use software to do data analysis
Organization
Distribution model
Parameter estimation
Regression analysis
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.
In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Speakers:
Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey
What is path analysis?
What are general assumptions?
What is input path diagram?
What is output path diagram?
How unexplained variance is shown in path diagram?
This is an interesting application of Path Analysis leveraging Richard Florida's findings regarding real estate valuations in different cities. This example serves as an introduction to Path Analysis.
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0Xavier Llorà
One hundred and fifty years have passed since the publication of Darwin's world-changing manuscript "The Origins of Species by Means of Natural Selection". Darwin's ideas have proven their power to reach beyond the biology realm, and their ability to define a conceptual framework which allows us to model and understand complex systems. In the mid 1950s and 60s the efforts of a scattered group of engineers proved the benefits of adopting an evolutionary paradigm to solve complex real-world problems. In the 70s, the emerging presence of computers brought us a new collection of artificial evolution paradigms, among which genetic algorithms rapidly gained widespread adoption. Currently, the Internet has propitiated an exponential growth of information and computational resources that are clearly disrupting our perception and forcing us to reevaluate the boundaries between technology and social interaction. Darwin's ideas can, once again, help us understand such disruptive change. In this talk, I will review the origin of artificial evolution ideas and techniques. I will also show how these techniques are, nowadays, helping to solve a wide range of applications, from life science problems to twitter puzzles, and how high performance computing can make Darwin ideas a routinary tool to help us model and understand complex systems.
The Deep Continual Learning community should move beyond studying forgetting in Class-Incremental Learning Scenarios! In this tutorial we gave at
#CoLLAs2023, me and Antonio Carta try to explain why and how! 👇
Do you agree?
Tale of the Knowledge Organization In an Age of Wicked ProblemsGomindSHIFT
The traditional expert organization is on a melting iceberg. There are 3 strategies to continue living on icebergs. The better option is to learn to swim. This presentations takes both a traditional look at expert organizational survival and a transformative look.
https://youtu.be/3FE2HhQnFh0
I am thankful for the CZI scholarship as a DeepLabCut AI resident to study the neuroscience of sexual diversity. DeepLabCut is a deep-learning-based open-source toolbox for 3D pose estimation. It has been used in a wide range of applications e.g. chicken agriculture, surgery, dog poop detector, infant exploratory behaviour, lizard robotics, exergaming biofeedback, 3D triangulation of cheetahs chasing prey, spider webbing, stroke rehabilitation, wildlife conservation, pupil tracking, parrot tripedal locomotion, fear behaviour, dog emotions, functional recovery after spinal cord injury etc.
http://www.mackenziemathislab.org/deeplabcut
A behaviomics approach like DeepLabCut significantly benefits my research on sexual behaviour in a steroid-independent and collective behaviour context. Traditional methods are limited by their subjectivity, biased in selecting parameters to measure, and extremely labour-intensive and time-consuming. There are also limits to human perception and language to accurately detect and describe behaviour. At a broader level, behaviomics improves animal ethics and biodiversity. For animal ethics, behaviomics increase the accuracy and throughput of the data, which reduces the number of animals for the same amount of data. These data also contribute to developing in silico and robotic models that can replace animal experiments. For biodiversity, behaviomics allows researchers to move away from behaviour recordings in the lab in “labesticated” animal models and captive species. More wildlife in naturalistic settings can be studied including footage from drones and satellites.
https://www.nature.com/articles/s41467-022-27980-y
From the experiences of people like Deborah Raji and Timnit Gebru, we know the field of AI is dominated by and predominantly serves the WEIRD (Western, educated, industrialized, rich and democratic) population, particularly white, cisgender, and heterosexual males. It excludes marginalised minorities from its creations which led to race and gender misidentification problems, as well as resulting in the “weapons of math destruction”, as coined by Cathy O'Neil. We need to improve this by embracing perspectives beyond Western science, for example, by incorporating Indigenous communities and Arabic philosophies. There’s also a hegemony of software licensing that provides additional economic barriers to access. DeepLabCut hopes to reduce this barrier by being open source.
https://www.currentaffairs.org/.../software-licesing-is-a...
This residency drives forward my research on the neuroscience of sexual diversity and trains me to become an open-source code contributor. Learning from approaches like EarSketch, Queer in AI, and Black Girls Code, this residency also helps me diversify AI through assisting marginalised minorities to learn AI and become code contributors as well. There are many people to thank for this opportunity. https://www.deeplabcutairesidency.org/our-team
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Dr. Aparna Varde
This is the 3rd part of the tutorial on commonsense knowledge (CSK) at ACM WSDM 2021 by Simon Razniewski, Niket Tandon and Aparna Varde. It focuses on evaluation of the acquired knowledge, both intrinsic & extrinsic, as well as highlights, outlook with a brief perspective on COVID and open issues for further research.
Abstract: Commonsense knowledge is a foundational cornerstone of artificial intelligence applications. Whereas information extraction and knowledge base construction for instance-oriented assertions, such as Brad Pitt’s birth date, or Angelina Jolie’s movie awards, has received much attention, commonsense knowledge on general concepts (politicians, bicycles, printers) and activities (eating pizza, fixing printers) has only been tackled recently. In this tutorial we present state-of-the-art methodologies towards the compilation and consolidation of such commonsense knowledge (CSK). We cover text-extraction-based, multi-modal and Transformer-based techniques, with special focus on the issues of web search and ranking, as of relevance to the WSDM community.
Philosophy of Big Data: Big Data, the Individual, and SocietyMelanie Swan
Philosophical concepts elucidate the impact the Big Data Era (exabytes/year of scientific, governmental, corporate, personal data being created) is having on our sense of ourselves as individuals in society as information generators in constant dialogue with the pervasive information climate.
According to the technology research company Gartner, by 2020 more than 40% of an organization's work will be non-routine. During the next ten years six radical changes will disrupt traditional work. The need to adapt to these mega trends will be imperative for success. Will you be nimble enough to adapt to these impending work changes and adjust accordingly?
Speaker: Pierre Richemond, Data Science Institute of Imperial College
Title: Cutting edge generative models: Applications and implications
Abstract: This talk will examine recent developments in deep learning content generation at scale. Whether it be images or text, the latest methods have now reached a level of quality making it hard to discriminate between human- and AI-generated content. We will review recent examples of such generative models, and put their significance in a broader context, in light of such powerful tools’ potential for dual use.
Bio: Pierre is currently researching his PhD in deep reinforcement learning at the Data Science Institute of Imperial College. He also teaches Deep Learning at the Graduate School, and helps to run the Deep Learning Network and organises thematic reading groups. His background is in mathematics - he has studied electrical engineering at ENST, probability theory and stochastic processes at Universite Paris VI - Ecole Polytechnique, and business management at HEC.
'Living Lab' for HCI - presentation made at HCI International 2009Ed Chi
HCI have long moved beyond the evaluation setting of a single user
sitting in front of a single desktop computer, yet many of our fundamentally
held viewpoints about evaluation continues to be ruled by outdated biases
derived from this legacy. We need to engage with real users in 'Living
Laboratories', in which researchers either adopt or create functioning systems
that are used in real settings. These new experimental platforms will greatly
enable researchers to conduct evaluations that span many users, places, time,
location, and social factors in ways that are unimaginable before.
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
An introduction to the Gene Wiki project with an emphasis on the use of the new WikiData project. Also describes mark2cure, a citizen science initiative oriented on biomedical text mining.
Building a Biomedical Knowledge Garden Benjamin Good
Describes the tribulations of building a large biomedical knowledge graph. Provides a comparison between the UMLS and Wikidata in terms of content and structure. Concludes with the idea of anchoring the knowledge graph in Wikidata items and properties.
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
Talk by Ian Andrews & Mike Goddard @Greenplum at Data Science London 28/11/2012. A financial services case on how to standardize merchant names with RegEx & fuzzy matching
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
4. Time-To-Event Data
• Survival Analysis is a branch of statistics which
deals with the modelling of time-to-event data
– The outcome variable of interest is time until an
event occurs.
• death, disease, failure
• recovery, marriage
– It is called reliability theory/analysis in
engineering, and duration analysis/modelling in
economics or sociology.
4
5. Y X
How to build
a probabilistic model of Y ?
5
6. Y X
How to build
a probabilistic model of Y ?
How to build
a probabilistic model of Y given X ?
6
7. Y X
How to build
a probabilistic model of Y ?
How to build
a probabilistic model of Y given X ?
7
8. Censoring
• A key problem in survival analysis
– It occurs when we have some information about
individual survival time, but we don’t know the
survival time exactly.
8
10. Y X
Options:
1) Wait for those patients to die?
2) Discard the censored data?
3) Use the censored data as if they were
not censored?
4) ……
10
11. Goals
• Survival Analysis attempts to answer
questions such as
– What is the fraction of a population which will
survive past a certain time? Of those that survive,
at what rate will they die?
– Can multiple causes of death be taken into
account?
– How do particular circumstances or characteristics
increase or decrease the odds of survival?
11
12. • Censoring of data
• Comparing groups
– (1 treatment vs. 2 placebo)
• Confounding or Interaction
factors
– Log WBC
12
14. The Data Are There
• Events meaningful to online marketing
– Time to Clicking the Ad
– Informational: Time to Finding the Wanted Info
– Transactional: Time to Buying the Product
– Social: Time to Joining/Leaving the Community
– ……
Time Matters!
14
15. Evidence-Based Marketing
• Let’s work as (real) doctors
– Users = Patients
– Advertisement (Marketing) = Treatment
Survival Analysis brings
the time dimension
back to the centre stage.
15
23. Departure Dynamics
• Who are likely to “die”?
• How soon will they “die”?
• Why do they “die”?
“live”= stay in the editors’ community
= keep editing
“die” = leave the editors’ community
= stop editing (for 5 months)
23
33. Gradient Boosted Trees (GBT)
• The success of GBT in our task is probably
attributable to
– its ability to capture the complex nonlinear
relationship between the target variable and the
features,
– its insensitivity to different feature value ranges as
well as outliers, and
– its resistance to overfitting via regularisation
mechanisms such as shrinkage and subsampling
(Friedman 1999a; 1999b).
• GBT vs RF
33
38. Final Result
• The 2nd best valid algorithm in the
WikiChallenge
– RMSLE = 0.862582: 41.7% improvement over
WMF’s in-house solution
– Much simpler model than the top performing
system : 21 behavioural dynamics features vs. 206
features
– WMF is now implementing this algorithm
permanently and looks forward to using it in the
production environment.
38
56. Hazard Function
Of those that survive, at what rate will they die?
The instantaneous potential per unit time for the event to occur,
given that the individual has survived t.
56
60. Conclusions
• For customary Wikipedia editors,
– the survival function can be well described by a
Weibull distribution (with the median lifetime of
about 53 days);
– there are two critical phases (0-2 weeks and 8-20
weeks) when the hazard rate of becoming inactive
increases;
– more active editors tend to keep active in editing
for longer time.
60
66. Semi-Parametric
• The semi-parametric property of the Cox
model => its popularity
– The baseline hazard is unspecified
– Robust: it will closely approximate the correct
parametric model
– Using a minimum of assumptions
66
75. Lightning Does Strike Twice!
• Roy Sullivan, a former park ranger from Virginia
– He was struck by lightning 7 times
• 1942 (lost big-toe nail)
• 1969 (lost eyebrows)
• 1970 (left shoulder seared)
• 1972 (hair set on fire)
• 1973 (hair set on fire & legs seared)
• 1976 (ankle injured)
• 1977 (chest & stomach burned)
– He committed suicide in September 1983.
75
76. A Lot More To Do
• Multiple Occurrences of “Death”
– Recurrent Event Survival Analysis (e.g., based on
Counting Process)
• Multiple Types of “Death”
– Competing Risks Survival Analysis
76
77. Software Tools
• R
– The ‘survival’ package
• Matlab
– The ‘statistics’ toolbox
• Python
– The ‘statsmodels’ module?
77
78. References
• David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning
Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta
• John Wallace. How Big Data is Changing Retail Marketing Analytics.
Webinar, Apr 2005. http://goo.gl/OlMmi
• Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors
Keep Active? In Proceedings of the 8th International Symposium on Wikis
and Open Collaboration (WikiSym), Linz, Austria, Aug 2012.
http://goo.gl/On3qr
• Dell Zhang. Wikipedia Edit Number Prediction based on Temporal
Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct
2011. http://goo.gl/s2Dex
78