Kaggle is a community of almost 400K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons we have learned from the Kaggle community.
From an Olympic-sized thought-powered light show to your very own brain sensing headband, InteraXon CEO and neuroscientist Ariel Garten speaks to the sensor revolution and to the democratization of brain sensing technology. With the launch of their first consumer-ready product, Muse: the brain sensing headband, Ariel shares how more and more people are turning to brain sensing tech to gain immediate insight and personalized data to track and explore their inner technologies and enable them to do more with their minds than they ever thought possible.
• Ariel Garten - Co-Founder and CEO, InteraXon
New York eHealth Collaborative Digital Health Conference
November 18, 2014
Kaggle is a community of almost 400K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons we have learned from the Kaggle community.
From an Olympic-sized thought-powered light show to your very own brain sensing headband, InteraXon CEO and neuroscientist Ariel Garten speaks to the sensor revolution and to the democratization of brain sensing technology. With the launch of their first consumer-ready product, Muse: the brain sensing headband, Ariel shares how more and more people are turning to brain sensing tech to gain immediate insight and personalized data to track and explore their inner technologies and enable them to do more with their minds than they ever thought possible.
• Ariel Garten - Co-Founder and CEO, InteraXon
New York eHealth Collaborative Digital Health Conference
November 18, 2014
Meaningful (meta)data at scale: removing barriers to precision medicine researchNolan Nichols
Randomized controlled trials (RCTs) are the gold standard for evaluating therapeutics in patient populations. The data collected during RCTs include a wealth of clinical measures, biomarkers, and tissue samples – the analysis of which can lead to the approval of new medicines that improve the lives of patients. The secondary use of these data can also fuel the discovery of novel targets and biomarkers that support precision medicine, but a lack of metadata standards creates substantial barriers to reuse.
For this talk, I will discuss the challenges that arise when aggregating diverse types of data from a large number of RCTs and present a case study in how to apply (meta)data standards for the scalable curation and integration of these data into an analysis ready form.
Elsevier Medical Graph – mit Machine Learning zu Precision MedicineRising Media Ltd.
Elsevier Health Analytics entwickelt den Medical Knowledge Graph, welcher Korrelationen zwischen Krankheiten und zwischen Krankheiten und Behandlungen darstellt. Auf einem Gesamtdatensatz von sechs Millionen anonymisierten Patienten, beobachtbar über sechs Jahre, haben wir über 2000 Modelle erstellt, welche die Entwicklung von Krankheiten prognostizieren. Jedes Modell ist adjustiert für mehr als 3000 Kovariablen. Dazu kam ein Boosting Algorithmus mit Variablenselektion zum Einsatz. Die Betas der selektierten Variablen wurden extrahiert, getestet hinsichtlich Kausalität und Signifikanz, und daraus wurde die erste Version des Medical Graphen mit über 2000 Krankheitsknoten und 25.000 Effekt-Kanten gebaut. Der Graph wird aktuell in der Praxis getestet, mit dem Ziel, dem Arzt eine patienten-individuelle Entscheidungsunterstützung für die Behandlung zu geben.
Example of high-ROI data science project I did from start to finish. Please contact me if you are a hiring manager at GOOG, AMZN, McKinsey/BCG/Bain, or Booz Allen Hamilton and need someone who can do something like this!
Impact.Tech "Statistical Literacy for Deep Tech"Impact.Tech
Understanding how to effectively discuss and interpret statistics and scientific data is incredibly important for both investors and founders. This seminar is meant to arm investors with basic statistical literacy when deciding to partner with a company during due diligence. It is also meant to help founders understand how investors assess statistics and scientific data. Increasing literacy and comfort with scientific terminology among the broader community will enable investors to better communicate with and support these founders.
Using life science case studies, this seminar will communicate in clear terms some of the most important measurements and tests applied by deep tech start-ups, such as: sensitivity vs specificity, false positive vs negative rate, prospective vs retrospective studies, multiple hypothesis corrections, regression and other basic statistical models (p-value, t-test, etc).
This seminar will be produced and presented by Noel Jee, a Principal at Illumina Ventures with a focus in therapeutics and diagnostics. Prior to joining the fund, Noel worked at L.E.K. Consulting as a management consultant specializing in the life sciences. He has consulted on strategy engagements for companies in the pharmaceuticals, biotech, and diagnostics industries. He obtained a dual B.S. degree from the University of Maryland College Park, and his PhD in Chemistry and Chemical Biology from the University of California San Francisco.
Using Spark in Healthcare Predictive Analytics in the OR - Data Science Pop-u...Domino Data Lab
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. Using multi-variate linear regression, we will show how they can predict available OR block times using Spark MLlib resulting in better OR utilization and shorter wait times for patients. Presented by Denny Lee, Data Scientist and Evangelist at Databricks.
abhealth.us working with Videntity.com and NewWave.io have created ShareMyHealth and VerifyMyIdentity to enable Medicaid Beneficiaries to share their health data from the local Health Information Exchange (HIE) using FHIR and #BlueButton 2.0 style sharing. Everything is open source and built using standards (OAuth2.0, OpenID Connect and FHIR) and available from GitHub.com/transparenthealth
Instead of talking about artificial intelligence at the organizational level in hospitals and in research laboratories, the focus for non-machine learning practitioner should be on understanding the data pipes and what is involved around the model training.
alternative download link:
https://www.dropbox.com/s/9tv673sxkxcnojj/dataStrategyForOphthalmology.pdf?dl=0
Open Source Pharma: Crowd computing: A new approach to predictive modelingOpen Source Pharma
Presentation about "Predictive in silico models," given by Joerg Bentzien at the Open Source Pharma Conference. The event took place at Rockefeller Foundation Bellagio Center in July 2014.
Joerg Bentzien Bio:
http://www.opensourcepharma.net/participants/jorg-bentzien
Conference Agenda (see Day 1, Session 2):
http://www.opensourcepharma.net/agenda.html
ML practitioners and advocates are increasingly finding themselves becoming gatekeepers of the modern world. The models you create have power to get people arrested or vindicated, get loans approved or rejected, determine what interest rate should be charged for such loans, who should be shown to you in your long list of pursuits on your Tinder, what news do you read, who gets called for a job phone screen or even a college admission... the list goes on. My goal in this talk is to summarize the kinds of disparate outcomes that are caused by cargo cult machine learning, and recent academic efforts to address some of them.
Meaningful (meta)data at scale: removing barriers to precision medicine researchNolan Nichols
Randomized controlled trials (RCTs) are the gold standard for evaluating therapeutics in patient populations. The data collected during RCTs include a wealth of clinical measures, biomarkers, and tissue samples – the analysis of which can lead to the approval of new medicines that improve the lives of patients. The secondary use of these data can also fuel the discovery of novel targets and biomarkers that support precision medicine, but a lack of metadata standards creates substantial barriers to reuse.
For this talk, I will discuss the challenges that arise when aggregating diverse types of data from a large number of RCTs and present a case study in how to apply (meta)data standards for the scalable curation and integration of these data into an analysis ready form.
Elsevier Medical Graph – mit Machine Learning zu Precision MedicineRising Media Ltd.
Elsevier Health Analytics entwickelt den Medical Knowledge Graph, welcher Korrelationen zwischen Krankheiten und zwischen Krankheiten und Behandlungen darstellt. Auf einem Gesamtdatensatz von sechs Millionen anonymisierten Patienten, beobachtbar über sechs Jahre, haben wir über 2000 Modelle erstellt, welche die Entwicklung von Krankheiten prognostizieren. Jedes Modell ist adjustiert für mehr als 3000 Kovariablen. Dazu kam ein Boosting Algorithmus mit Variablenselektion zum Einsatz. Die Betas der selektierten Variablen wurden extrahiert, getestet hinsichtlich Kausalität und Signifikanz, und daraus wurde die erste Version des Medical Graphen mit über 2000 Krankheitsknoten und 25.000 Effekt-Kanten gebaut. Der Graph wird aktuell in der Praxis getestet, mit dem Ziel, dem Arzt eine patienten-individuelle Entscheidungsunterstützung für die Behandlung zu geben.
Example of high-ROI data science project I did from start to finish. Please contact me if you are a hiring manager at GOOG, AMZN, McKinsey/BCG/Bain, or Booz Allen Hamilton and need someone who can do something like this!
Impact.Tech "Statistical Literacy for Deep Tech"Impact.Tech
Understanding how to effectively discuss and interpret statistics and scientific data is incredibly important for both investors and founders. This seminar is meant to arm investors with basic statistical literacy when deciding to partner with a company during due diligence. It is also meant to help founders understand how investors assess statistics and scientific data. Increasing literacy and comfort with scientific terminology among the broader community will enable investors to better communicate with and support these founders.
Using life science case studies, this seminar will communicate in clear terms some of the most important measurements and tests applied by deep tech start-ups, such as: sensitivity vs specificity, false positive vs negative rate, prospective vs retrospective studies, multiple hypothesis corrections, regression and other basic statistical models (p-value, t-test, etc).
This seminar will be produced and presented by Noel Jee, a Principal at Illumina Ventures with a focus in therapeutics and diagnostics. Prior to joining the fund, Noel worked at L.E.K. Consulting as a management consultant specializing in the life sciences. He has consulted on strategy engagements for companies in the pharmaceuticals, biotech, and diagnostics industries. He obtained a dual B.S. degree from the University of Maryland College Park, and his PhD in Chemistry and Chemical Biology from the University of California San Francisco.
Using Spark in Healthcare Predictive Analytics in the OR - Data Science Pop-u...Domino Data Lab
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. Using multi-variate linear regression, we will show how they can predict available OR block times using Spark MLlib resulting in better OR utilization and shorter wait times for patients. Presented by Denny Lee, Data Scientist and Evangelist at Databricks.
abhealth.us working with Videntity.com and NewWave.io have created ShareMyHealth and VerifyMyIdentity to enable Medicaid Beneficiaries to share their health data from the local Health Information Exchange (HIE) using FHIR and #BlueButton 2.0 style sharing. Everything is open source and built using standards (OAuth2.0, OpenID Connect and FHIR) and available from GitHub.com/transparenthealth
Instead of talking about artificial intelligence at the organizational level in hospitals and in research laboratories, the focus for non-machine learning practitioner should be on understanding the data pipes and what is involved around the model training.
alternative download link:
https://www.dropbox.com/s/9tv673sxkxcnojj/dataStrategyForOphthalmology.pdf?dl=0
Open Source Pharma: Crowd computing: A new approach to predictive modelingOpen Source Pharma
Presentation about "Predictive in silico models," given by Joerg Bentzien at the Open Source Pharma Conference. The event took place at Rockefeller Foundation Bellagio Center in July 2014.
Joerg Bentzien Bio:
http://www.opensourcepharma.net/participants/jorg-bentzien
Conference Agenda (see Day 1, Session 2):
http://www.opensourcepharma.net/agenda.html
ML practitioners and advocates are increasingly finding themselves becoming gatekeepers of the modern world. The models you create have power to get people arrested or vindicated, get loans approved or rejected, determine what interest rate should be charged for such loans, who should be shown to you in your long list of pursuits on your Tinder, what news do you read, who gets called for a job phone screen or even a college admission... the list goes on. My goal in this talk is to summarize the kinds of disparate outcomes that are caused by cargo cult machine learning, and recent academic efforts to address some of them.
These slides use concepts from my (Jeff Funk) course entitled Biz Models for Hi-Tech Products to analyze the business model for Kaggle’s Crowd Sourcing Service for Data Analytics. Kaggle connects data scientists with organizations who have problems related to data analysis. Kaggle helps organizations define their data analytic problems, present them to data scientists, and organize and evaluate competitions between data analytic solutions. Its data ensemble technique also evaluates the effectiveness of the various solutions. These slides describe the specific value proposition for organizations and data scientists and other aspects of the business model such as the method of value capture, scope of activities, and method of strategic control.
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrank Rybicki
These are my #AI slides for medical deep learning using #radiology and medical imaging examples. Please use them & modify to teach your own group about medical AI.
Counter Intuitive Machine Learning for the Industrial Internet of ThingsJune Andrews
The Industrial Internet of Things (IIoT) is the infrastructure and data flow built around the world's really valuable things like airplane engines, medical scanners, nuclear power plants, and oil pipelines. These machines and systems require far greater uptime, security, governance, and regulation than the IoT landscape based around consumer activity. In the IIoT, the cost of being wrong can be as dramatic as the catastrophic loss of life on a massive scale. Nevertheless, given the growing scale powered by the digitalization of industrial assets, there is clearly an increased role for machine learning to help automate and augment human decision making for the IIoT. It is against this backdrop that traditional machine learning techniques must necessarily be adapted and new approaches must be innovated. We see industrial machine learning as distinct from consumer machine learning and in this talk we will cover the counterintuitive changes of featurization, metrics for model performance, and human-in-the-loop design changes for using machine learning in an industrial environment.
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!
This Edureka Random Forest tutorial will help you understand all the basics of Random Forest machine learning algorithm. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn random forest analysis along with examples. Below are the topics covered in this tutorial:
1) Introduction to Classification
2) Why Random Forest?
3) What is Random Forest?
4) Random Forest Use Cases
5) How Random Forest Works?
6) Demo in R: Diabetes Prevention Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Data-driven models for efficient diagnosis and disease management. From Academia to Startups.
Talk given at Crabb Lab Meeting, City University, London UK – Wed 23 August 2017
Most data scientists are focused on predictive (aka supervised) models, yet real growth depends on the estimation of effect of an action and optimizations of action policies. To this end, I will present causal inference and related packages.
There are three layers of analytics: descriptive (BI), predictive (supervised modeling), and prescriptive. The latter, less-known one, focus on answering the most important business questions. For example, "what was the effect of giving a discount?" or "who to call first?" - In this talk, we will first discuss what frameworks are used to answer these questions, namely causal inference, and reinforcement learning. Then we will deep dive into CI and why is it important. Last but not least we will present some code.
Natural Language Processing on Non-Textual Datagpano
Talk by Casey Stella, presented at the SF Data Mining Hadoop Summit Meetup, on June 8, 2015. Notebook available at https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/ipython/clinical2vec.ipynb
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. GE Flight Quest 2
Optimize flight routes based
on weather & traffic
$250,000
122 teams
Hewlett Foundation: Automated Essay Scoring
Develop an automated scoring algorithm
for student-written essays
$100,000
155 teams
Allstate Purchase Prediction Challenge
Develop an automated scoring algorithm
for student-written essays
$50,000
1,570 teams
Merck Molecular Activity Challenge
Help develop safe and effective medicines
by predicting molecular activity
$40,000
236 teams
Higgs Boson Machine Learning Challenge
Use the ATLAS experiment to
identify the Higgs boson
$13,000
1,302 teams
3. Age Income Default
58 $95,824 True
73 $20,708 False
59 $82,152 False
66 $25,334 True
Age Income Default
73 $53,445
61 $36,679
47 $90,422
44 $79,040
Training Data Test Data
The Kaggle Approach
4.
5. Mapping Dark Matter
Competition Progress
Accuracy
(lower is better)
Week 1 Week 3 Week 5 Week 7 End
.0150
.0170
Martin O’Leary
PhD student in Glaciology, Cambridge U
6. “In less than a week, Martin O’Leary,
a PhD student in glaciology,
outperformed the state-of-the-art
algorithms”
“The world’s brightest physicists have
been working for decades on solving
one of the great unifying problems of
our universe”
7. Mapping Dark Matter
Competition Progress
Accuracy
(lower is better)
Week 1 Week 3 Week 5 Week 7 End
.0150
.0170
Martin O’Leary
PhD student in Glaciology, Cambridge U
Marius Cobzarenco
Grad student in computer vision, UC London
Ali Haissaine & Eu Jin Loc
Signature Verification, Qatar U & Grad Student @ Deloitte
Other
deepZot (David Kirkby & Daniel Margala)
Particle Physicist & Cosmologist
8. EXAMPLE ESSAY QUESTION —
We all understand the benefits of laughter.
For example, someone once said,
“Laughter is the shortest distance between
two people.”
Many other people believe that laughter is
an important part of any relationship. Tell a
true story in which laughter was one
element or part.
We can work
with difficult
data —
9. The winning model
correctly predicted
seizures 82% of the
time. Until that point,
researchers had
struggled to develop an
algorithm that did better
than chance
Mayo Clinic:
Seizure detection
from EEG
readings
10. We’ve worked with
many of the
world’s largest
companies
Healthcare &
Pharma
Consumer
Internet
Finance IndustrialConsumer
Marketing
Oil
& Gas
$50b+
Beverage
Co.
Global
Bank
Top
Credit
Card
Issuer
Top 5 E&P
Top 20 E&P