디지털 의료의 현재와 미래

: 임상신경생리학을 중심으로
Professor, SAHIST, Sungkyunkwan University
Director, Digital Healthcare Institute
Yoon Sup Choi, Ph.D.
Disclaimer
저는 위의 회사들과 지분 관계, 자문 등으로

이해 관계가 있음을 밝힙니다.
스타트업
벤처캐피털
“It's in Apple's DNA that technology alone is not enough. 

It's technology married with liberal arts.”
The Convergence of IT, BT and Medicine
최윤섭 지음
의료인공지능
표지디자인•최승협
컴퓨터
털 헬
치를 만드는 것을 화두로
기업가, 엔젤투자가, 에반
의 대표적인 전문가로, 활
이 분야를 처음 소개한 장
포항공과대학교에서 컴
동 대학원 시스템생명공
취득하였다. 스탠퍼드대
조교수, KT 종합기술원 컨
구원 연구조교수 등을 거
저널에 10여 편의 논문을
국내 최초로 디지털 헬스
윤섭 디지털 헬스케어 연
국내 유일의 헬스케어 스
어 파트너스’의 공동 창업
스타트업을 의료 전문가
관대학교 디지털헬스학과
뷰노, 직토, 3billion, 서지
소울링, 메디히어, 모바일
자문을 맡아 한국에서도
고 있다. 국내 최초의 디
케어 이노베이션』에 활발
을 연재하고 있다. 저서로
와 『그렇게 나는 스스로
•블로그_ http://www
•페이스북_ https://w
•이메일_ yoonsup.c
최윤섭
의료 인공지능은 보수적인 의료 시스템을 재편할 혁신을 일으키고 있다. 의료 인공지능의 빠른 발전과
광범위한 영향은 전문화, 세분화되며 발전해 온 현대 의료 전문가들이 이해하기가 어려우며, 어디서부
터 공부해야 할지도 막연하다. 이런 상황에서 의료 인공지능의 개념과 적용, 그리고 의사와의 관계를 쉽
게 풀어내는 이 책은 좋은 길라잡이가 될 것이다. 특히 미래의 주역이 될 의학도와 젊은 의료인에게 유용
한 소개서이다.
━ 서준범, 서울아산병원 영상의학과 교수, 의료영상인공지능사업단장
인공지능이 의료의 패러다임을 크게 바꿀 것이라는 것에 동의하지 않는 사람은 거의 없다. 하지만 인공
지능이 처리해야 할 의료의 난제는 많으며 그 해결 방안도 천차만별이다. 흔히 생각하는 만병통치약 같
은 의료 인공지능은 존재하지 않는다. 이 책은 다양한 의료 인공지능의 개발, 활용 및 가능성을 균형 있
게 분석하고 있다. 인공지능을 도입하려는 의료인, 생소한 의료 영역에 도전할 인공지능 연구자 모두에
게 일독을 권한다.
━ 정지훈, 경희사이버대 미디어커뮤니케이션학과 선임강의교수, 의사
서울의대 기초의학교육을 책임지고 있는 교수의 입장에서, 산업화 이후 변하지 않은 현재의 의학 교육
으로는 격변하는 인공지능 시대에 의대생을 대비시키지 못한다는 한계를 절실히 느낀다. 저와 함께 의
대 인공지능 교육을 개척하고 있는 최윤섭 소장의 전문적 분석과 미래 지향적 안목이 담긴 책이다. 인공
지능이라는 미래를 대비할 의대생과 교수, 그리고 의대 진학을 고민하는 학생과 학부모에게 추천한다.
━ 최형진, 서울대학교 의과대학 해부학교실 교수, 내과 전문의
최근 의료 인공지능의 도입에 대해서 극단적인 시각과 태도가 공존하고 있다. 이 책은 다양한 사례와 깊
은 통찰을 통해 의료 인공지능의 현황과 미래에 대해 균형적인 시각을 제공하여, 인공지능이 의료에 본
격적으로 도입되기 위한 토론의 장을 마련한다. 의료 인공지능이 일상화된 10년 후 돌아보았을 때, 이 책
이 그런 시대를 이끄는 길라잡이 역할을 하였음을 확인할 수 있기를 기대한다.
━ 정규환, 뷰노 CTO
의료 인공지능은 다른 분야 인공지능보다 더 본질적인 이해가 필요하다. 단순히 인간의 일을 대신하는
수준을 넘어 의학의 패러다임을 데이터 기반으로 변화시키기 때문이다. 따라서 인공지능을 균형있게 이
해하고, 어떻게 의사와 환자에게 도움을 줄 수 있을지 깊은 고민이 필요하다. 세계적으로 일어나고 있는
이러한 노력의 결과물을 집대성한 이 책이 반가운 이유다.
━ 백승욱, 루닛 대표
의료 인공지능의 최신 동향뿐만 아니라, 의의와 한계, 전망, 그리고 다양한 생각거리까지 주는 책이다.
논쟁이 되는 여러 이슈에 대해서도 저자는 자신의 시각을 명확한 근거에 기반하여 설득력 있게 제시하
고 있다. 개인적으로는 이 책을 대학원 수업 교재로 활용하려 한다.
━ 신수용, 성균관대학교 디지털헬스학과 교수
최윤섭지음
의료인공지능
값 20,000원
ISBN 979-11-86269-99-2
최초의 책!
계 안팎에서 제기
고 있다. 현재 의
분 커버했다고 자
것인가, 어느 진료
제하고 효용과 안
누가 지는가, 의학
쉬운 언어로 깊이
들이 의료 인공지
적인 용어를 최대
서 다른 곳에서 접
를 접하게 될 것
너무나 빨리 발전
책에서 제시하는
술을 공부하며, 앞
란다.
의사 면허를 취득
저가 도움되면 좋
를 불러일으킬 것
화를 일으킬 수도
슈에 제대로 대응
분은 의학 교육의
예비 의사들은 샌
지능과 함께하는
레이닝 방식도 이
전에 진료실과 수
겠지만, 여러분들
도생하는 수밖에
미래의료학자 최윤섭 박사가 제시하는
의료 인공지능의 현재와 미래
의료 딥러닝과 IBM 왓슨의 현주소
인공지능은 의사를 대체하는가
값 20,000원
ISBN 979-11-86269-99-2
레이닝 방식도 이
전에 진료실과 수
겠지만, 여러분들
도생하는 수밖에
소울링, 메디히어, 모바일
자문을 맡아 한국에서도
고 있다. 국내 최초의 디
케어 이노베이션』에 활발
을 연재하고 있다. 저서로
와 『그렇게 나는 스스로
•블로그_ http://www
•페이스북_ https://w
•이메일_ yoonsup.c
Inevitable Tsunami of Change
https://rockhealth.com/reports/2018-year-end-funding-report-is-digital-health-in-a-bubble/
•2018년에는 $8.1B 가 투자되며 역대 최대 규모를 또 한 번 갱신 (전년 대비 42.% 증가)

•총 368개의 딜 (전년 359 대비 소폭 증가): 개별 딜의 규모가 커졌음

•전체 딜의 절반이 seed 혹은 series A 투자였음

•‘초기 기업들이 역대 최고로 큰 규모의 투자를’, ‘역대 가장 자주’ 받고 있음
https://rockhealth.com/reports/digital-health-funding-2015-year-in-review/
5%
8%
24%
27%
36%
Life Science & Health
Mobile
Enterprise & Data
Consumer
Commerce
9%
13%
23%
24%
31%
Life Science & Health
Consumer
Enterprise
Data & AI
Others
2014 2015
Investment of GoogleVentures in 2014-2015
startuphealth.com/reports
Firm 2017 YTD Deals Stage
Early Mid Late
1 7
1 7
2 6
2 6
3 5
3 5
3 5
3 5
THE TOP INVESTORS OF 2017 YTD
We are seeing huge strides in new investors pouring money into the digital health market, however all the top 10 investors of
2017 year to date are either maintaining or increasing their investment activity.
Source: StartUp Health Insights | startuphealth.com/insights Note: Report based on public data on seed, venture, corporate venture and private equity funding only. © 2017 StartUp Health LLC
DEALS & FUNDING GEOGRAPHY INVESTORSMOONSHOTS
20
•개별 투자자별로 보자면, 이 분야 전통의 강자(?)인 Google
Ventures와 Khosla Ventures가 각각 7개로 공동 1위, 

•GE Ventures와 Accel Partners가 6건으로 공동 2위를 기록

•GV 가 투자한 기업

•virtual fitness membership network를 만드는 뉴욕의
ClassPass

•Remote clinical trial 회사인 Science 37

•Digital specialty prescribing platform ZappRx 등에 투자.

•Khosla Ventures 가 투자한 기업

•single-molecule 검사 장비를 만드는 TwoPoreGuys

•Mabu라는 AI-powered patient engagement robot 을 만드
는 Catalia Health에 투자.
헬스케어넓은 의미의 건강 관리에는 해당되지만, 

디지털 기술이 적용되지 않고, 전문 의료 영역도 아닌 것

예) 운동, 영양, 수면
디지털 헬스케어
건강 관리 중에 디지털 기술이 사용되는 것

예) 사물인터넷, 인공지능, 3D 프린터, VR/AR
모바일 헬스케어
디지털 헬스케어 중 

모바일 기술이 사용되는 것

예) 스마트폰, 사물인터넷, SNS
개인 유전정보분석
예) 암유전체, 질병위험도, 

보인자, 약물 민감도
예) 웰니스, 조상 분석
헬스케어 관련 분야 구성도(ver 0.3)
의료
질병 예방, 치료, 처방, 관리 

등 전문 의료 영역
원격의료
원격진료
EDITORIAL OPEN
Digital medicine, on its way to being just plain medicine
npj Digital Medicine (2018)1:20175 ; doi:10.1038/
s41746-017-0005-1
There are already nearly 30,000 peer-reviewed English-language
scientific journals, producing an estimated 2.5 million articles a year.1
So why another, and why one focused specifically on digital
medicine?
To answer that question, we need to begin by defining what
“digital medicine” means: using digital tools to upgrade the
practice of medicine to one that is high-definition and far more
individualized. It encompasses our ability to digitize human beings
using biosensors that track our complex physiologic systems, but
also the means to process the vast data generated via algorithms,
cloud computing, and artificial intelligence. It has the potential to
democratize medicine, with smartphones as the hub, enabling
each individual to generate their own real world data and being
far more engaged with their health. Add to this new imaging
tools, mobile device laboratory capabilities, end-to-end digital
clinical trials, telemedicine, and one can see there is a remarkable
array of transformative technology which lays the groundwork for
a new form of healthcare.
As is obvious by its definition, the far-reaching scope of digital
medicine straddles many and widely varied expertise. Computer
scientists, healthcare providers, engineers, behavioral scientists,
ethicists, clinical researchers, and epidemiologists are just some of
the backgrounds necessary to move the field forward. But to truly
accelerate the development of digital medicine solutions in health
requires the collaborative and thoughtful interaction between
individuals from several, if not most of these specialties. That is the
primary goal of npj Digital Medicine: to serve as a cross-cutting
resource for everyone interested in this area, fostering collabora-
tions and accelerating its advancement.
Current systems of healthcare face multiple insurmountable
challenges. Patients are not receiving the kind of care they want
and need, caregivers are dissatisfied with their role, and in most
countries, especially the United States, the cost of care is
unsustainable. We are confident that the development of new
systems of care that take full advantage of the many capabilities
that digital innovations bring can address all of these major issues.
Researchers too, can take advantage of these leading-edge
technologies as they enable clinical research to break free of the
confines of the academic medical center and be brought into the
real world of participants’ lives. The continuous capture of multiple
interconnected streams of data will allow for a much deeper
refinement of our understanding and definition of most pheno-
types, with the discovery of novel signals in these enormous data
sets made possible only through the use of machine learning.
Our enthusiasm for the future of digital medicine is tempered by
the recognition that presently too much of the publicized work in
this field is characterized by irrational exuberance and excessive
hype. Many technologies have yet to be formally studied in a
clinical setting, and for those that have, too many began and
ended with an under-powered pilot program. In addition, there are
more than a few examples of digital “snake oil” with substantial
uptake prior to their eventual discrediting.2
Both of these practices
are barriers to advancing the field of digital medicine.
Our vision for npj Digital Medicine is to provide a reliable,
evidence-based forum for all clinicians, researchers, and even
patients, curious about how digital technologies can transform
every aspect of health management and care. Being open source,
as all medical research should be, allows for the broadest possible
dissemination, which we will strongly encourage, including
through advocating for the publication of preprints
And finally, quite paradoxically, we hope that npj Digital
Medicine is so successful that in the coming years there will no
longer be a need for this journal, or any journal specifically
focused on digital medicine. Because if we are able to meet our
primary goal of accelerating the advancement of digital medicine,
then soon, we will just be calling it medicine. And there are
already several excellent journals for that.
ACKNOWLEDGEMENTS
Supported by the National Institutes of Health (NIH)/National Center for Advancing
Translational Sciences grant UL1TR001114 and a grant from the Qualcomm Foundation.
ADDITIONAL INFORMATION
Competing interests:The authors declare no competing financial interests.
Publisher's note:Springer Nature remains neutral with regard to jurisdictional claims
in published maps and institutional affiliations.
Change history:The original version of this Article had an incorrect Article number
of 5 and an incorrect Publication year of 2017. These errors have now been corrected
in the PDF and HTML versions of the Article.
Steven R. Steinhubl1
and Eric J. Topol1
1
Scripps Translational Science Institute, 3344 North Torrey Pines
Court, Suite 300, La Jolla, CA 92037, USA
Correspondence: Steven R. Steinhubl (steinhub@scripps.edu) or
Eric J. Topol (etopol@scripps.edu)
REFERENCES
1. Ware, M. & Mabe, M. The STM report: an overview of scientific and scholarly journal
publishing 2015 [updated March]. http://digitalcommons.unl.edu/scholcom/92017
(2015).
2. Plante, T. B., Urrea, B. & MacFarlane, Z. T. et al. Validation of the instant blood
pressure smartphone App. JAMA Intern. Med. 176, 700–702 (2016).
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons license, and indicate if changes were made. The images or other third party
material in this article are included in the article’s Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this license, visit http://creativecommons.
org/licenses/by/4.0/.
© The Author(s) 2018
Received: 19 October 2017 Accepted: 25 October 2017
www.nature.com/npjdigitalmed
Published in partnership with the Scripps Translational Science Institute
디지털 의료의 미래는?

일상적인 의료가 되는 것
What is most important factor in digital medicine?
“Data! Data! Data!” he cried.“I can’t
make bricks without clay!”
- Sherlock Holmes,“The Adventure of the Copper Beeches”
새로운 데이터가

새로운 방식으로

새로운 주체에 의해

측정, 저장, 통합, 분석된다.
데이터의 종류

데이터의 질적/양적 측면
웨어러블 기기

스마트폰

유전 정보 분석

인공지능

SNS
사용자/환자

대중
Three Steps to Implement Digital Medicine
• Step 1. Measure the Data
• Step 2. Collect the Data
• Step 3. Insight from the Data
Digital Healthcare Industry Landscape
Data Measurement Data Integration Data Interpretation Treatment
Smartphone Gadget/Apps
DNA
Artificial Intelligence
2nd Opinion
Wearables / IoT
(ver. 3)
EMR/EHR 3D Printer
Counseling
Data Platform
Accelerator/early-VC
Telemedicine
Device
On Demand (O2O)
VR
Digital Healthcare Institute
Diretor, Yoon Sup Choi, Ph.D.
yoonsup.choi@gmail.com
Data Measurement Data Integration Data Interpretation Treatment
Smartphone Gadget/Apps
DNA
Artificial Intelligence
2nd Opinion
Device
On Demand (O2O)
Wearables / IoT
Digital Healthcare Institute
Diretor, Yoon Sup Choi, Ph.D.
yoonsup.choi@gmail.com
EMR/EHR 3D Printer
Counseling
Data Platform
Accelerator/early-VC
VR
Telemedicine
Digital Healthcare Industry Landscape (ver. 3)
Step 1. Measure the Data
Smartphone: the origin of healthcare innovation
Smartphone: the origin of healthcare innovation
2013?
The election of Pope Benedict
The Election of Pope Francis
The Election of Pope Francis
The Election of Pope Benedict
SummerTanThese Days
Sci Transl Med 2015
검이경 더마토스코프 안과질환 피부암
기생충 호흡기 심전도 수면
식단 활동량 발열 생리/임신
CellScope’s iPhone-enabled otoscope
CellScope’s iPhone-enabled otoscope
한국에서는 불법
“왼쪽 귀에 대한 비디오를 보면 고막 뒤에
액체가 보인다. 고막은 특별히 부어 있거
나 모양이 이상하지는 않다. 그러므로 심한
염증이 있어보이지는 않는다.
네가 스쿠버 다이빙 하면서 압력평형에 어
려움을 느꼈다는 것을 감안한다면, 고막의
움직임을 테스트 할 수 있는 의사에게 직접
진찰 받는 것도 좋겠다. ...”
한국에서는 불법
First Derm
한국에서는 불법
AliveCor Heart Monitor (Kardia)
AliveCor Heart Monitor (Kardia)
“심장박동은 안정적이기 때문에, 

당장 병원에 갈 필요는 없겠습니다. 

그래도 이상이 있으면 전문의에게 

진료를 받아보세요. “
한국에서는 불법
2015년 2017년
30분-1시간 정도 일상적인 코골이가 있음

이걸 어떻게 믿나?
녹음을 해줌. 

PGS와의 analytical validity의 증명?
• 아이폰의 센서로 측정한 자신의 의료/건강 데이터를 플랫폼에 공유 가능

• 가속도계, 마이크, 자이로스코프, GPS 센서 등을 이용

• 걸음, 운동량, 기억력, 목소리 떨림 등등

• 기존의 의학연구의 문제를 해결: 충분한 의료 데이터의 확보

• 연구 참여자 등록에 물리적, 시간적 장벽을 제거 (1번/3개월 ➞ 1번/1초)

• 대중의 의료 연구 참여 장려: 연구 참여자의 수 증가

• 발표 후 24시간 내에 수만명의 연구 참여자들이 지원

• 사용자 본인의 동의 하에 진행
ResearchKit
•초기 버전으로, 5가지 질환에 대한 앱 5개를 소개
ResearchKit
ResearchKit
ResearchKit
http://www.roche.com/media/store/roche_stories/roche-stories-2015-08-10.htm
http://www.roche.com/media/store/roche_stories/roche-stories-2015-08-10.htm
pRED app to track Parkinson’s symptoms in drug trial
Autism and Beyond EpiWatchMole Mapper
measuring facial expressions of young
patients having autism
measuring morphological changes
of moles
measuring behavioral data
of epilepsy patients
•스탠퍼드의 심혈관 질환 연구 앱, myHeart 

• 발표 하루만에 11,000 명의 참가자가 등록

• 스탠퍼드의 해당 연구 책임자 앨런 영,

“기존의 방식으로는 11,000명 참가자는 

미국 전역의 50개 병원에서 1년간 모집해야 한다”
•파킨슨 병 연구 앱, mPower

• 발표 하루만에 5,589 명의 참가자가 등록

• 기존에 6000만불을 들여 5년 동안 모집한

환자의 수는 단 800명
The mPower study, Parkinson
disease mobile data collected using
ResearchKit
Brian M. Bot1
, Christine Suver1
, Elias Chaibub Neto1
, Michael Kellen1
, Arno Klein1
,
Christopher Bare1
, Megan Doerr1
, Abhishek Pratap1
, John Wilbanks1
, E. Ray Dorsey2
,
Stephen H. Friend1
& Andrew D. Trister1
Current measures of health and disease are often insensitive, episodic, and subjective. Further, these
measures generally are not designed to provide meaningful feedback to individuals. The impact of high-
resolution activity data collected from mobile phones is only beginning to be explored. Here we present
data from mPower, a clinical observational study about Parkinson disease conducted purely through an
iPhone app interface. The study interrogated aspects of this movement disorder through surveys and
frequent sensor-based recordings from participants with and without Parkinson disease. Benefitting from
large enrollment and repeated measurements on many individuals, these data may help establish baseline
variability of real-world activity measurement collected via mobile phones, and ultimately may lead to
quantification of the ebbs-and-flows of Parkinson symptoms. App source code for these data collection
modules are available through an open source license for use in studies of other conditions. We hope that
releasing data contributed by engaged research participants will seed a new community of analysts working
collaboratively on understanding mobile health data to advance human health.
Design Type(s) observation design • time series design • repeated measure design
Measurement Type(s) disease severity measurement
Technology Type(s) Patient Self-Report
Factor Type(s)
Sample Characteristic(s) Homo sapiens
OPEN
SUBJECT CATEGORIES
» Research data
» Neurology
» Parkinson’s disease
» Medical research
Received: 07 December 2015
Accepted: 02 February 2016
Published: 3 March 2016
www.nature.com/scientificdata
Wearable Devices
http://www.rolls-royce.com/about/our-technology/enabling-technologies/engine-health-management.aspx#sense
250 sensors to monitor the “health” of the GE turbines
Fig 1. What can consumer wearables do? Heart rate can be measured with an oximeter built into a ring [3], muscle activity with an electromyographi
sensor embedded into clothing [4], stress with an electodermal sensor incorporated into a wristband [5], and physical activity or sleep patterns via an
accelerometer in a watch [6,7]. In addition, a female’s most fertile period can be identified with detailed body temperature tracking [8], while levels of me
attention can be monitored with a small number of non-gelled electroencephalogram (EEG) electrodes [9]. Levels of social interaction (also known to a
PLOS Medicine 2016
PwC Health Research Institute Health wearables: Early days2
insurers—offering incentives for
use may gain traction. HRI’s survey
Source: HRI/CIS Wearables consumer survey 2014
21%
of US
consumers
currently
own a
wearable
technology
product
2%
wear it a few
times a month
2%
no longer
use it
7%
wear it a few
times a week
10%
wear it
everyday
Figure 2: Wearables are not mainstream – yet
Just one in five US consumers say they own a wearable device.
Intelligence Series sought to better
understand American consumers’
attitudes toward wearables through
done with the data.
PwC, Health wearables: early days, 2014
PwC | The Wearable Life | 3
device (up from 21% in 2014). And 36% own more than one.
We didn’t even ask this question in our previous survey since
it wasn’t relevant at the time. That’s how far we’ve come.
millennials are far more likely to own wearables than older
adults. Adoption of wearables declines with age.
Of note in our survey findings, however: Consumers aged
35 to 49 are more likely to own smart watches.
Across the board for gender, age, and ethnicity, fitness
wearable technology is most popular.
Fitness band
Smart clothing
Smart video/
photo device
(e.g. GoPro)
Smart watch
Smart
glasses*
45%
14%
27%
15%
12%
Base: Respondents who currently own at least one device (pre-quota sample, n=700); Q10A/B/C/D/E. Please tell us your relationship with the following wearable
technology products. *Includes VR/AR glasses
Fitness runs away with it
% respondents who own type of wearable device
PwC,The Wearable Life 2.0, 2016
• 49% own at least one wearable device (up from 21% in2014)
• 36% own more than one device.
Hype or Hope?
Source: Gartner
Fitbit
Apple Watch
https://clinicaltrials.gov/ct2/results?term=fitbit&Search=Search
•의료기기가 아님에도 Fitbit 은 이미 임상 연구에 폭넓게 사용되고 있음

•Fitbit 이 장려하지 않았음에도, 임상 연구자들이 자발적으로 사용

•Fitbit 을 이용한 임상 연구 수는 계속 증가하는 추세 (16.3(80), 16.8(113), 17.7(173))
•Fitbit이 임상연구에 활용되는 것은 크게 두 가지 경우

•Fitbit 자체가 intervention이 되어서 활동량이나 치료 효과를 증진시킬 수 있는지 여부

•연구 참여자의 활동량을 모니터링 하기 위한 수단

•1. Fitbit으로 환자의 활동량을 증가시키기 위한 연구들

•Fitbit이 소아 비만 환자의 활동량을 증가시키는지 여부를 연구

•Fitbit이 위소매절제술을 받은 환자들의 활동량을 증가시키는지 여부

•Fitbit이 젊은 낭성 섬유증 (cystic fibrosis) 환자의 활동량을 증가시키는지 여부

•Fitbit이 암 환자의 신체 활동량을 증가시키기 위한 동기부여가 되는지 여부

•2. Fitbit으로 임상 연구에 참여하는 환자의 활동량을 모니터링

•항암 치료를 받은 환자들의 건강과 예후를 평가하는데 fitbit을 사용

•현금이 자녀/부모의 활동량을 증가시키는지 파악하기 위해 fitbit을 사용

•Brain tumor 환자의 삶의 질 측정을 위해 다른 survey 결과와 함께 fitbit을 사용

•말초동맥 질환(Peripheral Artery Disease) 환자의 활동량을 평가하기 위해
•체중 감량이 유방암 재발에 미치는 영향을 연구

•유방암 환자들 중 20%는 재발, 대부분이 전이성 유방암

•과체중은 유방암의 위험을 높인다고 알려져 왔으며,

•비만은 초기 유방암 환자의 예후를 좋지 않게 만드는 것도 알려짐 

•하지만, 체중 감량과 유방암 재발 위험도의 상관관계 연구는 아직 없음

•3,200 명의 과체중, 초기 비만 유방암 환자들이 2년간 참여

•결과에 따라 전세계 유방암 환자의 표준 치료에 체중 감량이 포함될 가능성

•Fitbit 이 체중 감량 프로그램에 대한 지원

•Fitbit Charge HR: 운동량, 칼로리 소모, 심박수 측정

•Fitbit Aria Wi-Fi Smart Scale: 스마트 체중계

•FitStar: 개인 맞춤형 동영상 운동 코칭 서비스
2016. 4. 27.
http://nurseslabs.tumblr.com/post/82438508492/medical-surgical-nursing-mnemonics-and-tips-2
•Biogen Idec, 다발성 경화증 환자의 모니터링에 Fitbit을 사용

•고가의 약 효과성을 검증하여 보험 약가 유지 목적

•정교한 측정으로 MS 전조 증상의 조기 발견 가능?
Dec 23, 2014
Zikto:Your Walking Coach
(“FREE VERTICAL MOMENTS AND TRANSVERSE FORCES IN HUMAN WALKING AND
THEIR ROLE IN RELATION TO ARM-SWING”, 	
YU LI*, WEIJIE WANG, ROBIN H. CROMPTON AND MICHAEL M. GUNTHER) 	
(“SYNTHESIS OF NATURAL ARM SWING MOTION IN HUMAN BIPEDAL WALKING”,
JAEHEUNG PARK)︎
Right Arm
Left Foot
Left Arm
Right Foot
“보행 시 팔의 움직임은 몸의 역학적 균형을 맞추기 위한 자동적인 행동
으로, 반대쪽 발의 움직임을 관찰할 수 있는 지표”
보행 종류에 따른 신체 운동 궤도의 변화
발의 모양 팔의 스윙 궤도
일반 보행
팔자 걸음
구부린 걸음
직토 워크에서 수집하는 데이터
종류 설명 비고
충격량 발에 전해지는 충격량 분석 Impact Score
보행 주기 보행의 주기 분석 Interval Score
보폭 단위 보행 시의 거리 Stride(향후 보행 분석 고도화용)
팔의 3차원 궤도 걸음에 따른 팔의 움직임 궤도 팔의 Accel,Gyro Data 취합
보행 자세 상기 자료를 분석한 보행 자세 분류 총 8가지 종류로 구분
비대칭 지수 신체 부위별(어깨, 허리, 골반) 비대칭 점수 제공 1주일 1회 반대쪽 손 착용을 통한 데이터 취득 필요
걸음걸이 템플릿 보행시 발생하는 특이점들을 추출하여 개인별 템플릿 저장 생체 인증 기능용
with the courtesy of ZIKTO, Inc
Empatica Embrace: Smart Band for epilepsy
Empatica Embrace: Smart Band for epilepsy
https://www.empatica.com/science
Monitoring the Autonomic Nervous System
“Sympathetic activation increases when you experience excitement or
stress whether physical, emotional, or cognitive.The skin is the only organ
that is purely innervated by the sympathetic nervous system.”
https://www.empatica.com/science
https://www.empatica.com/science
https://www.empatica.com/science
Convulsive seizure detection using a wrist-worn electrodermal
activity and accelerometry biosensor
*yMing-Zher Poh, zTobias Loddenkemper, xClaus Reinsberger, yNicholas C. Swenson,
yShubhi Goyal, yMangwe C. Sabtala, {Joseph R. Madsen, and yRosalind W. Picard
*Harvard-MIT Division of Health Sciences and Technology, Cambridge, Massachusetts, U.S.A.; yMIT Media Lab, Massachusetts
Institute of Technology, Cambridge, Massachusetts, U.S.A.; zDivision of Epilepsy and Clinical Neurophysiology, Department of
Neurology, Children’s Hospital Boston, Harvard Medical School, Boston, Massachusetts, U.S.A.; xDepartment of Neurology, Division
of Epilepsy, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, U.S.A.; and {Department of
Neurosurgery, Children’s Hospital Boston, Harvard Medical School, Boston, Massachusetts, U.S.A.
SUMMARY
The special requirements for a seizure detector suitable
for everyday use in terms of cost, comfort, and social
acceptance call for alternatives to electroencephalogra-
phy (EEG)–based methods. Therefore, we developed an
algorithm for automatic detection of generalized tonic–
clonic (GTC) seizures based on sympathetically mediated
electrodermal activity (EDA) and accelerometry mea-
sured using a novel wrist-worn biosensor. The problem of
GTC seizure detection was posed as a supervised learning
task in which the goal was to classify 10-s epochs as a
seizure or nonseizure event based on 19 extracted fea-
tures from EDA and accelerometry recordings using a
Support Vector Machine. Performance was evaluated
using a double cross-validation method. The new seizure
detection algorithm was tested on >4,213 h of recordings
from 80 patients and detected 15 (94%) of 16 of the GTC
seizures from seven patients with 130 false alarms (0.74
per 24 h). This algorithm can potentially provide a convul-
sive seizure alarm system for caregivers and objective
quantification of seizure frequency.
KEY WORDS: Seizure alarm, Electrodermal activity,
Accelerometry, Wearable sensor, Epilepsy.
Although combined electroencephalography (EEG) and
video-monitoring remain the gold standard for seizure
detection in clinical routine, most patients are opposed to
wearing scalp EEG electrodes to obtain seizure warnings
for everyday use (Schulze-Bonhage et al., 2010). Accele-
rometry recordings offer a less-obtrusive method for detect-
ing seizures with motor accompaniments (Nijsen et al.,
2005). Previously, we showed that electrodermal activity
(EDA), which reflects the modulation of sweat gland activ-
ity by the sympathetic nervous system, increases during
convulsive seizures (Poh et al., 2010a). Herein we describe
a novel methodology for generalized tonic–clonic (GTC)
seizure detection using information from both EDA and
accelerometry signals recorded with a wrist-worn sensor.
Methods
This study was approved by the institutional review
boards of Massachusetts Institute of Technology and Chil-
dren’s Hospital Boston. We recruited patients with epilepsy
who were admitted to the long-term video-EEG monitoring
(LTM) unit. All participants (or their caregivers) provided
written informed consent. Custom-built EDA and accele-
rometry biosensors were placed on the wrists (Fig. S1) such
that the electrodes were in contact with the ventral side of
the forearms (Poh et al., 2010b).
The various stages of the GTC seizure detector are
depicted in Fig. 1A. A sliding window was used to extract
10-s epochs from both accelerometry and EDA recordings
for each 2.5-s increment (75% overlap). The data were then
preprocessed to remove nonmotor and nonrhythmic epochs.
A total of 19 features including time, frequency, and nonlin-
ear features were extracted from remaining epochs of the
accelerometry and EDA signals to form feature vectors.
Finally, each feature vector was assigned to a seizure or
nonseizure class using a Support Vector Machine (SVM).
We implemented a non–patient-specific seizure detection
algorithm that excluded all data from a test patient in the
training phase (double leave-one-patient-out cross-valida-
tion). To allow the SVM to learn from previous examples of
seizures from the test patient if that patient had more than a
single GTC seizure recording available, we also imple-
mented double leave-one-seizure-out cross-validation.
Because the detector was not trained solely on data from a
Accepted February 3, 2012; Early View publication March 20, 2012.
Address correspondence to Ming-Zher Poh, Ph.D., MIT Media Lab,
Massachusetts Institute of Technology, Room E14-374B, 75 Amherst St.,
Cambridge, MA 02139, U.S.A. E-mail: zher@mit.edu
Wiley Periodicals, Inc.
ª 2012 International League Against Epilepsy
Epilepsia, 53(5):e93–e97, 2012
doi: 10.1111/j.1528-1167.2012.03444.x
BRIEF COMMUNICATION
e93
•가속도계와 EDA 센서가 내장된 스마트 밴드

•뇌전증 환자 80명을 총 4,213 시간 모니터링

•대발작을 94% detection 성공 (15 out of 16)

•19개의 feature를 10초마다 측정: 기계학습 (SVM)으로 분석
•135명의 환자 대상, multi-center trial 

•272일, 6530시간 모니터링

•총 40번의 대발작을 100% detection 성공



•2018년 1월 성인 epilepsy 환자 대상의 FDA 인허가 (prescription-only)

•2019년 1월 6~21세 소아청소년 환자 대상의 FDA 인허가 (prescription-only)
Cardiogram
•실리콘밸리의 Cardiogram 은 애플워치로 측정한 심박수 데이터를 바탕으로 서비스

•2016년 10월 Andressen Horowitz 에서 $2m의 투자 유치
https://blog.cardiogr.am/what-do-normal-and-abnormal-heart-rhythms-look-like-on-apple-watch-7b33b4a8ecfa
•Cardiogram은 심박수에 운동, 수면, 감정, 의료적인 상태가 반영된다고 주장

•특히, 심박 데이터를 기반으로 심방세동(atrial fibrillation)과 심방 조동(atrial flutter)의 detection 시도
Cardiogram
•Cardiogram은 심박 데이터만으로 심방세동을 detection할 수 있다고 주장

•“Irregularly irregular”

•high absolute variability (a range of 30+ bpm)

•a higher fraction missing measurements

•a lack of periodicity in heart rate variability

•심방세동 특유의 불규칙적인 리듬을 detection 하는 정도로 생각하면 될 듯

•“불규칙적인 리듬을 가지는 (심방세동이 아닌) 다른 부정맥과 구분 가능한가?” (쉽지 않을듯)

•따라서, 심박으로 detection한 환자를 심전도(ECG)로 confirm 하는 것이 필요
Cardiogram for A.Fib
Passive Detection of Atrial Fibrillation
Using a Commercially Available Smartwatch
Geoffrey H. Tison, MD, MPH; José M. Sanchez, MD; Brandon Ballinger, BS; Avesh Singh, MS; Jeffrey E. Olgin, MD;
Mark J. Pletcher, MD, MPH; Eric Vittinghoff, PhD; Emily S. Lee, BA; Shannon M. Fan, BA; Rachel A. Gladstone, BA;
Carlos Mikell, BS; Nimit Sohoni, BS; Johnson Hsieh, MS; Gregory M. Marcus, MD, MAS
IMPORTANCE Atrial fibrillation (AF) affects 34 million people worldwide and is a leading cause
of stroke. A readily accessible means to continuously monitor for AF could prevent large
numbers of strokes and death.
OBJECTIVE To develop and validate a deep neural network to detect AF using smartwatch
data.
DESIGN, SETTING, AND PARTICIPANTS In this multinational cardiovascular remote cohort study
coordinated at the University of California, San Francisco, smartwatches were used to obtain
heart rate and step count data for algorithm development. A total of 9750 participants
enrolled in the Health eHeart Study and 51 patients undergoing cardioversion at the
University of California, San Francisco, were enrolled between February 2016 and March 2017.
A deep neural network was trained using a method called heuristic pretraining in which the
network approximated representations of the R-R interval (ie, time between heartbeats)
without manual labeling of training data. Validation was performed against the reference
standard 12-lead electrocardiography (ECG) in a separate cohort of patients undergoing
cardioversion. A second exploratory validation was performed using smartwatch data from
ambulatory individuals against the reference standard of self-reported history of persistent
AF. Data were analyzed from March 2017 to September 2017.
MAIN OUTCOMES AND MEASURES The sensitivity, specificity, and receiver operating
characteristic C statistic for the algorithm to detect AF were generated based on the
reference standard of 12-lead ECG–diagnosed AF.
RESULTS Of the 9750 participants enrolled in the remote cohort, including 347 participants
with AF, 6143 (63.0%) were male, and the mean (SD) age was 42 (12) years. There were more
than 139 million heart rate measurements on which the deep neural network was trained. The
deep neural network exhibited a C statistic of 0.97 (95% CI, 0.94-1.00; P < .001) to detect AF
against the reference standard 12-lead ECG–diagnosed AF in the external validation cohort of
51 patients undergoing cardioversion; sensitivity was 98.0% and specificity was 90.2%. In an
exploratory analysis relying on self-report of persistent AF in ambulatory participants, the C
statistic was 0.72 (95% CI, 0.64-0.78); sensitivity was 67.7% and specificity was 67.6%.
CONCLUSIONS AND RELEVANCE This proof-of-concept study found that smartwatch
photoplethysmography coupled with a deep neural network can passively detect AF but with
some loss of sensitivity and specificity against a criterion-standard ECG. Further studies will
help identify the optimal role for smartwatch-guided rhythm assessment.
JAMA Cardiol. doi:10.1001/jamacardio.2018.0136
Published online March 21, 2018.
Editorial
Supplemental content and
Audio
Author Affiliations: Division of
Cardiology, Department of Medicine,
University of California, San Francisco
(Tison, Sanchez, Olgin, Lee, Fan,
Gladstone, Mikell, Marcus);
Cardiogram Incorporated, San
Francisco, California (Ballinger, Singh,
Sohoni, Hsieh); Department of
Epidemiology and Biostatistics,
University of California, San Francisco
(Pletcher, Vittinghoff).
Corresponding Author: Gregory M.
Marcus, MD, MAS, Division of
Cardiology, Department of Medicine,
University of California, San
Francisco, 505 Parnassus Ave,
M1180B, San Francisco, CA 94143-
0124 (marcusg@medicine.ucsf.edu).
Research
JAMA Cardiology | Original Investigation
(Reprinted) E1
© 2018 American Medical Association. All rights reserved.
Passive Detection of Atrial Fibrillation
Using a Commercially Available Smartwatch
Geoffrey H. Tison, MD, MPH; José M. Sanchez, MD; Brandon Ballinger, BS; Avesh Singh, MS; Jeffrey E. Olgin, MD;
Mark J. Pletcher, MD, MPH; Eric Vittinghoff, PhD; Emily S. Lee, BA; Shannon M. Fan, BA; Rachel A. Gladstone, BA;
Carlos Mikell, BS; Nimit Sohoni, BS; Johnson Hsieh, MS; Gregory M. Marcus, MD, MAS
IMPORTANCE Atrial fibrillation (AF) affects 34 million people worldwide and is a leading cause
of stroke. A readily accessible means to continuously monitor for AF could prevent large
numbers of strokes and death.
OBJECTIVE To develop and validate a deep neural network to detect AF using smartwatch
data.
DESIGN, SETTING, AND PARTICIPANTS In this multinational cardiovascular remote cohort study
coordinated at the University of California, San Francisco, smartwatches were used to obtain
heart rate and step count data for algorithm development. A total of 9750 participants
enrolled in the Health eHeart Study and 51 patients undergoing cardioversion at the
University of California, San Francisco, were enrolled between February 2016 and March 2017.
A deep neural network was trained using a method called heuristic pretraining in which the
network approximated representations of the R-R interval (ie, time between heartbeats)
without manual labeling of training data. Validation was performed against the reference
standard 12-lead electrocardiography (ECG) in a separate cohort of patients undergoing
cardioversion. A second exploratory validation was performed using smartwatch data from
ambulatory individuals against the reference standard of self-reported history of persistent
AF. Data were analyzed from March 2017 to September 2017.
MAIN OUTCOMES AND MEASURES The sensitivity, specificity, and receiver operating
characteristic C statistic for the algorithm to detect AF were generated based on the
reference standard of 12-lead ECG–diagnosed AF.
RESULTS Of the 9750 participants enrolled in the remote cohort, including 347 participants
with AF, 6143 (63.0%) were male, and the mean (SD) age was 42 (12) years. There were more
than 139 million heart rate measurements on which the deep neural network was trained. The
deep neural network exhibited a C statistic of 0.97 (95% CI, 0.94-1.00; P < .001) to detect AF
against the reference standard 12-lead ECG–diagnosed AF in the external validation cohort of
51 patients undergoing cardioversion; sensitivity was 98.0% and specificity was 90.2%. In an
exploratory analysis relying on self-report of persistent AF in ambulatory participants, the C
statistic was 0.72 (95% CI, 0.64-0.78); sensitivity was 67.7% and specificity was 67.6%.
CONCLUSIONS AND RELEVANCE This proof-of-concept study found that smartwatch
photoplethysmography coupled with a deep neural network can passively detect AF but with
some loss of sensitivity and specificity against a criterion-standard ECG. Further studies will
help identify the optimal role for smartwatch-guided rhythm assessment.
JAMA Cardiol. doi:10.1001/jamacardio.2018.0136
Published online March 21, 2018.
Editorial
Supplemental content and
Audio
Author Affiliations: Division of
Cardiology, Department of Medicine,
University of California, San Francisco
(Tison, Sanchez, Olgin, Lee, Fan,
Gladstone, Mikell, Marcus);
Cardiogram Incorporated, San
Francisco, California (Ballinger, Singh,
Sohoni, Hsieh); Department of
Epidemiology and Biostatistics,
University of California, San Francisco
(Pletcher, Vittinghoff).
Corresponding Author: Gregory M.
Marcus, MD, MAS, Division of
Cardiology, Department of Medicine,
University of California, San
Francisco, 505 Parnassus Ave,
M1180B, San Francisco, CA 94143-
0124 (marcusg@medicine.ucsf.edu).
Research
JAMA Cardiology | Original Investigation
(Reprinted) E1
© 2018 American Medical Association. All rights reserved.
• eHeart Study in UCSF
• A total of 9,750 participants
• 51 patients undergoing cardio version
• Validated against standard 12-lead ECG
Passive Detection of Atrial Fibrillation
Using a Commercially Available Smartwatch
Geoffrey H. Tison, MD, MPH; José M. Sanchez, MD; Brandon Ballinger, BS; Avesh Singh, MS; Jeffrey E. Olgin, MD;
Mark J. Pletcher, MD, MPH; Eric Vittinghoff, PhD; Emily S. Lee, BA; Shannon M. Fan, BA; Rachel A. Gladstone, BA;
Carlos Mikell, BS; Nimit Sohoni, BS; Johnson Hsieh, MS; Gregory M. Marcus, MD, MAS
IMPORTANCE Atrial fibrillation (AF) affects 34 million people worldwide and is a leading cause
of stroke. A readily accessible means to continuously monitor for AF could prevent large
numbers of strokes and death.
OBJECTIVE To develop and validate a deep neural network to detect AF using smartwatch
data.
DESIGN, SETTING, AND PARTICIPANTS In this multinational cardiovascular remote cohort study
coordinated at the University of California, San Francisco, smartwatches were used to obtain
heart rate and step count data for algorithm development. A total of 9750 participants
enrolled in the Health eHeart Study and 51 patients undergoing cardioversion at the
University of California, San Francisco, were enrolled between February 2016 and March 2017.
A deep neural network was trained using a method called heuristic pretraining in which the
network approximated representations of the R-R interval (ie, time between heartbeats)
without manual labeling of training data. Validation was performed against the reference
standard 12-lead electrocardiography (ECG) in a separate cohort of patients undergoing
cardioversion. A second exploratory validation was performed using smartwatch data from
ambulatory individuals against the reference standard of self-reported history of persistent
AF. Data were analyzed from March 2017 to September 2017.
MAIN OUTCOMES AND MEASURES The sensitivity, specificity, and receiver operating
characteristic C statistic for the algorithm to detect AF were generated based on the
reference standard of 12-lead ECG–diagnosed AF.
RESULTS Of the 9750 participants enrolled in the remote cohort, including 347 participants
with AF, 6143 (63.0%) were male, and the mean (SD) age was 42 (12) years. There were more
than 139 million heart rate measurements on which the deep neural network was trained. The
deep neural network exhibited a C statistic of 0.97 (95% CI, 0.94-1.00; P < .001) to detect AF
against the reference standard 12-lead ECG–diagnosed AF in the external validation cohort of
51 patients undergoing cardioversion; sensitivity was 98.0% and specificity was 90.2%. In an
exploratory analysis relying on self-report of persistent AF in ambulatory participants, the C
statistic was 0.72 (95% CI, 0.64-0.78); sensitivity was 67.7% and specificity was 67.6%.
CONCLUSIONS AND RELEVANCE This proof-of-concept study found that smartwatch
photoplethysmography coupled with a deep neural network can passively detect AF but with
some loss of sensitivity and specificity against a criterion-standard ECG. Further studies will
help identify the optimal role for smartwatch-guided rhythm assessment.
JAMA Cardiol. doi:10.1001/jamacardio.2018.0136
Published online March 21, 2018.
Editorial
Supplemental content and
Audio
Author Affiliations: Division of
Cardiology, Department of Medicine,
University of California, San Francisco
(Tison, Sanchez, Olgin, Lee, Fan,
Gladstone, Mikell, Marcus);
Cardiogram Incorporated, San
Francisco, California (Ballinger, Singh,
Sohoni, Hsieh); Department of
Epidemiology and Biostatistics,
University of California, San Francisco
(Pletcher, Vittinghoff).
Corresponding Author: Gregory M.
Marcus, MD, MAS, Division of
Cardiology, Department of Medicine,
University of California, San
Francisco, 505 Parnassus Ave,
M1180B, San Francisco, CA 94143-
0124 (marcusg@medicine.ucsf.edu).
Research
JAMA Cardiology | Original Investigation
(Reprinted) E1
© 2018 American Medical Association. All rights reserved.
tion from the participant (dependent on user adherence) and
by the episodic nature of data obtained. A Samsung Simband
(Samsung) exhibited high sensitivity and specificity for AF de-
32
costs associated with the care of those patients, the potential
reduction in stroke could ultimately provide cost savings.
SeveralfactorsmakedetectionofAFfromambulatorydata
Figure 2. Accuracy of Detecting Atrial Fibrillation in the Cardioversion Cohort
100
80
60
40
20
0
0 10080
Sensitivity,%
1 –Specificity, %
604020
Cardioversion cohortA
100
80
60
40
20
0
0 10080
Sensitivity,%
1 –Specificity, %
604020
Ambulatory subset of remote cohortB
A, Receiver operating characteristic
curve among 51 individuals
undergoing in-hospital cardioversion.
The curve demonstrates a C statistic
of 0.97 (95% CI, 0.94-1.00), and the
point on the curve indicates a
sensitivity of 98.0% and a specificity
of 90.2%. B, Receiver operating
characteristic curve among 1617
individuals in the ambulatory subset
of the remote cohort. The curve
demonstrates a C statistic of 0.72
(95% CI, 0.64-0.78), and the point on
the curve indicates a sensitivity of
67.7% and a specificity of 67.6%.
Table 3. Performance Characteristics of Deep Neural Network in Validation Cohortsa
Cohort
%
AUCSensitivity Specificity PPV NPV
Cardioversion cohort (sedentary) 98.0 90.2 90.9 97.8 0.97
Subset of remote cohort (ambulatory) 67.7 67.6 7.9 98.1 0.72
Abbreviations: AUC, area under the receiver operating characteristic curve;
NPV, negative predictive value; PPV, positive predictive value.
a
In the cardioversion cohort, the atrial fibrillation reference standard was
12-lead electrocardiography diagnosis; in the remote cohort, the atrial
fibrillation reference standard was limited to self-reported history of persistent
atrial fibrillation.
Research Original Investigation Passive Detection of Atrial Fibrillation Using a Commercially Available Smartwatch
AUC=0.98 AUC=0.72
• In external validation using standard 12-lead ECG, algorithm
performance achieved a C statistic of 0.97.
• The passive detection of AF from free-living smartwatch data
has substantial clinical implications.
• Importantly, the accuracy of detecting self-reported AF in an
ambulatory setting was more modest (C statistic of 0.72)
애플워치4: 심전도, 부정맥, 낙상 측정
FDA 의료기기 인허가
•De Novo 의료기기로 인허가 받음 (새로운 종류의 의료기기)

•9월에 발표하였으나, 부정맥 관련 기능은 12월에 활성화

•미국 애플워치에서만 가능하고, 한국은안 됨 (미국에서 구매한 경우, 한국 앱스토어 ID로 가능)
• 애플워치4의 부정맥 측정 기능으로, 

• 기능이 활성화된 당일에 자신의 심방세동을 측정한 사용자

• 애플워치 결과 보고, 응급실에 갔더니, 

• 실제로 심방세동을 진단 받게 되었음
• 애플워치4 부정맥 (심방세동) 측정 기능

• ‘진단’이나 기존 환자의 ‘관리’ 목적이 아니라, 

• ‘측정’ 목적

• 기존에 진단 받지 않은 환자 중에, 

• 심방세동이 있는 사람을 확인하여 병원으로 연결

• 정확성을 정말 철저하게 검증했는가? 

• 애플워치에 의해서 측정된 심방세동의 20% 정도가

• 패치 형태의 ECG 모니터에서 측정되지 않음 

• 즉, false alarm 이 많을 수 있음 

• 불필요한 병원 방문, 검사, 의료 비용 발생 등을 우려하고 있음
https://www.scripps.edu/science-and-medicine/translational-institute/about/news/oran-ecg-app/index.html?fbclid=IwAR02Z8SG679-svCkyxBhv3S1JUOSFQlI6UCvNu3wvUgyRmc1r2ft963MFmM
• 애플워치4의 심방세동 측정 기능의 ‘위험성’ 경고
• 일반인을 대상의 측정에서 false positive의 위험
• (실제로는 심방세동 없는데, 있는 것으로 잘못 나온 케이스)
• False positive가 많은 PSA 검사와 비교하여 설명
• 특히, 애플워치는 PSA와 달리 장기적인 정확성 데이터조차 없음
• 의료기기 인허가를 받기는 했으나,
• 애플워치4가 얼마나 정확한지는 아무도 모름..
Early detection of prostate cancer with PSA
testing and a digital rectal exam
1,000 men without screening
How many men died from prostate cancer?
How many men died from any cause?
How many men without prostate cancer
experienced false alarms and unnecessarily had
tissue samples removed (biopsy)?
7 7
210 210
- 160
Remaining men
*E.g. treatments that include removal of the
prostate gland (prostatectomy) or radiation
therapy	which can lead to incontinence and
impotence.
Source: Ilic et al. Cochrane Database Syst
Rev 2013(1):CD004876.
Last update: November 2017
www.harding-center.mpg.de/en/fact-boxes
Numbers for men aged 50 years or older who either did or did not participate in prostate cancer screening for approximately 11
years.
How many men with non-progressive prostate
cancer were unnecessarily diagnosed or treated*?
20-
1,000 men with screening
https://www.scripps.edu/science-and-medicine/translational-institute/about/news/oran-ecg-app/index.html?fbclid=IwAR02Z8SG679-svCkyxBhv3S1JUOSFQlI6UCvNu3wvUgyRmc1r2ft963MFmM
Rationale and design of a large-scale, app-
based study to identify cardiac arrhythmias
using a smartwatch: The Apple Heart Study
Mintu P. Turakhia, MD, MAS, a,b
Manisha Desai, PhD, c
Haley Hedlin, PhD, c
Amol Rajmane, MD, MBA, d
Nisha Talati, MBA, d
Todd Ferris, MD, MS, e
Sumbul Desai, MD, f
Divya Nag f
Mithun Patel, MD, f
Peter Kowey, MD, g
John S. Rumsfeld, MD, PhD, h
Andrea M. Russo, MD, i
Mellanie True Hills, BS, j
Christopher B. Granger, MD, k
Kenneth W. Mahaffey, MD, d
and Marco V. Perez, MD l
Stanford, Palo Alto, Cupertino, CA; Philadelphia PA; Denver
Colorado; Camden NJ; Decatur TX; Durham NC
Background Smartwatch and fitness band wearable consumer electronics can passively measure pulse rate from the
wrist using photoplethysmography (PPG). Identification of pulse irregularity or variability from these data has the potential to
identify atrial fibrillation or atrial flutter (AF, collectively). The rapidly expanding consumer base of these devices allows for
detection of undiagnosed AF at scale.
Methods The Apple Heart Study is a prospective, single arm pragmatic study that has enrolled 419,093 participants
(NCT03335800). The primary objective is to measure the proportion of participants with an irregular pulse detected by the
Apple Watch (Apple Inc, Cupertino, CA) with AF on subsequent ambulatory ECG patch monitoring. The secondary objectives
are to: 1) characterize the concordance of pulse irregularity notification episodes from the Apple Watch with simultaneously
recorded ambulatory ECGs; 2) estimate the rate of initial contact with a health care provider within 3 months after notification
of pulse irregularity. The study is conducted virtually, with screening, consent and data collection performed electronically from
within an accompanying smartphone app. Study visits are performed by telehealth study physicians via video chat through the
app, and ambulatory ECG patches are mailed to the participants.
Conclusions The results of this trial will provide initial evidence for the ability of a smartwatch algorithm to identify pulse
irregularity and variability which may reflect previously unknown AF. The Apple Heart Study will help provide a foundation for
how wearable technology can inform the clinical approach to AF identification and screening. (Am Heart J 2019;207:66-75.)
Atrial fibrillation and atrial flutter (AF, collectively)
together represent the most common cardiac arrhythmia,
currently affecting over 5 million people in the United
States1,2
with projected estimates up to 12 million
persons by 2050.3
AF increases the risk of stroke 5-fold4
and is responsible for at least 15% to 25% of strokes in the
United States.5
Oral anticoagulation can substantially
reduce the relative risk of stroke in patients with AF by
49% to 74%, with absolute risk reductions of 2.7% for
primary stroke prevention and 8.4% for secondary
prevention.6
Unfortunately, 18% of AF-associated strokes
present with AF that is newly detected at the time of
stroke.7
AF can be subclinical due to minimal symptom severity,
frank absence of symptoms, or paroxysmal nature, even
in the presence of tachycardia during AF episodes. It is
estimated that 700,000 people in the United States may
have previously unknown AF, with an incremental cost
burden of 3.2 billion dollars.8,9
Asymptomatic AF is
associated with similar risk of all-cause death, cardiovas-
cular death, and stroke/thromboembolism compared to
symptomatic AF.10
Minimally symptomatic patients have
been shown to derive significant symptom relief follow-
ing rate or rhythm control of AF.11
Undiagnosed or
untreated AF can also lead to development of heart failure
From the a
Center for Digital Health, Stanford University Stanford, CA, b
VA Palo Alto Health
Care System, Palo Alto, CA, c
Quantitative Sciences Unit, Stanford University, Stanford,
CA, d
Stanford Center for Clinical Research, Stanford University, Stanford, CA, e
Information
Resources and Technology, Stanford University, Stanford, CA, f
Apple Inc. Cupertino, CA,
g
Lankenau Heart Institute and Jefferson Medical College, Philadelphia, PA, h
University of
Colorado School of Medicine, Denver, CO, i
Division of Cardiovascular Disease, Cooper
Medical School of Rowan University, Camden, NJ, j
StopAfib.org, American Foundation for
Women's Health, Decatur, TX, k
Duke Clinical Research Institute, Duke University, Durham,
NC, and l
Division of Cardiovascular Medicine, Stanford University, Stanford, CA.
Peter Alexander Noseworthy, MD served as guest editor for this article.
RCT# NCT03335800
Submitted August 13, 2018; accepted September 4, 2018.
Reprint requests: Mintu Turakhia, Marco Perez, Stanford Center for Clinical Research,
Stanford University, 1070 Arastradero Rd., Palo Alto, CA, 94304.
E-mail: mintu@stanford.edu
0002-8703
© 2018 The Authors. Published by Elsevier Inc. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
https://doi.org/10.1016/j.ahj.2018.09.002
Trial Design
American Heart Journal, 2019
American Heart Journal, 2019
American Heart Journal, 2019
Figure 1
•Apple Heart Study

•스탠퍼드의 원격 임상 시험 / 애플 스폰서 

•PPG를 통해 심장 박동수와 규칙성을 측정

•PPG에서 심방세동이 의심되는 이상이 발견되면 

다음 단계로 ambulatory ECG를 ePatch로 측정

•동시 기록한 애플워치의 결과와 비교

•ePatch의 사용 및 결과 분석에는 원격진료를 활용

•40만명의 피실험자 등록은 마쳤고 추적 연구 진행 중
•American College of Cardiology’s 68th Annual Scientific Session

•전체 임상 참여자 중에서 irregular pusle notification 받은 사람은 불과 0.5%

•애플워치와 ECG patch를 동시에 사용한 결과 71%의 positive predictive value. 

•irregular pusle notification 받은 사람 중 84%가 그 시점에 심방세동을 가짐

•f/u으로 그 다음 일주일 동안 ECG patch를 착용한 사람 중 34%가 심방세동을 발견

•Irregular pusle notification 받은 사람 중에 실제로 병원에 간 사람은 57% (전체 환자군의 0.3%)
n
n-
ng
n
es
h-
n
ne
ne
ct
d
n-
at
s-
or
e,
ts
n
a-
gs
d
ch
Nat Biotech 2015
디지털 표현형
Digital Phenotype:
Your smartphone knows if you are depressed
Ginger.io
Digital Phenotype:
Your smartphone knows if you are depressed
J Med Internet Res. 2015 Jul 15;17(7):e175.
The correlation analysis between the features and the PHQ-9 scores revealed that 6 of the 10
features were significantly correlated to the scores:
• strong correlation: circadian movement, normalized entropy, location variance
• correlation: phone usage features, usage duration and usage frequency
the manifestations of disease by providing a
more comprehensive and nuanced view of the
experience of illness. Through the lens of the
digital phenotype, an individual’s interaction
The digital phenotype
Sachin H Jain, Brian W Powers, Jared B Hawkins & John S Brownstein
In the coming years, patient phenotypes captured to enhance health and wellness will extend to human interactions with
digital technology.
In 1982, the evolutionary biologist Richard
Dawkins introduced the concept of the
“extended phenotype”1, the idea that pheno-
types should not be limited just to biological
processes, such as protein biosynthesis or tissue
growth, but extended to include all effects that
a gene has on its environment inside or outside
ofthebodyoftheindividualorganism.Dawkins
stressed that many delineations of phenotypes
are arbitrary. Animals and humans can modify
their environments, and these modifications
andassociatedbehaviorsareexpressionsofone’s
genome and, thus, part of their extended phe-
notype. In the animal kingdom, he cites damn
buildingbybeaversasanexampleofthebeaver’s
extended phenotype1.
Aspersonaltechnologybecomesincreasingly
embedded in human lives, we think there is an
important extension of Dawkins’s theory—the
notion of a ‘digital phenotype’. Can aspects of
ourinterfacewithtechnologybesomehowdiag-
nosticand/orprognosticforcertainconditions?
Can one’s clinical data be linked and analyzed
together with online activity and behavior data
to create a unified, nuanced view of human dis-
ease?Here,wedescribetheconceptofthedigital
phenotype. Although several disparate studies
have touched on this notion, the framework for
medicine has yet to be described. We attempt to
define digital phenotype and further describe
the opportunities and challenges in incorporat-
ing these data into healthcare.
Jan. 2013
0.000
0.002
0.004
Density
0.006
July 2013 Jan. 2014 July 2014
User 1
User 2
User 3
User 4
User 5
User 6
User 7
Date
Figure 1 Timeline of insomnia-related tweets from representative individuals. Density distributions
(probability density functions) are shown for seven individual users over a two-year period. Density on
the y axis highlights periods of relative activity for each user. A representative tweet from each user is
shown as an example.
npg©2015NatureAmerica,Inc.Allrightsreserved.
http://www.nature.com/nbt/journal/v33/n5/full/nbt.3223.html
ers, Jared B Hawkins & John S Brownstein
phenotypes captured to enhance health and wellness will extend to human interactions with
st Richard
pt of the
hat pheno-
biological
sis or tissue
effects that
or outside
m.Dawkins
phenotypes
can modify
difications
onsofone’s
ended phe-
cites damn
hebeaver’s
ncreasingly
there is an
heory—the
aspects of
ehowdiag-
Jan. 2013
0.000
0.002
0.004
Density
0.006
July 2013 Jan. 2014 July 2014
User 1
User 2
User 3
User 4
User 5
User 6
User 7
Date
Figure 1 Timeline of insomnia-related tweets from representative individuals. Density distributions
(probability density functions) are shown for seven individual users over a two-year period. Density on
the y axis highlights periods of relative activity for each user. A representative tweet from each user is
Your twitter knows if you cannot sleep
Timeline of insomnia-related tweets from representative individuals.
Nat. Biotech. 2015
Reece & Danforth, “Instagram photos reveal predictive markers of depression” (2016)
higher Hue (bluer)
lower Saturation (grayer)
lower Brightness (darker)
Digital Phenotype:
Your Instagram knows if you are depressed
Rao (MVR) (24) .  
 
Results 
Both All­data and Pre­diagnosis models were decisively superior to a null model
. All­data predictors were significant with 99% probability.57.5;(KAll  = 1 K 49.8)  Pre = 1  7
Pre­diagnosis and All­data confidence levels were largely identical, with two exceptions: 
Pre­diagnosis Brightness decreased to 90% confidence, and Pre­diagnosis posting frequency 
dropped to 30% confidence, suggesting a null predictive value in the latter case.  
Increased hue, along with decreased brightness and saturation, predicted depression. This 
means that photos posted by depressed individuals tended to be bluer, darker, and grayer (see 
Fig. 2). The more comments Instagram posts received, the more likely they were posted by 
depressed participants, but the opposite was true for likes received. In the All­data model, higher 
posting frequency was also associated with depression. Depressed participants were more likely 
to post photos with faces, but had a lower average face count per photograph than healthy 
participants. Finally, depressed participants were less likely to apply Instagram filters to their 
posted photos.  
 
Fig. 2. Magnitude and direction of regression coefficients in All­data (N=24,713) and Pre­diagnosis (N=18,513) 
models. X­axis values represent the adjustment in odds of an observation belonging to depressed individuals, per 
Reece & Danforth, “Instagram photos reveal predictive markers of depression” (2016)
 
 
Fig. 1. Comparison of HSV values. Right photograph has higher Hue (bluer), lower Saturation (grayer), and lower 
Brightness (darker) than left photograph. Instagram photos posted by depressed individuals had HSV values 
shifted towards those in the right photograph, compared with photos posted by healthy individuals. 
 
Units of observation 
In determining the best time span for this analysis, we encountered a difficult question: 
When and for how long does depression occur? A diagnosis of depression does not indicate the 
persistence of a depressive state for every moment of every day, and to conduct analysis using an 
individual’s entire posting history as a single unit of observation is therefore rather specious. At 
the other extreme, to take each individual photograph as units of observation runs the risk of 
being too granular. DeChoudhury et al. (5) looked at all of a given user’s posts in a single day, 
and aggregated those data into per­person, per­day units of observation. We adopted this 
precedent of “user­days” as a unit of analysis .  5
 
Statistical framework 
We used Bayesian logistic regression with uninformative priors to determine the strength 
of individual predictors. Two separate models were trained. The All­data model used all 
collected data to address Hypothesis 1. The Pre­diagnosis model used all data collected from 
higher Hue (bluer)
lower Saturation (grayer)
lower Brightness (darker)
Digital Phenotype:
Your Instagram knows if you are depressed
Reece & Danforth, “Instagram photos reveal predictive markers of depression” (2016)
. In particular, depressedχ2 07.84, p .17e 64;( All  = 9   = 9 − 1 13.80, p .87e 44)χ2Pre  = 8   = 2 − 1  
participants were less likely than healthy participants to use any filters at all. When depressed 
participants did employ filters, they most disproportionately favored the “Inkwell” filter, which 
converts color photographs to black­and­white images. Conversely, healthy participants most 
disproportionately favored the Valencia filter, which lightens the tint of photos. Examples of 
filtered photographs are provided in SI Appendix VIII.  
 
Fig. 3. Instagram filter usage among depressed and healthy participants. Bars indicate difference between observed 
and expected usage frequencies, based on a Chi­squared analysis of independence. Blue bars indicate 
disproportionate use of a filter by depressed compared to healthy participants, orange bars indicate the reverse. 
Digital Phenotype:
Your Instagram knows if you are depressed
Reece & Danforth, “Instagram photos reveal predictive markers of depression” (2016)
 
VIII. Instagram filter examples 
 
Fig. S8. Examples of Inkwell and Valencia Instagram filters.  Inkwell converts 
color photos to black­and­white, Valencia lightens tint.  Depressed participants 
most favored Inkwell compared to healthy participants, Healthy participants 
Mindstrong Health
• 스마트폰 사용 패턴을 바탕으로 

• 인지능력, 우울증, 조현병, 양극성 장애, PTSD 등을 측정

• 미국 국립정신건강연구소 소장인 Tomas Insel 이 공동 설립

• 아마존의 제프 베조스 투자
BRIEF COMMUNICATION OPEN
Digital biomarkers of cognitive function
Paul Dagum1
To identify digital biomarkers associated with cognitive function, we analyzed human–computer interaction from 7 days of
smartphone use in 27 subjects (ages 18–34) who received a gold standard neuropsychological assessment. For several
neuropsychological constructs (working memory, memory, executive function, language, and intelligence), we found a family of
digital biomarkers that predicted test scores with high correlations (p < 10−4
). These preliminary results suggest that passive
measures from smartphone use could be a continuous ecological surrogate for laboratory-based neuropsychological assessment.
npj Digital Medicine (2018)1:10 ; doi:10.1038/s41746-018-0018-4
INTRODUCTION
By comparison to the functional metrics available in other
disciplines, conventional measures of neuropsychiatric disorders
have several challenges. First, they are obtrusive, requiring a
subject to break from their normal routine, dedicating time and
often travel. Second, they are not ecological and require subjects
to perform a task outside of the context of everyday behavior.
Third, they are episodic and provide sparse snapshots of a patient
only at the time of the assessment. Lastly, they are poorly scalable,
taxing limited resources including space and trained staff.
In seeking objective and ecological measures of cognition, we
attempted to develop a method to measure memory and
executive function not in the laboratory but in the moment,
day-to-day. We used human–computer interaction on smart-
phones to identify digital biomarkers that were correlated with
neuropsychological performance.
RESULTS
In 2014, 27 participants (ages 27.1 ± 4.4 years, education
14.1 ± 2.3 years, M:F 8:19) volunteered for neuropsychological
assessment and a test of the smartphone app. Smartphone
human–computer interaction data from the 7 days following
the neuropsychological assessment showed a range of correla-
tions with the cognitive scores. Table 1 shows the correlation
between each neurocognitive test and the cross-validated
predictions of the supervised kernel PCA constructed from
the biomarkers for that test. Figure 1 shows each participant
test score and the digital biomarker prediction for (a) digits
backward, (b) symbol digit modality, (c) animal fluency,
(d) Wechsler Memory Scale-3rd Edition (WMS-III) logical
memory (delayed free recall), (e) brief visuospatial memory test
(delayed free recall), and (f) Wechsler Adult Intelligence Scale-
4th Edition (WAIS-IV) block design. Construct validity of the
predictions was determined using pattern matching that
computed a correlation of 0.87 with p < 10−59
between the
covariance matrix of the predictions and the covariance matrix
of the tests.
Table 1. Fourteen neurocognitive assessments covering five cognitive
domains and dexterity were performed by a neuropsychologist.
Shown are the group mean and standard deviation, range of score,
and the correlation between each test and the cross-validated
prediction constructed from the digital biomarkers for that test
Cognitive predictions
Mean (SD) Range R (predicted),
p-value
Working memory
Digits forward 10.9 (2.7) 7–15 0.71 ± 0.10, 10−4
Digits backward 8.3 (2.7) 4–14 0.75 ± 0.08, 10−5
Executive function
Trail A 23.0 (7.6) 12–39 0.70 ± 0.10, 10−4
Trail B 53.3 (13.1) 37–88 0.82 ± 0.06, 10−6
Symbol digit modality 55.8 (7.7) 43–67 0.70 ± 0.10, 10−4
Language
Animal fluency 22.5 (3.8) 15–30 0.67 ± 0.11, 10−4
FAS phonemic fluency 42 (7.1) 27–52 0.63 ± 0.12, 10−3
Dexterity
Grooved pegboard test
(dominant hand)
62.7 (6.7) 51–75 0.73 ± 0.09, 10−4
Memory
California verbal learning test
(delayed free recall)
14.1 (1.9) 9–16 0.62 ± 0.12, 10−3
WMS-III logical memory
(delayed free recall)
29.4 (6.2) 18–42 0.81 ± 0.07, 10−6
Brief visuospatial memory test
(delayed free recall)
10.2 (1.8) 5–12 0.77 ± 0.08, 10−5
Intelligence scale
WAIS-IV block design 46.1(12.8) 12–61 0.83 ± 0.06, 10−6
WAIS-IV matrix reasoning 22.1(3.3) 12–26 0.80 ± 0.07, 10−6
WAIS-IV vocabulary 40.6(4.0) 31–50 0.67 ± 0.11, 10−4
Received: 5 October 2017 Revised: 3 February 2018 Accepted: 7 February 2018
1
Mindstrong Health, 248 Homer Street, Palo Alto, CA 94301, USA
Correspondence: Paul Dagum (paul@mindstronghealth.com)
www.nature.com/npjdigitalmed
Published in partnership with the Scripps Translational Science Institute
• 총 45가지 스마트폰 사용 패턴: 타이핑, 스크롤, 화면 터치

• 스페이스바 누른 후, 다음 문자 타이핑하는 행동

• 백스페이스를 누른 후, 그 다음 백스페이스

• 주소록에서 사람을 찾는 행동 양식

• 스마트폰 사용 패턴과 인지 능력의 상관 관계 

• 20-30대 피험자 27명

• Working Memory, Language, Dexterity etc
BRIEF COMMUNICATION OPEN
Digital biomarkers of cognitive function
Paul Dagum1
To identify digital biomarkers associated with cognitive function, we analyzed human–computer interaction from 7 days of
smartphone use in 27 subjects (ages 18–34) who received a gold standard neuropsychological assessment. For several
neuropsychological constructs (working memory, memory, executive function, language, and intelligence), we found a family of
digital biomarkers that predicted test scores with high correlations (p < 10−4
). These preliminary results suggest that passive
measures from smartphone use could be a continuous ecological surrogate for laboratory-based neuropsychological assessment.
npj Digital Medicine (2018)1:10 ; doi:10.1038/s41746-018-0018-4
INTRODUCTION
By comparison to the functional metrics available in other
disciplines, conventional measures of neuropsychiatric disorders
have several challenges. First, they are obtrusive, requiring a
subject to break from their normal routine, dedicating time and
often travel. Second, they are not ecological and require subjects
to perform a task outside of the context of everyday behavior.
Third, they are episodic and provide sparse snapshots of a patient
only at the time of the assessment. Lastly, they are poorly scalable,
taxing limited resources including space and trained staff.
In seeking objective and ecological measures of cognition, we
attempted to develop a method to measure memory and
executive function not in the laboratory but in the moment,
day-to-day. We used human–computer interaction on smart-
phones to identify digital biomarkers that were correlated with
neuropsychological performance.
RESULTS
In 2014, 27 participants (ages 27.1 ± 4.4 years, education
14.1 ± 2.3 years, M:F 8:19) volunteered for neuropsychological
assessment and a test of the smartphone app. Smartphone
human–computer interaction data from the 7 days following
the neuropsychological assessment showed a range of correla-
tions with the cognitive scores. Table 1 shows the correlation
between each neurocognitive test and the cross-validated
predictions of the supervised kernel PCA constructed from
the biomarkers for that test. Figure 1 shows each participant
test score and the digital biomarker prediction for (a) digits
backward, (b) symbol digit modality, (c) animal fluency,
(d) Wechsler Memory Scale-3rd Edition (WMS-III) logical
memory (delayed free recall), (e) brief visuospatial memory test
(delayed free recall), and (f) Wechsler Adult Intelligence Scale-
4th Edition (WAIS-IV) block design. Construct validity of the
predictions was determined using pattern matching that
computed a correlation of 0.87 with p < 10−59
between the
covariance matrix of the predictions and the covariance matrix
of the tests.
Table 1. Fourteen neurocognitive assessments covering five cognitive
domains and dexterity were performed by a neuropsychologist.
Shown are the group mean and standard deviation, range of score,
and the correlation between each test and the cross-validated
prediction constructed from the digital biomarkers for that test
Cognitive predictions
Mean (SD) Range R (predicted),
p-value
Working memory
Digits forward 10.9 (2.7) 7–15 0.71 ± 0.10, 10−4
Digits backward 8.3 (2.7) 4–14 0.75 ± 0.08, 10−5
Executive function
Trail A 23.0 (7.6) 12–39 0.70 ± 0.10, 10−4
Trail B 53.3 (13.1) 37–88 0.82 ± 0.06, 10−6
Symbol digit modality 55.8 (7.7) 43–67 0.70 ± 0.10, 10−4
Language
Animal fluency 22.5 (3.8) 15–30 0.67 ± 0.11, 10−4
FAS phonemic fluency 42 (7.1) 27–52 0.63 ± 0.12, 10−3
Dexterity
Grooved pegboard test
(dominant hand)
62.7 (6.7) 51–75 0.73 ± 0.09, 10−4
Memory
California verbal learning test
(delayed free recall)
14.1 (1.9) 9–16 0.62 ± 0.12, 10−3
WMS-III logical memory
(delayed free recall)
29.4 (6.2) 18–42 0.81 ± 0.07, 10−6
Brief visuospatial memory test
(delayed free recall)
10.2 (1.8) 5–12 0.77 ± 0.08, 10−5
Intelligence scale
WAIS-IV block design 46.1(12.8) 12–61 0.83 ± 0.06, 10−6
WAIS-IV matrix reasoning 22.1(3.3) 12–26 0.80 ± 0.07, 10−6
WAIS-IV vocabulary 40.6(4.0) 31–50 0.67 ± 0.11, 10−4
Received: 5 October 2017 Revised: 3 February 2018 Accepted: 7 February 2018
1
Mindstrong Health, 248 Homer Street, Palo Alto, CA 94301, USA
Correspondence: Paul Dagum (paul@mindstronghealth.com)
www.nature.com/npjdigitalmed
Published in partnership with the Scripps Translational Science Institute
Fig. 1 A blue square represents a participant test Z-score normed to the 27 participant scores and a red circle represents the digital biomarker
prediction Z-score normed to the 27 predictions. Test scores and predictions shown are a digits backward, b symbol digit modality, c animal
fluency, d Wechsler memory Scale-3rd Edition (WMS-III) logical memory (delayed free recall), e brief visuospatial memory test (delayed free
recall), and f Wechsler adult intelligence scale-4th Edition (WAIS-IV) block design
Digital biomarkers of cognitive function
P Dagum
2
1234567890():,;
• 스마트폰 사용 패턴과 인지 능력의 상관 관계

• 파란색: 표준 인지 능력 테스트 결과

• 붉은색: 마인드 스트롱의 스마트폰 사용 패턴
Patient Generated Health Data
Step 2. Collect the Data
Sci Transl Med 2015
Google Fit
Samsung SAMI
Epic MyChart Epic EHR
Dexcom CGM
Patients/User
Devices
EH Hospit
Whitings
+
Apple Watch
Apps
HealthKit
Hospital B
Hospital C
Hospital A
Hospital A Hospital B
Hospital C
interoperability
Hospital B
Hospital C
Hospital A
•2018년 1월에 출시 당시, 존스홉킨스, UC샌디에고 등 12개의 병원에 연동

•(2019년 2월 현재) 1년 만에 200개 이상의 병원에 연동

•VA와도 연동된다고 밝힘 (with 9 million veterans)

•2008년 구글 헬스는 3년 동안 12개 병원에 연동에 그쳤음
Data-driven Medicine에 대한 두 가지 전략
• top-down: 먼저 가설을 세우고, 그에 맞는 종류의 데이터를 모아서 검증해보자. 

• bottom-up: 일단 ‘모든’ 데이터를 최대한 많이 모아 놓으면, 뭐라도 큰 게 나오겠지.
• top-down: 먼저 가설을 세우고, 그에 맞는 종류의 데이터를 모아서 검증해보자. 

• bottom-up: 일단 ‘모든’ 데이터를 최대한 많이 모아 놓으면, 뭐라도 큰 게 나오겠지.
Data-driven Medicine에 대한 두 가지 전략
©2017NatureAmerica,Inc.,partofSpringerNature.Allrightsreserved.
NATURE BIOTECHNOLOGY ADVANCE ONLINE PUBLICATION 1
A RT I C L E S
In order to understand the basis of wellness and disease, we and
others have pursued a global and holistic approach termed ‘systems
medicine’1. The defining feature of systems medicine is the collec-
tion of diverse longitudinal data for each individual. These data sets
can be used to unravel the complexity of human biology and dis-
ease by assessing both genetic and environmental determinants of
health and their interactions. We refer to such data as personal, dense,
dynamic data clouds: personal, because each data cloud is unique to
an individual; dense, because of the high number of measurements;
and dynamic, because we monitor longitudinally. The convergence
of advances in systems medicine, big data analysis, individual meas-
urement devices, and consumer-activated social networks has led
to a vision of healthcare that is predictive, preventive, personalized,
and participatory (P4)2, also known as ‘precision medicine’. Personal,
dense, dynamic data clouds are indispensable to realizing this vision3.
The US healthcare system invests 97% of its resources on disease
care4, with little attention to wellness and disease prevention. Here
we investigate scientific wellness, which we define as a quantitative
data-informed approach to maintaining and improving health and
avoiding disease.
Several recent studies have illustrated the utility of multi-omic lon-
gitudinal data to look for signs of reversible early disease or disease
risk factors in single individuals. The dynamics of human gut and sali-
vary microbiota in response to travel abroad and enteric infection was
characterized in two individuals using daily stool and saliva samples5.
Daily multi-omic data collection from one individual over 14 months
identified signatures of respiratory infection and the onset of type 2
diabetes6. Crohn’s disease progression was tracked over many years
in one individual using regular blood and stool measurements7. Each
of these studies yielded insights into system dynamics even though
they had only one or two participants.
We report the generation and analysis of personal, dense, dynamic
data clouds for 108 individuals over the course of a 9-month study that
we call the Pioneer 100 Wellness Project (P100). Our study included
whole genome sequences; clinical tests, metabolomes, proteomes, and
microbiomes at 3-month intervals; and frequent activity measure-
ments (i.e., wearing a Fitbit). This study takes a different approach
from previous studies, in that a broad set of assays were carried out less
frequently in a (comparatively) large number of people. Furthermore,
we identified ‘actionable possibilities’ for each individual to enhance
her/his health. Risk factors that we observed in participants’ clinical
markers and genetics were used as a starting point to identify action-
able possibilities for behavioral coaching.
We report the correlations among different data types and identify
population-level changes in clinical markers. This project is the pilot
for the 100,000 (100K) person wellness project that we proposed
in 2014 (ref. 8). An increased scale of personal, dense, dynamic
data clouds in future holds the potential to improve our under-
standing of scientific wellness and delineate early warning signs for
human diseases.
RESULTS
The P100 study had four objectives. First, establish cost-efficient
procedures for generating, storing, and analyzing multiple sources
A wellness study of 108 individuals using personal,
dense, dynamic data clouds
Nathan D Price1,2,6,7, Andrew T Magis2,6, John C Earls2,6, Gustavo Glusman1 , Roie Levy1, Christopher Lausted1,
Daniel T McDonald1,5, Ulrike Kusebauch1, Christopher L Moss1, Yong Zhou1, Shizhen Qin1, Robert L Moritz1 ,
Kristin Brogaard2, Gilbert S Omenn1,3, Jennifer C Lovejoy1,2 & Leroy Hood1,4,7
Personal data for 108 individuals were collected during a 9-month period, including whole genome sequences; clinical tests,
metabolomes, proteomes, and microbiomes at three time points; and daily activity tracking. Using all of these data, we generated
a correlation network that revealed communities of related analytes associated with physiology and disease. Connectivity within
analyte communities enabled the identification of known and candidate biomarkers (e.g., gamma-glutamyltyrosine was densely
interconnected with clinical analytes for cardiometabolic disease). We calculated polygenic scores from genome-wide association
studies (GWAS) for 127 traits and diseases, and used these to discover molecular correlates of polygenic risk (e.g., genetic risk
for inflammatory bowel disease was negatively correlated with plasma cystine). Finally, behavioral coaching informed by personal
data helped participants to improve clinical biomarkers. Our results show that measurement of personal data clouds over time can
improve our understanding of health and disease, including early transitions to disease states.
1Institute for Systems Biology, Seattle, Washington, USA. 2Arivale, Seattle, Washington, USA. 3Department of Computational Medicine and Bioinformatics, University
of Michigan, Ann Arbor, Michigan, USA. 4Providence St. Joseph Health, Seattle, Washington, USA. 5Present address: University of California, San Diego, San Diego,
California, USA. 6These authors contributed equally to this work. 7These authors jointly supervised this work. Correspondence should be addressed to N.D.P.
(nathan.price@systemsbiology.org) or L.H. (lhood@systemsbiology.org).
Received 16 October 2016; accepted 11 April 2017; published online 17 July 2017; doi:10.1038/nbt.3870
NatureAmerica,Inc.,partofSpringerNature.Allrightsreserved.
Intro
a
b
Round 1 Coaching sessions Round 2 Coaching sessions Round 3 Coaching sessions
Month 1 Month 2 Month 3 Month 4 Month 5 Month 6 Month 7 Month 8 Month 9
Clinical labs
Cardiovascular
HDL/LDL cholesterol, triglycerides,
particle profiles, and other markers
Blood sample
Metabolomics
Xenobiotics and metabolism-related
small molecules
Blood sample
Diabetes risk
Fasting glucose, HbA1c, insulin,
and other markers
Blood sample
Inflammation
IL-6, IL-8, and other markers
Blood sample
Nutrition and toxins
Ferritin, vitamin D, glutathione, mercury,
lead, and other markers
Blood sample
Genetics
Whole genome sequence
Blood sample
Proteomics
Inflammation, cardiovascular, liver,
brain, and heart-related proteins
Blood sample
Gut microbiome
16S rRNA sequencing
Stool sample
Quantified self
Daily activity
Activity tracker
Stress
Four-point cortisol
Saliva
모든 가용한 다차원적 데이터를 측정해보자
©2017NatureAmerica,Inc.,partofSpringerNature.Allrightsreserved. Proteomics
Genetic
traits
Microbiome
Coriobacteriia
Allergic sensitization
GH
NEMO
CD40L
REN
T PA
HSP 27
LEP
SIRT2
IL 6
FABP4
IL 1RA
EGF
VEGF
A
CSTB
BETA
NGF
PPBP(2)
PPBP
NCF2
4E
BP1
STAM
PB
SIRT2
CSF
1IL
6
FGF
21
IL
10RA
IL
18R1IL8IL7
TNFSF14
CCL20
FLT3L
CXCL10CD5HGFAXIN1
VEGFAOPGDNEROSM
APCSINHBCCRP(2)CRPCFHR1HGFAC
MBL2
SERPINC1
GC
PTGDS
ACTA2
ACTA2(2)
PDGF SUBUNIT B
Deletion Cfhr1
Inflammatory Bowel Disease
Activated Partial Thromboplastin Time
Bladder Cancer
Bilirubin Levels
Gamma Linolenic Acid
Dihomo gamma Linolenic Acid
Arachidonic Acid
Linoleic Acid
Adrenic Acid
Deltaproteobacteria
Mollicutes
Verrucomicrobiae
Coriobacteriales
Verrucomicrobiales
Verrucomicrobia
Coriobacteriaceae
91otu13421
91otu4418
91otu1825
M
ogibacteriaceae
Unclassified
Desulfovibrionaceae
Pasteurellaceae
Peptostreptococcaceae
Christensenellaceae
Verrucom
icrobiaceae
Alanine
RatioOm6Om3
AlphaAminoN
ButyricAcid
Interleukinll6
SmallLdlParticle
RatioGlnGln
Threonine
3Methylhistidine
AverageinflammationScore
Mercury
DocosapentaenoicAcidDocosatetraenoicAcid
EicosadienoicAcidHomalrLeucineOmega3indexTyrosine
HdlCholesterolCPeptide
1Methylhistidine
3HydroxyisovalericAcid
IsovalerylglycineIsoleucine
Figlu
TotalCholesterolLinoleicDihomoYLinolejc
PalmitoleicAcid
ArachidonicAcid
LdlParticle
ArachidonicEicosapentaenoic
Pasteurellales
Diversity
Tenericutes
Clinical labs
Metabolomics
5Hydroxyhexanoate
Tl16:0(palmiticAcid)
Tl18:3n6(gLinolenicAcid)Tl15:0(pentadecanoicAcid)Tl14:1n5(myristoleicAcid)Tl20:2n6(eicosadienoicAcid)Tl20:5n3(eicosapentaenoicAcid)
Tl18:2n6(linoleicAcid)
Tldm16:0(plasmalogenPalmiticAcid)
Tl22:6n3(docosahexaenoicAcid)
Tl22:4n6(adrenicAcid)
Tl18:1n9(oleicAcid)
Tldm18:1n9(plasmalogenOleicAcid)
Tl20:4n6(arachidonicAcid)
Tl14:0(myristicAcid)
Arachidate(20:0)
StearoylArachidonoylGlycerophosphoethanolamine(1)*
1Linoleoylglycerophosphocholine(18:2n6)
StearoylLinoleoylGlycerophosphoethanolamine(1)*
1Palmitoleoylglycerophosphocholine(16:1)*
PalmitoylOleoylGlycerophosphoglycerol(2)*
PalmitoylLinoleoylGlycerophosphocholine(1)*
Tl20:3n6(diHomoGLinoleicAcid)
2Hydroxypalmitate
NervonoylSphingomyelin*
Titl(totalTotalLipid)
Cholesterol
D
ocosahexaenoate
(dha;22;6n3)
Eicosapentaenoate
(epa; 20:5n3)
3
Carboxy
4
M
ethyl 5
Propyl 2
Furanpropanoate
(cm
pf)
3
M
ethyladipate
Cholate
Phosphoethanolamine
1 Oleoylglycerol (1 Monoolein)
Tigloylglycine
Valine
sobutyrylglycine
soleucine
eucine
P Cresol Glucuronide*
Phenylacetylglutamine
P Cresol Sulfate
Tyrosine
S Methylcysteine
Cystine
3 Methylhistidine
1 Methylhistidine
N Acetyltryptophan
3 Indoxyl Sulfate
Serotonin (5ht)
Creatinine
Glutamate
Cysteine Glutathione Disulfide
Gamma Glutamylthreonine*Gamma Glutamylalanine
Gamma Glutamylglutamate
Gamma Glutamylglutamine
Bradykinin, Hydroxy Pro(3)
Bradykinin, Des Arg(9)
BradykininMannoseBilirubin (e,e)*
Biliverdin
Bilirubin (z,z)
L UrobilinNicotinamide
Alpha TocopherolHippurate
Cinnam
oylglycine
Ldl Particle
N
um
ber
Triglycerides
Bilirubin
Direct
Alkaline
Phosphatase
EgfrNon
AfrAm
erican
CholesterolTotal
LdlSm
all
LdlM
edium
BilirubinTotal
Ggt
EgfrAfricanAmerican
Cystine
MargaricAcid
ElaidicAcid
Proinsulin
Hba1c
Insulin
Triglycerides
Ldlcholesterol
DihomoGammaLinolenicAcid
HsCrp
GlutamicAcid
Height
Weight
Leptin
BodyMasIndex
PhenylaceticAcid
Valine
TotalOmega3
TotalOmega6
HsCrpRelativeRisk
DocosahexaenoicAcid
AlphaAminoadipicAcid
EicosapentaenoicAcid
GammaAminobutyricAcid
5
Acetylam
ino
6
Form
ylam
ino
3
M
ethyluracil
Adenosine 5
Monophosphate (amp)
Gamma Glutamyltyrosine
Gamma Glutamyl 2 Aminobutyrate
N Acetyl 3 Methylhistidine*
3 Phenylpropionate (hydrocinnamate)
Figure 2 Top 100 correlations per pair of data types. Subset of top statistically significant Spearman inter-omic cross-sectional correlations between
all data sets collected in our cohort. Each line represents one correlation that was significant after adjustment for multiple hypothesis testing using the
method of Benjamini and Hochberg10 at padj < 0.05. The mean of all three time points was used to compute the correlations between analytes. Up to
100 correlations per pair of data types are shown in this figure. See Supplementary Figure 1 and Supplementary Table 2 for the complete inter-omic
cross-sectional network.
Nature Biotechnology 2017
측정한 모든 종류의 데이터들 중에 가장 correlation이 높은 100개의 pair를 선정
• 버릴리(구글)의 베이스라인 프로젝트

• 건강과 질병을 새롭게 정의하기 위한 프로젝트

• 4년 동안 10,000 명의 개인의 건강 상태를 면밀하게 추적하여 데이터를 축적

• 심박수와 수면패턴 및 유전 정보, 감정 상태, 진료기록, 가족력, 소변/타액/혈액 검사 등
iCarbonX

•중국 BGI의 대표였던 준왕이 창업

•'모든 데이터를 측정'하고 이를 정밀 의료에 활용할 계획

•데이터를 측정할 수 있는 역량을 가진 회사에 투자 및 인수

•SomaLogic, HealthTell, PatientsLikMe

•향후 5년 동안 100만명-1000만 명의 데이터 모을 계획

•이 데이터의 분석은 인공지능으로
•Precision Medicine Initiative Cohort Program

•2억 1500만 달러 투입

•최소한 100만명의 미국인을 자발적으로 모집해서

•EMR, 가족력, 유전 정보, 혈액 및 소변 검사 결과, 

•MRI 등의 영상 의료 데이터, 웨어러블 디바이스를 통한 데이터
The Future of Individualized Medicine, 2019 @San Diego
The Future of Individualized Medicine, 2019 @San Diego
Step 3. Insight from the Data
Data Overload
How to Analyze and Interpret the Big Data?
and/or
Two ways to get insights from the big data
No choice but to bring AI into the medicine
Martin Duggan,“IBM Watson Health - Integrated Care & the Evolution to Cognitive Computing”
•복잡한 의료 데이터의 분석 및 insight 도출

•영상 의료/병리 데이터의 분석/판독

•연속 데이터의 모니터링 및 예방/예측
의료 인공지능의 세 유형
•복잡한 의료 데이터의 분석 및 insight 도출

•영상 의료/병리 데이터의 분석/판독

•연속 데이터의 모니터링 및 예방/예측
의료 인공지능의 세 유형
Jeopardy!
2011년 인간 챔피언 두 명 과 퀴즈 대결을 벌여서 압도적인 우승을 차지
ARTICLE OPEN
Scalable and accurate deep learning with electronic health
records
Alvin Rajkomar 1,2
, Eyal Oren1
, Kai Chen1
, Andrew M. Dai1
, Nissan Hajaj1
, Michaela Hardt1
, Peter J. Liu1
, Xiaobing Liu1
, Jake Marcus1
,
Mimi Sun1
, Patrik Sundberg1
, Hector Yee1
, Kun Zhang1
, Yi Zhang1
, Gerardo Flores1
, Gavin E. Duggan1
, Jamie Irvine1
, Quoc Le1
,
Kurt Litsch1
, Alexander Mossin1
, Justin Tansuwan1
, De Wang1
, James Wexler1
, Jimbo Wilson1
, Dana Ludwig2
, Samuel L. Volchenboum3
,
Katherine Chou1
, Michael Pearson1
, Srinivasan Madabushi1
, Nigam H. Shah4
, Atul J. Butte2
, Michael D. Howell1
, Claire Cui1
,
Greg S. Corrado1
and Jeffrey Dean1
Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare
quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR
data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation
of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that
deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple
centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two US academic
medical centers with 216,221 adult patients hospitalized for at least 24 h. In the sequential format we propose, this volume of EHR
data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for
tasks such as predicting: in-hospital mortality (area under the receiver operator curve [AUROC] across sites 0.93–0.94), 30-day
unplanned readmission (AUROC 0.75–0.76), prolonged length of stay (AUROC 0.85–0.86), and all of a patient’s final discharge
diagnoses (frequency-weighted AUROC 0.90). These models outperformed traditional, clinically-used predictive models in all cases.
We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios. In a case
study of a particular prediction, we demonstrate that neural networks can be used to identify relevant information from the
patient’s chart.
npj Digital Medicine (2018)1:18 ; doi:10.1038/s41746-018-0029-1
INTRODUCTION
The promise of digital medicine stems in part from the hope that,
by digitizing health data, we might more easily leverage computer
information systems to understand and improve care. In fact,
routinely collected patient healthcare data are now approaching
the genomic scale in volume and complexity.1
Unfortunately,
most of this information is not yet used in the sorts of predictive
statistical models clinicians might use to improve care delivery. It
is widely suspected that use of such efforts, if successful, could
provide major benefits not only for patient safety and quality but
also in reducing healthcare costs.2–6
In spite of the richness and potential of available data, scaling
the development of predictive models is difficult because, for
traditional predictive modeling techniques, each outcome to be
predicted requires the creation of a custom dataset with specific
variables.7
It is widely held that 80% of the effort in an analytic
model is preprocessing, merging, customizing, and cleaning
datasets,8,9
not analyzing them for insights. This profoundly limits
the scalability of predictive models.
Another challenge is that the number of potential predictor
variables in the electronic health record (EHR) may easily number
in the thousands, particularly if free-text notes from doctors,
nurses, and other providers are included. Traditional modeling
approaches have dealt with this complexity simply by choosing a
very limited number of commonly collected variables to consider.7
This is problematic because the resulting models may produce
imprecise predictions: false-positive predictions can overwhelm
physicians, nurses, and other providers with false alarms and
concomitant alert fatigue,10
which the Joint Commission identified
as a national patient safety priority in 2014.11
False-negative
predictions can miss significant numbers of clinically important
events, leading to poor clinical outcomes.11,12
Incorporating the
entire EHR, including clinicians’ free-text notes, offers some hope
of overcoming these shortcomings but is unwieldy for most
predictive modeling techniques.
Recent developments in deep learning and artificial neural
networks may allow us to address many of these challenges and
unlock the information in the EHR. Deep learning emerged as the
preferred machine learning approach in machine perception
problems ranging from computer vision to speech recognition,
but has more recently proven useful in natural language
processing, sequence prediction, and mixed modality data
settings.13–17
These systems are known for their ability to handle
large volumes of relatively messy data, including errors in labels
Received: 26 January 2018 Revised: 14 March 2018 Accepted: 26 March 2018
1
Google Inc, Mountain View, CA, USA; 2
University of California, San Francisco, San Francisco, CA, USA; 3
University of Chicago Medicine, Chicago, IL, USA and 4
Stanford University,
Stanford, CA, USA
Correspondence: Alvin Rajkomar (alvinrajkomar@google.com)
These authors contributed equally: Alvin Rajkomar, Eyal Oren
www.nature.com/npjdigitalmed
Published in partnership with the Scripps Translational Science Institute
•2018년 1월 구글이 전자의무기록(EMR)을 분석하여, 환자 치료 결과를 예측하는 인공지능 발표

•환자가 입원 중에 사망할 것인지

•장기간 입원할 것인지

•퇴원 후에 30일 내에 재입원할 것인지

•퇴원 시의 진단명

•이번 연구의 특징: 확장성

•과거 다른 연구와 달리 EMR의 일부 데이터를 pre-processing 하지 않고,

•전체 EMR 를 통채로 모두 분석하였음: UCSF, UCM (시카고 대학병원)

•특히, 비정형 데이터인 의사의 진료 노트도 분석
Nat Digi Med 2018
Nat Digi Med 2018
clinically-used predictive models. Because we were inte
understanding whether deep learning could scale to
valid predictions across divergent healthcare domains, w
single data structure to make predictions for an importan
outcome (death), a standard measure of quality of ca
missions), a measure of resource utilization (length of sta
measure of understanding of a patient’s problems (diagn
Second, using the entirety of a patient’s chart fo
prediction does more than promote scalability, it expos
data with which to make an accurate prediction. For pr
made at discharge, our deep learning models consider
than 46 billion pieces of EHR data and achieved more
predictions, earlier in the hospital stay, than did tr
models.
To the best of our knowledge, our models outperform
EHR models in the medical literature for predicting
(0.92–0.94 vs 0.91),42
unexpected readmission (0.75–
0.69),43
and increased length of stay (0.85–0.86 vs 0.77).
comparisons to other studies are difficult45
because of
underlying study designs,23,46–57
incomplete definitions o
and outcomes,58,59
restrictions on disease-specific cohort
use of data unavailable in real-time.63,65,66
Theref
implemented baselines based on the HOSPITAL score,67
score, and Liu’s model44
on our data, and demonstrat
better performance. We are not aware of a study that pr
many ICD codes as this study, but our micro-F1 score exce
shown on the smaller MIMIC-III dataset when predictin
diagnoses (0.40 vs 0.28).68
The clinical impact of this impr
is suggested, for example, by the improvement of numbe
to evaluate for inpatient mortality: the deep learning mod
fire half the number of alerts of a traditional predictive
resulting in many fewer false positives.
However, the novelty of the approach does not lie s
token is considered as a potential predictor by the deep learning model. The line within the boxplot represents the median,
represents the interquartile range (IQR), and the whiskers are 1.5 times the IQR. The number of tokens increased steadily from adm
discharge. At discharge, the median number of tokens for Hospital A was 86,477 and for Hospital B was 122,961
Table 2. Prediction accuracy of each task made at different time
points
Hospital A Hospital B
Inpatient mortality, AUROCa
(95% CI)
24 h before admission 0.87 (0.85–0.89) 0.81 (0.79–0.83)
At admission 0.90 (0.88–0.92) 0.90 (0.86–0.91)
24 h after admission 0.95 (0.94–0.96) 0.93 (0.92–0.94)
Baseline (aEWSb
) at 24 h after
admission
0.85 (0.81–0.89) 0.86 (0.83–0.88)
30-day readmission, AUROC (95% CI)
At admission 0.73 (0.71–0.74) 0.72 (0.71–0.73)
At 24 h after admission 0.74 (0.72–0.75) 0.73 (0.72–0.74)
At discharge 0.77 (0.75–0.78) 0.76 (0.75–0.77)
Baseline (mHOSPITALc
) at
discharge
0.70 (0.68–0.72) 0.68 (0.67–0.69)
Length of stay at least 7 days, AUROC (95% CI)
At admission 0.81 (0.80–0.82) 0.80 (0.80–0.81)
At 24 h after admission 0.86 (0.86–0.87) 0.85 (0.85–0.86)
Baseline (Liud
) at 24 h after
admission
0.76 (0.75–0.77) 0.74 (0.73–0.75)
Discharge diagnoses (weighted AUROC)
At admission 0.87 0.86
At 24 h after admission 0.89 0.88
At discharge 0.90 0.90
a
Area under the receiver operator curve
b
Augmented Early Warning System score
c
Modified HOSPITAL score for readmission
d
Modified Liu score for long length of stay
•2018년 1월 구글이 전자의무기록(EMR)을 분석하여, 환자 치료 결과를 예측하는 인공지능 발표

•환자가 입원 중에 사망할 것인지

•장기간 입원할 것인지

•퇴원 후에 30일 내에 재입원할 것인지

•퇴원 시의 진단명

•이번 연구의 특징: 확장성

•과거 다른 연구와 달리 EMR의 일부 데이터를 pre-processing 하지 않고,

•전체 EMR 를 통채로 모두 분석하였음: UCSF, UCM (시카고 대학병원)

•특히, 비정형 데이터인 의사의 진료 노트도 분석
•“향후 10년 동안 첫번째 cardiovascular event 가 올 것인가” 예측

•전향적 코호트 스터디: 영국 환자 378,256 명

•일상적 의료 데이터를 바탕으로 기계학습으로 질병을 예측하는 첫번째 대규모 스터디

•기존의 ACC/AHA 가이드라인과 4가지 기계학습 알고리즘의 정확도를 비교

•Random forest; Logistic regression; Gradient bossting; Neural network
Can machine-learning improve cardiovascular
risk prediction using routine clinical data?
Stephen F.Weng et al PLoS One 2017
in a sensitivity of 62.7% and PPV of 17.1%. The random forest algorithm resulted in a net
increase of 191 CVD cases from the baseline model, increasing the sensitivity to 65.3% and
PPV to 17.8% while logistic regression resulted in a net increase of 324 CVD cases (sensitivity
67.1%; PPV 18.3%). Gradient boosting machines and neural networks performed best, result-
ing in a net increase of 354 (sensitivity 67.5%; PPV 18.4%) and 355 CVD (sensitivity 67.5%;
PPV 18.4%) cases correctly predicted, respectively.
The ACC/AHA baseline model correctly predicted 53,106 non-cases from 75,585 total non-
cases, resulting in a specificity of 70.3% and NPV of 95.1%. The net increase in non-cases
Table 3. Top 10 risk factor variables for CVD algorithms listed in descending order of coefficient effect size (ACC/AHA; logistic regression),
weighting (neural networks), or selection frequency (random forest, gradient boosting machines). Algorithms were derived from training cohort of
295,267 patients.
ACC/AHA Algorithm Machine-learning Algorithms
Men Women ML: Logistic
Regression
ML: Random Forest ML: Gradient Boosting
Machines
ML: Neural Networks
Age Age Ethnicity Age Age Atrial Fibrillation
Total Cholesterol HDL Cholesterol Age Gender Gender Ethnicity
HDL Cholesterol Total Cholesterol SES: Townsend
Deprivation Index
Ethnicity Ethnicity Oral Corticosteroid
Prescribed
Smoking Smoking Gender Smoking Smoking Age
Age x Total Cholesterol Age x HDL Cholesterol Smoking HDL cholesterol HDL cholesterol Severe Mental Illness
Treated Systolic Blood
Pressure
Age x Total Cholesterol Atrial Fibrillation HbA1c Triglycerides SES: Townsend
Deprivation Index
Age x Smoking Treated Systolic Blood
Pressure
Chronic Kidney Disease Triglycerides Total Cholesterol Chronic Kidney Disease
Age x HDL Cholesterol Untreated Systolic
Blood Pressure
Rheumatoid Arthritis SES: Townsend
Deprivation Index
HbA1c BMI missing
Untreated Systolic
Blood Pressure
Age x Smoking Family history of
premature CHD
BMI Systolic Blood Pressure Smoking
Diabetes Diabetes COPD Total Cholesterol SES: Townsend
Deprivation Index
Gender
Italics: Protective Factors
https://doi.org/10.1371/journal.pone.0174944.t003
PLOS ONE | https://doi.org/10.1371/journal.pone.0174944 April 4, 2017 8 / 14
•기존 ACC/AHA 가이드라인의 위험 요소의 일부분만 기계학습 알고리즘에도 포함

•하지만, Diabetes는 네 모델 모두에서 포함되지 않았다. 

•기존의 위험 예측 툴에는 포함되지 않던, 아래와 같은 새로운 요소들이 포함되었다.

•COPD, severe mental illness, prescribing of oral corticosteroids

•triglyceride level 등의 바이오 마커
Stephen F.Weng et al PLoS One 2017
Can machine-learning improve cardiovascular
risk prediction using routine clinical data?
correctly predicted compared to the baseline ACC/AHA model ranged from 191 non-cases for
the random forest algorithm to 355 non-cases for the neural networks. Full details on classifi-
cation analysis can be found in S2 Table.
Discussion
Compared to an established AHA/ACC risk prediction algorithm, we found all machine-
learning algorithms tested were better at identifying individuals who will develop CVD and
those that will not. Unlike established approaches to risk prediction, the machine-learning
methods used were not limited to a small set of risk factors, and incorporated more pre-exist-
Table 4. Performance of the machine-learning (ML) algorithms predicting 10-year cardiovascular disease (CVD) risk derived from applying train-
ing algorithms on the validation cohort of 82,989 patients. Higher c-statistics results in better algorithm discrimination. The baseline (BL) ACC/AHA
10-year risk prediction algorithm is provided for comparative purposes.
Algorithms AUC c-statistic Standard Error* 95% Confidence
Interval
Absolute Change from Baseline
LCL UCL
BL: ACC/AHA 0.728 0.002 0.723 0.735 —
ML: Random Forest 0.745 0.003 0.739 0.750 +1.7%
ML: Logistic Regression 0.760 0.003 0.755 0.766 +3.2%
ML: Gradient Boosting Machines 0.761 0.002 0.755 0.766 +3.3%
ML: Neural Networks 0.764 0.002 0.759 0.769 +3.6%
*Standard error estimated by jack-knife procedure [30]
https://doi.org/10.1371/journal.pone.0174944.t004
Can machine-learning improve cardiovascular risk prediction using routine clinical data?
•네 가지 기계학습 모델 모두 기존의 ACC/AHA 가이드라인 대비 더 정확했다.

•Neural Networks 이 AUC=0.764 로 가장 정확했다.

•“이 모델을 활용했더라면 355 명의 추가적인 cardiovascular event 를 예방했을 것”

•Deep Learning 을 활용하면 정확도는 더 높아질 수 있을 것

•Genetic information 등의 추가적인 risk factor 를 활용해볼 수 있다.
LETTERS
https://doi.org/10.1038/s41591-018-0335-9
1
Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou, China. 2
Institute for Genomic Medicine, Institute of
Engineering in Medicine, and Shiley Eye Institute, University of California, San Diego, La Jolla, CA, USA. 3
Hangzhou YITU Healthcare Technology Co. Ltd,
Hangzhou, China. 4
Department of Thoracic Surgery/Oncology, First Affiliated Hospital of Guangzhou Medical University, China State Key Laboratory and
National Clinical Research Center for Respiratory Disease, Guangzhou, China. 5
Guangzhou Kangrui Co. Ltd, Guangzhou, China. 6
Guangzhou Regenerative
Medicine and Health Guangdong Laboratory, Guangzhou, China. 7
Veterans Administration Healthcare System, San Diego, CA, USA. 8
These authors contributed
equally: Huiying Liang, Brian Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu. *e-mail: kang.zhang@gmail.com; xiahumin@hotmail.com
Artificial intelligence (AI)-based methods have emerged as
powerful tools to transform medical care. Although machine
learning classifiers (MLCs) have already demonstrated strong
performance in image-based diagnoses, analysis of diverse
and massive electronic health record (EHR) data remains chal-
lenging. Here, we show that MLCs can query EHRs in a manner
similar to the hypothetico-deductive reasoning used by physi-
cians and unearth associations that previous statistical meth-
ods have not found. Our model applies an automated natural
language processing system using deep learning techniques
to extract clinically relevant information from EHRs. In total,
101.6 million data points from 1,362,559 pediatric patient
visits presenting to a major referral center were analyzed to
train and validate the framework. Our model demonstrates
high diagnostic accuracy across multiple organ systems and is
comparable to experienced pediatricians in diagnosing com-
mon childhood diseases. Our study provides a proof of con-
cept for implementing an AI-based system as a means to aid
physiciansintacklinglargeamountsofdata,augmentingdiag-
nostic evaluations, and to provide clinical decision support in
cases of diagnostic uncertainty or complexity. Although this
impact may be most evident in areas where healthcare provid-
ers are in relative shortage, the benefits of such an AI system
are likely to be universal.
Medical information has become increasingly complex over
time. The range of disease entities, diagnostic testing and biomark-
ers, and treatment modalities has increased exponentially in recent
years. Subsequently, clinical decision-making has also become more
complex and demands the synthesis of decisions from assessment
of large volumes of data representing clinical information. In the
current digital age, the electronic health record (EHR) represents a
massive repository of electronic data points representing a diverse
array of clinical information1–3
. Artificial intelligence (AI) methods
have emerged as potentially powerful tools to mine EHR data to aid
in disease diagnosis and management, mimicking and perhaps even
augmenting the clinical decision-making of human physicians1
.
To formulate a diagnosis for any given patient, physicians fre-
quently use hypotheticodeductive reasoning. Starting with the chief
complaint, the physician then asks appropriately targeted questions
relating to that complaint. From this initial small feature set, the
physician forms a differential diagnosis and decides what features
(historical questions, physical exam findings, laboratory testing,
and/or imaging studies) to obtain next in order to rule in or rule
out the diagnoses in the differential diagnosis set. The most use-
ful features are identified, such that when the probability of one of
the diagnoses reaches a predetermined level of acceptability, the
process is stopped, and the diagnosis is accepted. It may be pos-
sible to achieve an acceptable level of certainty of the diagnosis with
only a few features without having to process the entire feature set.
Therefore, the physician can be considered a classifier of sorts.
In this study, we designed an AI-based system using machine
learning to extract clinically relevant features from EHR notes to
mimic the clinical reasoning of human physicians. In medicine,
machine learning methods have already demonstrated strong per-
formance in image-based diagnoses, notably in radiology2
, derma-
tology4
, and ophthalmology5–8
, but analysis of EHR data presents
a number of difficult challenges. These challenges include the vast
quantity of data, high dimensionality, data sparsity, and deviations
Evaluation and accurate diagnoses of pediatric
diseases using artificial intelligence
Huiying Liang1,8
, Brian Y. Tsui 2,8
, Hao Ni3,8
, Carolina C. S. Valentim4,8
, Sally L. Baxter 2,8
,
Guangjian Liu1,8
, Wenjia Cai 2
, Daniel S. Kermany1,2
, Xin Sun1
, Jiancong Chen2
, Liya He1
, Jie Zhu1
,
Pin Tian2
, Hua Shao2
, Lianghong Zheng5,6
, Rui Hou5,6
, Sierra Hewett1,2
, Gen Li1,2
, Ping Liang3
,
Xuan Zang3
, Zhiqi Zhang3
, Liyan Pan1
, Huimin Cai5,6
, Rujuan Ling1
, Shuhua Li1
, Yongwang Cui1
,
Shusheng Tang1
, Hong Ye1
, Xiaoyan Huang1
, Waner He1
, Wenqing Liang1
, Qing Zhang1
, Jianmin Jiang1
,
Wei Yu1
, Jianqun Gao1
, Wanxing Ou1
, Yingmin Deng1
, Qiaozhen Hou1
, Bei Wang1
, Cuichan Yao1
,
Yan Liang1
, Shu Zhang1
, Yaou Duan2
, Runze Zhang2
, Sarah Gibson2
, Charlotte L. Zhang2
, Oulan Li2
,
Edward D. Zhang2
, Gabriel Karin2
, Nathan Nguyen2
, Xiaokang Wu1,2
, Cindy Wen2
, Jie Xu2
, Wenqin Xu2
,
Bochu Wang2
, Winston Wang2
, Jing Li1,2
, Bianca Pizzato2
, Caroline Bao2
, Daoman Xiang1
, Wanting He1,2
,
Suiqin He2
, Yugui Zhou1,2
, Weldon Haw2,7
, Michael Goldbaum2
, Adriana Tremoulet2
, Chun-Nan Hsu 2
,
Hannah Carter2
, Long Zhu3
, Kang Zhang 1,2,7
* and Huimin Xia 1
*
NATURE MEDICINE | www.nature.com/naturemedicine
Nat Med 2019 Feb
•소아 환자 130만명의 EMR 데이터 101.6 million 개 분석 

•딥러닝 기반의 자연어 처리 기술

•의사의 hypotetico-deductive reasoning 모방

•소아 환자의 common disease를 진단하는 인공지능
LETTERS
https://doi.org/10.1038/s41591-018-0335-9
1
Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou, China. 2
Institute for Genomic Medicine, Institute of
Engineering in Medicine, and Shiley Eye Institute, University of California, San Diego, La Jolla, CA, USA. 3
Hangzhou YITU Healthcare Technology Co. Ltd,
Hangzhou, China. 4
Department of Thoracic Surgery/Oncology, First Affiliated Hospital of Guangzhou Medical University, China State Key Laboratory and
National Clinical Research Center for Respiratory Disease, Guangzhou, China. 5
Guangzhou Kangrui Co. Ltd, Guangzhou, China. 6
Guangzhou Regenerative
Medicine and Health Guangdong Laboratory, Guangzhou, China. 7
Veterans Administration Healthcare System, San Diego, CA, USA. 8
These authors contributed
equally: Huiying Liang, Brian Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu. *e-mail: kang.zhang@gmail.com; xiahumin@hotmail.com
Artificial intelligence (AI)-based methods have emerged as
powerful tools to transform medical care. Although machine
learning classifiers (MLCs) have already demonstrated strong
performance in image-based diagnoses, analysis of diverse
and massive electronic health record (EHR) data remains chal-
lenging. Here, we show that MLCs can query EHRs in a manner
similar to the hypothetico-deductive reasoning used by physi-
cians and unearth associations that previous statistical meth-
ods have not found. Our model applies an automated natural
language processing system using deep learning techniques
to extract clinically relevant information from EHRs. In total,
101.6 million data points from 1,362,559 pediatric patient
visits presenting to a major referral center were analyzed to
train and validate the framework. Our model demonstrates
high diagnostic accuracy across multiple organ systems and is
comparable to experienced pediatricians in diagnosing com-
mon childhood diseases. Our study provides a proof of con-
cept for implementing an AI-based system as a means to aid
physiciansintacklinglargeamountsofdata,augmentingdiag-
nostic evaluations, and to provide clinical decision support in
cases of diagnostic uncertainty or complexity. Although this
impact may be most evident in areas where healthcare provid-
ers are in relative shortage, the benefits of such an AI system
are likely to be universal.
Medical information has become increasingly complex over
time. The range of disease entities, diagnostic testing and biomark-
ers, and treatment modalities has increased exponentially in recent
years. Subsequently, clinical decision-making has also become more
complex and demands the synthesis of decisions from assessment
of large volumes of data representing clinical information. In the
current digital age, the electronic health record (EHR) represents a
massive repository of electronic data points representing a diverse
array of clinical information1–3
. Artificial intelligence (AI) methods
have emerged as potentially powerful tools to mine EHR data to aid
in disease diagnosis and management, mimicking and perhaps even
augmenting the clinical decision-making of human physicians1
.
To formulate a diagnosis for any given patient, physicians fre-
quently use hypotheticodeductive reasoning. Starting with the chief
complaint, the physician then asks appropriately targeted questions
relating to that complaint. From this initial small feature set, the
physician forms a differential diagnosis and decides what features
(historical questions, physical exam findings, laboratory testing,
and/or imaging studies) to obtain next in order to rule in or rule
out the diagnoses in the differential diagnosis set. The most use-
ful features are identified, such that when the probability of one of
the diagnoses reaches a predetermined level of acceptability, the
process is stopped, and the diagnosis is accepted. It may be pos-
sible to achieve an acceptable level of certainty of the diagnosis with
only a few features without having to process the entire feature set.
Therefore, the physician can be considered a classifier of sorts.
In this study, we designed an AI-based system using machine
learning to extract clinically relevant features from EHR notes to
mimic the clinical reasoning of human physicians. In medicine,
machine learning methods have already demonstrated strong per-
formance in image-based diagnoses, notably in radiology2
, derma-
tology4
, and ophthalmology5–8
, but analysis of EHR data presents
a number of difficult challenges. These challenges include the vast
quantity of data, high dimensionality, data sparsity, and deviations
Evaluation and accurate diagnoses of pediatric
diseases using artificial intelligence
Huiying Liang1,8
, Brian Y. Tsui 2,8
, Hao Ni3,8
, Carolina C. S. Valentim4,8
, Sally L. Baxter 2,8
,
Guangjian Liu1,8
, Wenjia Cai 2
, Daniel S. Kermany1,2
, Xin Sun1
, Jiancong Chen2
, Liya He1
, Jie Zhu1
,
Pin Tian2
, Hua Shao2
, Lianghong Zheng5,6
, Rui Hou5,6
, Sierra Hewett1,2
, Gen Li1,2
, Ping Liang3
,
Xuan Zang3
, Zhiqi Zhang3
, Liyan Pan1
, Huimin Cai5,6
, Rujuan Ling1
, Shuhua Li1
, Yongwang Cui1
,
Shusheng Tang1
, Hong Ye1
, Xiaoyan Huang1
, Waner He1
, Wenqing Liang1
, Qing Zhang1
, Jianmin Jiang1
,
Wei Yu1
, Jianqun Gao1
, Wanxing Ou1
, Yingmin Deng1
, Qiaozhen Hou1
, Bei Wang1
, Cuichan Yao1
,
Yan Liang1
, Shu Zhang1
, Yaou Duan2
, Runze Zhang2
, Sarah Gibson2
, Charlotte L. Zhang2
, Oulan Li2
,
Edward D. Zhang2
, Gabriel Karin2
, Nathan Nguyen2
, Xiaokang Wu1,2
, Cindy Wen2
, Jie Xu2
, Wenqin Xu2
,
Bochu Wang2
, Winston Wang2
, Jing Li1,2
, Bianca Pizzato2
, Caroline Bao2
, Daoman Xiang1
, Wanting He1,2
,
Suiqin He2
, Yugui Zhou1,2
, Weldon Haw2,7
, Michael Goldbaum2
, Adriana Tremoulet2
, Chun-Nan Hsu 2
,
Hannah Carter2
, Long Zhu3
, Kang Zhang 1,2,7
* and Huimin Xia 1
*
NATURE MEDICINE | www.nature.com/naturemedicine
Nat Med 2019 Feb
LETTERSNATURE MEDICINE
examination, laboratory testing, and PACS (picture archiving and
communication systems) reports), the F1 scores exceeded 90%
except in one instance, which was for categorical variables detected
tree, similar to how a human physician might evaluate a patient’s
features to achieve a diagnosis based on the same clinical data
incorporated into the information model. Encounters labeled by
Systemic generalized diseases
Varicella without complication
Influenza
Infectious mononucleosis
Sepsis
Exanthema subitum
Neuropsychiatric diseases
Tic disorder
Attention-deficit hyperactivity disorders
Bacterial meningitis
Encephalitis
Convulsions
Genitourinary diseases
Respiratory diseases
Upper respiratory
diseases
Acute upper respiratory infection
Sinusitis
Acute sinusitis
Acute recurrent sinusitis
Acute laryngitis
Acute pharyngitis
Lower respiratory
diseases
Bronchitis
Acute bronchitis
Bronchiolitis
Acute bronchitis due to Mycoplasma pneumoniae
Pneumonia
Bacterial pneumonia
Bronchopneumonia
Bacterial pneumonia elsewhere
Mycoplasma infection
Asthma
Asthma uncomplicated
Cough variant asthma
Asthma with acute exacerbation
Acute tracheitis
Gastrointestinal diseases
Diarrhea
Mouth-related diseases
Enteroviral vesicular stomatitis
with exanthem
Fig. 2 | Hierarchy of the diagnostic framework in a large pediatric cohort. A hierarchical logistic regression classifier was used to establish a diagnostic
system based on anatomic divisions. An organ-based approach was used, wherein diagnoses were first separated into broad organ systems, then
subsequently divided into organ subsystems and/or into more specific diagnosis groups.
LETTERS
https://doi.org/10.1038/s41591-018-0335-9
1
Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou, China. 2
Institute for Genomic Medicine, Institute of
Engineering in Medicine, and Shiley Eye Institute, University of California, San Diego, La Jolla, CA, USA. 3
Hangzhou YITU Healthcare Technology Co. Ltd,
Hangzhou, China. 4
Department of Thoracic Surgery/Oncology, First Affiliated Hospital of Guangzhou Medical University, China State Key Laboratory and
National Clinical Research Center for Respiratory Disease, Guangzhou, China. 5
Guangzhou Kangrui Co. Ltd, Guangzhou, China. 6
Guangzhou Regenerative
Medicine and Health Guangdong Laboratory, Guangzhou, China. 7
Veterans Administration Healthcare System, San Diego, CA, USA. 8
These authors contributed
equally: Huiying Liang, Brian Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu. *e-mail: kang.zhang@gmail.com; xiahumin@hotmail.com
Artificial intelligence (AI)-based methods have emerged as
powerful tools to transform medical care. Although machine
learning classifiers (MLCs) have already demonstrated strong
performance in image-based diagnoses, analysis of diverse
and massive electronic health record (EHR) data remains chal-
lenging. Here, we show that MLCs can query EHRs in a manner
similar to the hypothetico-deductive reasoning used by physi-
cians and unearth associations that previous statistical meth-
ods have not found. Our model applies an automated natural
language processing system using deep learning techniques
to extract clinically relevant information from EHRs. In total,
101.6 million data points from 1,362,559 pediatric patient
visits presenting to a major referral center were analyzed to
train and validate the framework. Our model demonstrates
high diagnostic accuracy across multiple organ systems and is
comparable to experienced pediatricians in diagnosing com-
mon childhood diseases. Our study provides a proof of con-
cept for implementing an AI-based system as a means to aid
physiciansintacklinglargeamountsofdata,augmentingdiag-
nostic evaluations, and to provide clinical decision support in
cases of diagnostic uncertainty or complexity. Although this
impact may be most evident in areas where healthcare provid-
ers are in relative shortage, the benefits of such an AI system
are likely to be universal.
Medical information has become increasingly complex over
time. The range of disease entities, diagnostic testing and biomark-
ers, and treatment modalities has increased exponentially in recent
years. Subsequently, clinical decision-making has also become more
complex and demands the synthesis of decisions from assessment
of large volumes of data representing clinical information. In the
current digital age, the electronic health record (EHR) represents a
massive repository of electronic data points representing a diverse
array of clinical information1–3
. Artificial intelligence (AI) methods
have emerged as potentially powerful tools to mine EHR data to aid
in disease diagnosis and management, mimicking and perhaps even
augmenting the clinical decision-making of human physicians1
.
To formulate a diagnosis for any given patient, physicians fre-
quently use hypotheticodeductive reasoning. Starting with the chief
complaint, the physician then asks appropriately targeted questions
relating to that complaint. From this initial small feature set, the
physician forms a differential diagnosis and decides what features
(historical questions, physical exam findings, laboratory testing,
and/or imaging studies) to obtain next in order to rule in or rule
out the diagnoses in the differential diagnosis set. The most use-
ful features are identified, such that when the probability of one of
the diagnoses reaches a predetermined level of acceptability, the
process is stopped, and the diagnosis is accepted. It may be pos-
sible to achieve an acceptable level of certainty of the diagnosis with
only a few features without having to process the entire feature set.
Therefore, the physician can be considered a classifier of sorts.
In this study, we designed an AI-based system using machine
learning to extract clinically relevant features from EHR notes to
mimic the clinical reasoning of human physicians. In medicine,
machine learning methods have already demonstrated strong per-
formance in image-based diagnoses, notably in radiology2
, derma-
tology4
, and ophthalmology5–8
, but analysis of EHR data presents
a number of difficult challenges. These challenges include the vast
quantity of data, high dimensionality, data sparsity, and deviations
Evaluation and accurate diagnoses of pediatric
diseases using artificial intelligence
Huiying Liang1,8
, Brian Y. Tsui 2,8
, Hao Ni3,8
, Carolina C. S. Valentim4,8
, Sally L. Baxter 2,8
,
Guangjian Liu1,8
, Wenjia Cai 2
, Daniel S. Kermany1,2
, Xin Sun1
, Jiancong Chen2
, Liya He1
, Jie Zhu1
,
Pin Tian2
, Hua Shao2
, Lianghong Zheng5,6
, Rui Hou5,6
, Sierra Hewett1,2
, Gen Li1,2
, Ping Liang3
,
Xuan Zang3
, Zhiqi Zhang3
, Liyan Pan1
, Huimin Cai5,6
, Rujuan Ling1
, Shuhua Li1
, Yongwang Cui1
,
Shusheng Tang1
, Hong Ye1
, Xiaoyan Huang1
, Waner He1
, Wenqing Liang1
, Qing Zhang1
, Jianmin Jiang1
,
Wei Yu1
, Jianqun Gao1
, Wanxing Ou1
, Yingmin Deng1
, Qiaozhen Hou1
, Bei Wang1
, Cuichan Yao1
,
Yan Liang1
, Shu Zhang1
, Yaou Duan2
, Runze Zhang2
, Sarah Gibson2
, Charlotte L. Zhang2
, Oulan Li2
,
Edward D. Zhang2
, Gabriel Karin2
, Nathan Nguyen2
, Xiaokang Wu1,2
, Cindy Wen2
, Jie Xu2
, Wenqin Xu2
,
Bochu Wang2
, Winston Wang2
, Jing Li1,2
, Bianca Pizzato2
, Caroline Bao2
, Daoman Xiang1
, Wanting He1,2
,
Suiqin He2
, Yugui Zhou1,2
, Weldon Haw2,7
, Michael Goldbaum2
, Adriana Tremoulet2
, Chun-Nan Hsu 2
,
Hannah Carter2
, Long Zhu3
, Kang Zhang 1,2,7
* and Huimin Xia 1
*
NATURE MEDICINE | www.nature.com/naturemedicine
Nat Med 2019 Feb
LETTERSNATURE MEDICINE
of our system was especially strong for the common conditions of
acute upper respiratory infection and sinusitis, both of which were
diagnosed with an accuracy of 0.95 between the machine-predicted
diagnosis and the human physician-generated diagnosis. In con-
trast, dangerous conditions tend to be less common and would have
diagnostic hierarchy decision tree can be adjusted to what is most
appropriate for the clinical situation.
In terms of implementation, we foresee this type of AI-assisted
diagnostic system being integrated into clinical practice in several
ways. First, it could assist with triage procedures. For example,
Table 2 | Illustration of diagnostic performance of our AI model and physicians
Disease conditions Our model Physicians
Physician group 1 Physician group 2 Physician group 3 Physician group 4 Physician group 5
Asthma 0.920 0.801 0.837 0.904 0.890 0.935
Encephalitis 0.837 0.947 0.961 0.950 0.959 0.965
Gastrointestinal disease 0.865 0.818 0.872 0.854 0.896 0.893
Group: ‘Acute laryngitis’ 0.786 0.808 0.730 0.879 0.940 0.943
Group: ‘Pneumonia’ 0.888 0.829 0.767 0.946 0.952 0.972
Group: ‘Sinusitis’ 0.932 0.839 0.797 0.896 0.873 0.870
Lower respiratory 0.803 0.803 0.815 0.910 0.903 0.935
Mouth-related diseases 0.897 0.818 0.872 0.854 0.896 0.893
Neuropsychiatric disease 0.895 0.925 0.963 0.960 0.962 0.906
Respiratory 0.935 0.808 0.769 0.89 0.907 0.917
Systemic or generalized 0.925 0.879 0.907 0.952 0.907 0.944
Upper respiratory 0.929 0.817 0.754 0.884 0.916 0.916
Root 0.889 0.843 0.863 0.908 0.903 0.912
Average F1 score 0.885 0.841 0.839 0.907 0.915 0.923
We used the F1score to evaluate the diagnosis performance across different groups (rows); our model, two junior physician groups (groups 1 and 2), and three senior physician groups (groups 3, 4, and
5) (see Methods section for description). We observed that our model performed better than junior physician groups but slightly worse than three experienced physician groups. Root is the first level of
diagnosis classification.
•multiple organ system에 대해서, 

•주니어 스태프 보다는 높은 정확도

•시니어 스태프 보다는 낮은 정확도
•복잡한 의료 데이터의 분석 및 insight 도출

•영상 의료/병리 데이터의 분석/판독

•연속 데이터의 모니터링 및 예방/예측
의료 인공지능의 세 유형
REVIEW ARTICLE | FOCUS
https://doi.org/10.1038/s41591-018-0300-7
Department of Molecular Medicine, Scripps Research, La Jolla, CA, USA. e-mail: etopol@scripps.edu
M
edicine is at the crossroad of two major trends. The first
is a failed business model, with increasing expenditures
and jobs allocated to healthcare, but with deteriorating key
outcomes, including reduced life expectancy and high infant, child-
hood, and maternal mortality in the United States1,2
. This exem-
plifies a paradox that is not at all confined to American medicine:
investment of more human capital with worse human health out-
comes. The second is the generation of data in massive quantities,
from sources such as high-resolution medical imaging, biosensors
with continuous output of physiologic metrics, genome sequenc-
ing, and electronic medical records. The limits on analysis of such
data by humans alone have clearly been exceeded, necessitating
an increased reliance on machines. Accordingly, at the same time
that there is more dependence than ever on humans to provide
healthcare, algorithms are desperately needed to help. Yet the inte-
gration of human and artificial intelligence (AI) for medicine has
barely begun.
Looking deeper, there are notable, longstanding deficiencies in
healthcare that are responsible for its path of diminishing returns.
These include a large number of serious diagnostic errors, mis-
takes in treatment, an enormous waste of resources, inefficiencies
in workflow, inequities, and inadequate time between patients and
clinicians3,4
. Eager for improvement, leaders in healthcare and com-
puter scientists have asserted that AI might have a role in address-
ing all of these problems. That might eventually be the case, but
researchers are at the starting gate in the use of neural networks to
ameliorate the ills of the practice of medicine. In this Review, I have
gathered much of the existing base of evidence for the use of AI in
medicine, laying out the opportunities and pitfalls.
Artificial intelligence for clinicians
Almost every type of clinician, ranging from specialty doctor to
paramedic, will be using AI technology, and in particular deep
learning, in the future. This largely involved pattern recognition
using deep neural networks (DNNs) (Box 1) that can help interpret
medical scans, pathology slides, skin lesions, retinal images, electro-
cardiograms, endoscopy, faces, and vital signs. The neural net inter-
pretation is typically compared with physicians’ assessments using a
plot of true-positive versus false-positive rates, known as a receiver
operating characteristic (ROC), for which the area under the curve
(AUC) is used to express the level of accuracy (Box 1).
Radiology. One field that has attracted particular attention for
application of AI is radiology5
. Chest X-rays are the most common
type of medical scan, with more than 2 billion performed worldwide
per year. In one study, the accuracy of one algorithm, based on a
121-layer convolutional neural network, in detecting pneumonia in
over 112,000 labeled frontal chest X-ray images was compared with
that of four radiologists, and the conclusion was that the algorithm
outperformed the radiologists. However, the algorithm’s AUC of
0.76, although somewhat better than that for two previously tested
DNN algorithms for chest X-ray interpretation5
, is far from optimal.
In addition, the test used in this study is not necessarily comparable
with the daily tasks of a radiologist, who will diagnose much more
than pneumonia in any given scan. To further validate the conclu-
sions of this study, a comparison with results from more than four
radiologists should be made. A team at Google used an algorithm
that analyzed the same image set as in the previously discussed
study to make 14 different diagnoses, resulting in AUC scores that
ranged from 0.63 for pneumonia to 0.87 for heart enlargement or
a collapsed lung6
. More recently, in another related study, it was
shown that a DNN that is currently in use in hospitals in India for
interpretation of four different chest X-ray key findings was at least
as accurate as four radiologists7
. For the narrower task of detecting
cancerous pulmonary nodules on a chest X-ray, a DNN that retro-
spectively assessed scans from over 34,000 patients achieved a level
of accuracy exceeding 17 of 18 radiologists8
. It can be difficult for
emergency room doctors to accurately diagnose wrist fractures,
but a DNN led to marked improvement, increasing sensitivity from
81% to 92% and reducing misinterpretation by 47% (ref. 9
).
Similarly, DNNs have been applied across a wide variety of
medical scans, including bone films for fractures and estimation of
aging10–12
, classification of tuberculosis13
, and vertebral compression
fractures14
; computed tomography (CT) scans for lung nodules15
,
liver masses16
, pancreatic cancer17
, and coronary calcium score18
;
brain scans for evidence of hemorrhage19
, head trauma20
, and acute
referrals21
; magnetic resonance imaging22
; echocardiograms23,24
;
and mammographies25,26
. A unique imaging-recognition study
focusing on the breadth of acute neurologic events, such as stroke
or head trauma, was carried out on over 37,000 head CT 3-D scans,
which the algorithm analyzed for 13 different anatomical find-
ings versus gold-standard labels (annotated by expert radiologists)
and achieved an AUC of 0.73 (ref. 27
). A simulated prospective,
double-blind, randomized control trial was conducted with real
cases from the dataset and showed that the deep-learning algorithm
could interpret scans 150 times faster than radiologists (1.2 versus
177seconds). But the conclusion that the algorithm’s diagnostic
accuracyinscreeningacuteneurologicscanswaspoorerthanhuman
High-performance medicine: the convergence of
human and artificial intelligence
Eric J. Topol
The use of artificial intelligence, and the deep-learning subtype in particular, has been enabled by the use of labeled big data, along
with markedly enhanced computing power and cloud storage, across all sectors. In medicine, this is beginning to have an impact
at three levels: for clinicians, predominantly via rapid, accurate image interpretation; for health systems, by improving workflow
and the potential for reducing medical errors; and for patients, by enabling them to process their own data to promote health.
The current limitations, including bias, privacy and security, and lack of transparency, along with the future directions of these
applications will be discussed in this article. Over time, marked improvements in accuracy, productivity, and workflow will likely
be actualized, but whether that will be used to improve the patient–doctor relationship or facilitate its erosion remains to be seen.
REVIEW ARTICLE | FOCUS
https://doi.org/10.1038/s41591-018-0300-7
NATURE MEDICINE | VOL 25 | JANUARY 2019 | 44–56 | www.nature.com/naturemedicine44
an
ed
as
tio
rit
da
of
al
an
(T
m
ap
D
an
be
la
Table 1 | Peer-reviewed publications of AI algorithms compared
with doctors
Specialty Images Publication
Radiology/
neurology
CT head, acute
neurological events
Titano et al. 27
CT head for brain
hemorrhage
Arbabshirani et al.19
CT head for trauma Chilamkurthy et al.20
CXR for metastatic lung
nodules
Nam et al.8
CXR for multiple findings Singh et al.7
Mammography for breast
density
Lehman et al.26
Wrist X-ray* Lindsey et al.9
Pathology Breast cancer Ehteshami Bejnordi et al.41
Lung cancer (+driver
mutation)
Coudray et al.33
Brain tumors
(+methylation)
Capper et al.45
Breast cancer metastases* Steiner et al.35
Breast cancer metastases Liu et al.34
Dermatology Skin cancers Esteva et al.47
Melanoma Haenssle et al.48
Skin lesions Han et al.49
Ophthalmology Diabetic retinopathy Gulshan et al.51
Diabetic retinopathy* Abramoff et al.31
Diabetic retinopathy* Kanagasingam et al.32
Congenital cataracts Long et al.38
Retinal diseases (OCT) De Fauw et al.56
Macular degeneration Burlina et al.52
Retinopathy of prematurity Brown et al.60
AMD and diabetic
retinopathy
Kermany et al.53
Gastroenterology Polyps at colonoscopy* Mori et al.36
Polyps at colonoscopy Wang et al.37
Cardiology Echocardiography Madani et al.23
Echocardiography Zhang et al.24
T
C
A
A
iC
Z
B
N
ID
Ic
Im
V
A
M
A
A
This copy is for personal use only.
To order printed copies, contact reprints@rsna.org
This copy is for personal use only.
To order printed copies, contact reprints@rsna.org
ORIGINAL RESEARCH • THORACIC IMAGING
hest radiography, one of the most common diagnos- intraobserver agreements because of its limited spatial reso-
Development and Validation of Deep
Learning–based Automatic Detection
Algorithm for Malignant Pulmonary Nodules
on Chest Radiographs
Ju Gang Nam, MD* • Sunggyun Park, PhD* • Eui Jin Hwang, MD • Jong Hyuk Lee, MD • Kwang-Nam Jin, MD,
PhD • KunYoung Lim, MD, PhD • Thienkai HuyVu, MD, PhD • Jae Ho Sohn, MD • Sangheum Hwang, PhD • Jin
Mo Goo, MD, PhD • Chang Min Park, MD, PhD
From the Department of Radiology and Institute of Radiation Medicine, Seoul National University Hospital and College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul
03080, Republic of Korea (J.G.N., E.J.H., J.M.G., C.M.P.); Lunit Incorporated, Seoul, Republic of Korea (S.P.); Department of Radiology, Armed Forces Seoul Hospital,
Seoul, Republic of Korea (J.H.L.); Department of Radiology, Seoul National University Boramae Medical Center, Seoul, Republic of Korea (K.N.J.); Department of
Radiology, National Cancer Center, Goyang, Republic of Korea (K.Y.L.); Department of Radiology and Biomedical Imaging, University of California, San Francisco,
San Francisco, Calif (T.H.V., J.H.S.); and Department of Industrial & Information Systems Engineering, Seoul National University of Science and Technology, Seoul,
Republic of Korea (S.H.). Received January 30, 2018; revision requested March 20; revision received July 29; accepted August 6. Address correspondence to C.M.P.
(e-mail: cmpark.morphius@gmail.com).
Study supported by SNUH Research Fund and Lunit (06–2016–3000) and by Seoul Research and Business Development Program (FI170002).
*J.G.N. and S.P. contributed equally to this work.
Conflicts of interest are listed at the end of this article.
Radiology 2018; 00:1–11 • https://doi.org/10.1148/radiol.2018180237 • Content codes:
Purpose: To develop and validate a deep learning–based automatic detection algorithm (DLAD) for malignant pulmonary nodules
on chest radiographs and to compare its performance with physicians including thoracic radiologists.
Materials and Methods: For this retrospective study, DLAD was developed by using 43292 chest radiographs (normal radiograph–
to–nodule radiograph ratio, 34067:9225) in 34676 patients (healthy-to-nodule ratio, 30784:3892; 19230 men [mean age, 52.8
years; age range, 18–99 years]; 15446 women [mean age, 52.3 years; age range, 18–98 years]) obtained between 2010 and 2015,
which were labeled and partially annotated by 13 board-certified radiologists, in a convolutional neural network. Radiograph clas-
sification and nodule detection performances of DLAD were validated by using one internal and four external data sets from three
South Korean hospitals and one U.S. hospital. For internal and external validation, radiograph classification and nodule detection
performances of DLAD were evaluated by using the area under the receiver operating characteristic curve (AUROC) and jackknife
alternative free-response receiver-operating characteristic (JAFROC) figure of merit (FOM), respectively. An observer performance
test involving 18 physicians, including nine board-certified radiologists, was conducted by using one of the four external validation
data sets. Performances of DLAD, physicians, and physicians assisted with DLAD were evaluated and compared.
Results: According to one internal and four external validation data sets, radiograph classification and nodule detection perfor-
mances of DLAD were a range of 0.92–0.99 (AUROC) and 0.831–0.924 (JAFROC FOM), respectively. DLAD showed a higher
AUROC and JAFROC FOM at the observer performance test than 17 of 18 and 15 of 18 physicians, respectively (P , .05), and
all physicians showed improved nodule detection performances with DLAD (mean JAFROC FOM improvement, 0.043; range,
0.006–0.190; P , .05).
Conclusion: This deep learning–based automatic detection algorithm outperformed physicians in radiograph classification and nod-
ule detection performance for malignant pulmonary nodules on chest radiographs, and it enhanced physicians’ performances when
used as a second reader.
©RSNA, 2018
Online supplemental material is available for this article.
• 43,292 chest PA (normal:nodule=34,067:9225)

• labeled/annotated by 13 board-certified radiologists.

• DLAD were validated 1 internal + 4 external datasets 

• 서울대병원 / 보라매병원 / 국립암센터 / UCSF 

• Classification / Lesion localization 

• 인공지능 vs. 의사 vs. 인공지능+의사

• 다양한 수준의 의사와 비교

• Non-radiology / radiology residents 

• Board-certified radiologist / Thoracic radiologists
Nam et al
Figure 1: Images in a 78-year-old female patient with a 1.9-cm part-solid nodule at the left upper lobe. (a) The nodule was faintly visible on the
chest radiograph (arrowheads) and was detected by 11 of 18 observers. (b) At contrast-enhanced CT examination, biopsy confirmed lung adeno-
carcinoma (arrow). (c) DLAD reported the nodule with a confidence level of 2, resulting in its detection by an additional five radiologists and an
elevation in its confidence by eight radiologists.
Figure 2: Images in a 64-year-old male patient with a 2.2-cm lung adenocarcinoma at the left upper lobe. (a) The nodule was faintly visible on
the chest radiograph (arrowheads) and was detected by seven of 18 observers. (b) Biopsy confirmed lung adenocarcinoma in the left upper lobe
on contrast-enhanced CT image (arrow). (c) DLAD reported the nodule with a confidence level of 2, resulting in its detection by an additional two
radiologists and an elevated confidence level of the nodule by two radiologists.
•손 엑스레이 영상을 판독하여 환자의 골연령 (뼈 나이)를 계산해주는 인공지능

• 기존에 의사는 그룰리히-파일(Greulich-Pyle)법 등으로 표준 사진과 엑스레이를 비교하여 판독

• 인공지능은 참조표준영상에서 성별/나이별 패턴을 찾아서 유사성을 확률로 표시 + 표준 영상 검색

•의사가 성조숙증이나 저성장을 진단하는데 도움을 줄 수 있음
Copyright 2016 American Medical Association. All rights reserved.
Development and Validation of a Deep Learning Algorithm
for Detection of Diabetic Retinopathy
in Retinal Fundus Photographs
Varun Gulshan, PhD; Lily Peng, MD, PhD; Marc Coram, PhD; Martin C. Stumpe, PhD; Derek Wu, BS; Arunachalam Narayanaswamy, PhD;
Subhashini Venugopalan, MS; Kasumi Widner, MS; Tom Madams, MEng; Jorge Cuadros, OD, PhD; Ramasamy Kim, OD, DNB;
Rajiv Raman, MS, DNB; Philip C. Nelson, BS; Jessica L. Mega, MD, MPH; Dale R. Webster, PhD
IMPORTANCE Deep learning is a family of computational methods that allow an algorithm to
program itself by learning from a large set of examples that demonstrate the desired
behavior, removing the need to specify rules explicitly. Application of these methods to
medical imaging requires further assessment and validation.
OBJECTIVE To apply deep learning to create an algorithm for automated detection of diabetic
retinopathy and diabetic macular edema in retinal fundus photographs.
DESIGN AND SETTING A specific type of neural network optimized for image classification
called a deep convolutional neural network was trained using a retrospective development
data set of 128 175 retinal images, which were graded 3 to 7 times for diabetic retinopathy,
diabetic macular edema, and image gradability by a panel of 54 US licensed ophthalmologists
and ophthalmology senior residents between May and December 2015. The resultant
algorithm was validated in January and February 2016 using 2 separate data sets, both
graded by at least 7 US board-certified ophthalmologists with high intragrader consistency.
EXPOSURE Deep learning–trained algorithm.
MAIN OUTCOMES AND MEASURES The sensitivity and specificity of the algorithm for detecting
referable diabetic retinopathy (RDR), defined as moderate and worse diabetic retinopathy,
referable diabetic macular edema, or both, were generated based on the reference standard
of the majority decision of the ophthalmologist panel. The algorithm was evaluated at 2
operating points selected from the development set, one selected for high specificity and
another for high sensitivity.
RESULTS TheEyePACS-1datasetconsistedof9963imagesfrom4997patients(meanage,54.4
years;62.2%women;prevalenceofRDR,683/8878fullygradableimages[7.8%]);the
Messidor-2datasethad1748imagesfrom874patients(meanage,57.6years;42.6%women;
prevalenceofRDR,254/1745fullygradableimages[14.6%]).FordetectingRDR,thealgorithm
hadanareaunderthereceiveroperatingcurveof0.991(95%CI,0.988-0.993)forEyePACS-1and
0.990(95%CI,0.986-0.995)forMessidor-2.Usingthefirstoperatingcutpointwithhigh
specificity,forEyePACS-1,thesensitivitywas90.3%(95%CI,87.5%-92.7%)andthespecificity
was98.1%(95%CI,97.8%-98.5%).ForMessidor-2,thesensitivitywas87.0%(95%CI,81.1%-
91.0%)andthespecificitywas98.5%(95%CI,97.7%-99.1%).Usingasecondoperatingpoint
withhighsensitivityinthedevelopmentset,forEyePACS-1thesensitivitywas97.5%and
specificitywas93.4%andforMessidor-2thesensitivitywas96.1%andspecificitywas93.9%.
CONCLUSIONS AND RELEVANCE In this evaluation of retinal fundus photographs from adults
with diabetes, an algorithm based on deep machine learning had high sensitivity and
specificity for detecting referable diabetic retinopathy. Further research is necessary to
determine the feasibility of applying this algorithm in the clinical setting and to determine
whether use of the algorithm could lead to improved care and outcomes compared with
current ophthalmologic assessment.
JAMA. doi:10.1001/jama.2016.17216
Published online November 29, 2016.
Editorial
Supplemental content
Author Affiliations: Google Inc,
Mountain View, California (Gulshan,
Peng, Coram, Stumpe, Wu,
Narayanaswamy, Venugopalan,
Widner, Madams, Nelson, Webster);
Department of Computer Science,
University of Texas, Austin
(Venugopalan); EyePACS LLC,
San Jose, California (Cuadros); School
of Optometry, Vision Science
Graduate Group, University of
California, Berkeley (Cuadros);
Aravind Medical Research
Foundation, Aravind Eye Care
System, Madurai, India (Kim); Shri
Bhagwan Mahavir Vitreoretinal
Services, Sankara Nethralaya,
Chennai, Tamil Nadu, India (Raman);
Verily Life Sciences, Mountain View,
California (Mega); Cardiovascular
Division, Department of Medicine,
Brigham and Women’s Hospital and
Harvard Medical School, Boston,
Massachusetts (Mega).
Corresponding Author: Lily Peng,
MD, PhD, Google Research, 1600
Amphitheatre Way, Mountain View,
CA 94043 (lhpeng@google.com).
Research
JAMA | Original Investigation | INNOVATIONS IN HEALTH CARE DELIVERY
(Reprinted) E1
Copyright 2016 American Medical Association. All rights reserved.
• EyePACS-1 과 Messidor-2 의 AUC = 0.991, 0.990

• 7-8명의 안과 전문의와 민감도와 특이도가 동일한 수준

• F-score: 0.95 (vs. 인간 의사는 0.91)
Additional sensitivity analyses were conducted for sev- effects of data set size on algorithm performance were exam-
Figure 2. Validation Set Performance for Referable Diabetic Retinopathy
100
80
60
40
20
0
0
70
80
85
95
90
75
0 5 10 15 20 25 30
100806040
Sensitivity,%
1 – Specificity, %
20
EyePACS-1: AUC, 99.1%; 95% CI, 98.8%-99.3%A
100
High-sensitivity operating point
High-specificity operating point
100
80
60
40
20
0
0
70
80
85
95
90
75
0 5 10 15 20 25 30
100806040
Sensitivity,% 1 – Specificity, %
20
Messidor-2: AUC, 99.0%; 95% CI, 98.6%-99.5%B
100
High-specificity operating point
High-sensitivity operating point
Performance of the algorithm (black curve) and ophthalmologists (colored
circles) for the presence of referable diabetic retinopathy (moderate or worse
diabetic retinopathy or referable diabetic macular edema) on A, EyePACS-1
(8788 fully gradable images) and B, Messidor-2 (1745 fully gradable images).
The black diamonds on the graph correspond to the sensitivity and specificity of
the algorithm at the high-sensitivity and high-specificity operating points.
In A, for the high-sensitivity operating point, specificity was 93.4% (95% CI,
92.8%-94.0%) and sensitivity was 97.5% (95% CI, 95.8%-98.7%); for the
high-specificity operating point, specificity was 98.1% (95% CI, 97.8%-98.5%)
and sensitivity was 90.3% (95% CI, 87.5%-92.7%). In B, for the high-sensitivity
operating point, specificity was 93.9% (95% CI, 92.4%-95.3%) and sensitivity
was 96.1% (95% CI, 92.4%-98.3%); for the high-specificity operating point,
specificity was 98.5% (95% CI, 97.7%-99.1%) and sensitivity was 87.0% (95%
CI, 81.1%-91.0%). There were 8 ophthalmologists who graded EyePACS-1 and 7
ophthalmologists who graded Messidor-2. AUC indicates area under the
receiver operating characteristic curve.
Research Original Investigation Accuracy of a Deep Learning Algorithm for Detection of Diabetic Retinopathy
안저 판독 인공지능의 정확도
0 0 M O N T H 2 0 1 7 | V O L 0 0 0 | N A T U R E | 1
LETTER doi:10.1038/nature21056
Dermatologist-level classification of skin cancer
with deep neural networks
Andre Esteva1
*, Brett Kuprel1
*, Roberto A. Novoa2,3
, Justin Ko2
, Susan M. Swetter2,4
, Helen M. Blau5
& Sebastian Thrun6
Skin cancer, the most common human malignancy1–3
, is primarily
diagnosed visually, beginning with an initial clinical screening
and followed potentially by dermoscopic analysis, a biopsy and
histopathological examination. Automated classification of skin
lesions using images is a challenging task owing to the fine-grained
variability in the appearance of skin lesions. Deep convolutional
neural networks (CNNs)4,5
show potential for general and highly
variable tasks across many fine-grained object categories6–11
.
Here we demonstrate classification of skin lesions using a single
CNN, trained end-to-end from images directly, using only pixels
and disease labels as inputs. We train a CNN using a dataset of
129,450 clinical images—two orders of magnitude larger than
previous datasets12
—consisting of 2,032 different diseases. We
test its performance against 21 board-certified dermatologists on
biopsy-proven clinical images with two critical binary classification
use cases: keratinocyte carcinomas versus benign seborrheic
keratoses; and malignant melanomas versus benign nevi. The first
case represents the identification of the most common cancers, the
second represents the identification of the deadliest skin cancer.
The CNN achieves performance on par with all tested experts
across both tasks, demonstrating an artificial intelligence capable
of classifying skin cancer with a level of competence comparable to
dermatologists. Outfitted with deep neural networks, mobile devices
can potentially extend the reach of dermatologists outside of the
clinic. It is projected that 6.3 billion smartphone subscriptions will
exist by the year 2021 (ref. 13) and can therefore potentially provide
low-cost universal access to vital diagnostic care.
There are 5.4 million new cases of skin cancer in the United States2
every year. One in five Americans will be diagnosed with a cutaneous
malignancy in their lifetime. Although melanomas represent fewer than
5% of all skin cancers in the United States, they account for approxi-
mately 75% of all skin-cancer-related deaths, and are responsible for
over 10,000 deaths annually in the United States alone. Early detection
is critical, as the estimated 5-year survival rate for melanoma drops
from over 99% if detected in its earliest stages to about 14% if detected
in its latest stages. We developed a computational method which may
allow medical practitioners and patients to proactively track skin
lesions and detect cancer earlier. By creating a novel disease taxonomy,
and a disease-partitioning algorithm that maps individual diseases into
training classes, we are able to build a deep learning system for auto-
mated dermatology.
Previous work in dermatological computer-aided classification12,14,15
has lacked the generalization capability of medical practitioners
owing to insufficient data and a focus on standardized tasks such as
dermoscopy16–18
and histological image classification19–22
. Dermoscopy
images are acquired via a specialized instrument and histological
images are acquired via invasive biopsy and microscopy; whereby
both modalities yield highly standardized images. Photographic
images (for example, smartphone images) exhibit variability in factors
such as zoom, angle and lighting, making classification substantially
more challenging23,24
. We overcome this challenge by using a data-
driven approach—1.41 million pre-training and training images
make classification robust to photographic variability. Many previous
techniques require extensive preprocessing, lesion segmentation and
extraction of domain-specific visual features before classification. By
contrast, our system requires no hand-crafted features; it is trained
end-to-end directly from image labels and raw pixels, with a single
network for both photographic and dermoscopic images. The existing
body of work uses small datasets of typically less than a thousand
images of skin lesions16,18,19
, which, as a result, do not generalize well
to new images. We demonstrate generalizable classification with a new
dermatologist-labelled dataset of 129,450 clinical images, including
3,374 dermoscopy images.
Deep learning algorithms, powered by advances in computation
and very large datasets25
, have recently been shown to exceed human
performance in visual tasks such as playing Atari games26
, strategic
board games like Go27
and object recognition6
. In this paper we
outline the development of a CNN that matches the performance of
dermatologists at three key diagnostic tasks: melanoma classification,
melanoma classification using dermoscopy and carcinoma
classification. We restrict the comparisons to image-based classification.
We utilize a GoogleNet Inception v3 CNN architecture9
that was pre-
trained on approximately 1.28 million images (1,000 object categories)
from the 2014 ImageNet Large Scale Visual Recognition Challenge6
,
and train it on our dataset using transfer learning28
. Figure 1 shows the
working system. The CNN is trained using 757 disease classes. Our
dataset is composed of dermatologist-labelled images organized in a
tree-structured taxonomy of 2,032 diseases, in which the individual
diseases form the leaf nodes. The images come from 18 different
clinician-curated, open-access online repositories, as well as from
clinical data from Stanford University Medical Center. Figure 2a shows
a subset of the full taxonomy, which has been organized clinically and
visually by medical experts. We split our dataset into 127,463 training
and validation images and 1,942 biopsy-labelled test images.
To take advantage of fine-grained information contained within the
taxonomy structure, we develop an algorithm (Extended Data Table 1)
to partition diseases into fine-grained training classes (for example,
amelanotic melanoma and acrolentiginous melanoma). During
inference, the CNN outputs a probability distribution over these fine
classes. To recover the probabilities for coarser-level classes of interest
(for example, melanoma) we sum the probabilities of their descendants
(see Methods and Extended Data Fig. 1 for more details).
We validate the effectiveness of the algorithm in two ways, using
nine-fold cross-validation. First, we validate the algorithm using a
three-class disease partition—the first-level nodes of the taxonomy,
which represent benign lesions, malignant lesions and non-neoplastic
1
Department of Electrical Engineering, Stanford University, Stanford, California, USA. 2
Department of Dermatology, Stanford University, Stanford, California, USA. 3
Department of Pathology,
Stanford University, Stanford, California, USA. 4
Dermatology Service, Veterans Affairs Palo Alto Health Care System, Palo Alto, California, USA. 5
Baxter Laboratory for Stem Cell Biology, Department
of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, California, USA. 6
Department of Computer Science, Stanford University,
Stanford, California, USA.
*These authors contributed equally to this work.
© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
딥러닝과 피부과 전문의의

피부암 분류 정확도 LETTE
a
b
0 1
Sensitivity
0
1
Specificity
Melanoma: 130 images
1
Specificity
Melanoma: 225 images
0 1
Sensitivity
0
1
Specificity
Melanoma: 111 dermoscopy images
1
Specificity
Carcinoma: 707 images
1
Specificity
Melanoma: 1,010 dermoscopy images
0 1
Sensitivity
0
1
Specificity
Carcinoma: 135 images
Algorithm: AUC = 0.96
Dermatologists (25)
Average dermatologist
Algorithm: AUC = 0.94
Dermatologists (22)
Average dermatologist
Algorithm: AUC = 0.91
Dermatologists (21)
Average dermatologist
21명 중에 인공지능보다 정확성이 떨어지는 피부과 전문의들이 상당수 있었음

피부과 전문의들의 평균 성적도 인공지능보다 좋지 않았음
ARTICLES
https://doi.org/10.1038/s41591-018-0177-5
A
ccording to the American Cancer Society and the Cancer
Statistics Center (see URLs), over 150,000 patients with lung
cancer succumb to the disease each year (154,050 expected
for 2018), while another 200,000 new cases are diagnosed on a
yearly basis (234,030 expected for 2018). It is one of the most widely
spread cancers in the world because of not only smoking, but also
exposure to toxic chemicals like radon, asbestos and arsenic. LUAD
and LUSC are the two most prevalent types of non–small cell lung
cancer1
, and each is associated with discrete treatment guidelines. In
the absence of definitive histologic features, this important distinc-
tion can be challenging and time-consuming, and requires confir-
matory immunohistochemical stains.
Classification of lung cancer type is a key diagnostic process
because the available treatment options, including conventional
chemotherapy and, more recently, targeted therapies, differ for
LUAD and LUSC2
. Also, a LUAD diagnosis will prompt the search
for molecular biomarkers and sensitizing mutations and thus has
a great impact on treatment options3,4
. For example, epidermal
growth factor receptor (EGFR) mutations, present in about 20% of
LUAD, and anaplastic lymphoma receptor tyrosine kinase (ALK)
rearrangements, present in<5% of LUAD5
, currently have tar-
geted therapies approved by the Food and Drug Administration
(FDA)6,7
. Mutations in other genes, such as KRAS and tumor pro-
tein P53 (TP53) are very common (about 25% and 50%, respec-
tively) but have proven to be particularly challenging drug targets
so far5,8
. Lung biopsies are typically used to diagnose lung cancer
type and stage. Virtual microscopy of stained images of tissues is
typically acquired at magnifications of 20×to 40×, generating very
large two-dimensional images (10,000 to>100,000 pixels in each
dimension) that are oftentimes challenging to visually inspect in
an exhaustive manner. Furthermore, accurate interpretation can be
difficult, and the distinction between LUAD and LUSC is not always
clear, particularly in poorly differentiated tumors; in this case, ancil-
lary studies are recommended for accurate classification9,10
. To assist
experts, automatic analysis of lung cancer whole-slide images has
been recently studied to predict survival outcomes11
and classifica-
tion12
. For the latter, Yu et al.12
combined conventional thresholding
and image processing techniques with machine-learning methods,
such as random forest classifiers, support vector machines (SVM) or
Naive Bayes classifiers, achieving an AUC of ~0.85 in distinguishing
normal from tumor slides, and ~0.75 in distinguishing LUAD from
LUSC slides. More recently, deep learning was used for the classi-
fication of breast, bladder and lung tumors, achieving an AUC of
0.83 in classification of lung tumor types on tumor slides from The
Cancer Genome Atlas (TCGA)13
. Analysis of plasma DNA values
was also shown to be a good predictor of the presence of non–small
cell cancer, with an AUC of ~0.94 (ref. 14
) in distinguishing LUAD
from LUSC, whereas the use of immunochemical markers yields an
AUC of ~0.94115
.
Here, we demonstrate how the field can further benefit from deep
learning by presenting a strategy based on convolutional neural
networks (CNNs) that not only outperforms methods in previously
Classification and mutation prediction from
non–small cell lung cancer histopathology
images using deep learning
Nicolas Coudray 1,2,9
, Paolo Santiago Ocampo3,9
, Theodore Sakellaropoulos4
, Navneet Narula3
,
Matija Snuderl3
, David Fenyö5,6
, Andre L. Moreira3,7
, Narges Razavian 8
* and Aristotelis Tsirigos 1,3
*
Visual inspection of histopathology slides is one of the main methods used by pathologists to assess the stage, type and sub-
type of lung tumors. Adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) are the most prevalent subtypes of lung
cancer, and their distinction requires visual inspection by an experienced pathologist. In this study, we trained a deep con-
volutional neural network (inception v3) on whole-slide images obtained from The Cancer Genome Atlas to accurately and
automatically classify them into LUAD, LUSC or normal lung tissue. The performance of our method is comparable to that of
pathologists, with an average area under the curve (AUC) of 0.97. Our model was validated on independent datasets of frozen
tissues, formalin-fixed paraffin-embedded tissues and biopsies. Furthermore, we trained the network to predict the ten most
commonly mutated genes in LUAD. We found that six of them—STK11, EGFR, FAT1, SETBP1, KRAS and TP53—can be pre-
dicted from pathology images, with AUCs from 0.733 to 0.856 as measured on a held-out population. These findings suggest
that deep-learning models can assist pathologists in the detection of cancer subtype or gene mutations. Our approach can be
applied to any cancer type, and the code is available at https://github.com/ncoudray/DeepPATH.
• 정상, adenocarcinoma(LUAD), squamous cell carcinoma(LUSC) 를 정확하게 구분

• Tumor vs. normal, LUAD vs. LUSC 의 구분에 AUC 0.99, 0.95 이상

• Normal, LUAD, LUSC 중 하나를 다른 두 가지와 구분하는 것도 5x 20x 모두 AUC 0.9 이상

• 이 정확도는 세 명의 병리과 전문의와 동등한 수준 

• 딥러닝이 틀린 것 중에 50%는, 병리과 전문의 세 명 중 적어도 한 명이 틀렸고, 

• 병리과 전문의 세 명 중 적어도 한 명이 틀린 케이스 중, 83%는 딥러닝이 정확히 분류했다.
• 더 나아가서 TCGA를 바탕으로 개발된 인공지능을, 

• 완전히 독립적인, 특히 fresh frozen, FFPE, biopsy 의 세 가지 방식으로 얻은 

• LUAD, LUSC 데이터에 적용해보았을 때에도 대부분 AUC 0.9 이상으로 정확하게 판독
ARTICLES NATURE MEDICINE
fibrosis, inflammation or blood was also present, but also in very
poorly differentiated tumors. Sections obtained from biopsies are
usually much smaller, which reduces the number of tiles per slide,
but the performance of our model remains consistent for the 102
samples tested (AUC ~0.834–0.861 using 20×magnification and
0.871–0.928 using 5×magnification; Fig. 2c), and the accuracy
of the classification does not correlate with the sample size or the
size of the area selected by our pathologist (Supplementary Fig. 4;
the tumor area on the frozen and FFPE samples, then applied this
model to the biopsies and finally applied the TCGA-trained three-
way classifier on the tumor area selected by the automatic tumor
selection model. The per-tile AUC of the automatic tumor selection
model (using the pathologist’s tumor selection as reference) was
0.886 (CI, 0.880–0.891) for the biopsies, 0.797 (CI, 0.795–0.800)
for the frozen samples, and 0.852 (CI, 0.808–0.895) for the FFPE
samples. As demonstrated in Supplementary Fig. 3a (right-most bar
LUAD at 5×
AUC = 0.919, CI = 0.861–0.949
1
a
b
c
0.5
Truepositive
0
0 0.5
False positive
1
1
0.5
Truepositive
0
0 0.5
False positive
1
1
0.5
Truepositive
0
0 0.5
False positive
1
Frozen
FFPE
Biopsies
LUSC at 5×
AUC = 0.977, CI = 0.949–0.995
LUAD at 20×
AUC = 0.913, CI = 0.849–0.963
LUSC at 20×
AUC = 0.941, CI = 0.894–0.977
LUAD at 5×
AUC = 0.861, CI = 0.792–0.919
LUSC at 5×
AUC = 0.975, CI = 0.945–0.996
LUAD at 20×
AUC = 0.833, CI = 0.762–0.894
LUSC at 20×
AUC = 0.932, CI = 0.884–0.971
LUAD at 5×
AUC = 0.871, CI = 0.784–0.938
LUSC at 5×
AUC = 0.928, CIs = 0.871–0.972
LUAD at 20×
AUC = 0.834, CI = 0.743–0.909
LUSC at 20×
AUC = 0.861, CI = 0.780–0.928
Fig. 2 | Classification of presence and type of tumor on alternative cohorts. a–c, Receiver operating characteristic (ROC) curves (left) from tests on
frozen sections (n=98 biologically independent slides) (a), FFPE sections (n=140 biologically independent slides) (b) and biopsies (n=102 biologically
independent slides) from NYU Langone Medical Center (c). On the right of each plot, we show examples of raw images with an overlap in light gray of the
mask generated by a pathologist and the corresponding heatmaps obtained with the three-way classifier. Scale bars, 1mm.
Frozen
FFPE
Biopsy
ARTICLES
https://doi.org/10.1038/s41551-018-0301-3
C
olonoscopy is the gold-standard screening test for colorectal
cancer1–3
, one of the leading causes of cancer death in both the
United States4,5
and China6
. Colonoscopy can reduce the risk
of death from colorectal cancer through the detection of tumours
at an earlier, more treatable stage as well as through the removal of
precancerous adenomas3,7
. Conversely, failure to detect adenomas
may lead to the development of interval cancer. Evidence has shown
that each 1.0% increase in adenoma detection rate (ADR) leads to a
3.0% decrease in the risk of interval colorectal cancer8
.
Although more than 14million colonoscopies are performed
in the United States annually2
, the adenoma miss rate (AMR) is
estimated to be 6–27%9
. Certain polyps may be missed more fre-
quently, including smaller polyps10,11
, flat polyps12
and polyps in the
left colon13
. There are two independent reasons why a polyp may
be missed during colonoscopy: (i) it was never in the visual field or
(ii) it was in the visual field but not recognized. Several hardware
innovations have sought to address the first problem by improv-
ing visualization of the colonic lumen, for instance by providing a
larger, panoramic camera view, or by flattening colonic folds using a
distal-cap attachment. The problem of unrecognized polyps within
the visual field has been more difficult to address14
. Several studies
have shown that observation of the video monitor by either nurses
or gastroenterology trainees may increase polyp detection by up
to 30%15–17
. Ideally, a real-time automatic polyp-detection system
could serve as a similarly effective second observer that could draw
the endoscopist’s eye, in real time, to concerning lesions, effec-
tively creating an ‘extra set of eyes’ on all aspects of the video data
with fidelity. Although automatic polyp detection in colonoscopy
videos has been an active research topic for the past 20 years, per-
formance levels close to that of the expert endoscopist18–20
have not
been achieved. Early work in automatic polyp detection has focused
on applying deep-learning techniques to polyp detection, but most
published works are small in scale, with small development and/or
training validation sets19,20
.
Here, we report the development and validation of a deep-learn-
ing algorithm, integrated with a multi-threaded processing system,
for the automatic detection of polyps during colonoscopy. We vali-
dated the system in two image studies and two video studies. Each
study contained two independent validation datasets.
Results
We developed a deep-learning algorithm using 5,545colonoscopy
images from colonoscopy reports of 1,290patients that underwent
a colonoscopy examination in the Endoscopy Center of Sichuan
Provincial People’s Hospital between January 2007 and December
2015. Out of the 5,545images used, 3,634images contained polyps
(65.54%) and 1,911 images did not contain polyps (34.46%). For
algorithm training, experienced endoscopists annotated the pres-
ence of each polyp in all of the images in the development data-
set. We validated the algorithm on four independent datasets.
DatasetsA and B were used for image analysis, and datasetsC and D
were used for video analysis.
DatasetA contained 27,113colonoscopy images from colo-
noscopy reports of 1,138consecutive patients who underwent a
colonoscopy examination in the Endoscopy Center of Sichuan
Provincial People’s Hospital between January and December 2016
and who were found to have at least one polyp. Out of the 27,113
images, 5,541images contained polyps (20.44%) and 21,572images
did not contain polyps (79.56%). All polyps were confirmed histo-
logically after biopsy. DatasetB is a public database (CVC-ClinicDB;
Development and validation of a deep-learning
algorithm for the detection of polyps during
colonoscopy
Pu Wang1
, Xiao Xiao2
, Jeremy R. Glissen Brown3
, Tyler M. Berzin 3
, Mengtian Tu1
, Fei Xiong1
,
Xiao Hu1
, Peixi Liu1
, Yan Song1
, Di Zhang1
, Xue Yang1
, Liangping Li1
, Jiong He2
, Xin Yi2
, Jingjia Liu2
and
Xiaogang Liu 1
*
The detection and removal of precancerous polyps via colonoscopy is the gold standard for the prevention of colon cancer.
However, the detection rate of adenomatous polyps can vary significantly among endoscopists. Here, we show that a machine-
learningalgorithmcandetectpolypsinclinicalcolonoscopies,inrealtimeandwithhighsensitivityandspecificity.Wedeveloped
the deep-learning algorithm by using data from 1,290 patients, and validated it on newly collected 27,113 colonoscopy images
from 1,138 patients with at least one detected polyp (per-image-sensitivity, 94.38%; per-image-specificity, 95.92%; area under
the receiver operating characteristic curve, 0.984), on a public database of 612 polyp-containing images (per-image-sensitiv-
ity, 88.24%), on 138 colonoscopy videos with histologically confirmed polyps (per-image-sensitivity of 91.64%; per-polyp-sen-
sitivity, 100%), and on 54 unaltered full-range colonoscopy videos without polyps (per-image-specificity, 95.40%). By using a
multi-threaded processing system, the algorithm can process at least 25 frames per second with a latency of 76.80±5.60ms
in real-time video analysis. The software may aid endoscopists while performing colonoscopies, and help assess differences in
polyp and adenoma detection performance among endoscopists.
• 딥러닝으로 clinical colonoscopy에서 정확하게 polyp detection하는 인공지능

• Real time으로 측정 가능: 초당 25 프레임 처리

• 이미지와 비디오 모두 validation

• sensitivity 와 specificity 대부분 90% 이상
•Some polyps were detected with only partial appearance.
•detected in both normal and insufficient light condition.
•detected under both qualified and suboptimal bowel preparations.
ARTICLESNATURE BIOMEDICAL ENGINEERING
from patients who underwent colonoscopy examinations up to 2
years later.
Also, we demonstrated high per-image-sensitivity (94.38%
and 91.64%) in both the image (datasetA) and video (datasetC)
analyses. DatasetsA and C included large variations of polyp mor-
phology and image quality (Fig. 3, Supplementary Figs. 2–5 and
Supplementary Videos 3 and 4). For images with only flat and iso-
datasets are often small and do not represent the full range of colon
conditions encountered in the clinical setting, and there are often
discrepancies in the reporting of clinical metrics of success such as
sensitivity and specificity19,20,26
. Compared with other metrics such
as precision, we believe that sensitivity and specificity are the most
appropriate metrics for the evaluation of algorithm performance
because of their independence on the ratio of positive to negative
Fig. 3 | Examples of polyp detection for datasetsA and C. Polyps of different morphology, including flat isochromatic polyps (left), dome-shaped polyps
(second from left, middle), pedunculated polyps (second from right) and sessile serrated adenomatous polyps (right), were detected by the algorithm
(as indicated by the green tags in the bottom set of images) in both normal and insufficient light conditions, under both qualified and suboptimal bowel
preparations. Some polyps were detected with only partial appearance (middle, second from right). See Supplementary Figs 2–6 for additional examples.
flat isochromatic polyps dome-shaped polyps sessile serrated adenomatous polypspedunculated polyps
Examples of Polyp Detection for Datasets A and C
•복잡한 의료 데이터의 분석 및 insight 도출

•영상 의료/병리 데이터의 분석/판독

•연속 데이터의 모니터링 및 예방/예측
의료 인공지능의 세 유형
http://www.rolls-royce.com/about/our-technology/enabling-technologies/engine-health-management.aspx#sense
250 sensors to monitor the “health” of the GE turbines
Fig 1. What can consumer wearables do? Heart rate can be measured with an oximeter built into a ring [3], muscle activity with an electromyographi
sensor embedded into clothing [4], stress with an electodermal sensor incorporated into a wristband [5], and physical activity or sleep patterns via an
accelerometer in a watch [6,7]. In addition, a female’s most fertile period can be identified with detailed body temperature tracking [8], while levels of me
attention can be monitored with a small number of non-gelled electroencephalogram (EEG) electrodes [9]. Levels of social interaction (also known to a
PLOS Medicine 2016
S E P S I S
A targeted real-time early warning score (TREWScore)
for septic shock
Katharine E. Henry,1
David N. Hager,2
Peter J. Pronovost,3,4,5
Suchi Saria1,3,5,6
*
Sepsis is a leading cause of death in the United States, with mortality highest among patients who develop septic
shock. Early aggressive treatment decreases morbidity and mortality. Although automated screening tools can detect
patients currently experiencing severe sepsis and septic shock, none predict those at greatest risk of developing
shock. We analyzed routinely available physiological and laboratory data from intensive care unit patients and devel-
oped “TREWScore,” a targeted real-time early warning score that predicts which patients will develop septic shock.
TREWScore identified patients before the onset of septic shock with an area under the ROC (receiver operating
characteristic) curve (AUC) of 0.83 [95% confidence interval (CI), 0.81 to 0.85]. At a specificity of 0.67, TREWScore
achieved a sensitivity of 0.85 and identified patients a median of 28.2 [interquartile range (IQR), 10.6 to 94.2] hours
before onset. Of those identified, two-thirds were identified before any sepsis-related organ dysfunction. In compar-
ison, the Modified Early Warning Score, which has been used clinically for septic shock prediction, achieved a lower
AUC of 0.73 (95% CI, 0.71 to 0.76). A routine screening protocol based on the presence of two of the systemic inflam-
matory response syndrome criteria, suspicion of infection, and either hypotension or hyperlactatemia achieved a low-
er sensitivity of 0.74 at a comparable specificity of 0.64. Continuous sampling of data from the electronic health
records and calculation of TREWScore may allow clinicians to identify patients at risk for septic shock and provide
earlier interventions that would prevent or mitigate the associated morbidity and mortality.
INTRODUCTION
Seven hundred fifty thousand patients develop severe sepsis and septic
shock in the United States each year. More than half of them are
admitted to an intensive care unit (ICU), accounting for 10% of all
ICU admissions, 20 to 30% of hospital deaths, and $15.4 billion in an-
nual health care costs (1–3). Several studies have demonstrated that
morbidity, mortality, and length of stay are decreased when severe sep-
sis and septic shock are identified and treated early (4–8). In particular,
one study showed that mortality from septic shock increased by 7.6%
with every hour that treatment was delayed after the onset of hypo-
tension (9).
More recent studies comparing protocolized care, usual care, and
early goal-directed therapy (EGDT) for patients with septic shock sug-
gest that usual care is as effective as EGDT (10–12). Some have inter-
preted this to mean that usual care has improved over time and reflects
important aspects of EGDT, such as early antibiotics and early ag-
gressive fluid resuscitation (13). It is likely that continued early identi-
fication and treatment will further improve outcomes. However, the
Acute Physiology Score (SAPS II), SequentialOrgan Failure Assessment
(SOFA) scores, Modified Early Warning Score (MEWS), and Simple
Clinical Score (SCS) have been validated to assess illness severity and
risk of death among septic patients (14–17). Although these scores
are useful for predicting general deterioration or mortality, they typical-
ly cannot distinguish with high sensitivity and specificity which patients
are at highest risk of developing a specific acute condition.
The increased use of electronic health records (EHRs), which can be
queried in real time, has generated interest in automating tools that
identify patients at risk for septic shock (18–20). A number of “early
warning systems,” “track and trigger” initiatives, “listening applica-
tions,” and “sniffers” have been implemented to improve detection
andtimelinessof therapy forpatients with severe sepsis andseptic shock
(18, 20–23). Although these tools have been successful at detecting pa-
tients currently experiencing severe sepsis or septic shock, none predict
which patients are at highest risk of developing septic shock.
The adoption of the Affordable Care Act has added to the growing
excitement around predictive models derived from electronic health
R E S E A R C H A R T I C L E
onNovember3,2016http://stm.sciencemag.org/Downloadedfrom
puted as new data became avail
when his or her score crossed t
dation set, the AUC obtained f
0.81 to 0.85) (Fig. 2). At a spec
of 0.33], TREWScore achieved a s
a median of 28.2 hours (IQR, 10
Identification of patients b
A critical event in the developme
related organ dysfunction (seve
been shown to increase after th
more than two-thirds (68.8%) o
were identified before any sepsi
tients were identified a median
(Fig. 3B).
Comparison of TREWScore
Weevaluatedtheperformanceof
methods for the purpose of provid
use of TREWScore. We first com
to MEWS, a general metric used
of catastrophic deterioration (17
oped for tracking sepsis, MEWS
tion of patients at risk for severe
Fig. 2. ROC for detection of septic shock before onset in the validation
set. The ROC curve for TREWScore is shown in blue, with the ROC curve for
MEWS in red. The sensitivity and specificity performance of the routine
screening criteria is indicated by the purple dot. Normal 95% CIs are shown
for TREWScore and MEWS. TPR, true-positive rate; FPR, false-positive rate.
R E S E A R C H A R T I C L E
A targeted real-time early warning score (TREWScore)
for septic shock
AUC=0.83
At a specificity of 0.67,TREWScore achieved a sensitivity of 0.85 

and identified patients a median of 28.2 hours before onset.
ADA 2018
•미국에서 아이폰 앱으로 출시

•사용이 얼마나 번거로울지가 관건

•어느 정도의 기간을 활용해야 효과가 있는가: 2주? 평생?

•Food logging 등을 어떻게 할 것인가?

•과금 방식도 아직 공개되지 않은듯
ADA 2018
An Algorithm Based on Deep Learning for Predicting In-Hospital
Cardiac Arrest
Joon-myoung Kwon, MD;* Youngnam Lee, MS;* Yeha Lee, PhD; Seungwoo Lee, BS; Jinsik Park, MD, PhD
Background-—In-hospital cardiac arrest is a major burden to public health, which affects patient safety. Although traditional track-
and-trigger systems are used to predict cardiac arrest early, they have limitations, with low sensitivity and high false-alarm rates.
We propose a deep learning–based early warning system that shows higher performance than the existing track-and-trigger
systems.
Methods and Results-—This retrospective cohort study reviewed patients who were admitted to 2 hospitals from June 2010 to July
2017. A total of 52 131 patients were included. Specifically, a recurrent neural network was trained using data from June 2010 to
January 2017. The result was tested using the data from February to July 2017. The primary outcome was cardiac arrest, and the
secondary outcome was death without attempted resuscitation. As comparative measures, we used the area under the receiver
operating characteristic curve (AUROC), the area under the precision–recall curve (AUPRC), and the net reclassification index.
Furthermore, we evaluated sensitivity while varying the number of alarms. The deep learning–based early warning system (AUROC:
0.850; AUPRC: 0.044) significantly outperformed a modified early warning score (AUROC: 0.603; AUPRC: 0.003), a random forest
algorithm (AUROC: 0.780; AUPRC: 0.014), and logistic regression (AUROC: 0.613; AUPRC: 0.007). Furthermore, the deep learning–
based early warning system reduced the number of alarms by 82.2%, 13.5%, and 42.1% compared with the modified early warning
system, random forest, and logistic regression, respectively, at the same sensitivity.
Conclusions-—An algorithm based on deep learning had high sensitivity and a low false-alarm rate for detection of patients with
cardiac arrest in the multicenter study. (J Am Heart Assoc. 2018;7:e008678. DOI: 10.1161/JAHA.118.008678.)
Key Words: artificial intelligence • cardiac arrest • deep learning • machine learning • rapid response system • resuscitation
In-hospital cardiac arrest is a major burden to public health,
which affects patient safety.1–3
More than a half of cardiac
arrests result from respiratory failure or hypovolemic shock,
and 80% of patients with cardiac arrest show signs of
deterioration in the 8 hours before cardiac arrest.4–9
However,
209 000 in-hospital cardiac arrests occur in the United States
each year, and the survival discharge rate for patients with
cardiac arrest is <20% worldwide.10,11
Rapid response systems
(RRSs) have been introduced in many hospitals to detect
cardiac arrest using the track-and-trigger system (TTS).12,13
Two types of TTS are used in RRSs. For the single-parameter
TTS (SPTTS), cardiac arrest is predicted if any single vital sign
(eg, heart rate [HR], blood pressure) is out of the normal
range.14
The aggregated weighted TTS calculates a weighted
score for each vital sign and then finds patients with cardiac
arrest based on the sum of these scores.15
The modified early
warning score (MEWS) is one of the most widely used
approaches among all aggregated weighted TTSs (Table 1)16
;
however, traditional TTSs including MEWS have limitations, with
low sensitivity or high false-alarm rates.14,15,17
Sensitivity and
false-alarm rate interact: Increased sensitivity creates higher
false-alarm rates and vice versa.
Current RRSs suffer from low sensitivity or a high false-
alarm rate. An RRS was used for only 30% of patients before
unplanned intensive care unit admission and was not used for
22.8% of patients, even if they met the criteria.18,19
From the Departments of Emergency Medicine (J.-m.K.) and Cardiology (J.P.), Mediplex Sejong Hospital, Incheon, Korea; VUNO, Seoul, Korea (Youngnam L., Yeha L.,
S.L.).
*Dr Kwon and Mr Youngnam Lee contributed equally to this study.
Correspondence to: Joon-myoung Kwon, MD, Department of Emergency medicine, Mediplex Sejong Hospital, 20, Gyeyangmunhwa-ro, Gyeyang-gu, Incheon 21080,
Korea. E-mail: kwonjm@sejongh.co.kr
Received January 18, 2018; accepted May 31, 2018.
ª 2018 The Authors. Published on behalf of the American Heart Association, Inc., by Wiley. This is an open access article under the terms of the Creative Commons
Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for
commercial purposes.
DOI: 10.1161/JAHA.118.008678 Journal of the American Heart Association 1
ORIGINAL RESEARCH
byguestonJune28,2018http://jaha.ahajournals.org/Downloadedfrom
•환자 수: 86,290

•cardiac arrest: 633

•Input: Heart rate, Respiratory rate, Body temperature, Systolic Blood Pressure
(source: VUNO)
Cardiac Arrest Prediction Accuracy
•대학병원 신속 대응팀에서 처리 가능한 알림 수 (A, B 지점) 에서 더 큰 정확도 차이를 보임

•A: DEWS 33.0%, MEWS 0.3%

•B: DEWS 42.7%, MEWS 4.0%
(source: VUNO)
APPH(Alarms Per Patients Per Hour)
(source: VUNO)
Less False Alarm
(source: VUNO)
시간에 따른 DEWS 예측 변화
FOCUS | LETTERS
https://doi.org/10.1038/s41591-018-0268-3
1
Department of Computer Science, Stanford University, Stanford, CA, USA. 2
iRhythm Technologies Inc., San Francisco, CA, USA. 3
Division of Cardiology,
Department of Medicine, University of California San Francisco, San Francisco, CA, USA. 4
Department of Medicine and Center for Digital Health, Stanford
University School of Medicine, Stanford, CA, USA. 5
Veterans Affairs Palo Alto Health Care System, Palo Alto, CA, USA. 6
These authors contributed equally:
Awni Y. Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H. Tison. *e-mail: awni@cs.stanford.edu
Computerized electrocardiogram (ECG) interpretation plays
a critical role in the clinical ECG workflow1
. Widely available
digital ECG data and the algorithmic paradigm of deep learn-
ing2
present an opportunity to substantially improve the accu-
racy and scalability of automated ECG analysis. However, a
comprehensive evaluation of an end-to-end deep learning
approach for ECG analysis across a wide variety of diagnostic
classes has not been previously reported. Here, we develop
a deep neural network (DNN) to classify 12 rhythm classes
using 91,232 single-lead ECGs from 53,549 patients who
used a single-lead ambulatory ECG monitoring device. When
validated against an independent test dataset annotated by
a consensus committee of board-certified practicing cardiolo-
gists, the DNN achieved an average area under the receiver
operating characteristic curve (ROC) of 0.97. The average F1
score, which is the harmonic mean of the positive predictive
value and sensitivity, for the DNN (0.837) exceeded that of
average cardiologists (0.780). With specificity fixed at the
average specificity achieved by cardiologists, the sensitivity
of the DNN exceeded the average cardiologist sensitivity for
all rhythm classes. These findings demonstrate that an end-
to-end deep learning approach can classify a broad range of
distinct arrhythmias from single-lead ECGs with high diagnos-
tic performance similar to that of cardiologists. If confirmed in
clinical settings, this approach could reduce the rate of misdi-
agnosed computerized ECG interpretations and improve the
efficiency of expert human ECG interpretation by accurately
triaging or prioritizing the most urgent conditions.
The electrocardiogram is a fundamental tool in the everyday
practice of clinical medicine, with more than 300 million ECGs
obtained annually worldwide3
. The ECG is pivotal for diagnos-
ing a wide spectrum of abnormalities from arrhythmias to acute
coronary syndrome4
. Computer-aided interpretation has become
increasingly important in the clinical ECG workflow since its intro-
duction over 50years ago, serving as a crucial adjunct to physician
interpretation in many clinical settings1
. However, existing com-
mercial ECG interpretation algorithms still show substantial rates
of misdiagnosis1,5–7
. The combination of widespread digitization of
ECG data and the development of algorithmic paradigms that can
benefit from large-scale processing of raw data presents an opportu-
nity to reexamine the standard approach to algorithmic ECG analy-
sis and may provide substantial improvements to automated ECG
interpretation.
Substantial algorithmic advances in the past five years have been
driven largely by a specific class of models known as deep neural
networks2
. DNNs are computational models consisting of multiple
processing layers, with each layer being able to learn increasingly
abstract, higher-level representations of the input data relevant to
perform specific tasks. They have dramatically improved the state
of the art in speech recognition8
, image recognition9
, strategy games
such as Go10
, and in medical applications11,12
. The ability of DNNs
to recognize patterns and learn useful features from raw input data
without requiring extensive data preprocessing, feature engineer-
ing or handcrafted rules2
makes them particularly well suited to
interpret ECG data. Furthermore, since DNN performance tends
to increase as the amount of training data increases2
, this approach
is well positioned to take advantage of the widespread digitization
of ECG data.
A comprehensive evaluation of whether an end-to-end deep
learning approach can be used to analyze raw ECG data to classify
a broad range of diagnoses remains lacking. Much of the previous
work to employ DNNs toward ECG interpretation has focused on
single aspects of the ECG processing pipeline, such as noise reduc-
tion13,14
or feature extraction15,16
, or has approached limited diag-
nostic tasks, detecting only a handful of heartbeat types (normal,
ventricular or supraventricular ectopic, fusion, and so on)17–20
or
rhythm diagnoses (most commonly atrial fibrillation or ventric-
ular tachycardia)21–25
. Lack of appropriate data has limited many
efforts beyond these applications. Most prior efforts used data
from the MIT-BIH Arrhythmia database (PhysioNet)26
, which
is limited by the small number of patients and rhythm episodes
present in the dataset.
In this study, we constructed a large, novel ECG dataset that
underwent expert annotation for a broad range of ECG rhythm
classes. We developed a DNN to detect 12 rhythm classes from
raw single-lead ECG inputs using a training dataset consisting of
91,232 ECG records from 53,549 patients. The DNN was designed
to classify 10 arrhythmias as well as sinus rhythm and noise for
a total of 12 output rhythm classes (Extended Data Fig. 1). ECG
data were recorded by the Zio monitor, which is a Food and Drug
Administration (FDA)-cleared, single-lead, patch-based ambula-
tory ECG monitor27
that continuously records data from a single
vector (modified Lead II) at 200Hz. The mean and median wear
time of the Zio monitor in our dataset was 10.6 and 13.0days,
respectively. Mean age was 69±16years and 43% were women.
We validated the DNN on a test dataset that consisted of 328 ECG
records collected from 328 unique patients, which was annotated by
a consensus committee of expert cardiologists (see Methods). Mean
age on the test dataset was 70±17years and 38% were women. The
mean inter-annotator agreement on the test dataset was 72.8%.
Cardiologist-level arrhythmia detection and
classification in ambulatory electrocardiograms
using a deep neural network
Awni Y. Hannun 1,6
*, Pranav Rajpurkar 1,6
, Masoumeh Haghpanahi2,6
, Geoffrey H. Tison 3,6
,
Codie Bourn2
, Mintu P. Turakhia4,5
and Andrew Y. Ng1
FOCUS | LETTERS
https://doi.org/10.1038/s41591-018-0268-3
NATURE MEDICINE | VOL 25 | JANUARY 2019 | 65–69 | www.nature.com/naturemedicine 65
• 53,549명의 환자에게서 얻은 91,232 건의 single-lead ECG 데이터

• ZIO patch (FDA cleared, single led, ambulatory ECG monitor)

• 총 12가지 종류의 부정맥으로 구분하는 DNN 개발 (34-layer network)
Cardiologist-Level Arrhythmia
Detection with Convolutional
Neural Networks
•Validation

• 6명의 독립적인 cardiologist 의 평균적인 실력과 비교 

• F1 score를 기준으로 비교 (precision과 recall의 조화평균)LETTERS | FOCUS NATURE MEDICINE
Supplementary Table 1 shows the number of unique patients
exhibiting each rhythm class.
We first compared the performance of the DNN against the gold
standard cardiologist consensus committee diagnoses by calculat-
ing the AUC (Table 1a). Since the DNN algorithm was designed
to make a rhythm class prediction approximately once per second
(see Methods), we report performance both as assessed once every
second—which we call “sequence-level” and consists of one rhythm
class per interval—and once per record, which we call “set-level”
scores on the 10% development dataset (n=8,761) were materially
unchanged from the test dataset results, although they were slightly
higher (Supplementary Tables 3 and 4). In addition, we retrained
the DNN holding out an additional 10% of the training dataset as
a second held-out test dataset (n=8,768); the AUC and F1 scores
for all rhythms were materially unchanged (Supplementary Tables 5
and 6). We note that unlike the primary test dataset, which has gold-
standard annotations from a committee of cardiologists, both sensi-
tivity analysis datasets are annotated by certified ECG technicians.
Table 1 | Diagnostic performance of the DNN and averaged individual cardiologists compared to the cardiologist committee
consensus (n=328)
Algorithm AUC (95% CI)a
Algorithm F1
b
Average cardiologist F1
Sequencea
Setb
Sequence Set Sequence Set
Atrial fibrillation and flutter 0.973 (0.966–0.980) 0.965 (0.932–0.998) 0.801 0.831 0.677 0.686
AVB 0.988 (0.983–0.993) 0.981 (0.953–1.000) 0.828 0.808 0.772 0.761
Bigeminy 0.997 (0.991–1.000) 0.996 (0.976–1.000) 0.847 0.870 0.842 0.853
EAR 0.913 (0.889–0.937) 0.940 (0.870–1.000) 0.541 0.596 0.482 0.536
IVR 0.995 (0.989–1.000) 0.987 (0.959–1.000) 0.761 0.818 0.632 0.720
Junctional rhythm 0.987 (0.980–0.993) 0.979 (0.946–1.000) 0.664 0.789 0.692 0.679
Noise 0.981 (0.973–0.989) 0.947 (0.898–0.996) 0.844 0.761 0.768 0.685
Sinus rhythm 0.975 (0.971–0.979) 0.987 (0.976–0.998) 0.887 0.933 0.852 0.910
SVT 0.973 (0.960–0.985) 0.953 (0.903–1.000) 0.488 0.693 0.451 0.564
Trigeminy 0.998 (0.995–1.000) 0.997 (0.979–1.000) 0.907 0.864 0.842 0.812
Ventricular tachycardia 0.995 (0.980–1.000) 0.980 (0.934–1.000) 0.541 0.681 0.566 0.769
Wenckebach 0.978 (0.967–0.989) 0.977 (0.938–1.000) 0.702 0.780 0.591 0.738
Frequency-weighted average 0.978 0.977 0.807 0.837 0.753 0.780
a
DNN algorithm area under the ROC compared to the cardiologist committee consensus. b
DNN algorithm and averaged individual cardiologist F1 scores compared to the cardiologist committee consensus.
Sequence-level describes the algorithm predictions that are made once every 256 input samples (approximately every 1.3s) and are compared against the gold-standard committee consensus at the same
intervals. Set-level describes the unique set of algorithm predictions that are present in the 30-s record. Sequence AUC prediction, n=7,544; set AUC prediction, n=328.
LETTERS | FOCUS
https://doi.org/10.1038/s41591-018-0268-3LETTERS | FOCUS NATURE MEDICINE
• Set level average F1 score: 전반적으로 인공지능이 더 나은 퍼포먼스

• DNN (0.837) > cardiologist (0.780)

• DNN과 cardiologist 는 비슷한 추이의 F1 score를 보임

• VT, EAR 등에 대해서는 모두 낮음
FOCUS | LETTERSNATURE MEDICINE
Our study is the first comprehensive demonstration of a deep
0.0 0.1 0.2 0.3
Specificity Specificity Specificity
0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Sensitivity
Class Atrial fibrillation
Model
Individual cardiologist
Average cardiologist
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Sensitivity
Class Trigeminy
Model
Individual cardiologist
Average cardiologist
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Sensitivity
Class AVB
Model
Individual cardiologist
Average cardiologist
b
0.0 0.2 0.4 0.6 0.8 1.0
Sensitivity (Recall)
0.0
0.2
0.4
0.6
0.8
1.0
PPV(precision)
Class Atrial fibrillation
Model
Individual cardiologist
Average cardiologist
0.0 0.2 0.4 0.6 0.8 1.0
Sensitivity (Recall)
0.0
0.2
0.4
0.6
0.8
1.0
PPV(precision)
Class Trigeminy
Model
Individual cardiologist
Average cardiologist
0.0 0.2 0.4 0.6 0.8 1.0
Sensitivity (Recall)
0.0
0.2
0.4
0.6
0.8
1.0
PPV(precision)
Class AVB
Model
Individual cardiologist
Average cardiologist
a
Fig. 1 | ROC and precision-recall curves. a, Examples of ROC curves calculated at the sequence level for atrial fibrillation (AF), trigeminy, and AVB.
b, Examples of precision-recall curves calculated at the sequence level for atrial fibrillation, trigeminy, and AVB. Individual cardiologist performance is
indicated by the red crosses and averaged cardiologist performance is indicated by the green dot. The line represents the ROC (a) or precision-recall curve
(b) achieved by the DNN model. n=7,544 where each of the 328 30-s ECGs received 23 sequence-level predictions.
https://doi.org/10.1038/s41591-018-0268-3FOCUS | LETTERSNATURE MEDICINE
• DNN model met or exceeded the averaged cardiologist performance for all rhythm classes.
ehensive demonstration of a deep
classification across a broad range
rtant ECG rhythm diagnoses. Our
hted AUC of 0.97, with higher aver-
than cardiologists. These findings
DNN approach has the potential
acy of algorithmic ECG interpreta-
mputational advances compel us to
to automated ECG interpretation.
aches whose performance improves
uch as deep learning2
, can leverage
CG data and provide clear oppor-
ideal of a learning health care sys-
this study of a dataset large enough
learning approach to predict mul-
nd our validation against the high
sus committee. (Most cardiologists
bnormalities.) We believe this is the
ndard, since cardiologists perform
y all clinical settings.
the paradigm shift represented by
nable a new approach to automated
oach to automated ECG interpreta-
across a series of steps that include
raction, feature selection/reduction,
hand-engineered heuristics and deri-
developed with the ultimate aim to
rhythm, such as atrial fibrillation31,32
.
In contrast, DNNs enable an approach that is fundamentally different
since a single algorithm can accomplish all of these steps ‘end-to-end’
without requiring class-specific feature extraction; in other words, the
DNN can accept the raw ECG data as input and output diagnostic
Table 2 | DNN algorithm and cardiologist sensitivity compared
to the cardiologist committee consensus, with specificity fixed
at the average specificity level achieved by cardiologists
Specificity Average
cardiologist
sensitivity
DNN
algorithm
sensitivity
Atrial fibrillation and
flutter
0.941 0.710 0.861
AVB 0.981 0.731 0.858
Bigeminy 0.996 0.829 0.921
EAR 0.993 0.380 0.445
IVR 0.991 0.611 0.867
Junctional rhythm 0.984 0.634 0.729
Noise 0.983 0.749 0.803
Sinus rhythm 0.859 0.901 0.950
SVT 0.983 0.408 0.487
Ventricular tachycardia 0.996 0.652 0.702
Wenckebach 0.986 0.541 0.651
raged cardiologist performance is indicated by the green dot. The line represents the ROC (a) or precision-recall curve
7,544 where each of the 328 30-s ECGs received 23 sequence-level predictions.
2019 | 65–69 | www.nature.com/naturemedicine 67
• Cardiologist 와 DNN의 sensitivity 비교

• DNN의 경우: specificity를 cardiologist와 동일하게 설정한 경우의 sensitivity

• 12 종류의 부정맥 모두에 DNN이 더 높은 sensitivity를 보임
•복잡한 의료 데이터의 분석 및 insight 도출

•영상 의료/병리 데이터의 분석/판독

•연속 데이터의 모니터링 및 예방/예측
의료 인공지능의 세 유형
Three Steps to Implement Digital Medicine
• Step 1. Measure the Data
• Step 2. Collect the Data
• Step 3. Insight from the Data
Feedback/Questions
• E-mail: yoonsup.choi@gmail.com 

• Blog: http://www.yoonsupchoi.com

• Facebook: 최윤섭 디지털 헬스케어 연구소

디지털 의료의 현재와 미래: 임상신경생리학을 중심으로

  • 1.
    디지털 의료의 현재와미래 : 임상신경생리학을 중심으로 Professor, SAHIST, Sungkyunkwan University Director, Digital Healthcare Institute Yoon Sup Choi, Ph.D.
  • 2.
    Disclaimer 저는 위의 회사들과지분 관계, 자문 등으로 이해 관계가 있음을 밝힙니다. 스타트업 벤처캐피털
  • 3.
    “It's in Apple'sDNA that technology alone is not enough. 
 It's technology married with liberal arts.”
  • 4.
    The Convergence ofIT, BT and Medicine
  • 6.
    최윤섭 지음 의료인공지능 표지디자인•최승협 컴퓨터 털 헬 치를만드는 것을 화두로 기업가, 엔젤투자가, 에반 의 대표적인 전문가로, 활 이 분야를 처음 소개한 장 포항공과대학교에서 컴 동 대학원 시스템생명공 취득하였다. 스탠퍼드대 조교수, KT 종합기술원 컨 구원 연구조교수 등을 거 저널에 10여 편의 논문을 국내 최초로 디지털 헬스 윤섭 디지털 헬스케어 연 국내 유일의 헬스케어 스 어 파트너스’의 공동 창업 스타트업을 의료 전문가 관대학교 디지털헬스학과 뷰노, 직토, 3billion, 서지 소울링, 메디히어, 모바일 자문을 맡아 한국에서도 고 있다. 국내 최초의 디 케어 이노베이션』에 활발 을 연재하고 있다. 저서로 와 『그렇게 나는 스스로 •블로그_ http://www •페이스북_ https://w •이메일_ yoonsup.c 최윤섭 의료 인공지능은 보수적인 의료 시스템을 재편할 혁신을 일으키고 있다. 의료 인공지능의 빠른 발전과 광범위한 영향은 전문화, 세분화되며 발전해 온 현대 의료 전문가들이 이해하기가 어려우며, 어디서부 터 공부해야 할지도 막연하다. 이런 상황에서 의료 인공지능의 개념과 적용, 그리고 의사와의 관계를 쉽 게 풀어내는 이 책은 좋은 길라잡이가 될 것이다. 특히 미래의 주역이 될 의학도와 젊은 의료인에게 유용 한 소개서이다. ━ 서준범, 서울아산병원 영상의학과 교수, 의료영상인공지능사업단장 인공지능이 의료의 패러다임을 크게 바꿀 것이라는 것에 동의하지 않는 사람은 거의 없다. 하지만 인공 지능이 처리해야 할 의료의 난제는 많으며 그 해결 방안도 천차만별이다. 흔히 생각하는 만병통치약 같 은 의료 인공지능은 존재하지 않는다. 이 책은 다양한 의료 인공지능의 개발, 활용 및 가능성을 균형 있 게 분석하고 있다. 인공지능을 도입하려는 의료인, 생소한 의료 영역에 도전할 인공지능 연구자 모두에 게 일독을 권한다. ━ 정지훈, 경희사이버대 미디어커뮤니케이션학과 선임강의교수, 의사 서울의대 기초의학교육을 책임지고 있는 교수의 입장에서, 산업화 이후 변하지 않은 현재의 의학 교육 으로는 격변하는 인공지능 시대에 의대생을 대비시키지 못한다는 한계를 절실히 느낀다. 저와 함께 의 대 인공지능 교육을 개척하고 있는 최윤섭 소장의 전문적 분석과 미래 지향적 안목이 담긴 책이다. 인공 지능이라는 미래를 대비할 의대생과 교수, 그리고 의대 진학을 고민하는 학생과 학부모에게 추천한다. ━ 최형진, 서울대학교 의과대학 해부학교실 교수, 내과 전문의 최근 의료 인공지능의 도입에 대해서 극단적인 시각과 태도가 공존하고 있다. 이 책은 다양한 사례와 깊 은 통찰을 통해 의료 인공지능의 현황과 미래에 대해 균형적인 시각을 제공하여, 인공지능이 의료에 본 격적으로 도입되기 위한 토론의 장을 마련한다. 의료 인공지능이 일상화된 10년 후 돌아보았을 때, 이 책 이 그런 시대를 이끄는 길라잡이 역할을 하였음을 확인할 수 있기를 기대한다. ━ 정규환, 뷰노 CTO 의료 인공지능은 다른 분야 인공지능보다 더 본질적인 이해가 필요하다. 단순히 인간의 일을 대신하는 수준을 넘어 의학의 패러다임을 데이터 기반으로 변화시키기 때문이다. 따라서 인공지능을 균형있게 이 해하고, 어떻게 의사와 환자에게 도움을 줄 수 있을지 깊은 고민이 필요하다. 세계적으로 일어나고 있는 이러한 노력의 결과물을 집대성한 이 책이 반가운 이유다. ━ 백승욱, 루닛 대표 의료 인공지능의 최신 동향뿐만 아니라, 의의와 한계, 전망, 그리고 다양한 생각거리까지 주는 책이다. 논쟁이 되는 여러 이슈에 대해서도 저자는 자신의 시각을 명확한 근거에 기반하여 설득력 있게 제시하 고 있다. 개인적으로는 이 책을 대학원 수업 교재로 활용하려 한다. ━ 신수용, 성균관대학교 디지털헬스학과 교수 최윤섭지음 의료인공지능 값 20,000원 ISBN 979-11-86269-99-2 최초의 책! 계 안팎에서 제기 고 있다. 현재 의 분 커버했다고 자 것인가, 어느 진료 제하고 효용과 안 누가 지는가, 의학 쉬운 언어로 깊이 들이 의료 인공지 적인 용어를 최대 서 다른 곳에서 접 를 접하게 될 것 너무나 빨리 발전 책에서 제시하는 술을 공부하며, 앞 란다. 의사 면허를 취득 저가 도움되면 좋 를 불러일으킬 것 화를 일으킬 수도 슈에 제대로 대응 분은 의학 교육의 예비 의사들은 샌 지능과 함께하는 레이닝 방식도 이 전에 진료실과 수 겠지만, 여러분들 도생하는 수밖에 미래의료학자 최윤섭 박사가 제시하는 의료 인공지능의 현재와 미래 의료 딥러닝과 IBM 왓슨의 현주소 인공지능은 의사를 대체하는가 값 20,000원 ISBN 979-11-86269-99-2 레이닝 방식도 이 전에 진료실과 수 겠지만, 여러분들 도생하는 수밖에 소울링, 메디히어, 모바일 자문을 맡아 한국에서도 고 있다. 국내 최초의 디 케어 이노베이션』에 활발 을 연재하고 있다. 저서로 와 『그렇게 나는 스스로 •블로그_ http://www •페이스북_ https://w •이메일_ yoonsup.c
  • 8.
  • 9.
    https://rockhealth.com/reports/2018-year-end-funding-report-is-digital-health-in-a-bubble/ •2018년에는 $8.1B 가투자되며 역대 최대 규모를 또 한 번 갱신 (전년 대비 42.% 증가) •총 368개의 딜 (전년 359 대비 소폭 증가): 개별 딜의 규모가 커졌음 •전체 딜의 절반이 seed 혹은 series A 투자였음 •‘초기 기업들이 역대 최고로 큰 규모의 투자를’, ‘역대 가장 자주’ 받고 있음
  • 10.
  • 11.
    5% 8% 24% 27% 36% Life Science &Health Mobile Enterprise & Data Consumer Commerce 9% 13% 23% 24% 31% Life Science & Health Consumer Enterprise Data & AI Others 2014 2015 Investment of GoogleVentures in 2014-2015
  • 12.
    startuphealth.com/reports Firm 2017 YTDDeals Stage Early Mid Late 1 7 1 7 2 6 2 6 3 5 3 5 3 5 3 5 THE TOP INVESTORS OF 2017 YTD We are seeing huge strides in new investors pouring money into the digital health market, however all the top 10 investors of 2017 year to date are either maintaining or increasing their investment activity. Source: StartUp Health Insights | startuphealth.com/insights Note: Report based on public data on seed, venture, corporate venture and private equity funding only. © 2017 StartUp Health LLC DEALS & FUNDING GEOGRAPHY INVESTORSMOONSHOTS 20 •개별 투자자별로 보자면, 이 분야 전통의 강자(?)인 Google Ventures와 Khosla Ventures가 각각 7개로 공동 1위, •GE Ventures와 Accel Partners가 6건으로 공동 2위를 기록
 •GV 가 투자한 기업 •virtual fitness membership network를 만드는 뉴욕의 ClassPass •Remote clinical trial 회사인 Science 37 •Digital specialty prescribing platform ZappRx 등에 투자.
 •Khosla Ventures 가 투자한 기업 •single-molecule 검사 장비를 만드는 TwoPoreGuys •Mabu라는 AI-powered patient engagement robot 을 만드 는 Catalia Health에 투자.
  • 14.
    헬스케어넓은 의미의 건강관리에는 해당되지만, 디지털 기술이 적용되지 않고, 전문 의료 영역도 아닌 것 예) 운동, 영양, 수면 디지털 헬스케어 건강 관리 중에 디지털 기술이 사용되는 것 예) 사물인터넷, 인공지능, 3D 프린터, VR/AR 모바일 헬스케어 디지털 헬스케어 중 모바일 기술이 사용되는 것 예) 스마트폰, 사물인터넷, SNS 개인 유전정보분석 예) 암유전체, 질병위험도, 보인자, 약물 민감도 예) 웰니스, 조상 분석 헬스케어 관련 분야 구성도(ver 0.3) 의료 질병 예방, 치료, 처방, 관리 등 전문 의료 영역 원격의료 원격진료
  • 15.
    EDITORIAL OPEN Digital medicine,on its way to being just plain medicine npj Digital Medicine (2018)1:20175 ; doi:10.1038/ s41746-017-0005-1 There are already nearly 30,000 peer-reviewed English-language scientific journals, producing an estimated 2.5 million articles a year.1 So why another, and why one focused specifically on digital medicine? To answer that question, we need to begin by defining what “digital medicine” means: using digital tools to upgrade the practice of medicine to one that is high-definition and far more individualized. It encompasses our ability to digitize human beings using biosensors that track our complex physiologic systems, but also the means to process the vast data generated via algorithms, cloud computing, and artificial intelligence. It has the potential to democratize medicine, with smartphones as the hub, enabling each individual to generate their own real world data and being far more engaged with their health. Add to this new imaging tools, mobile device laboratory capabilities, end-to-end digital clinical trials, telemedicine, and one can see there is a remarkable array of transformative technology which lays the groundwork for a new form of healthcare. As is obvious by its definition, the far-reaching scope of digital medicine straddles many and widely varied expertise. Computer scientists, healthcare providers, engineers, behavioral scientists, ethicists, clinical researchers, and epidemiologists are just some of the backgrounds necessary to move the field forward. But to truly accelerate the development of digital medicine solutions in health requires the collaborative and thoughtful interaction between individuals from several, if not most of these specialties. That is the primary goal of npj Digital Medicine: to serve as a cross-cutting resource for everyone interested in this area, fostering collabora- tions and accelerating its advancement. Current systems of healthcare face multiple insurmountable challenges. Patients are not receiving the kind of care they want and need, caregivers are dissatisfied with their role, and in most countries, especially the United States, the cost of care is unsustainable. We are confident that the development of new systems of care that take full advantage of the many capabilities that digital innovations bring can address all of these major issues. Researchers too, can take advantage of these leading-edge technologies as they enable clinical research to break free of the confines of the academic medical center and be brought into the real world of participants’ lives. The continuous capture of multiple interconnected streams of data will allow for a much deeper refinement of our understanding and definition of most pheno- types, with the discovery of novel signals in these enormous data sets made possible only through the use of machine learning. Our enthusiasm for the future of digital medicine is tempered by the recognition that presently too much of the publicized work in this field is characterized by irrational exuberance and excessive hype. Many technologies have yet to be formally studied in a clinical setting, and for those that have, too many began and ended with an under-powered pilot program. In addition, there are more than a few examples of digital “snake oil” with substantial uptake prior to their eventual discrediting.2 Both of these practices are barriers to advancing the field of digital medicine. Our vision for npj Digital Medicine is to provide a reliable, evidence-based forum for all clinicians, researchers, and even patients, curious about how digital technologies can transform every aspect of health management and care. Being open source, as all medical research should be, allows for the broadest possible dissemination, which we will strongly encourage, including through advocating for the publication of preprints And finally, quite paradoxically, we hope that npj Digital Medicine is so successful that in the coming years there will no longer be a need for this journal, or any journal specifically focused on digital medicine. Because if we are able to meet our primary goal of accelerating the advancement of digital medicine, then soon, we will just be calling it medicine. And there are already several excellent journals for that. ACKNOWLEDGEMENTS Supported by the National Institutes of Health (NIH)/National Center for Advancing Translational Sciences grant UL1TR001114 and a grant from the Qualcomm Foundation. ADDITIONAL INFORMATION Competing interests:The authors declare no competing financial interests. Publisher's note:Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Change history:The original version of this Article had an incorrect Article number of 5 and an incorrect Publication year of 2017. These errors have now been corrected in the PDF and HTML versions of the Article. Steven R. Steinhubl1 and Eric J. Topol1 1 Scripps Translational Science Institute, 3344 North Torrey Pines Court, Suite 300, La Jolla, CA 92037, USA Correspondence: Steven R. Steinhubl (steinhub@scripps.edu) or Eric J. Topol (etopol@scripps.edu) REFERENCES 1. Ware, M. & Mabe, M. The STM report: an overview of scientific and scholarly journal publishing 2015 [updated March]. http://digitalcommons.unl.edu/scholcom/92017 (2015). 2. Plante, T. B., Urrea, B. & MacFarlane, Z. T. et al. Validation of the instant blood pressure smartphone App. JAMA Intern. Med. 176, 700–702 (2016). Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/. © The Author(s) 2018 Received: 19 October 2017 Accepted: 25 October 2017 www.nature.com/npjdigitalmed Published in partnership with the Scripps Translational Science Institute 디지털 의료의 미래는? 일상적인 의료가 되는 것
  • 16.
    What is mostimportant factor in digital medicine?
  • 17.
    “Data! Data! Data!”he cried.“I can’t make bricks without clay!” - Sherlock Holmes,“The Adventure of the Copper Beeches”
  • 19.
    새로운 데이터가 새로운 방식으로 새로운주체에 의해 측정, 저장, 통합, 분석된다. 데이터의 종류 데이터의 질적/양적 측면 웨어러블 기기 스마트폰 유전 정보 분석 인공지능 SNS 사용자/환자 대중
  • 20.
    Three Steps toImplement Digital Medicine • Step 1. Measure the Data • Step 2. Collect the Data • Step 3. Insight from the Data
  • 21.
    Digital Healthcare IndustryLandscape Data Measurement Data Integration Data Interpretation Treatment Smartphone Gadget/Apps DNA Artificial Intelligence 2nd Opinion Wearables / IoT (ver. 3) EMR/EHR 3D Printer Counseling Data Platform Accelerator/early-VC Telemedicine Device On Demand (O2O) VR Digital Healthcare Institute Diretor, Yoon Sup Choi, Ph.D. yoonsup.choi@gmail.com
  • 22.
    Data Measurement DataIntegration Data Interpretation Treatment Smartphone Gadget/Apps DNA Artificial Intelligence 2nd Opinion Device On Demand (O2O) Wearables / IoT Digital Healthcare Institute Diretor, Yoon Sup Choi, Ph.D. yoonsup.choi@gmail.com EMR/EHR 3D Printer Counseling Data Platform Accelerator/early-VC VR Telemedicine Digital Healthcare Industry Landscape (ver. 3)
  • 23.
  • 24.
    Smartphone: the originof healthcare innovation
  • 25.
    Smartphone: the originof healthcare innovation
  • 26.
    2013? The election ofPope Benedict The Election of Pope Francis
  • 27.
    The Election ofPope Francis The Election of Pope Benedict
  • 28.
  • 31.
  • 32.
    검이경 더마토스코프 안과질환피부암 기생충 호흡기 심전도 수면 식단 활동량 발열 생리/임신
  • 33.
  • 34.
  • 35.
  • 36.
    “왼쪽 귀에 대한비디오를 보면 고막 뒤에 액체가 보인다. 고막은 특별히 부어 있거 나 모양이 이상하지는 않다. 그러므로 심한 염증이 있어보이지는 않는다. 네가 스쿠버 다이빙 하면서 압력평형에 어 려움을 느꼈다는 것을 감안한다면, 고막의 움직임을 테스트 할 수 있는 의사에게 직접 진찰 받는 것도 좋겠다. ...” 한국에서는 불법
  • 37.
  • 39.
  • 40.
  • 43.
    “심장박동은 안정적이기 때문에,
 당장 병원에 갈 필요는 없겠습니다. 
 그래도 이상이 있으면 전문의에게 
 진료를 받아보세요. “ 한국에서는 불법
  • 46.
  • 48.
    30분-1시간 정도 일상적인코골이가 있음 이걸 어떻게 믿나?
  • 49.
    녹음을 해줌. PGS와의analytical validity의 증명?
  • 51.
    • 아이폰의 센서로측정한 자신의 의료/건강 데이터를 플랫폼에 공유 가능 • 가속도계, 마이크, 자이로스코프, GPS 센서 등을 이용 • 걸음, 운동량, 기억력, 목소리 떨림 등등 • 기존의 의학연구의 문제를 해결: 충분한 의료 데이터의 확보 • 연구 참여자 등록에 물리적, 시간적 장벽을 제거 (1번/3개월 ➞ 1번/1초) • 대중의 의료 연구 참여 장려: 연구 참여자의 수 증가 • 발표 후 24시간 내에 수만명의 연구 참여자들이 지원 • 사용자 본인의 동의 하에 진행 ResearchKit
  • 52.
    •초기 버전으로, 5가지질환에 대한 앱 5개를 소개 ResearchKit
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
    Autism and BeyondEpiWatchMole Mapper measuring facial expressions of young patients having autism measuring morphological changes of moles measuring behavioral data of epilepsy patients
  • 58.
    •스탠퍼드의 심혈관 질환연구 앱, myHeart • 발표 하루만에 11,000 명의 참가자가 등록 • 스탠퍼드의 해당 연구 책임자 앨런 영,
 “기존의 방식으로는 11,000명 참가자는 
 미국 전역의 50개 병원에서 1년간 모집해야 한다”
  • 59.
    •파킨슨 병 연구앱, mPower • 발표 하루만에 5,589 명의 참가자가 등록 • 기존에 6000만불을 들여 5년 동안 모집한
 환자의 수는 단 800명
  • 60.
    The mPower study,Parkinson disease mobile data collected using ResearchKit Brian M. Bot1 , Christine Suver1 , Elias Chaibub Neto1 , Michael Kellen1 , Arno Klein1 , Christopher Bare1 , Megan Doerr1 , Abhishek Pratap1 , John Wilbanks1 , E. Ray Dorsey2 , Stephen H. Friend1 & Andrew D. Trister1 Current measures of health and disease are often insensitive, episodic, and subjective. Further, these measures generally are not designed to provide meaningful feedback to individuals. The impact of high- resolution activity data collected from mobile phones is only beginning to be explored. Here we present data from mPower, a clinical observational study about Parkinson disease conducted purely through an iPhone app interface. The study interrogated aspects of this movement disorder through surveys and frequent sensor-based recordings from participants with and without Parkinson disease. Benefitting from large enrollment and repeated measurements on many individuals, these data may help establish baseline variability of real-world activity measurement collected via mobile phones, and ultimately may lead to quantification of the ebbs-and-flows of Parkinson symptoms. App source code for these data collection modules are available through an open source license for use in studies of other conditions. We hope that releasing data contributed by engaged research participants will seed a new community of analysts working collaboratively on understanding mobile health data to advance human health. Design Type(s) observation design • time series design • repeated measure design Measurement Type(s) disease severity measurement Technology Type(s) Patient Self-Report Factor Type(s) Sample Characteristic(s) Homo sapiens OPEN SUBJECT CATEGORIES » Research data » Neurology » Parkinson’s disease » Medical research Received: 07 December 2015 Accepted: 02 February 2016 Published: 3 March 2016 www.nature.com/scientificdata
  • 61.
  • 64.
  • 65.
    Fig 1. Whatcan consumer wearables do? Heart rate can be measured with an oximeter built into a ring [3], muscle activity with an electromyographi sensor embedded into clothing [4], stress with an electodermal sensor incorporated into a wristband [5], and physical activity or sleep patterns via an accelerometer in a watch [6,7]. In addition, a female’s most fertile period can be identified with detailed body temperature tracking [8], while levels of me attention can be monitored with a small number of non-gelled electroencephalogram (EEG) electrodes [9]. Levels of social interaction (also known to a PLOS Medicine 2016
  • 66.
    PwC Health ResearchInstitute Health wearables: Early days2 insurers—offering incentives for use may gain traction. HRI’s survey Source: HRI/CIS Wearables consumer survey 2014 21% of US consumers currently own a wearable technology product 2% wear it a few times a month 2% no longer use it 7% wear it a few times a week 10% wear it everyday Figure 2: Wearables are not mainstream – yet Just one in five US consumers say they own a wearable device. Intelligence Series sought to better understand American consumers’ attitudes toward wearables through done with the data. PwC, Health wearables: early days, 2014
  • 67.
    PwC | TheWearable Life | 3 device (up from 21% in 2014). And 36% own more than one. We didn’t even ask this question in our previous survey since it wasn’t relevant at the time. That’s how far we’ve come. millennials are far more likely to own wearables than older adults. Adoption of wearables declines with age. Of note in our survey findings, however: Consumers aged 35 to 49 are more likely to own smart watches. Across the board for gender, age, and ethnicity, fitness wearable technology is most popular. Fitness band Smart clothing Smart video/ photo device (e.g. GoPro) Smart watch Smart glasses* 45% 14% 27% 15% 12% Base: Respondents who currently own at least one device (pre-quota sample, n=700); Q10A/B/C/D/E. Please tell us your relationship with the following wearable technology products. *Includes VR/AR glasses Fitness runs away with it % respondents who own type of wearable device PwC,The Wearable Life 2.0, 2016 • 49% own at least one wearable device (up from 21% in2014) • 36% own more than one device.
  • 68.
  • 69.
  • 71.
  • 72.
    https://clinicaltrials.gov/ct2/results?term=fitbit&Search=Search •의료기기가 아님에도 Fitbit은 이미 임상 연구에 폭넓게 사용되고 있음 •Fitbit 이 장려하지 않았음에도, 임상 연구자들이 자발적으로 사용 •Fitbit 을 이용한 임상 연구 수는 계속 증가하는 추세 (16.3(80), 16.8(113), 17.7(173))
  • 74.
    •Fitbit이 임상연구에 활용되는것은 크게 두 가지 경우 •Fitbit 자체가 intervention이 되어서 활동량이나 치료 효과를 증진시킬 수 있는지 여부 •연구 참여자의 활동량을 모니터링 하기 위한 수단
 •1. Fitbit으로 환자의 활동량을 증가시키기 위한 연구들 •Fitbit이 소아 비만 환자의 활동량을 증가시키는지 여부를 연구 •Fitbit이 위소매절제술을 받은 환자들의 활동량을 증가시키는지 여부 •Fitbit이 젊은 낭성 섬유증 (cystic fibrosis) 환자의 활동량을 증가시키는지 여부 •Fitbit이 암 환자의 신체 활동량을 증가시키기 위한 동기부여가 되는지 여부 •2. Fitbit으로 임상 연구에 참여하는 환자의 활동량을 모니터링 •항암 치료를 받은 환자들의 건강과 예후를 평가하는데 fitbit을 사용 •현금이 자녀/부모의 활동량을 증가시키는지 파악하기 위해 fitbit을 사용 •Brain tumor 환자의 삶의 질 측정을 위해 다른 survey 결과와 함께 fitbit을 사용 •말초동맥 질환(Peripheral Artery Disease) 환자의 활동량을 평가하기 위해
  • 75.
    •체중 감량이 유방암재발에 미치는 영향을 연구 •유방암 환자들 중 20%는 재발, 대부분이 전이성 유방암 •과체중은 유방암의 위험을 높인다고 알려져 왔으며, •비만은 초기 유방암 환자의 예후를 좋지 않게 만드는 것도 알려짐 •하지만, 체중 감량과 유방암 재발 위험도의 상관관계 연구는 아직 없음 •3,200 명의 과체중, 초기 비만 유방암 환자들이 2년간 참여 •결과에 따라 전세계 유방암 환자의 표준 치료에 체중 감량이 포함될 가능성 •Fitbit 이 체중 감량 프로그램에 대한 지원 •Fitbit Charge HR: 운동량, 칼로리 소모, 심박수 측정 •Fitbit Aria Wi-Fi Smart Scale: 스마트 체중계 •FitStar: 개인 맞춤형 동영상 운동 코칭 서비스 2016. 4. 27.
  • 77.
  • 78.
    •Biogen Idec, 다발성경화증 환자의 모니터링에 Fitbit을 사용 •고가의 약 효과성을 검증하여 보험 약가 유지 목적 •정교한 측정으로 MS 전조 증상의 조기 발견 가능? Dec 23, 2014
  • 79.
  • 84.
    (“FREE VERTICAL MOMENTSAND TRANSVERSE FORCES IN HUMAN WALKING AND THEIR ROLE IN RELATION TO ARM-SWING”, YU LI*, WEIJIE WANG, ROBIN H. CROMPTON AND MICHAEL M. GUNTHER) (“SYNTHESIS OF NATURAL ARM SWING MOTION IN HUMAN BIPEDAL WALKING”, JAEHEUNG PARK)︎ Right Arm Left Foot Left Arm Right Foot “보행 시 팔의 움직임은 몸의 역학적 균형을 맞추기 위한 자동적인 행동 으로, 반대쪽 발의 움직임을 관찰할 수 있는 지표” 보행 종류에 따른 신체 운동 궤도의 변화 발의 모양 팔의 스윙 궤도 일반 보행 팔자 걸음 구부린 걸음 직토 워크에서 수집하는 데이터 종류 설명 비고 충격량 발에 전해지는 충격량 분석 Impact Score 보행 주기 보행의 주기 분석 Interval Score 보폭 단위 보행 시의 거리 Stride(향후 보행 분석 고도화용) 팔의 3차원 궤도 걸음에 따른 팔의 움직임 궤도 팔의 Accel,Gyro Data 취합 보행 자세 상기 자료를 분석한 보행 자세 분류 총 8가지 종류로 구분 비대칭 지수 신체 부위별(어깨, 허리, 골반) 비대칭 점수 제공 1주일 1회 반대쪽 손 착용을 통한 데이터 취득 필요 걸음걸이 템플릿 보행시 발생하는 특이점들을 추출하여 개인별 템플릿 저장 생체 인증 기능용 with the courtesy of ZIKTO, Inc
  • 85.
    Empatica Embrace: SmartBand for epilepsy
  • 86.
    Empatica Embrace: SmartBand for epilepsy
  • 87.
    https://www.empatica.com/science Monitoring the AutonomicNervous System “Sympathetic activation increases when you experience excitement or stress whether physical, emotional, or cognitive.The skin is the only organ that is purely innervated by the sympathetic nervous system.” https://www.empatica.com/science
  • 88.
  • 89.
  • 90.
    Convulsive seizure detectionusing a wrist-worn electrodermal activity and accelerometry biosensor *yMing-Zher Poh, zTobias Loddenkemper, xClaus Reinsberger, yNicholas C. Swenson, yShubhi Goyal, yMangwe C. Sabtala, {Joseph R. Madsen, and yRosalind W. Picard *Harvard-MIT Division of Health Sciences and Technology, Cambridge, Massachusetts, U.S.A.; yMIT Media Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts, U.S.A.; zDivision of Epilepsy and Clinical Neurophysiology, Department of Neurology, Children’s Hospital Boston, Harvard Medical School, Boston, Massachusetts, U.S.A.; xDepartment of Neurology, Division of Epilepsy, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, U.S.A.; and {Department of Neurosurgery, Children’s Hospital Boston, Harvard Medical School, Boston, Massachusetts, U.S.A. SUMMARY The special requirements for a seizure detector suitable for everyday use in terms of cost, comfort, and social acceptance call for alternatives to electroencephalogra- phy (EEG)–based methods. Therefore, we developed an algorithm for automatic detection of generalized tonic– clonic (GTC) seizures based on sympathetically mediated electrodermal activity (EDA) and accelerometry mea- sured using a novel wrist-worn biosensor. The problem of GTC seizure detection was posed as a supervised learning task in which the goal was to classify 10-s epochs as a seizure or nonseizure event based on 19 extracted fea- tures from EDA and accelerometry recordings using a Support Vector Machine. Performance was evaluated using a double cross-validation method. The new seizure detection algorithm was tested on >4,213 h of recordings from 80 patients and detected 15 (94%) of 16 of the GTC seizures from seven patients with 130 false alarms (0.74 per 24 h). This algorithm can potentially provide a convul- sive seizure alarm system for caregivers and objective quantification of seizure frequency. KEY WORDS: Seizure alarm, Electrodermal activity, Accelerometry, Wearable sensor, Epilepsy. Although combined electroencephalography (EEG) and video-monitoring remain the gold standard for seizure detection in clinical routine, most patients are opposed to wearing scalp EEG electrodes to obtain seizure warnings for everyday use (Schulze-Bonhage et al., 2010). Accele- rometry recordings offer a less-obtrusive method for detect- ing seizures with motor accompaniments (Nijsen et al., 2005). Previously, we showed that electrodermal activity (EDA), which reflects the modulation of sweat gland activ- ity by the sympathetic nervous system, increases during convulsive seizures (Poh et al., 2010a). Herein we describe a novel methodology for generalized tonic–clonic (GTC) seizure detection using information from both EDA and accelerometry signals recorded with a wrist-worn sensor. Methods This study was approved by the institutional review boards of Massachusetts Institute of Technology and Chil- dren’s Hospital Boston. We recruited patients with epilepsy who were admitted to the long-term video-EEG monitoring (LTM) unit. All participants (or their caregivers) provided written informed consent. Custom-built EDA and accele- rometry biosensors were placed on the wrists (Fig. S1) such that the electrodes were in contact with the ventral side of the forearms (Poh et al., 2010b). The various stages of the GTC seizure detector are depicted in Fig. 1A. A sliding window was used to extract 10-s epochs from both accelerometry and EDA recordings for each 2.5-s increment (75% overlap). The data were then preprocessed to remove nonmotor and nonrhythmic epochs. A total of 19 features including time, frequency, and nonlin- ear features were extracted from remaining epochs of the accelerometry and EDA signals to form feature vectors. Finally, each feature vector was assigned to a seizure or nonseizure class using a Support Vector Machine (SVM). We implemented a non–patient-specific seizure detection algorithm that excluded all data from a test patient in the training phase (double leave-one-patient-out cross-valida- tion). To allow the SVM to learn from previous examples of seizures from the test patient if that patient had more than a single GTC seizure recording available, we also imple- mented double leave-one-seizure-out cross-validation. Because the detector was not trained solely on data from a Accepted February 3, 2012; Early View publication March 20, 2012. Address correspondence to Ming-Zher Poh, Ph.D., MIT Media Lab, Massachusetts Institute of Technology, Room E14-374B, 75 Amherst St., Cambridge, MA 02139, U.S.A. E-mail: zher@mit.edu Wiley Periodicals, Inc. ª 2012 International League Against Epilepsy Epilepsia, 53(5):e93–e97, 2012 doi: 10.1111/j.1528-1167.2012.03444.x BRIEF COMMUNICATION e93 •가속도계와 EDA 센서가 내장된 스마트 밴드 •뇌전증 환자 80명을 총 4,213 시간 모니터링 •대발작을 94% detection 성공 (15 out of 16) •19개의 feature를 10초마다 측정: 기계학습 (SVM)으로 분석
  • 91.
    •135명의 환자 대상,multi-center trial •272일, 6530시간 모니터링 •총 40번의 대발작을 100% detection 성공
 
 •2018년 1월 성인 epilepsy 환자 대상의 FDA 인허가 (prescription-only) •2019년 1월 6~21세 소아청소년 환자 대상의 FDA 인허가 (prescription-only)
  • 92.
    Cardiogram •실리콘밸리의 Cardiogram 은애플워치로 측정한 심박수 데이터를 바탕으로 서비스 •2016년 10월 Andressen Horowitz 에서 $2m의 투자 유치
  • 93.
    https://blog.cardiogr.am/what-do-normal-and-abnormal-heart-rhythms-look-like-on-apple-watch-7b33b4a8ecfa •Cardiogram은 심박수에 운동,수면, 감정, 의료적인 상태가 반영된다고 주장 •특히, 심박 데이터를 기반으로 심방세동(atrial fibrillation)과 심방 조동(atrial flutter)의 detection 시도 Cardiogram
  • 94.
    •Cardiogram은 심박 데이터만으로심방세동을 detection할 수 있다고 주장 •“Irregularly irregular” •high absolute variability (a range of 30+ bpm) •a higher fraction missing measurements •a lack of periodicity in heart rate variability •심방세동 특유의 불규칙적인 리듬을 detection 하는 정도로 생각하면 될 듯 •“불규칙적인 리듬을 가지는 (심방세동이 아닌) 다른 부정맥과 구분 가능한가?” (쉽지 않을듯) •따라서, 심박으로 detection한 환자를 심전도(ECG)로 confirm 하는 것이 필요 Cardiogram for A.Fib
  • 95.
    Passive Detection ofAtrial Fibrillation Using a Commercially Available Smartwatch Geoffrey H. Tison, MD, MPH; José M. Sanchez, MD; Brandon Ballinger, BS; Avesh Singh, MS; Jeffrey E. Olgin, MD; Mark J. Pletcher, MD, MPH; Eric Vittinghoff, PhD; Emily S. Lee, BA; Shannon M. Fan, BA; Rachel A. Gladstone, BA; Carlos Mikell, BS; Nimit Sohoni, BS; Johnson Hsieh, MS; Gregory M. Marcus, MD, MAS IMPORTANCE Atrial fibrillation (AF) affects 34 million people worldwide and is a leading cause of stroke. A readily accessible means to continuously monitor for AF could prevent large numbers of strokes and death. OBJECTIVE To develop and validate a deep neural network to detect AF using smartwatch data. DESIGN, SETTING, AND PARTICIPANTS In this multinational cardiovascular remote cohort study coordinated at the University of California, San Francisco, smartwatches were used to obtain heart rate and step count data for algorithm development. A total of 9750 participants enrolled in the Health eHeart Study and 51 patients undergoing cardioversion at the University of California, San Francisco, were enrolled between February 2016 and March 2017. A deep neural network was trained using a method called heuristic pretraining in which the network approximated representations of the R-R interval (ie, time between heartbeats) without manual labeling of training data. Validation was performed against the reference standard 12-lead electrocardiography (ECG) in a separate cohort of patients undergoing cardioversion. A second exploratory validation was performed using smartwatch data from ambulatory individuals against the reference standard of self-reported history of persistent AF. Data were analyzed from March 2017 to September 2017. MAIN OUTCOMES AND MEASURES The sensitivity, specificity, and receiver operating characteristic C statistic for the algorithm to detect AF were generated based on the reference standard of 12-lead ECG–diagnosed AF. RESULTS Of the 9750 participants enrolled in the remote cohort, including 347 participants with AF, 6143 (63.0%) were male, and the mean (SD) age was 42 (12) years. There were more than 139 million heart rate measurements on which the deep neural network was trained. The deep neural network exhibited a C statistic of 0.97 (95% CI, 0.94-1.00; P < .001) to detect AF against the reference standard 12-lead ECG–diagnosed AF in the external validation cohort of 51 patients undergoing cardioversion; sensitivity was 98.0% and specificity was 90.2%. In an exploratory analysis relying on self-report of persistent AF in ambulatory participants, the C statistic was 0.72 (95% CI, 0.64-0.78); sensitivity was 67.7% and specificity was 67.6%. CONCLUSIONS AND RELEVANCE This proof-of-concept study found that smartwatch photoplethysmography coupled with a deep neural network can passively detect AF but with some loss of sensitivity and specificity against a criterion-standard ECG. Further studies will help identify the optimal role for smartwatch-guided rhythm assessment. JAMA Cardiol. doi:10.1001/jamacardio.2018.0136 Published online March 21, 2018. Editorial Supplemental content and Audio Author Affiliations: Division of Cardiology, Department of Medicine, University of California, San Francisco (Tison, Sanchez, Olgin, Lee, Fan, Gladstone, Mikell, Marcus); Cardiogram Incorporated, San Francisco, California (Ballinger, Singh, Sohoni, Hsieh); Department of Epidemiology and Biostatistics, University of California, San Francisco (Pletcher, Vittinghoff). Corresponding Author: Gregory M. Marcus, MD, MAS, Division of Cardiology, Department of Medicine, University of California, San Francisco, 505 Parnassus Ave, M1180B, San Francisco, CA 94143- 0124 (marcusg@medicine.ucsf.edu). Research JAMA Cardiology | Original Investigation (Reprinted) E1 © 2018 American Medical Association. All rights reserved.
  • 96.
    Passive Detection ofAtrial Fibrillation Using a Commercially Available Smartwatch Geoffrey H. Tison, MD, MPH; José M. Sanchez, MD; Brandon Ballinger, BS; Avesh Singh, MS; Jeffrey E. Olgin, MD; Mark J. Pletcher, MD, MPH; Eric Vittinghoff, PhD; Emily S. Lee, BA; Shannon M. Fan, BA; Rachel A. Gladstone, BA; Carlos Mikell, BS; Nimit Sohoni, BS; Johnson Hsieh, MS; Gregory M. Marcus, MD, MAS IMPORTANCE Atrial fibrillation (AF) affects 34 million people worldwide and is a leading cause of stroke. A readily accessible means to continuously monitor for AF could prevent large numbers of strokes and death. OBJECTIVE To develop and validate a deep neural network to detect AF using smartwatch data. DESIGN, SETTING, AND PARTICIPANTS In this multinational cardiovascular remote cohort study coordinated at the University of California, San Francisco, smartwatches were used to obtain heart rate and step count data for algorithm development. A total of 9750 participants enrolled in the Health eHeart Study and 51 patients undergoing cardioversion at the University of California, San Francisco, were enrolled between February 2016 and March 2017. A deep neural network was trained using a method called heuristic pretraining in which the network approximated representations of the R-R interval (ie, time between heartbeats) without manual labeling of training data. Validation was performed against the reference standard 12-lead electrocardiography (ECG) in a separate cohort of patients undergoing cardioversion. A second exploratory validation was performed using smartwatch data from ambulatory individuals against the reference standard of self-reported history of persistent AF. Data were analyzed from March 2017 to September 2017. MAIN OUTCOMES AND MEASURES The sensitivity, specificity, and receiver operating characteristic C statistic for the algorithm to detect AF were generated based on the reference standard of 12-lead ECG–diagnosed AF. RESULTS Of the 9750 participants enrolled in the remote cohort, including 347 participants with AF, 6143 (63.0%) were male, and the mean (SD) age was 42 (12) years. There were more than 139 million heart rate measurements on which the deep neural network was trained. The deep neural network exhibited a C statistic of 0.97 (95% CI, 0.94-1.00; P < .001) to detect AF against the reference standard 12-lead ECG–diagnosed AF in the external validation cohort of 51 patients undergoing cardioversion; sensitivity was 98.0% and specificity was 90.2%. In an exploratory analysis relying on self-report of persistent AF in ambulatory participants, the C statistic was 0.72 (95% CI, 0.64-0.78); sensitivity was 67.7% and specificity was 67.6%. CONCLUSIONS AND RELEVANCE This proof-of-concept study found that smartwatch photoplethysmography coupled with a deep neural network can passively detect AF but with some loss of sensitivity and specificity against a criterion-standard ECG. Further studies will help identify the optimal role for smartwatch-guided rhythm assessment. JAMA Cardiol. doi:10.1001/jamacardio.2018.0136 Published online March 21, 2018. Editorial Supplemental content and Audio Author Affiliations: Division of Cardiology, Department of Medicine, University of California, San Francisco (Tison, Sanchez, Olgin, Lee, Fan, Gladstone, Mikell, Marcus); Cardiogram Incorporated, San Francisco, California (Ballinger, Singh, Sohoni, Hsieh); Department of Epidemiology and Biostatistics, University of California, San Francisco (Pletcher, Vittinghoff). Corresponding Author: Gregory M. Marcus, MD, MAS, Division of Cardiology, Department of Medicine, University of California, San Francisco, 505 Parnassus Ave, M1180B, San Francisco, CA 94143- 0124 (marcusg@medicine.ucsf.edu). Research JAMA Cardiology | Original Investigation (Reprinted) E1 © 2018 American Medical Association. All rights reserved. • eHeart Study in UCSF • A total of 9,750 participants • 51 patients undergoing cardio version • Validated against standard 12-lead ECG
  • 97.
    Passive Detection ofAtrial Fibrillation Using a Commercially Available Smartwatch Geoffrey H. Tison, MD, MPH; José M. Sanchez, MD; Brandon Ballinger, BS; Avesh Singh, MS; Jeffrey E. Olgin, MD; Mark J. Pletcher, MD, MPH; Eric Vittinghoff, PhD; Emily S. Lee, BA; Shannon M. Fan, BA; Rachel A. Gladstone, BA; Carlos Mikell, BS; Nimit Sohoni, BS; Johnson Hsieh, MS; Gregory M. Marcus, MD, MAS IMPORTANCE Atrial fibrillation (AF) affects 34 million people worldwide and is a leading cause of stroke. A readily accessible means to continuously monitor for AF could prevent large numbers of strokes and death. OBJECTIVE To develop and validate a deep neural network to detect AF using smartwatch data. DESIGN, SETTING, AND PARTICIPANTS In this multinational cardiovascular remote cohort study coordinated at the University of California, San Francisco, smartwatches were used to obtain heart rate and step count data for algorithm development. A total of 9750 participants enrolled in the Health eHeart Study and 51 patients undergoing cardioversion at the University of California, San Francisco, were enrolled between February 2016 and March 2017. A deep neural network was trained using a method called heuristic pretraining in which the network approximated representations of the R-R interval (ie, time between heartbeats) without manual labeling of training data. Validation was performed against the reference standard 12-lead electrocardiography (ECG) in a separate cohort of patients undergoing cardioversion. A second exploratory validation was performed using smartwatch data from ambulatory individuals against the reference standard of self-reported history of persistent AF. Data were analyzed from March 2017 to September 2017. MAIN OUTCOMES AND MEASURES The sensitivity, specificity, and receiver operating characteristic C statistic for the algorithm to detect AF were generated based on the reference standard of 12-lead ECG–diagnosed AF. RESULTS Of the 9750 participants enrolled in the remote cohort, including 347 participants with AF, 6143 (63.0%) were male, and the mean (SD) age was 42 (12) years. There were more than 139 million heart rate measurements on which the deep neural network was trained. The deep neural network exhibited a C statistic of 0.97 (95% CI, 0.94-1.00; P < .001) to detect AF against the reference standard 12-lead ECG–diagnosed AF in the external validation cohort of 51 patients undergoing cardioversion; sensitivity was 98.0% and specificity was 90.2%. In an exploratory analysis relying on self-report of persistent AF in ambulatory participants, the C statistic was 0.72 (95% CI, 0.64-0.78); sensitivity was 67.7% and specificity was 67.6%. CONCLUSIONS AND RELEVANCE This proof-of-concept study found that smartwatch photoplethysmography coupled with a deep neural network can passively detect AF but with some loss of sensitivity and specificity against a criterion-standard ECG. Further studies will help identify the optimal role for smartwatch-guided rhythm assessment. JAMA Cardiol. doi:10.1001/jamacardio.2018.0136 Published online March 21, 2018. Editorial Supplemental content and Audio Author Affiliations: Division of Cardiology, Department of Medicine, University of California, San Francisco (Tison, Sanchez, Olgin, Lee, Fan, Gladstone, Mikell, Marcus); Cardiogram Incorporated, San Francisco, California (Ballinger, Singh, Sohoni, Hsieh); Department of Epidemiology and Biostatistics, University of California, San Francisco (Pletcher, Vittinghoff). Corresponding Author: Gregory M. Marcus, MD, MAS, Division of Cardiology, Department of Medicine, University of California, San Francisco, 505 Parnassus Ave, M1180B, San Francisco, CA 94143- 0124 (marcusg@medicine.ucsf.edu). Research JAMA Cardiology | Original Investigation (Reprinted) E1 © 2018 American Medical Association. All rights reserved. tion from the participant (dependent on user adherence) and by the episodic nature of data obtained. A Samsung Simband (Samsung) exhibited high sensitivity and specificity for AF de- 32 costs associated with the care of those patients, the potential reduction in stroke could ultimately provide cost savings. SeveralfactorsmakedetectionofAFfromambulatorydata Figure 2. Accuracy of Detecting Atrial Fibrillation in the Cardioversion Cohort 100 80 60 40 20 0 0 10080 Sensitivity,% 1 –Specificity, % 604020 Cardioversion cohortA 100 80 60 40 20 0 0 10080 Sensitivity,% 1 –Specificity, % 604020 Ambulatory subset of remote cohortB A, Receiver operating characteristic curve among 51 individuals undergoing in-hospital cardioversion. The curve demonstrates a C statistic of 0.97 (95% CI, 0.94-1.00), and the point on the curve indicates a sensitivity of 98.0% and a specificity of 90.2%. B, Receiver operating characteristic curve among 1617 individuals in the ambulatory subset of the remote cohort. The curve demonstrates a C statistic of 0.72 (95% CI, 0.64-0.78), and the point on the curve indicates a sensitivity of 67.7% and a specificity of 67.6%. Table 3. Performance Characteristics of Deep Neural Network in Validation Cohortsa Cohort % AUCSensitivity Specificity PPV NPV Cardioversion cohort (sedentary) 98.0 90.2 90.9 97.8 0.97 Subset of remote cohort (ambulatory) 67.7 67.6 7.9 98.1 0.72 Abbreviations: AUC, area under the receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value. a In the cardioversion cohort, the atrial fibrillation reference standard was 12-lead electrocardiography diagnosis; in the remote cohort, the atrial fibrillation reference standard was limited to self-reported history of persistent atrial fibrillation. Research Original Investigation Passive Detection of Atrial Fibrillation Using a Commercially Available Smartwatch AUC=0.98 AUC=0.72 • In external validation using standard 12-lead ECG, algorithm performance achieved a C statistic of 0.97. • The passive detection of AF from free-living smartwatch data has substantial clinical implications. • Importantly, the accuracy of detecting self-reported AF in an ambulatory setting was more modest (C statistic of 0.72)
  • 98.
    애플워치4: 심전도, 부정맥,낙상 측정 FDA 의료기기 인허가 •De Novo 의료기기로 인허가 받음 (새로운 종류의 의료기기) •9월에 발표하였으나, 부정맥 관련 기능은 12월에 활성화 •미국 애플워치에서만 가능하고, 한국은안 됨 (미국에서 구매한 경우, 한국 앱스토어 ID로 가능)
  • 104.
    • 애플워치4의 부정맥측정 기능으로, • 기능이 활성화된 당일에 자신의 심방세동을 측정한 사용자 • 애플워치 결과 보고, 응급실에 갔더니, • 실제로 심방세동을 진단 받게 되었음
  • 105.
    • 애플워치4 부정맥(심방세동) 측정 기능 • ‘진단’이나 기존 환자의 ‘관리’ 목적이 아니라, • ‘측정’ 목적 • 기존에 진단 받지 않은 환자 중에, • 심방세동이 있는 사람을 확인하여 병원으로 연결
 • 정확성을 정말 철저하게 검증했는가? • 애플워치에 의해서 측정된 심방세동의 20% 정도가 • 패치 형태의 ECG 모니터에서 측정되지 않음 • 즉, false alarm 이 많을 수 있음 
 • 불필요한 병원 방문, 검사, 의료 비용 발생 등을 우려하고 있음
  • 106.
    https://www.scripps.edu/science-and-medicine/translational-institute/about/news/oran-ecg-app/index.html?fbclid=IwAR02Z8SG679-svCkyxBhv3S1JUOSFQlI6UCvNu3wvUgyRmc1r2ft963MFmM • 애플워치4의 심방세동측정 기능의 ‘위험성’ 경고 • 일반인을 대상의 측정에서 false positive의 위험 • (실제로는 심방세동 없는데, 있는 것으로 잘못 나온 케이스) • False positive가 많은 PSA 검사와 비교하여 설명 • 특히, 애플워치는 PSA와 달리 장기적인 정확성 데이터조차 없음 • 의료기기 인허가를 받기는 했으나, • 애플워치4가 얼마나 정확한지는 아무도 모름..
  • 107.
    Early detection ofprostate cancer with PSA testing and a digital rectal exam 1,000 men without screening How many men died from prostate cancer? How many men died from any cause? How many men without prostate cancer experienced false alarms and unnecessarily had tissue samples removed (biopsy)? 7 7 210 210 - 160 Remaining men *E.g. treatments that include removal of the prostate gland (prostatectomy) or radiation therapy which can lead to incontinence and impotence. Source: Ilic et al. Cochrane Database Syst Rev 2013(1):CD004876. Last update: November 2017 www.harding-center.mpg.de/en/fact-boxes Numbers for men aged 50 years or older who either did or did not participate in prostate cancer screening for approximately 11 years. How many men with non-progressive prostate cancer were unnecessarily diagnosed or treated*? 20- 1,000 men with screening https://www.scripps.edu/science-and-medicine/translational-institute/about/news/oran-ecg-app/index.html?fbclid=IwAR02Z8SG679-svCkyxBhv3S1JUOSFQlI6UCvNu3wvUgyRmc1r2ft963MFmM
  • 108.
    Rationale and designof a large-scale, app- based study to identify cardiac arrhythmias using a smartwatch: The Apple Heart Study Mintu P. Turakhia, MD, MAS, a,b Manisha Desai, PhD, c Haley Hedlin, PhD, c Amol Rajmane, MD, MBA, d Nisha Talati, MBA, d Todd Ferris, MD, MS, e Sumbul Desai, MD, f Divya Nag f Mithun Patel, MD, f Peter Kowey, MD, g John S. Rumsfeld, MD, PhD, h Andrea M. Russo, MD, i Mellanie True Hills, BS, j Christopher B. Granger, MD, k Kenneth W. Mahaffey, MD, d and Marco V. Perez, MD l Stanford, Palo Alto, Cupertino, CA; Philadelphia PA; Denver Colorado; Camden NJ; Decatur TX; Durham NC Background Smartwatch and fitness band wearable consumer electronics can passively measure pulse rate from the wrist using photoplethysmography (PPG). Identification of pulse irregularity or variability from these data has the potential to identify atrial fibrillation or atrial flutter (AF, collectively). The rapidly expanding consumer base of these devices allows for detection of undiagnosed AF at scale. Methods The Apple Heart Study is a prospective, single arm pragmatic study that has enrolled 419,093 participants (NCT03335800). The primary objective is to measure the proportion of participants with an irregular pulse detected by the Apple Watch (Apple Inc, Cupertino, CA) with AF on subsequent ambulatory ECG patch monitoring. The secondary objectives are to: 1) characterize the concordance of pulse irregularity notification episodes from the Apple Watch with simultaneously recorded ambulatory ECGs; 2) estimate the rate of initial contact with a health care provider within 3 months after notification of pulse irregularity. The study is conducted virtually, with screening, consent and data collection performed electronically from within an accompanying smartphone app. Study visits are performed by telehealth study physicians via video chat through the app, and ambulatory ECG patches are mailed to the participants. Conclusions The results of this trial will provide initial evidence for the ability of a smartwatch algorithm to identify pulse irregularity and variability which may reflect previously unknown AF. The Apple Heart Study will help provide a foundation for how wearable technology can inform the clinical approach to AF identification and screening. (Am Heart J 2019;207:66-75.) Atrial fibrillation and atrial flutter (AF, collectively) together represent the most common cardiac arrhythmia, currently affecting over 5 million people in the United States1,2 with projected estimates up to 12 million persons by 2050.3 AF increases the risk of stroke 5-fold4 and is responsible for at least 15% to 25% of strokes in the United States.5 Oral anticoagulation can substantially reduce the relative risk of stroke in patients with AF by 49% to 74%, with absolute risk reductions of 2.7% for primary stroke prevention and 8.4% for secondary prevention.6 Unfortunately, 18% of AF-associated strokes present with AF that is newly detected at the time of stroke.7 AF can be subclinical due to minimal symptom severity, frank absence of symptoms, or paroxysmal nature, even in the presence of tachycardia during AF episodes. It is estimated that 700,000 people in the United States may have previously unknown AF, with an incremental cost burden of 3.2 billion dollars.8,9 Asymptomatic AF is associated with similar risk of all-cause death, cardiovas- cular death, and stroke/thromboembolism compared to symptomatic AF.10 Minimally symptomatic patients have been shown to derive significant symptom relief follow- ing rate or rhythm control of AF.11 Undiagnosed or untreated AF can also lead to development of heart failure From the a Center for Digital Health, Stanford University Stanford, CA, b VA Palo Alto Health Care System, Palo Alto, CA, c Quantitative Sciences Unit, Stanford University, Stanford, CA, d Stanford Center for Clinical Research, Stanford University, Stanford, CA, e Information Resources and Technology, Stanford University, Stanford, CA, f Apple Inc. Cupertino, CA, g Lankenau Heart Institute and Jefferson Medical College, Philadelphia, PA, h University of Colorado School of Medicine, Denver, CO, i Division of Cardiovascular Disease, Cooper Medical School of Rowan University, Camden, NJ, j StopAfib.org, American Foundation for Women's Health, Decatur, TX, k Duke Clinical Research Institute, Duke University, Durham, NC, and l Division of Cardiovascular Medicine, Stanford University, Stanford, CA. Peter Alexander Noseworthy, MD served as guest editor for this article. RCT# NCT03335800 Submitted August 13, 2018; accepted September 4, 2018. Reprint requests: Mintu Turakhia, Marco Perez, Stanford Center for Clinical Research, Stanford University, 1070 Arastradero Rd., Palo Alto, CA, 94304. E-mail: mintu@stanford.edu 0002-8703 © 2018 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). https://doi.org/10.1016/j.ahj.2018.09.002 Trial Design American Heart Journal, 2019
  • 109.
  • 110.
    American Heart Journal,2019 Figure 1 •Apple Heart Study •스탠퍼드의 원격 임상 시험 / 애플 스폰서 •PPG를 통해 심장 박동수와 규칙성을 측정 •PPG에서 심방세동이 의심되는 이상이 발견되면 
 다음 단계로 ambulatory ECG를 ePatch로 측정 •동시 기록한 애플워치의 결과와 비교 •ePatch의 사용 및 결과 분석에는 원격진료를 활용 •40만명의 피실험자 등록은 마쳤고 추적 연구 진행 중
  • 111.
    •American College ofCardiology’s 68th Annual Scientific Session •전체 임상 참여자 중에서 irregular pusle notification 받은 사람은 불과 0.5% •애플워치와 ECG patch를 동시에 사용한 결과 71%의 positive predictive value.  •irregular pusle notification 받은 사람 중 84%가 그 시점에 심방세동을 가짐 •f/u으로 그 다음 일주일 동안 ECG patch를 착용한 사람 중 34%가 심방세동을 발견 •Irregular pusle notification 받은 사람 중에 실제로 병원에 간 사람은 57% (전체 환자군의 0.3%)
  • 112.
  • 113.
  • 114.
    Digital Phenotype: Your smartphoneknows if you are depressed Ginger.io
  • 115.
    Digital Phenotype: Your smartphoneknows if you are depressed J Med Internet Res. 2015 Jul 15;17(7):e175. The correlation analysis between the features and the PHQ-9 scores revealed that 6 of the 10 features were significantly correlated to the scores: • strong correlation: circadian movement, normalized entropy, location variance • correlation: phone usage features, usage duration and usage frequency
  • 116.
    the manifestations ofdisease by providing a more comprehensive and nuanced view of the experience of illness. Through the lens of the digital phenotype, an individual’s interaction The digital phenotype Sachin H Jain, Brian W Powers, Jared B Hawkins & John S Brownstein In the coming years, patient phenotypes captured to enhance health and wellness will extend to human interactions with digital technology. In 1982, the evolutionary biologist Richard Dawkins introduced the concept of the “extended phenotype”1, the idea that pheno- types should not be limited just to biological processes, such as protein biosynthesis or tissue growth, but extended to include all effects that a gene has on its environment inside or outside ofthebodyoftheindividualorganism.Dawkins stressed that many delineations of phenotypes are arbitrary. Animals and humans can modify their environments, and these modifications andassociatedbehaviorsareexpressionsofone’s genome and, thus, part of their extended phe- notype. In the animal kingdom, he cites damn buildingbybeaversasanexampleofthebeaver’s extended phenotype1. Aspersonaltechnologybecomesincreasingly embedded in human lives, we think there is an important extension of Dawkins’s theory—the notion of a ‘digital phenotype’. Can aspects of ourinterfacewithtechnologybesomehowdiag- nosticand/orprognosticforcertainconditions? Can one’s clinical data be linked and analyzed together with online activity and behavior data to create a unified, nuanced view of human dis- ease?Here,wedescribetheconceptofthedigital phenotype. Although several disparate studies have touched on this notion, the framework for medicine has yet to be described. We attempt to define digital phenotype and further describe the opportunities and challenges in incorporat- ing these data into healthcare. Jan. 2013 0.000 0.002 0.004 Density 0.006 July 2013 Jan. 2014 July 2014 User 1 User 2 User 3 User 4 User 5 User 6 User 7 Date Figure 1 Timeline of insomnia-related tweets from representative individuals. Density distributions (probability density functions) are shown for seven individual users over a two-year period. Density on the y axis highlights periods of relative activity for each user. A representative tweet from each user is shown as an example. npg©2015NatureAmerica,Inc.Allrightsreserved. http://www.nature.com/nbt/journal/v33/n5/full/nbt.3223.html
  • 117.
    ers, Jared BHawkins & John S Brownstein phenotypes captured to enhance health and wellness will extend to human interactions with st Richard pt of the hat pheno- biological sis or tissue effects that or outside m.Dawkins phenotypes can modify difications onsofone’s ended phe- cites damn hebeaver’s ncreasingly there is an heory—the aspects of ehowdiag- Jan. 2013 0.000 0.002 0.004 Density 0.006 July 2013 Jan. 2014 July 2014 User 1 User 2 User 3 User 4 User 5 User 6 User 7 Date Figure 1 Timeline of insomnia-related tweets from representative individuals. Density distributions (probability density functions) are shown for seven individual users over a two-year period. Density on the y axis highlights periods of relative activity for each user. A representative tweet from each user is Your twitter knows if you cannot sleep Timeline of insomnia-related tweets from representative individuals. Nat. Biotech. 2015
  • 118.
    Reece & Danforth,“Instagram photos reveal predictive markers of depression” (2016) higher Hue (bluer) lower Saturation (grayer) lower Brightness (darker)
  • 119.
    Digital Phenotype: Your Instagramknows if you are depressed Rao (MVR) (24) .     Results  Both All­data and Pre­diagnosis models were decisively superior to a null model . All­data predictors were significant with 99% probability.57.5;(KAll  = 1 K 49.8)  Pre = 1  7 Pre­diagnosis and All­data confidence levels were largely identical, with two exceptions:  Pre­diagnosis Brightness decreased to 90% confidence, and Pre­diagnosis posting frequency  dropped to 30% confidence, suggesting a null predictive value in the latter case.   Increased hue, along with decreased brightness and saturation, predicted depression. This  means that photos posted by depressed individuals tended to be bluer, darker, and grayer (see  Fig. 2). The more comments Instagram posts received, the more likely they were posted by  depressed participants, but the opposite was true for likes received. In the All­data model, higher  posting frequency was also associated with depression. Depressed participants were more likely  to post photos with faces, but had a lower average face count per photograph than healthy  participants. Finally, depressed participants were less likely to apply Instagram filters to their  posted photos.     Fig. 2. Magnitude and direction of regression coefficients in All­data (N=24,713) and Pre­diagnosis (N=18,513)  models. X­axis values represent the adjustment in odds of an observation belonging to depressed individuals, per  Reece & Danforth, “Instagram photos reveal predictive markers of depression” (2016)     Fig. 1. Comparison of HSV values. Right photograph has higher Hue (bluer), lower Saturation (grayer), and lower  Brightness (darker) than left photograph. Instagram photos posted by depressed individuals had HSV values  shifted towards those in the right photograph, compared with photos posted by healthy individuals.    Units of observation  In determining the best time span for this analysis, we encountered a difficult question:  When and for how long does depression occur? A diagnosis of depression does not indicate the  persistence of a depressive state for every moment of every day, and to conduct analysis using an  individual’s entire posting history as a single unit of observation is therefore rather specious. At  the other extreme, to take each individual photograph as units of observation runs the risk of  being too granular. DeChoudhury et al. (5) looked at all of a given user’s posts in a single day,  and aggregated those data into per­person, per­day units of observation. We adopted this  precedent of “user­days” as a unit of analysis .  5   Statistical framework  We used Bayesian logistic regression with uninformative priors to determine the strength  of individual predictors. Two separate models were trained. The All­data model used all  collected data to address Hypothesis 1. The Pre­diagnosis model used all data collected from  higher Hue (bluer) lower Saturation (grayer) lower Brightness (darker)
  • 120.
    Digital Phenotype: Your Instagramknows if you are depressed Reece & Danforth, “Instagram photos reveal predictive markers of depression” (2016) . In particular, depressedχ2 07.84, p .17e 64;( All  = 9   = 9 − 1 13.80, p .87e 44)χ2Pre  = 8   = 2 − 1   participants were less likely than healthy participants to use any filters at all. When depressed  participants did employ filters, they most disproportionately favored the “Inkwell” filter, which  converts color photographs to black­and­white images. Conversely, healthy participants most  disproportionately favored the Valencia filter, which lightens the tint of photos. Examples of  filtered photographs are provided in SI Appendix VIII.     Fig. 3. Instagram filter usage among depressed and healthy participants. Bars indicate difference between observed  and expected usage frequencies, based on a Chi­squared analysis of independence. Blue bars indicate  disproportionate use of a filter by depressed compared to healthy participants, orange bars indicate the reverse. 
  • 121.
    Digital Phenotype: Your Instagramknows if you are depressed Reece & Danforth, “Instagram photos reveal predictive markers of depression” (2016)   VIII. Instagram filter examples    Fig. S8. Examples of Inkwell and Valencia Instagram filters.  Inkwell converts  color photos to black­and­white, Valencia lightens tint.  Depressed participants  most favored Inkwell compared to healthy participants, Healthy participants 
  • 122.
    Mindstrong Health • 스마트폰사용 패턴을 바탕으로 • 인지능력, 우울증, 조현병, 양극성 장애, PTSD 등을 측정 • 미국 국립정신건강연구소 소장인 Tomas Insel 이 공동 설립 • 아마존의 제프 베조스 투자
  • 123.
    BRIEF COMMUNICATION OPEN Digitalbiomarkers of cognitive function Paul Dagum1 To identify digital biomarkers associated with cognitive function, we analyzed human–computer interaction from 7 days of smartphone use in 27 subjects (ages 18–34) who received a gold standard neuropsychological assessment. For several neuropsychological constructs (working memory, memory, executive function, language, and intelligence), we found a family of digital biomarkers that predicted test scores with high correlations (p < 10−4 ). These preliminary results suggest that passive measures from smartphone use could be a continuous ecological surrogate for laboratory-based neuropsychological assessment. npj Digital Medicine (2018)1:10 ; doi:10.1038/s41746-018-0018-4 INTRODUCTION By comparison to the functional metrics available in other disciplines, conventional measures of neuropsychiatric disorders have several challenges. First, they are obtrusive, requiring a subject to break from their normal routine, dedicating time and often travel. Second, they are not ecological and require subjects to perform a task outside of the context of everyday behavior. Third, they are episodic and provide sparse snapshots of a patient only at the time of the assessment. Lastly, they are poorly scalable, taxing limited resources including space and trained staff. In seeking objective and ecological measures of cognition, we attempted to develop a method to measure memory and executive function not in the laboratory but in the moment, day-to-day. We used human–computer interaction on smart- phones to identify digital biomarkers that were correlated with neuropsychological performance. RESULTS In 2014, 27 participants (ages 27.1 ± 4.4 years, education 14.1 ± 2.3 years, M:F 8:19) volunteered for neuropsychological assessment and a test of the smartphone app. Smartphone human–computer interaction data from the 7 days following the neuropsychological assessment showed a range of correla- tions with the cognitive scores. Table 1 shows the correlation between each neurocognitive test and the cross-validated predictions of the supervised kernel PCA constructed from the biomarkers for that test. Figure 1 shows each participant test score and the digital biomarker prediction for (a) digits backward, (b) symbol digit modality, (c) animal fluency, (d) Wechsler Memory Scale-3rd Edition (WMS-III) logical memory (delayed free recall), (e) brief visuospatial memory test (delayed free recall), and (f) Wechsler Adult Intelligence Scale- 4th Edition (WAIS-IV) block design. Construct validity of the predictions was determined using pattern matching that computed a correlation of 0.87 with p < 10−59 between the covariance matrix of the predictions and the covariance matrix of the tests. Table 1. Fourteen neurocognitive assessments covering five cognitive domains and dexterity were performed by a neuropsychologist. Shown are the group mean and standard deviation, range of score, and the correlation between each test and the cross-validated prediction constructed from the digital biomarkers for that test Cognitive predictions Mean (SD) Range R (predicted), p-value Working memory Digits forward 10.9 (2.7) 7–15 0.71 ± 0.10, 10−4 Digits backward 8.3 (2.7) 4–14 0.75 ± 0.08, 10−5 Executive function Trail A 23.0 (7.6) 12–39 0.70 ± 0.10, 10−4 Trail B 53.3 (13.1) 37–88 0.82 ± 0.06, 10−6 Symbol digit modality 55.8 (7.7) 43–67 0.70 ± 0.10, 10−4 Language Animal fluency 22.5 (3.8) 15–30 0.67 ± 0.11, 10−4 FAS phonemic fluency 42 (7.1) 27–52 0.63 ± 0.12, 10−3 Dexterity Grooved pegboard test (dominant hand) 62.7 (6.7) 51–75 0.73 ± 0.09, 10−4 Memory California verbal learning test (delayed free recall) 14.1 (1.9) 9–16 0.62 ± 0.12, 10−3 WMS-III logical memory (delayed free recall) 29.4 (6.2) 18–42 0.81 ± 0.07, 10−6 Brief visuospatial memory test (delayed free recall) 10.2 (1.8) 5–12 0.77 ± 0.08, 10−5 Intelligence scale WAIS-IV block design 46.1(12.8) 12–61 0.83 ± 0.06, 10−6 WAIS-IV matrix reasoning 22.1(3.3) 12–26 0.80 ± 0.07, 10−6 WAIS-IV vocabulary 40.6(4.0) 31–50 0.67 ± 0.11, 10−4 Received: 5 October 2017 Revised: 3 February 2018 Accepted: 7 February 2018 1 Mindstrong Health, 248 Homer Street, Palo Alto, CA 94301, USA Correspondence: Paul Dagum (paul@mindstronghealth.com) www.nature.com/npjdigitalmed Published in partnership with the Scripps Translational Science Institute • 총 45가지 스마트폰 사용 패턴: 타이핑, 스크롤, 화면 터치 • 스페이스바 누른 후, 다음 문자 타이핑하는 행동 • 백스페이스를 누른 후, 그 다음 백스페이스 • 주소록에서 사람을 찾는 행동 양식
 • 스마트폰 사용 패턴과 인지 능력의 상관 관계 • 20-30대 피험자 27명 • Working Memory, Language, Dexterity etc
  • 124.
    BRIEF COMMUNICATION OPEN Digitalbiomarkers of cognitive function Paul Dagum1 To identify digital biomarkers associated with cognitive function, we analyzed human–computer interaction from 7 days of smartphone use in 27 subjects (ages 18–34) who received a gold standard neuropsychological assessment. For several neuropsychological constructs (working memory, memory, executive function, language, and intelligence), we found a family of digital biomarkers that predicted test scores with high correlations (p < 10−4 ). These preliminary results suggest that passive measures from smartphone use could be a continuous ecological surrogate for laboratory-based neuropsychological assessment. npj Digital Medicine (2018)1:10 ; doi:10.1038/s41746-018-0018-4 INTRODUCTION By comparison to the functional metrics available in other disciplines, conventional measures of neuropsychiatric disorders have several challenges. First, they are obtrusive, requiring a subject to break from their normal routine, dedicating time and often travel. Second, they are not ecological and require subjects to perform a task outside of the context of everyday behavior. Third, they are episodic and provide sparse snapshots of a patient only at the time of the assessment. Lastly, they are poorly scalable, taxing limited resources including space and trained staff. In seeking objective and ecological measures of cognition, we attempted to develop a method to measure memory and executive function not in the laboratory but in the moment, day-to-day. We used human–computer interaction on smart- phones to identify digital biomarkers that were correlated with neuropsychological performance. RESULTS In 2014, 27 participants (ages 27.1 ± 4.4 years, education 14.1 ± 2.3 years, M:F 8:19) volunteered for neuropsychological assessment and a test of the smartphone app. Smartphone human–computer interaction data from the 7 days following the neuropsychological assessment showed a range of correla- tions with the cognitive scores. Table 1 shows the correlation between each neurocognitive test and the cross-validated predictions of the supervised kernel PCA constructed from the biomarkers for that test. Figure 1 shows each participant test score and the digital biomarker prediction for (a) digits backward, (b) symbol digit modality, (c) animal fluency, (d) Wechsler Memory Scale-3rd Edition (WMS-III) logical memory (delayed free recall), (e) brief visuospatial memory test (delayed free recall), and (f) Wechsler Adult Intelligence Scale- 4th Edition (WAIS-IV) block design. Construct validity of the predictions was determined using pattern matching that computed a correlation of 0.87 with p < 10−59 between the covariance matrix of the predictions and the covariance matrix of the tests. Table 1. Fourteen neurocognitive assessments covering five cognitive domains and dexterity were performed by a neuropsychologist. Shown are the group mean and standard deviation, range of score, and the correlation between each test and the cross-validated prediction constructed from the digital biomarkers for that test Cognitive predictions Mean (SD) Range R (predicted), p-value Working memory Digits forward 10.9 (2.7) 7–15 0.71 ± 0.10, 10−4 Digits backward 8.3 (2.7) 4–14 0.75 ± 0.08, 10−5 Executive function Trail A 23.0 (7.6) 12–39 0.70 ± 0.10, 10−4 Trail B 53.3 (13.1) 37–88 0.82 ± 0.06, 10−6 Symbol digit modality 55.8 (7.7) 43–67 0.70 ± 0.10, 10−4 Language Animal fluency 22.5 (3.8) 15–30 0.67 ± 0.11, 10−4 FAS phonemic fluency 42 (7.1) 27–52 0.63 ± 0.12, 10−3 Dexterity Grooved pegboard test (dominant hand) 62.7 (6.7) 51–75 0.73 ± 0.09, 10−4 Memory California verbal learning test (delayed free recall) 14.1 (1.9) 9–16 0.62 ± 0.12, 10−3 WMS-III logical memory (delayed free recall) 29.4 (6.2) 18–42 0.81 ± 0.07, 10−6 Brief visuospatial memory test (delayed free recall) 10.2 (1.8) 5–12 0.77 ± 0.08, 10−5 Intelligence scale WAIS-IV block design 46.1(12.8) 12–61 0.83 ± 0.06, 10−6 WAIS-IV matrix reasoning 22.1(3.3) 12–26 0.80 ± 0.07, 10−6 WAIS-IV vocabulary 40.6(4.0) 31–50 0.67 ± 0.11, 10−4 Received: 5 October 2017 Revised: 3 February 2018 Accepted: 7 February 2018 1 Mindstrong Health, 248 Homer Street, Palo Alto, CA 94301, USA Correspondence: Paul Dagum (paul@mindstronghealth.com) www.nature.com/npjdigitalmed Published in partnership with the Scripps Translational Science Institute Fig. 1 A blue square represents a participant test Z-score normed to the 27 participant scores and a red circle represents the digital biomarker prediction Z-score normed to the 27 predictions. Test scores and predictions shown are a digits backward, b symbol digit modality, c animal fluency, d Wechsler memory Scale-3rd Edition (WMS-III) logical memory (delayed free recall), e brief visuospatial memory test (delayed free recall), and f Wechsler adult intelligence scale-4th Edition (WAIS-IV) block design Digital biomarkers of cognitive function P Dagum 2 1234567890():,; • 스마트폰 사용 패턴과 인지 능력의 상관 관계 • 파란색: 표준 인지 능력 테스트 결과 • 붉은색: 마인드 스트롱의 스마트폰 사용 패턴
  • 125.
  • 126.
  • 128.
  • 130.
  • 131.
  • 132.
    Epic MyChart EpicEHR Dexcom CGM Patients/User Devices EH Hospit Whitings + Apple Watch Apps HealthKit
  • 135.
  • 136.
    Hospital A HospitalB Hospital C interoperability
  • 137.
  • 138.
    •2018년 1월에 출시당시, 존스홉킨스, UC샌디에고 등 12개의 병원에 연동 •(2019년 2월 현재) 1년 만에 200개 이상의 병원에 연동 •VA와도 연동된다고 밝힘 (with 9 million veterans) •2008년 구글 헬스는 3년 동안 12개 병원에 연동에 그쳤음
  • 139.
    Data-driven Medicine에 대한두 가지 전략 • top-down: 먼저 가설을 세우고, 그에 맞는 종류의 데이터를 모아서 검증해보자. • bottom-up: 일단 ‘모든’ 데이터를 최대한 많이 모아 놓으면, 뭐라도 큰 게 나오겠지.
  • 140.
    • top-down: 먼저가설을 세우고, 그에 맞는 종류의 데이터를 모아서 검증해보자. • bottom-up: 일단 ‘모든’ 데이터를 최대한 많이 모아 놓으면, 뭐라도 큰 게 나오겠지. Data-driven Medicine에 대한 두 가지 전략
  • 141.
    ©2017NatureAmerica,Inc.,partofSpringerNature.Allrightsreserved. NATURE BIOTECHNOLOGY ADVANCEONLINE PUBLICATION 1 A RT I C L E S In order to understand the basis of wellness and disease, we and others have pursued a global and holistic approach termed ‘systems medicine’1. The defining feature of systems medicine is the collec- tion of diverse longitudinal data for each individual. These data sets can be used to unravel the complexity of human biology and dis- ease by assessing both genetic and environmental determinants of health and their interactions. We refer to such data as personal, dense, dynamic data clouds: personal, because each data cloud is unique to an individual; dense, because of the high number of measurements; and dynamic, because we monitor longitudinally. The convergence of advances in systems medicine, big data analysis, individual meas- urement devices, and consumer-activated social networks has led to a vision of healthcare that is predictive, preventive, personalized, and participatory (P4)2, also known as ‘precision medicine’. Personal, dense, dynamic data clouds are indispensable to realizing this vision3. The US healthcare system invests 97% of its resources on disease care4, with little attention to wellness and disease prevention. Here we investigate scientific wellness, which we define as a quantitative data-informed approach to maintaining and improving health and avoiding disease. Several recent studies have illustrated the utility of multi-omic lon- gitudinal data to look for signs of reversible early disease or disease risk factors in single individuals. The dynamics of human gut and sali- vary microbiota in response to travel abroad and enteric infection was characterized in two individuals using daily stool and saliva samples5. Daily multi-omic data collection from one individual over 14 months identified signatures of respiratory infection and the onset of type 2 diabetes6. Crohn’s disease progression was tracked over many years in one individual using regular blood and stool measurements7. Each of these studies yielded insights into system dynamics even though they had only one or two participants. We report the generation and analysis of personal, dense, dynamic data clouds for 108 individuals over the course of a 9-month study that we call the Pioneer 100 Wellness Project (P100). Our study included whole genome sequences; clinical tests, metabolomes, proteomes, and microbiomes at 3-month intervals; and frequent activity measure- ments (i.e., wearing a Fitbit). This study takes a different approach from previous studies, in that a broad set of assays were carried out less frequently in a (comparatively) large number of people. Furthermore, we identified ‘actionable possibilities’ for each individual to enhance her/his health. Risk factors that we observed in participants’ clinical markers and genetics were used as a starting point to identify action- able possibilities for behavioral coaching. We report the correlations among different data types and identify population-level changes in clinical markers. This project is the pilot for the 100,000 (100K) person wellness project that we proposed in 2014 (ref. 8). An increased scale of personal, dense, dynamic data clouds in future holds the potential to improve our under- standing of scientific wellness and delineate early warning signs for human diseases. RESULTS The P100 study had four objectives. First, establish cost-efficient procedures for generating, storing, and analyzing multiple sources A wellness study of 108 individuals using personal, dense, dynamic data clouds Nathan D Price1,2,6,7, Andrew T Magis2,6, John C Earls2,6, Gustavo Glusman1 , Roie Levy1, Christopher Lausted1, Daniel T McDonald1,5, Ulrike Kusebauch1, Christopher L Moss1, Yong Zhou1, Shizhen Qin1, Robert L Moritz1 , Kristin Brogaard2, Gilbert S Omenn1,3, Jennifer C Lovejoy1,2 & Leroy Hood1,4,7 Personal data for 108 individuals were collected during a 9-month period, including whole genome sequences; clinical tests, metabolomes, proteomes, and microbiomes at three time points; and daily activity tracking. Using all of these data, we generated a correlation network that revealed communities of related analytes associated with physiology and disease. Connectivity within analyte communities enabled the identification of known and candidate biomarkers (e.g., gamma-glutamyltyrosine was densely interconnected with clinical analytes for cardiometabolic disease). We calculated polygenic scores from genome-wide association studies (GWAS) for 127 traits and diseases, and used these to discover molecular correlates of polygenic risk (e.g., genetic risk for inflammatory bowel disease was negatively correlated with plasma cystine). Finally, behavioral coaching informed by personal data helped participants to improve clinical biomarkers. Our results show that measurement of personal data clouds over time can improve our understanding of health and disease, including early transitions to disease states. 1Institute for Systems Biology, Seattle, Washington, USA. 2Arivale, Seattle, Washington, USA. 3Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA. 4Providence St. Joseph Health, Seattle, Washington, USA. 5Present address: University of California, San Diego, San Diego, California, USA. 6These authors contributed equally to this work. 7These authors jointly supervised this work. Correspondence should be addressed to N.D.P. (nathan.price@systemsbiology.org) or L.H. (lhood@systemsbiology.org). Received 16 October 2016; accepted 11 April 2017; published online 17 July 2017; doi:10.1038/nbt.3870
  • 142.
    NatureAmerica,Inc.,partofSpringerNature.Allrightsreserved. Intro a b Round 1 Coachingsessions Round 2 Coaching sessions Round 3 Coaching sessions Month 1 Month 2 Month 3 Month 4 Month 5 Month 6 Month 7 Month 8 Month 9 Clinical labs Cardiovascular HDL/LDL cholesterol, triglycerides, particle profiles, and other markers Blood sample Metabolomics Xenobiotics and metabolism-related small molecules Blood sample Diabetes risk Fasting glucose, HbA1c, insulin, and other markers Blood sample Inflammation IL-6, IL-8, and other markers Blood sample Nutrition and toxins Ferritin, vitamin D, glutathione, mercury, lead, and other markers Blood sample Genetics Whole genome sequence Blood sample Proteomics Inflammation, cardiovascular, liver, brain, and heart-related proteins Blood sample Gut microbiome 16S rRNA sequencing Stool sample Quantified self Daily activity Activity tracker Stress Four-point cortisol Saliva 모든 가용한 다차원적 데이터를 측정해보자
  • 143.
    ©2017NatureAmerica,Inc.,partofSpringerNature.Allrightsreserved. Proteomics Genetic traits Microbiome Coriobacteriia Allergic sensitization GH NEMO CD40L REN TPA HSP 27 LEP SIRT2 IL 6 FABP4 IL 1RA EGF VEGF A CSTB BETA NGF PPBP(2) PPBP NCF2 4E BP1 STAM PB SIRT2 CSF 1IL 6 FGF 21 IL 10RA IL 18R1IL8IL7 TNFSF14 CCL20 FLT3L CXCL10CD5HGFAXIN1 VEGFAOPGDNEROSM APCSINHBCCRP(2)CRPCFHR1HGFAC MBL2 SERPINC1 GC PTGDS ACTA2 ACTA2(2) PDGF SUBUNIT B Deletion Cfhr1 Inflammatory Bowel Disease Activated Partial Thromboplastin Time Bladder Cancer Bilirubin Levels Gamma Linolenic Acid Dihomo gamma Linolenic Acid Arachidonic Acid Linoleic Acid Adrenic Acid Deltaproteobacteria Mollicutes Verrucomicrobiae Coriobacteriales Verrucomicrobiales Verrucomicrobia Coriobacteriaceae 91otu13421 91otu4418 91otu1825 M ogibacteriaceae Unclassified Desulfovibrionaceae Pasteurellaceae Peptostreptococcaceae Christensenellaceae Verrucom icrobiaceae Alanine RatioOm6Om3 AlphaAminoN ButyricAcid Interleukinll6 SmallLdlParticle RatioGlnGln Threonine 3Methylhistidine AverageinflammationScore Mercury DocosapentaenoicAcidDocosatetraenoicAcid EicosadienoicAcidHomalrLeucineOmega3indexTyrosine HdlCholesterolCPeptide 1Methylhistidine 3HydroxyisovalericAcid IsovalerylglycineIsoleucine Figlu TotalCholesterolLinoleicDihomoYLinolejc PalmitoleicAcid ArachidonicAcid LdlParticle ArachidonicEicosapentaenoic Pasteurellales Diversity Tenericutes Clinical labs Metabolomics 5Hydroxyhexanoate Tl16:0(palmiticAcid) Tl18:3n6(gLinolenicAcid)Tl15:0(pentadecanoicAcid)Tl14:1n5(myristoleicAcid)Tl20:2n6(eicosadienoicAcid)Tl20:5n3(eicosapentaenoicAcid) Tl18:2n6(linoleicAcid) Tldm16:0(plasmalogenPalmiticAcid) Tl22:6n3(docosahexaenoicAcid) Tl22:4n6(adrenicAcid) Tl18:1n9(oleicAcid) Tldm18:1n9(plasmalogenOleicAcid) Tl20:4n6(arachidonicAcid) Tl14:0(myristicAcid) Arachidate(20:0) StearoylArachidonoylGlycerophosphoethanolamine(1)* 1Linoleoylglycerophosphocholine(18:2n6) StearoylLinoleoylGlycerophosphoethanolamine(1)* 1Palmitoleoylglycerophosphocholine(16:1)* PalmitoylOleoylGlycerophosphoglycerol(2)* PalmitoylLinoleoylGlycerophosphocholine(1)* Tl20:3n6(diHomoGLinoleicAcid) 2Hydroxypalmitate NervonoylSphingomyelin* Titl(totalTotalLipid) Cholesterol D ocosahexaenoate (dha;22;6n3) Eicosapentaenoate (epa; 20:5n3) 3 Carboxy 4 M ethyl 5 Propyl 2 Furanpropanoate (cm pf) 3 M ethyladipate Cholate Phosphoethanolamine 1 Oleoylglycerol (1 Monoolein) Tigloylglycine Valine sobutyrylglycine soleucine eucine P Cresol Glucuronide* Phenylacetylglutamine P Cresol Sulfate Tyrosine S Methylcysteine Cystine 3 Methylhistidine 1 Methylhistidine N Acetyltryptophan 3 Indoxyl Sulfate Serotonin (5ht) Creatinine Glutamate Cysteine Glutathione Disulfide Gamma Glutamylthreonine*Gamma Glutamylalanine Gamma Glutamylglutamate Gamma Glutamylglutamine Bradykinin, Hydroxy Pro(3) Bradykinin, Des Arg(9) BradykininMannoseBilirubin (e,e)* Biliverdin Bilirubin (z,z) L UrobilinNicotinamide Alpha TocopherolHippurate Cinnam oylglycine Ldl Particle N um ber Triglycerides Bilirubin Direct Alkaline Phosphatase EgfrNon AfrAm erican CholesterolTotal LdlSm all LdlM edium BilirubinTotal Ggt EgfrAfricanAmerican Cystine MargaricAcid ElaidicAcid Proinsulin Hba1c Insulin Triglycerides Ldlcholesterol DihomoGammaLinolenicAcid HsCrp GlutamicAcid Height Weight Leptin BodyMasIndex PhenylaceticAcid Valine TotalOmega3 TotalOmega6 HsCrpRelativeRisk DocosahexaenoicAcid AlphaAminoadipicAcid EicosapentaenoicAcid GammaAminobutyricAcid 5 Acetylam ino 6 Form ylam ino 3 M ethyluracil Adenosine 5 Monophosphate (amp) Gamma Glutamyltyrosine Gamma Glutamyl 2 Aminobutyrate N Acetyl 3 Methylhistidine* 3 Phenylpropionate (hydrocinnamate) Figure 2 Top 100 correlations per pair of data types. Subset of top statistically significant Spearman inter-omic cross-sectional correlations between all data sets collected in our cohort. Each line represents one correlation that was significant after adjustment for multiple hypothesis testing using the method of Benjamini and Hochberg10 at padj < 0.05. The mean of all three time points was used to compute the correlations between analytes. Up to 100 correlations per pair of data types are shown in this figure. See Supplementary Figure 1 and Supplementary Table 2 for the complete inter-omic cross-sectional network. Nature Biotechnology 2017 측정한 모든 종류의 데이터들 중에 가장 correlation이 높은 100개의 pair를 선정
  • 144.
    • 버릴리(구글)의 베이스라인프로젝트 • 건강과 질병을 새롭게 정의하기 위한 프로젝트 • 4년 동안 10,000 명의 개인의 건강 상태를 면밀하게 추적하여 데이터를 축적 • 심박수와 수면패턴 및 유전 정보, 감정 상태, 진료기록, 가족력, 소변/타액/혈액 검사 등
  • 145.
    iCarbonX •중국 BGI의 대표였던준왕이 창업 •'모든 데이터를 측정'하고 이를 정밀 의료에 활용할 계획 •데이터를 측정할 수 있는 역량을 가진 회사에 투자 및 인수 •SomaLogic, HealthTell, PatientsLikMe •향후 5년 동안 100만명-1000만 명의 데이터 모을 계획 •이 데이터의 분석은 인공지능으로
  • 146.
    •Precision Medicine InitiativeCohort Program •2억 1500만 달러 투입 •최소한 100만명의 미국인을 자발적으로 모집해서 •EMR, 가족력, 유전 정보, 혈액 및 소변 검사 결과, •MRI 등의 영상 의료 데이터, 웨어러블 디바이스를 통한 데이터
  • 147.
    The Future ofIndividualized Medicine, 2019 @San Diego
  • 148.
    The Future ofIndividualized Medicine, 2019 @San Diego
  • 149.
    Step 3. Insightfrom the Data
  • 151.
  • 152.
    How to Analyzeand Interpret the Big Data?
  • 153.
    and/or Two ways toget insights from the big data
  • 155.
    No choice butto bring AI into the medicine
  • 156.
    Martin Duggan,“IBM WatsonHealth - Integrated Care & the Evolution to Cognitive Computing”
  • 157.
    •복잡한 의료 데이터의분석 및 insight 도출 •영상 의료/병리 데이터의 분석/판독 •연속 데이터의 모니터링 및 예방/예측 의료 인공지능의 세 유형
  • 158.
    •복잡한 의료 데이터의분석 및 insight 도출 •영상 의료/병리 데이터의 분석/판독 •연속 데이터의 모니터링 및 예방/예측 의료 인공지능의 세 유형
  • 159.
    Jeopardy! 2011년 인간 챔피언두 명 과 퀴즈 대결을 벌여서 압도적인 우승을 차지
  • 162.
    ARTICLE OPEN Scalable andaccurate deep learning with electronic health records Alvin Rajkomar 1,2 , Eyal Oren1 , Kai Chen1 , Andrew M. Dai1 , Nissan Hajaj1 , Michaela Hardt1 , Peter J. Liu1 , Xiaobing Liu1 , Jake Marcus1 , Mimi Sun1 , Patrik Sundberg1 , Hector Yee1 , Kun Zhang1 , Yi Zhang1 , Gerardo Flores1 , Gavin E. Duggan1 , Jamie Irvine1 , Quoc Le1 , Kurt Litsch1 , Alexander Mossin1 , Justin Tansuwan1 , De Wang1 , James Wexler1 , Jimbo Wilson1 , Dana Ludwig2 , Samuel L. Volchenboum3 , Katherine Chou1 , Michael Pearson1 , Srinivasan Madabushi1 , Nigam H. Shah4 , Atul J. Butte2 , Michael D. Howell1 , Claire Cui1 , Greg S. Corrado1 and Jeffrey Dean1 Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two US academic medical centers with 216,221 adult patients hospitalized for at least 24 h. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (area under the receiver operator curve [AUROC] across sites 0.93–0.94), 30-day unplanned readmission (AUROC 0.75–0.76), prolonged length of stay (AUROC 0.85–0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed traditional, clinically-used predictive models in all cases. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios. In a case study of a particular prediction, we demonstrate that neural networks can be used to identify relevant information from the patient’s chart. npj Digital Medicine (2018)1:18 ; doi:10.1038/s41746-018-0029-1 INTRODUCTION The promise of digital medicine stems in part from the hope that, by digitizing health data, we might more easily leverage computer information systems to understand and improve care. In fact, routinely collected patient healthcare data are now approaching the genomic scale in volume and complexity.1 Unfortunately, most of this information is not yet used in the sorts of predictive statistical models clinicians might use to improve care delivery. It is widely suspected that use of such efforts, if successful, could provide major benefits not only for patient safety and quality but also in reducing healthcare costs.2–6 In spite of the richness and potential of available data, scaling the development of predictive models is difficult because, for traditional predictive modeling techniques, each outcome to be predicted requires the creation of a custom dataset with specific variables.7 It is widely held that 80% of the effort in an analytic model is preprocessing, merging, customizing, and cleaning datasets,8,9 not analyzing them for insights. This profoundly limits the scalability of predictive models. Another challenge is that the number of potential predictor variables in the electronic health record (EHR) may easily number in the thousands, particularly if free-text notes from doctors, nurses, and other providers are included. Traditional modeling approaches have dealt with this complexity simply by choosing a very limited number of commonly collected variables to consider.7 This is problematic because the resulting models may produce imprecise predictions: false-positive predictions can overwhelm physicians, nurses, and other providers with false alarms and concomitant alert fatigue,10 which the Joint Commission identified as a national patient safety priority in 2014.11 False-negative predictions can miss significant numbers of clinically important events, leading to poor clinical outcomes.11,12 Incorporating the entire EHR, including clinicians’ free-text notes, offers some hope of overcoming these shortcomings but is unwieldy for most predictive modeling techniques. Recent developments in deep learning and artificial neural networks may allow us to address many of these challenges and unlock the information in the EHR. Deep learning emerged as the preferred machine learning approach in machine perception problems ranging from computer vision to speech recognition, but has more recently proven useful in natural language processing, sequence prediction, and mixed modality data settings.13–17 These systems are known for their ability to handle large volumes of relatively messy data, including errors in labels Received: 26 January 2018 Revised: 14 March 2018 Accepted: 26 March 2018 1 Google Inc, Mountain View, CA, USA; 2 University of California, San Francisco, San Francisco, CA, USA; 3 University of Chicago Medicine, Chicago, IL, USA and 4 Stanford University, Stanford, CA, USA Correspondence: Alvin Rajkomar (alvinrajkomar@google.com) These authors contributed equally: Alvin Rajkomar, Eyal Oren www.nature.com/npjdigitalmed Published in partnership with the Scripps Translational Science Institute •2018년 1월 구글이 전자의무기록(EMR)을 분석하여, 환자 치료 결과를 예측하는 인공지능 발표 •환자가 입원 중에 사망할 것인지 •장기간 입원할 것인지 •퇴원 후에 30일 내에 재입원할 것인지 •퇴원 시의 진단명
 •이번 연구의 특징: 확장성 •과거 다른 연구와 달리 EMR의 일부 데이터를 pre-processing 하지 않고, •전체 EMR 를 통채로 모두 분석하였음: UCSF, UCM (시카고 대학병원) •특히, 비정형 데이터인 의사의 진료 노트도 분석 Nat Digi Med 2018
  • 163.
    Nat Digi Med2018 clinically-used predictive models. Because we were inte understanding whether deep learning could scale to valid predictions across divergent healthcare domains, w single data structure to make predictions for an importan outcome (death), a standard measure of quality of ca missions), a measure of resource utilization (length of sta measure of understanding of a patient’s problems (diagn Second, using the entirety of a patient’s chart fo prediction does more than promote scalability, it expos data with which to make an accurate prediction. For pr made at discharge, our deep learning models consider than 46 billion pieces of EHR data and achieved more predictions, earlier in the hospital stay, than did tr models. To the best of our knowledge, our models outperform EHR models in the medical literature for predicting (0.92–0.94 vs 0.91),42 unexpected readmission (0.75– 0.69),43 and increased length of stay (0.85–0.86 vs 0.77). comparisons to other studies are difficult45 because of underlying study designs,23,46–57 incomplete definitions o and outcomes,58,59 restrictions on disease-specific cohort use of data unavailable in real-time.63,65,66 Theref implemented baselines based on the HOSPITAL score,67 score, and Liu’s model44 on our data, and demonstrat better performance. We are not aware of a study that pr many ICD codes as this study, but our micro-F1 score exce shown on the smaller MIMIC-III dataset when predictin diagnoses (0.40 vs 0.28).68 The clinical impact of this impr is suggested, for example, by the improvement of numbe to evaluate for inpatient mortality: the deep learning mod fire half the number of alerts of a traditional predictive resulting in many fewer false positives. However, the novelty of the approach does not lie s token is considered as a potential predictor by the deep learning model. The line within the boxplot represents the median, represents the interquartile range (IQR), and the whiskers are 1.5 times the IQR. The number of tokens increased steadily from adm discharge. At discharge, the median number of tokens for Hospital A was 86,477 and for Hospital B was 122,961 Table 2. Prediction accuracy of each task made at different time points Hospital A Hospital B Inpatient mortality, AUROCa (95% CI) 24 h before admission 0.87 (0.85–0.89) 0.81 (0.79–0.83) At admission 0.90 (0.88–0.92) 0.90 (0.86–0.91) 24 h after admission 0.95 (0.94–0.96) 0.93 (0.92–0.94) Baseline (aEWSb ) at 24 h after admission 0.85 (0.81–0.89) 0.86 (0.83–0.88) 30-day readmission, AUROC (95% CI) At admission 0.73 (0.71–0.74) 0.72 (0.71–0.73) At 24 h after admission 0.74 (0.72–0.75) 0.73 (0.72–0.74) At discharge 0.77 (0.75–0.78) 0.76 (0.75–0.77) Baseline (mHOSPITALc ) at discharge 0.70 (0.68–0.72) 0.68 (0.67–0.69) Length of stay at least 7 days, AUROC (95% CI) At admission 0.81 (0.80–0.82) 0.80 (0.80–0.81) At 24 h after admission 0.86 (0.86–0.87) 0.85 (0.85–0.86) Baseline (Liud ) at 24 h after admission 0.76 (0.75–0.77) 0.74 (0.73–0.75) Discharge diagnoses (weighted AUROC) At admission 0.87 0.86 At 24 h after admission 0.89 0.88 At discharge 0.90 0.90 a Area under the receiver operator curve b Augmented Early Warning System score c Modified HOSPITAL score for readmission d Modified Liu score for long length of stay •2018년 1월 구글이 전자의무기록(EMR)을 분석하여, 환자 치료 결과를 예측하는 인공지능 발표 •환자가 입원 중에 사망할 것인지 •장기간 입원할 것인지 •퇴원 후에 30일 내에 재입원할 것인지 •퇴원 시의 진단명
 •이번 연구의 특징: 확장성 •과거 다른 연구와 달리 EMR의 일부 데이터를 pre-processing 하지 않고, •전체 EMR 를 통채로 모두 분석하였음: UCSF, UCM (시카고 대학병원) •특히, 비정형 데이터인 의사의 진료 노트도 분석
  • 164.
    •“향후 10년 동안첫번째 cardiovascular event 가 올 것인가” 예측 •전향적 코호트 스터디: 영국 환자 378,256 명 •일상적 의료 데이터를 바탕으로 기계학습으로 질병을 예측하는 첫번째 대규모 스터디 •기존의 ACC/AHA 가이드라인과 4가지 기계학습 알고리즘의 정확도를 비교 •Random forest; Logistic regression; Gradient bossting; Neural network
  • 165.
    Can machine-learning improvecardiovascular risk prediction using routine clinical data? Stephen F.Weng et al PLoS One 2017 in a sensitivity of 62.7% and PPV of 17.1%. The random forest algorithm resulted in a net increase of 191 CVD cases from the baseline model, increasing the sensitivity to 65.3% and PPV to 17.8% while logistic regression resulted in a net increase of 324 CVD cases (sensitivity 67.1%; PPV 18.3%). Gradient boosting machines and neural networks performed best, result- ing in a net increase of 354 (sensitivity 67.5%; PPV 18.4%) and 355 CVD (sensitivity 67.5%; PPV 18.4%) cases correctly predicted, respectively. The ACC/AHA baseline model correctly predicted 53,106 non-cases from 75,585 total non- cases, resulting in a specificity of 70.3% and NPV of 95.1%. The net increase in non-cases Table 3. Top 10 risk factor variables for CVD algorithms listed in descending order of coefficient effect size (ACC/AHA; logistic regression), weighting (neural networks), or selection frequency (random forest, gradient boosting machines). Algorithms were derived from training cohort of 295,267 patients. ACC/AHA Algorithm Machine-learning Algorithms Men Women ML: Logistic Regression ML: Random Forest ML: Gradient Boosting Machines ML: Neural Networks Age Age Ethnicity Age Age Atrial Fibrillation Total Cholesterol HDL Cholesterol Age Gender Gender Ethnicity HDL Cholesterol Total Cholesterol SES: Townsend Deprivation Index Ethnicity Ethnicity Oral Corticosteroid Prescribed Smoking Smoking Gender Smoking Smoking Age Age x Total Cholesterol Age x HDL Cholesterol Smoking HDL cholesterol HDL cholesterol Severe Mental Illness Treated Systolic Blood Pressure Age x Total Cholesterol Atrial Fibrillation HbA1c Triglycerides SES: Townsend Deprivation Index Age x Smoking Treated Systolic Blood Pressure Chronic Kidney Disease Triglycerides Total Cholesterol Chronic Kidney Disease Age x HDL Cholesterol Untreated Systolic Blood Pressure Rheumatoid Arthritis SES: Townsend Deprivation Index HbA1c BMI missing Untreated Systolic Blood Pressure Age x Smoking Family history of premature CHD BMI Systolic Blood Pressure Smoking Diabetes Diabetes COPD Total Cholesterol SES: Townsend Deprivation Index Gender Italics: Protective Factors https://doi.org/10.1371/journal.pone.0174944.t003 PLOS ONE | https://doi.org/10.1371/journal.pone.0174944 April 4, 2017 8 / 14 •기존 ACC/AHA 가이드라인의 위험 요소의 일부분만 기계학습 알고리즘에도 포함 •하지만, Diabetes는 네 모델 모두에서 포함되지 않았다.  •기존의 위험 예측 툴에는 포함되지 않던, 아래와 같은 새로운 요소들이 포함되었다. •COPD, severe mental illness, prescribing of oral corticosteroids •triglyceride level 등의 바이오 마커
  • 166.
    Stephen F.Weng etal PLoS One 2017 Can machine-learning improve cardiovascular risk prediction using routine clinical data? correctly predicted compared to the baseline ACC/AHA model ranged from 191 non-cases for the random forest algorithm to 355 non-cases for the neural networks. Full details on classifi- cation analysis can be found in S2 Table. Discussion Compared to an established AHA/ACC risk prediction algorithm, we found all machine- learning algorithms tested were better at identifying individuals who will develop CVD and those that will not. Unlike established approaches to risk prediction, the machine-learning methods used were not limited to a small set of risk factors, and incorporated more pre-exist- Table 4. Performance of the machine-learning (ML) algorithms predicting 10-year cardiovascular disease (CVD) risk derived from applying train- ing algorithms on the validation cohort of 82,989 patients. Higher c-statistics results in better algorithm discrimination. The baseline (BL) ACC/AHA 10-year risk prediction algorithm is provided for comparative purposes. Algorithms AUC c-statistic Standard Error* 95% Confidence Interval Absolute Change from Baseline LCL UCL BL: ACC/AHA 0.728 0.002 0.723 0.735 — ML: Random Forest 0.745 0.003 0.739 0.750 +1.7% ML: Logistic Regression 0.760 0.003 0.755 0.766 +3.2% ML: Gradient Boosting Machines 0.761 0.002 0.755 0.766 +3.3% ML: Neural Networks 0.764 0.002 0.759 0.769 +3.6% *Standard error estimated by jack-knife procedure [30] https://doi.org/10.1371/journal.pone.0174944.t004 Can machine-learning improve cardiovascular risk prediction using routine clinical data? •네 가지 기계학습 모델 모두 기존의 ACC/AHA 가이드라인 대비 더 정확했다. •Neural Networks 이 AUC=0.764 로 가장 정확했다. •“이 모델을 활용했더라면 355 명의 추가적인 cardiovascular event 를 예방했을 것” •Deep Learning 을 활용하면 정확도는 더 높아질 수 있을 것 •Genetic information 등의 추가적인 risk factor 를 활용해볼 수 있다.
  • 167.
    LETTERS https://doi.org/10.1038/s41591-018-0335-9 1 Guangzhou Women andChildren’s Medical Center, Guangzhou Medical University, Guangzhou, China. 2 Institute for Genomic Medicine, Institute of Engineering in Medicine, and Shiley Eye Institute, University of California, San Diego, La Jolla, CA, USA. 3 Hangzhou YITU Healthcare Technology Co. Ltd, Hangzhou, China. 4 Department of Thoracic Surgery/Oncology, First Affiliated Hospital of Guangzhou Medical University, China State Key Laboratory and National Clinical Research Center for Respiratory Disease, Guangzhou, China. 5 Guangzhou Kangrui Co. Ltd, Guangzhou, China. 6 Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou, China. 7 Veterans Administration Healthcare System, San Diego, CA, USA. 8 These authors contributed equally: Huiying Liang, Brian Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu. *e-mail: kang.zhang@gmail.com; xiahumin@hotmail.com Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains chal- lenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physi- cians and unearth associations that previous statistical meth- ods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing com- mon childhood diseases. Our study provides a proof of con- cept for implementing an AI-based system as a means to aid physiciansintacklinglargeamountsofdata,augmentingdiag- nostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare provid- ers are in relative shortage, the benefits of such an AI system are likely to be universal. Medical information has become increasingly complex over time. The range of disease entities, diagnostic testing and biomark- ers, and treatment modalities has increased exponentially in recent years. Subsequently, clinical decision-making has also become more complex and demands the synthesis of decisions from assessment of large volumes of data representing clinical information. In the current digital age, the electronic health record (EHR) represents a massive repository of electronic data points representing a diverse array of clinical information1–3 . Artificial intelligence (AI) methods have emerged as potentially powerful tools to mine EHR data to aid in disease diagnosis and management, mimicking and perhaps even augmenting the clinical decision-making of human physicians1 . To formulate a diagnosis for any given patient, physicians fre- quently use hypotheticodeductive reasoning. Starting with the chief complaint, the physician then asks appropriately targeted questions relating to that complaint. From this initial small feature set, the physician forms a differential diagnosis and decides what features (historical questions, physical exam findings, laboratory testing, and/or imaging studies) to obtain next in order to rule in or rule out the diagnoses in the differential diagnosis set. The most use- ful features are identified, such that when the probability of one of the diagnoses reaches a predetermined level of acceptability, the process is stopped, and the diagnosis is accepted. It may be pos- sible to achieve an acceptable level of certainty of the diagnosis with only a few features without having to process the entire feature set. Therefore, the physician can be considered a classifier of sorts. In this study, we designed an AI-based system using machine learning to extract clinically relevant features from EHR notes to mimic the clinical reasoning of human physicians. In medicine, machine learning methods have already demonstrated strong per- formance in image-based diagnoses, notably in radiology2 , derma- tology4 , and ophthalmology5–8 , but analysis of EHR data presents a number of difficult challenges. These challenges include the vast quantity of data, high dimensionality, data sparsity, and deviations Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence Huiying Liang1,8 , Brian Y. Tsui 2,8 , Hao Ni3,8 , Carolina C. S. Valentim4,8 , Sally L. Baxter 2,8 , Guangjian Liu1,8 , Wenjia Cai 2 , Daniel S. Kermany1,2 , Xin Sun1 , Jiancong Chen2 , Liya He1 , Jie Zhu1 , Pin Tian2 , Hua Shao2 , Lianghong Zheng5,6 , Rui Hou5,6 , Sierra Hewett1,2 , Gen Li1,2 , Ping Liang3 , Xuan Zang3 , Zhiqi Zhang3 , Liyan Pan1 , Huimin Cai5,6 , Rujuan Ling1 , Shuhua Li1 , Yongwang Cui1 , Shusheng Tang1 , Hong Ye1 , Xiaoyan Huang1 , Waner He1 , Wenqing Liang1 , Qing Zhang1 , Jianmin Jiang1 , Wei Yu1 , Jianqun Gao1 , Wanxing Ou1 , Yingmin Deng1 , Qiaozhen Hou1 , Bei Wang1 , Cuichan Yao1 , Yan Liang1 , Shu Zhang1 , Yaou Duan2 , Runze Zhang2 , Sarah Gibson2 , Charlotte L. Zhang2 , Oulan Li2 , Edward D. Zhang2 , Gabriel Karin2 , Nathan Nguyen2 , Xiaokang Wu1,2 , Cindy Wen2 , Jie Xu2 , Wenqin Xu2 , Bochu Wang2 , Winston Wang2 , Jing Li1,2 , Bianca Pizzato2 , Caroline Bao2 , Daoman Xiang1 , Wanting He1,2 , Suiqin He2 , Yugui Zhou1,2 , Weldon Haw2,7 , Michael Goldbaum2 , Adriana Tremoulet2 , Chun-Nan Hsu 2 , Hannah Carter2 , Long Zhu3 , Kang Zhang 1,2,7 * and Huimin Xia 1 * NATURE MEDICINE | www.nature.com/naturemedicine Nat Med 2019 Feb •소아 환자 130만명의 EMR 데이터 101.6 million 개 분석 •딥러닝 기반의 자연어 처리 기술 •의사의 hypotetico-deductive reasoning 모방 •소아 환자의 common disease를 진단하는 인공지능
  • 168.
    LETTERS https://doi.org/10.1038/s41591-018-0335-9 1 Guangzhou Women andChildren’s Medical Center, Guangzhou Medical University, Guangzhou, China. 2 Institute for Genomic Medicine, Institute of Engineering in Medicine, and Shiley Eye Institute, University of California, San Diego, La Jolla, CA, USA. 3 Hangzhou YITU Healthcare Technology Co. Ltd, Hangzhou, China. 4 Department of Thoracic Surgery/Oncology, First Affiliated Hospital of Guangzhou Medical University, China State Key Laboratory and National Clinical Research Center for Respiratory Disease, Guangzhou, China. 5 Guangzhou Kangrui Co. Ltd, Guangzhou, China. 6 Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou, China. 7 Veterans Administration Healthcare System, San Diego, CA, USA. 8 These authors contributed equally: Huiying Liang, Brian Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu. *e-mail: kang.zhang@gmail.com; xiahumin@hotmail.com Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains chal- lenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physi- cians and unearth associations that previous statistical meth- ods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing com- mon childhood diseases. Our study provides a proof of con- cept for implementing an AI-based system as a means to aid physiciansintacklinglargeamountsofdata,augmentingdiag- nostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare provid- ers are in relative shortage, the benefits of such an AI system are likely to be universal. Medical information has become increasingly complex over time. The range of disease entities, diagnostic testing and biomark- ers, and treatment modalities has increased exponentially in recent years. Subsequently, clinical decision-making has also become more complex and demands the synthesis of decisions from assessment of large volumes of data representing clinical information. In the current digital age, the electronic health record (EHR) represents a massive repository of electronic data points representing a diverse array of clinical information1–3 . Artificial intelligence (AI) methods have emerged as potentially powerful tools to mine EHR data to aid in disease diagnosis and management, mimicking and perhaps even augmenting the clinical decision-making of human physicians1 . To formulate a diagnosis for any given patient, physicians fre- quently use hypotheticodeductive reasoning. Starting with the chief complaint, the physician then asks appropriately targeted questions relating to that complaint. From this initial small feature set, the physician forms a differential diagnosis and decides what features (historical questions, physical exam findings, laboratory testing, and/or imaging studies) to obtain next in order to rule in or rule out the diagnoses in the differential diagnosis set. The most use- ful features are identified, such that when the probability of one of the diagnoses reaches a predetermined level of acceptability, the process is stopped, and the diagnosis is accepted. It may be pos- sible to achieve an acceptable level of certainty of the diagnosis with only a few features without having to process the entire feature set. Therefore, the physician can be considered a classifier of sorts. In this study, we designed an AI-based system using machine learning to extract clinically relevant features from EHR notes to mimic the clinical reasoning of human physicians. In medicine, machine learning methods have already demonstrated strong per- formance in image-based diagnoses, notably in radiology2 , derma- tology4 , and ophthalmology5–8 , but analysis of EHR data presents a number of difficult challenges. These challenges include the vast quantity of data, high dimensionality, data sparsity, and deviations Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence Huiying Liang1,8 , Brian Y. Tsui 2,8 , Hao Ni3,8 , Carolina C. S. Valentim4,8 , Sally L. Baxter 2,8 , Guangjian Liu1,8 , Wenjia Cai 2 , Daniel S. Kermany1,2 , Xin Sun1 , Jiancong Chen2 , Liya He1 , Jie Zhu1 , Pin Tian2 , Hua Shao2 , Lianghong Zheng5,6 , Rui Hou5,6 , Sierra Hewett1,2 , Gen Li1,2 , Ping Liang3 , Xuan Zang3 , Zhiqi Zhang3 , Liyan Pan1 , Huimin Cai5,6 , Rujuan Ling1 , Shuhua Li1 , Yongwang Cui1 , Shusheng Tang1 , Hong Ye1 , Xiaoyan Huang1 , Waner He1 , Wenqing Liang1 , Qing Zhang1 , Jianmin Jiang1 , Wei Yu1 , Jianqun Gao1 , Wanxing Ou1 , Yingmin Deng1 , Qiaozhen Hou1 , Bei Wang1 , Cuichan Yao1 , Yan Liang1 , Shu Zhang1 , Yaou Duan2 , Runze Zhang2 , Sarah Gibson2 , Charlotte L. Zhang2 , Oulan Li2 , Edward D. Zhang2 , Gabriel Karin2 , Nathan Nguyen2 , Xiaokang Wu1,2 , Cindy Wen2 , Jie Xu2 , Wenqin Xu2 , Bochu Wang2 , Winston Wang2 , Jing Li1,2 , Bianca Pizzato2 , Caroline Bao2 , Daoman Xiang1 , Wanting He1,2 , Suiqin He2 , Yugui Zhou1,2 , Weldon Haw2,7 , Michael Goldbaum2 , Adriana Tremoulet2 , Chun-Nan Hsu 2 , Hannah Carter2 , Long Zhu3 , Kang Zhang 1,2,7 * and Huimin Xia 1 * NATURE MEDICINE | www.nature.com/naturemedicine Nat Med 2019 Feb LETTERSNATURE MEDICINE examination, laboratory testing, and PACS (picture archiving and communication systems) reports), the F1 scores exceeded 90% except in one instance, which was for categorical variables detected tree, similar to how a human physician might evaluate a patient’s features to achieve a diagnosis based on the same clinical data incorporated into the information model. Encounters labeled by Systemic generalized diseases Varicella without complication Influenza Infectious mononucleosis Sepsis Exanthema subitum Neuropsychiatric diseases Tic disorder Attention-deficit hyperactivity disorders Bacterial meningitis Encephalitis Convulsions Genitourinary diseases Respiratory diseases Upper respiratory diseases Acute upper respiratory infection Sinusitis Acute sinusitis Acute recurrent sinusitis Acute laryngitis Acute pharyngitis Lower respiratory diseases Bronchitis Acute bronchitis Bronchiolitis Acute bronchitis due to Mycoplasma pneumoniae Pneumonia Bacterial pneumonia Bronchopneumonia Bacterial pneumonia elsewhere Mycoplasma infection Asthma Asthma uncomplicated Cough variant asthma Asthma with acute exacerbation Acute tracheitis Gastrointestinal diseases Diarrhea Mouth-related diseases Enteroviral vesicular stomatitis with exanthem Fig. 2 | Hierarchy of the diagnostic framework in a large pediatric cohort. A hierarchical logistic regression classifier was used to establish a diagnostic system based on anatomic divisions. An organ-based approach was used, wherein diagnoses were first separated into broad organ systems, then subsequently divided into organ subsystems and/or into more specific diagnosis groups.
  • 169.
    LETTERS https://doi.org/10.1038/s41591-018-0335-9 1 Guangzhou Women andChildren’s Medical Center, Guangzhou Medical University, Guangzhou, China. 2 Institute for Genomic Medicine, Institute of Engineering in Medicine, and Shiley Eye Institute, University of California, San Diego, La Jolla, CA, USA. 3 Hangzhou YITU Healthcare Technology Co. Ltd, Hangzhou, China. 4 Department of Thoracic Surgery/Oncology, First Affiliated Hospital of Guangzhou Medical University, China State Key Laboratory and National Clinical Research Center for Respiratory Disease, Guangzhou, China. 5 Guangzhou Kangrui Co. Ltd, Guangzhou, China. 6 Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou, China. 7 Veterans Administration Healthcare System, San Diego, CA, USA. 8 These authors contributed equally: Huiying Liang, Brian Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu. *e-mail: kang.zhang@gmail.com; xiahumin@hotmail.com Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains chal- lenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physi- cians and unearth associations that previous statistical meth- ods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing com- mon childhood diseases. Our study provides a proof of con- cept for implementing an AI-based system as a means to aid physiciansintacklinglargeamountsofdata,augmentingdiag- nostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare provid- ers are in relative shortage, the benefits of such an AI system are likely to be universal. Medical information has become increasingly complex over time. The range of disease entities, diagnostic testing and biomark- ers, and treatment modalities has increased exponentially in recent years. Subsequently, clinical decision-making has also become more complex and demands the synthesis of decisions from assessment of large volumes of data representing clinical information. In the current digital age, the electronic health record (EHR) represents a massive repository of electronic data points representing a diverse array of clinical information1–3 . Artificial intelligence (AI) methods have emerged as potentially powerful tools to mine EHR data to aid in disease diagnosis and management, mimicking and perhaps even augmenting the clinical decision-making of human physicians1 . To formulate a diagnosis for any given patient, physicians fre- quently use hypotheticodeductive reasoning. Starting with the chief complaint, the physician then asks appropriately targeted questions relating to that complaint. From this initial small feature set, the physician forms a differential diagnosis and decides what features (historical questions, physical exam findings, laboratory testing, and/or imaging studies) to obtain next in order to rule in or rule out the diagnoses in the differential diagnosis set. The most use- ful features are identified, such that when the probability of one of the diagnoses reaches a predetermined level of acceptability, the process is stopped, and the diagnosis is accepted. It may be pos- sible to achieve an acceptable level of certainty of the diagnosis with only a few features without having to process the entire feature set. Therefore, the physician can be considered a classifier of sorts. In this study, we designed an AI-based system using machine learning to extract clinically relevant features from EHR notes to mimic the clinical reasoning of human physicians. In medicine, machine learning methods have already demonstrated strong per- formance in image-based diagnoses, notably in radiology2 , derma- tology4 , and ophthalmology5–8 , but analysis of EHR data presents a number of difficult challenges. These challenges include the vast quantity of data, high dimensionality, data sparsity, and deviations Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence Huiying Liang1,8 , Brian Y. Tsui 2,8 , Hao Ni3,8 , Carolina C. S. Valentim4,8 , Sally L. Baxter 2,8 , Guangjian Liu1,8 , Wenjia Cai 2 , Daniel S. Kermany1,2 , Xin Sun1 , Jiancong Chen2 , Liya He1 , Jie Zhu1 , Pin Tian2 , Hua Shao2 , Lianghong Zheng5,6 , Rui Hou5,6 , Sierra Hewett1,2 , Gen Li1,2 , Ping Liang3 , Xuan Zang3 , Zhiqi Zhang3 , Liyan Pan1 , Huimin Cai5,6 , Rujuan Ling1 , Shuhua Li1 , Yongwang Cui1 , Shusheng Tang1 , Hong Ye1 , Xiaoyan Huang1 , Waner He1 , Wenqing Liang1 , Qing Zhang1 , Jianmin Jiang1 , Wei Yu1 , Jianqun Gao1 , Wanxing Ou1 , Yingmin Deng1 , Qiaozhen Hou1 , Bei Wang1 , Cuichan Yao1 , Yan Liang1 , Shu Zhang1 , Yaou Duan2 , Runze Zhang2 , Sarah Gibson2 , Charlotte L. Zhang2 , Oulan Li2 , Edward D. Zhang2 , Gabriel Karin2 , Nathan Nguyen2 , Xiaokang Wu1,2 , Cindy Wen2 , Jie Xu2 , Wenqin Xu2 , Bochu Wang2 , Winston Wang2 , Jing Li1,2 , Bianca Pizzato2 , Caroline Bao2 , Daoman Xiang1 , Wanting He1,2 , Suiqin He2 , Yugui Zhou1,2 , Weldon Haw2,7 , Michael Goldbaum2 , Adriana Tremoulet2 , Chun-Nan Hsu 2 , Hannah Carter2 , Long Zhu3 , Kang Zhang 1,2,7 * and Huimin Xia 1 * NATURE MEDICINE | www.nature.com/naturemedicine Nat Med 2019 Feb LETTERSNATURE MEDICINE of our system was especially strong for the common conditions of acute upper respiratory infection and sinusitis, both of which were diagnosed with an accuracy of 0.95 between the machine-predicted diagnosis and the human physician-generated diagnosis. In con- trast, dangerous conditions tend to be less common and would have diagnostic hierarchy decision tree can be adjusted to what is most appropriate for the clinical situation. In terms of implementation, we foresee this type of AI-assisted diagnostic system being integrated into clinical practice in several ways. First, it could assist with triage procedures. For example, Table 2 | Illustration of diagnostic performance of our AI model and physicians Disease conditions Our model Physicians Physician group 1 Physician group 2 Physician group 3 Physician group 4 Physician group 5 Asthma 0.920 0.801 0.837 0.904 0.890 0.935 Encephalitis 0.837 0.947 0.961 0.950 0.959 0.965 Gastrointestinal disease 0.865 0.818 0.872 0.854 0.896 0.893 Group: ‘Acute laryngitis’ 0.786 0.808 0.730 0.879 0.940 0.943 Group: ‘Pneumonia’ 0.888 0.829 0.767 0.946 0.952 0.972 Group: ‘Sinusitis’ 0.932 0.839 0.797 0.896 0.873 0.870 Lower respiratory 0.803 0.803 0.815 0.910 0.903 0.935 Mouth-related diseases 0.897 0.818 0.872 0.854 0.896 0.893 Neuropsychiatric disease 0.895 0.925 0.963 0.960 0.962 0.906 Respiratory 0.935 0.808 0.769 0.89 0.907 0.917 Systemic or generalized 0.925 0.879 0.907 0.952 0.907 0.944 Upper respiratory 0.929 0.817 0.754 0.884 0.916 0.916 Root 0.889 0.843 0.863 0.908 0.903 0.912 Average F1 score 0.885 0.841 0.839 0.907 0.915 0.923 We used the F1score to evaluate the diagnosis performance across different groups (rows); our model, two junior physician groups (groups 1 and 2), and three senior physician groups (groups 3, 4, and 5) (see Methods section for description). We observed that our model performed better than junior physician groups but slightly worse than three experienced physician groups. Root is the first level of diagnosis classification. •multiple organ system에 대해서, •주니어 스태프 보다는 높은 정확도 •시니어 스태프 보다는 낮은 정확도
  • 170.
    •복잡한 의료 데이터의분석 및 insight 도출 •영상 의료/병리 데이터의 분석/판독 •연속 데이터의 모니터링 및 예방/예측 의료 인공지능의 세 유형
  • 172.
    REVIEW ARTICLE |FOCUS https://doi.org/10.1038/s41591-018-0300-7 Department of Molecular Medicine, Scripps Research, La Jolla, CA, USA. e-mail: etopol@scripps.edu M edicine is at the crossroad of two major trends. The first is a failed business model, with increasing expenditures and jobs allocated to healthcare, but with deteriorating key outcomes, including reduced life expectancy and high infant, child- hood, and maternal mortality in the United States1,2 . This exem- plifies a paradox that is not at all confined to American medicine: investment of more human capital with worse human health out- comes. The second is the generation of data in massive quantities, from sources such as high-resolution medical imaging, biosensors with continuous output of physiologic metrics, genome sequenc- ing, and electronic medical records. The limits on analysis of such data by humans alone have clearly been exceeded, necessitating an increased reliance on machines. Accordingly, at the same time that there is more dependence than ever on humans to provide healthcare, algorithms are desperately needed to help. Yet the inte- gration of human and artificial intelligence (AI) for medicine has barely begun. Looking deeper, there are notable, longstanding deficiencies in healthcare that are responsible for its path of diminishing returns. These include a large number of serious diagnostic errors, mis- takes in treatment, an enormous waste of resources, inefficiencies in workflow, inequities, and inadequate time between patients and clinicians3,4 . Eager for improvement, leaders in healthcare and com- puter scientists have asserted that AI might have a role in address- ing all of these problems. That might eventually be the case, but researchers are at the starting gate in the use of neural networks to ameliorate the ills of the practice of medicine. In this Review, I have gathered much of the existing base of evidence for the use of AI in medicine, laying out the opportunities and pitfalls. Artificial intelligence for clinicians Almost every type of clinician, ranging from specialty doctor to paramedic, will be using AI technology, and in particular deep learning, in the future. This largely involved pattern recognition using deep neural networks (DNNs) (Box 1) that can help interpret medical scans, pathology slides, skin lesions, retinal images, electro- cardiograms, endoscopy, faces, and vital signs. The neural net inter- pretation is typically compared with physicians’ assessments using a plot of true-positive versus false-positive rates, known as a receiver operating characteristic (ROC), for which the area under the curve (AUC) is used to express the level of accuracy (Box 1). Radiology. One field that has attracted particular attention for application of AI is radiology5 . Chest X-rays are the most common type of medical scan, with more than 2 billion performed worldwide per year. In one study, the accuracy of one algorithm, based on a 121-layer convolutional neural network, in detecting pneumonia in over 112,000 labeled frontal chest X-ray images was compared with that of four radiologists, and the conclusion was that the algorithm outperformed the radiologists. However, the algorithm’s AUC of 0.76, although somewhat better than that for two previously tested DNN algorithms for chest X-ray interpretation5 , is far from optimal. In addition, the test used in this study is not necessarily comparable with the daily tasks of a radiologist, who will diagnose much more than pneumonia in any given scan. To further validate the conclu- sions of this study, a comparison with results from more than four radiologists should be made. A team at Google used an algorithm that analyzed the same image set as in the previously discussed study to make 14 different diagnoses, resulting in AUC scores that ranged from 0.63 for pneumonia to 0.87 for heart enlargement or a collapsed lung6 . More recently, in another related study, it was shown that a DNN that is currently in use in hospitals in India for interpretation of four different chest X-ray key findings was at least as accurate as four radiologists7 . For the narrower task of detecting cancerous pulmonary nodules on a chest X-ray, a DNN that retro- spectively assessed scans from over 34,000 patients achieved a level of accuracy exceeding 17 of 18 radiologists8 . It can be difficult for emergency room doctors to accurately diagnose wrist fractures, but a DNN led to marked improvement, increasing sensitivity from 81% to 92% and reducing misinterpretation by 47% (ref. 9 ). Similarly, DNNs have been applied across a wide variety of medical scans, including bone films for fractures and estimation of aging10–12 , classification of tuberculosis13 , and vertebral compression fractures14 ; computed tomography (CT) scans for lung nodules15 , liver masses16 , pancreatic cancer17 , and coronary calcium score18 ; brain scans for evidence of hemorrhage19 , head trauma20 , and acute referrals21 ; magnetic resonance imaging22 ; echocardiograms23,24 ; and mammographies25,26 . A unique imaging-recognition study focusing on the breadth of acute neurologic events, such as stroke or head trauma, was carried out on over 37,000 head CT 3-D scans, which the algorithm analyzed for 13 different anatomical find- ings versus gold-standard labels (annotated by expert radiologists) and achieved an AUC of 0.73 (ref. 27 ). A simulated prospective, double-blind, randomized control trial was conducted with real cases from the dataset and showed that the deep-learning algorithm could interpret scans 150 times faster than radiologists (1.2 versus 177seconds). But the conclusion that the algorithm’s diagnostic accuracyinscreeningacuteneurologicscanswaspoorerthanhuman High-performance medicine: the convergence of human and artificial intelligence Eric J. Topol The use of artificial intelligence, and the deep-learning subtype in particular, has been enabled by the use of labeled big data, along with markedly enhanced computing power and cloud storage, across all sectors. In medicine, this is beginning to have an impact at three levels: for clinicians, predominantly via rapid, accurate image interpretation; for health systems, by improving workflow and the potential for reducing medical errors; and for patients, by enabling them to process their own data to promote health. The current limitations, including bias, privacy and security, and lack of transparency, along with the future directions of these applications will be discussed in this article. Over time, marked improvements in accuracy, productivity, and workflow will likely be actualized, but whether that will be used to improve the patient–doctor relationship or facilitate its erosion remains to be seen. REVIEW ARTICLE | FOCUS https://doi.org/10.1038/s41591-018-0300-7 NATURE MEDICINE | VOL 25 | JANUARY 2019 | 44–56 | www.nature.com/naturemedicine44 an ed as tio rit da of al an (T m ap D an be la Table 1 | Peer-reviewed publications of AI algorithms compared with doctors Specialty Images Publication Radiology/ neurology CT head, acute neurological events Titano et al. 27 CT head for brain hemorrhage Arbabshirani et al.19 CT head for trauma Chilamkurthy et al.20 CXR for metastatic lung nodules Nam et al.8 CXR for multiple findings Singh et al.7 Mammography for breast density Lehman et al.26 Wrist X-ray* Lindsey et al.9 Pathology Breast cancer Ehteshami Bejnordi et al.41 Lung cancer (+driver mutation) Coudray et al.33 Brain tumors (+methylation) Capper et al.45 Breast cancer metastases* Steiner et al.35 Breast cancer metastases Liu et al.34 Dermatology Skin cancers Esteva et al.47 Melanoma Haenssle et al.48 Skin lesions Han et al.49 Ophthalmology Diabetic retinopathy Gulshan et al.51 Diabetic retinopathy* Abramoff et al.31 Diabetic retinopathy* Kanagasingam et al.32 Congenital cataracts Long et al.38 Retinal diseases (OCT) De Fauw et al.56 Macular degeneration Burlina et al.52 Retinopathy of prematurity Brown et al.60 AMD and diabetic retinopathy Kermany et al.53 Gastroenterology Polyps at colonoscopy* Mori et al.36 Polyps at colonoscopy Wang et al.37 Cardiology Echocardiography Madani et al.23 Echocardiography Zhang et al.24 T C A A iC Z B N ID Ic Im V A M A A
  • 173.
    This copy isfor personal use only. To order printed copies, contact reprints@rsna.org This copy is for personal use only. To order printed copies, contact reprints@rsna.org ORIGINAL RESEARCH • THORACIC IMAGING hest radiography, one of the most common diagnos- intraobserver agreements because of its limited spatial reso- Development and Validation of Deep Learning–based Automatic Detection Algorithm for Malignant Pulmonary Nodules on Chest Radiographs Ju Gang Nam, MD* • Sunggyun Park, PhD* • Eui Jin Hwang, MD • Jong Hyuk Lee, MD • Kwang-Nam Jin, MD, PhD • KunYoung Lim, MD, PhD • Thienkai HuyVu, MD, PhD • Jae Ho Sohn, MD • Sangheum Hwang, PhD • Jin Mo Goo, MD, PhD • Chang Min Park, MD, PhD From the Department of Radiology and Institute of Radiation Medicine, Seoul National University Hospital and College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul 03080, Republic of Korea (J.G.N., E.J.H., J.M.G., C.M.P.); Lunit Incorporated, Seoul, Republic of Korea (S.P.); Department of Radiology, Armed Forces Seoul Hospital, Seoul, Republic of Korea (J.H.L.); Department of Radiology, Seoul National University Boramae Medical Center, Seoul, Republic of Korea (K.N.J.); Department of Radiology, National Cancer Center, Goyang, Republic of Korea (K.Y.L.); Department of Radiology and Biomedical Imaging, University of California, San Francisco, San Francisco, Calif (T.H.V., J.H.S.); and Department of Industrial & Information Systems Engineering, Seoul National University of Science and Technology, Seoul, Republic of Korea (S.H.). Received January 30, 2018; revision requested March 20; revision received July 29; accepted August 6. Address correspondence to C.M.P. (e-mail: cmpark.morphius@gmail.com). Study supported by SNUH Research Fund and Lunit (06–2016–3000) and by Seoul Research and Business Development Program (FI170002). *J.G.N. and S.P. contributed equally to this work. Conflicts of interest are listed at the end of this article. Radiology 2018; 00:1–11 • https://doi.org/10.1148/radiol.2018180237 • Content codes: Purpose: To develop and validate a deep learning–based automatic detection algorithm (DLAD) for malignant pulmonary nodules on chest radiographs and to compare its performance with physicians including thoracic radiologists. Materials and Methods: For this retrospective study, DLAD was developed by using 43292 chest radiographs (normal radiograph– to–nodule radiograph ratio, 34067:9225) in 34676 patients (healthy-to-nodule ratio, 30784:3892; 19230 men [mean age, 52.8 years; age range, 18–99 years]; 15446 women [mean age, 52.3 years; age range, 18–98 years]) obtained between 2010 and 2015, which were labeled and partially annotated by 13 board-certified radiologists, in a convolutional neural network. Radiograph clas- sification and nodule detection performances of DLAD were validated by using one internal and four external data sets from three South Korean hospitals and one U.S. hospital. For internal and external validation, radiograph classification and nodule detection performances of DLAD were evaluated by using the area under the receiver operating characteristic curve (AUROC) and jackknife alternative free-response receiver-operating characteristic (JAFROC) figure of merit (FOM), respectively. An observer performance test involving 18 physicians, including nine board-certified radiologists, was conducted by using one of the four external validation data sets. Performances of DLAD, physicians, and physicians assisted with DLAD were evaluated and compared. Results: According to one internal and four external validation data sets, radiograph classification and nodule detection perfor- mances of DLAD were a range of 0.92–0.99 (AUROC) and 0.831–0.924 (JAFROC FOM), respectively. DLAD showed a higher AUROC and JAFROC FOM at the observer performance test than 17 of 18 and 15 of 18 physicians, respectively (P , .05), and all physicians showed improved nodule detection performances with DLAD (mean JAFROC FOM improvement, 0.043; range, 0.006–0.190; P , .05). Conclusion: This deep learning–based automatic detection algorithm outperformed physicians in radiograph classification and nod- ule detection performance for malignant pulmonary nodules on chest radiographs, and it enhanced physicians’ performances when used as a second reader. ©RSNA, 2018 Online supplemental material is available for this article. • 43,292 chest PA (normal:nodule=34,067:9225) • labeled/annotated by 13 board-certified radiologists. • DLAD were validated 1 internal + 4 external datasets • 서울대병원 / 보라매병원 / 국립암센터 / UCSF • Classification / Lesion localization • 인공지능 vs. 의사 vs. 인공지능+의사 • 다양한 수준의 의사와 비교 • Non-radiology / radiology residents • Board-certified radiologist / Thoracic radiologists
  • 174.
    Nam et al Figure1: Images in a 78-year-old female patient with a 1.9-cm part-solid nodule at the left upper lobe. (a) The nodule was faintly visible on the chest radiograph (arrowheads) and was detected by 11 of 18 observers. (b) At contrast-enhanced CT examination, biopsy confirmed lung adeno- carcinoma (arrow). (c) DLAD reported the nodule with a confidence level of 2, resulting in its detection by an additional five radiologists and an elevation in its confidence by eight radiologists. Figure 2: Images in a 64-year-old male patient with a 2.2-cm lung adenocarcinoma at the left upper lobe. (a) The nodule was faintly visible on the chest radiograph (arrowheads) and was detected by seven of 18 observers. (b) Biopsy confirmed lung adenocarcinoma in the left upper lobe on contrast-enhanced CT image (arrow). (c) DLAD reported the nodule with a confidence level of 2, resulting in its detection by an additional two radiologists and an elevated confidence level of the nodule by two radiologists.
  • 175.
    •손 엑스레이 영상을판독하여 환자의 골연령 (뼈 나이)를 계산해주는 인공지능 • 기존에 의사는 그룰리히-파일(Greulich-Pyle)법 등으로 표준 사진과 엑스레이를 비교하여 판독 • 인공지능은 참조표준영상에서 성별/나이별 패턴을 찾아서 유사성을 확률로 표시 + 표준 영상 검색 •의사가 성조숙증이나 저성장을 진단하는데 도움을 줄 수 있음
  • 176.
    Copyright 2016 AmericanMedical Association. All rights reserved. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs Varun Gulshan, PhD; Lily Peng, MD, PhD; Marc Coram, PhD; Martin C. Stumpe, PhD; Derek Wu, BS; Arunachalam Narayanaswamy, PhD; Subhashini Venugopalan, MS; Kasumi Widner, MS; Tom Madams, MEng; Jorge Cuadros, OD, PhD; Ramasamy Kim, OD, DNB; Rajiv Raman, MS, DNB; Philip C. Nelson, BS; Jessica L. Mega, MD, MPH; Dale R. Webster, PhD IMPORTANCE Deep learning is a family of computational methods that allow an algorithm to program itself by learning from a large set of examples that demonstrate the desired behavior, removing the need to specify rules explicitly. Application of these methods to medical imaging requires further assessment and validation. OBJECTIVE To apply deep learning to create an algorithm for automated detection of diabetic retinopathy and diabetic macular edema in retinal fundus photographs. DESIGN AND SETTING A specific type of neural network optimized for image classification called a deep convolutional neural network was trained using a retrospective development data set of 128 175 retinal images, which were graded 3 to 7 times for diabetic retinopathy, diabetic macular edema, and image gradability by a panel of 54 US licensed ophthalmologists and ophthalmology senior residents between May and December 2015. The resultant algorithm was validated in January and February 2016 using 2 separate data sets, both graded by at least 7 US board-certified ophthalmologists with high intragrader consistency. EXPOSURE Deep learning–trained algorithm. MAIN OUTCOMES AND MEASURES The sensitivity and specificity of the algorithm for detecting referable diabetic retinopathy (RDR), defined as moderate and worse diabetic retinopathy, referable diabetic macular edema, or both, were generated based on the reference standard of the majority decision of the ophthalmologist panel. The algorithm was evaluated at 2 operating points selected from the development set, one selected for high specificity and another for high sensitivity. RESULTS TheEyePACS-1datasetconsistedof9963imagesfrom4997patients(meanage,54.4 years;62.2%women;prevalenceofRDR,683/8878fullygradableimages[7.8%]);the Messidor-2datasethad1748imagesfrom874patients(meanage,57.6years;42.6%women; prevalenceofRDR,254/1745fullygradableimages[14.6%]).FordetectingRDR,thealgorithm hadanareaunderthereceiveroperatingcurveof0.991(95%CI,0.988-0.993)forEyePACS-1and 0.990(95%CI,0.986-0.995)forMessidor-2.Usingthefirstoperatingcutpointwithhigh specificity,forEyePACS-1,thesensitivitywas90.3%(95%CI,87.5%-92.7%)andthespecificity was98.1%(95%CI,97.8%-98.5%).ForMessidor-2,thesensitivitywas87.0%(95%CI,81.1%- 91.0%)andthespecificitywas98.5%(95%CI,97.7%-99.1%).Usingasecondoperatingpoint withhighsensitivityinthedevelopmentset,forEyePACS-1thesensitivitywas97.5%and specificitywas93.4%andforMessidor-2thesensitivitywas96.1%andspecificitywas93.9%. CONCLUSIONS AND RELEVANCE In this evaluation of retinal fundus photographs from adults with diabetes, an algorithm based on deep machine learning had high sensitivity and specificity for detecting referable diabetic retinopathy. Further research is necessary to determine the feasibility of applying this algorithm in the clinical setting and to determine whether use of the algorithm could lead to improved care and outcomes compared with current ophthalmologic assessment. JAMA. doi:10.1001/jama.2016.17216 Published online November 29, 2016. Editorial Supplemental content Author Affiliations: Google Inc, Mountain View, California (Gulshan, Peng, Coram, Stumpe, Wu, Narayanaswamy, Venugopalan, Widner, Madams, Nelson, Webster); Department of Computer Science, University of Texas, Austin (Venugopalan); EyePACS LLC, San Jose, California (Cuadros); School of Optometry, Vision Science Graduate Group, University of California, Berkeley (Cuadros); Aravind Medical Research Foundation, Aravind Eye Care System, Madurai, India (Kim); Shri Bhagwan Mahavir Vitreoretinal Services, Sankara Nethralaya, Chennai, Tamil Nadu, India (Raman); Verily Life Sciences, Mountain View, California (Mega); Cardiovascular Division, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts (Mega). Corresponding Author: Lily Peng, MD, PhD, Google Research, 1600 Amphitheatre Way, Mountain View, CA 94043 (lhpeng@google.com). Research JAMA | Original Investigation | INNOVATIONS IN HEALTH CARE DELIVERY (Reprinted) E1 Copyright 2016 American Medical Association. All rights reserved.
  • 177.
    • EyePACS-1 과 Messidor-2의 AUC = 0.991, 0.990 • 7-8명의 안과 전문의와 민감도와 특이도가 동일한 수준 • F-score: 0.95 (vs. 인간 의사는 0.91) Additional sensitivity analyses were conducted for sev- effects of data set size on algorithm performance were exam- Figure 2. Validation Set Performance for Referable Diabetic Retinopathy 100 80 60 40 20 0 0 70 80 85 95 90 75 0 5 10 15 20 25 30 100806040 Sensitivity,% 1 – Specificity, % 20 EyePACS-1: AUC, 99.1%; 95% CI, 98.8%-99.3%A 100 High-sensitivity operating point High-specificity operating point 100 80 60 40 20 0 0 70 80 85 95 90 75 0 5 10 15 20 25 30 100806040 Sensitivity,% 1 – Specificity, % 20 Messidor-2: AUC, 99.0%; 95% CI, 98.6%-99.5%B 100 High-specificity operating point High-sensitivity operating point Performance of the algorithm (black curve) and ophthalmologists (colored circles) for the presence of referable diabetic retinopathy (moderate or worse diabetic retinopathy or referable diabetic macular edema) on A, EyePACS-1 (8788 fully gradable images) and B, Messidor-2 (1745 fully gradable images). The black diamonds on the graph correspond to the sensitivity and specificity of the algorithm at the high-sensitivity and high-specificity operating points. In A, for the high-sensitivity operating point, specificity was 93.4% (95% CI, 92.8%-94.0%) and sensitivity was 97.5% (95% CI, 95.8%-98.7%); for the high-specificity operating point, specificity was 98.1% (95% CI, 97.8%-98.5%) and sensitivity was 90.3% (95% CI, 87.5%-92.7%). In B, for the high-sensitivity operating point, specificity was 93.9% (95% CI, 92.4%-95.3%) and sensitivity was 96.1% (95% CI, 92.4%-98.3%); for the high-specificity operating point, specificity was 98.5% (95% CI, 97.7%-99.1%) and sensitivity was 87.0% (95% CI, 81.1%-91.0%). There were 8 ophthalmologists who graded EyePACS-1 and 7 ophthalmologists who graded Messidor-2. AUC indicates area under the receiver operating characteristic curve. Research Original Investigation Accuracy of a Deep Learning Algorithm for Detection of Diabetic Retinopathy 안저 판독 인공지능의 정확도
  • 178.
    0 0 MO N T H 2 0 1 7 | V O L 0 0 0 | N A T U R E | 1 LETTER doi:10.1038/nature21056 Dermatologist-level classification of skin cancer with deep neural networks Andre Esteva1 *, Brett Kuprel1 *, Roberto A. Novoa2,3 , Justin Ko2 , Susan M. Swetter2,4 , Helen M. Blau5 & Sebastian Thrun6 Skin cancer, the most common human malignancy1–3 , is primarily diagnosed visually, beginning with an initial clinical screening and followed potentially by dermoscopic analysis, a biopsy and histopathological examination. Automated classification of skin lesions using images is a challenging task owing to the fine-grained variability in the appearance of skin lesions. Deep convolutional neural networks (CNNs)4,5 show potential for general and highly variable tasks across many fine-grained object categories6–11 . Here we demonstrate classification of skin lesions using a single CNN, trained end-to-end from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images—two orders of magnitude larger than previous datasets12 —consisting of 2,032 different diseases. We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images with two critical binary classification use cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The first case represents the identification of the most common cancers, the second represents the identification of the deadliest skin cancer. The CNN achieves performance on par with all tested experts across both tasks, demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists. Outfitted with deep neural networks, mobile devices can potentially extend the reach of dermatologists outside of the clinic. It is projected that 6.3 billion smartphone subscriptions will exist by the year 2021 (ref. 13) and can therefore potentially provide low-cost universal access to vital diagnostic care. There are 5.4 million new cases of skin cancer in the United States2 every year. One in five Americans will be diagnosed with a cutaneous malignancy in their lifetime. Although melanomas represent fewer than 5% of all skin cancers in the United States, they account for approxi- mately 75% of all skin-cancer-related deaths, and are responsible for over 10,000 deaths annually in the United States alone. Early detection is critical, as the estimated 5-year survival rate for melanoma drops from over 99% if detected in its earliest stages to about 14% if detected in its latest stages. We developed a computational method which may allow medical practitioners and patients to proactively track skin lesions and detect cancer earlier. By creating a novel disease taxonomy, and a disease-partitioning algorithm that maps individual diseases into training classes, we are able to build a deep learning system for auto- mated dermatology. Previous work in dermatological computer-aided classification12,14,15 has lacked the generalization capability of medical practitioners owing to insufficient data and a focus on standardized tasks such as dermoscopy16–18 and histological image classification19–22 . Dermoscopy images are acquired via a specialized instrument and histological images are acquired via invasive biopsy and microscopy; whereby both modalities yield highly standardized images. Photographic images (for example, smartphone images) exhibit variability in factors such as zoom, angle and lighting, making classification substantially more challenging23,24 . We overcome this challenge by using a data- driven approach—1.41 million pre-training and training images make classification robust to photographic variability. Many previous techniques require extensive preprocessing, lesion segmentation and extraction of domain-specific visual features before classification. By contrast, our system requires no hand-crafted features; it is trained end-to-end directly from image labels and raw pixels, with a single network for both photographic and dermoscopic images. The existing body of work uses small datasets of typically less than a thousand images of skin lesions16,18,19 , which, as a result, do not generalize well to new images. We demonstrate generalizable classification with a new dermatologist-labelled dataset of 129,450 clinical images, including 3,374 dermoscopy images. Deep learning algorithms, powered by advances in computation and very large datasets25 , have recently been shown to exceed human performance in visual tasks such as playing Atari games26 , strategic board games like Go27 and object recognition6 . In this paper we outline the development of a CNN that matches the performance of dermatologists at three key diagnostic tasks: melanoma classification, melanoma classification using dermoscopy and carcinoma classification. We restrict the comparisons to image-based classification. We utilize a GoogleNet Inception v3 CNN architecture9 that was pre- trained on approximately 1.28 million images (1,000 object categories) from the 2014 ImageNet Large Scale Visual Recognition Challenge6 , and train it on our dataset using transfer learning28 . Figure 1 shows the working system. The CNN is trained using 757 disease classes. Our dataset is composed of dermatologist-labelled images organized in a tree-structured taxonomy of 2,032 diseases, in which the individual diseases form the leaf nodes. The images come from 18 different clinician-curated, open-access online repositories, as well as from clinical data from Stanford University Medical Center. Figure 2a shows a subset of the full taxonomy, which has been organized clinically and visually by medical experts. We split our dataset into 127,463 training and validation images and 1,942 biopsy-labelled test images. To take advantage of fine-grained information contained within the taxonomy structure, we develop an algorithm (Extended Data Table 1) to partition diseases into fine-grained training classes (for example, amelanotic melanoma and acrolentiginous melanoma). During inference, the CNN outputs a probability distribution over these fine classes. To recover the probabilities for coarser-level classes of interest (for example, melanoma) we sum the probabilities of their descendants (see Methods and Extended Data Fig. 1 for more details). We validate the effectiveness of the algorithm in two ways, using nine-fold cross-validation. First, we validate the algorithm using a three-class disease partition—the first-level nodes of the taxonomy, which represent benign lesions, malignant lesions and non-neoplastic 1 Department of Electrical Engineering, Stanford University, Stanford, California, USA. 2 Department of Dermatology, Stanford University, Stanford, California, USA. 3 Department of Pathology, Stanford University, Stanford, California, USA. 4 Dermatology Service, Veterans Affairs Palo Alto Health Care System, Palo Alto, California, USA. 5 Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, California, USA. 6 Department of Computer Science, Stanford University, Stanford, California, USA. *These authors contributed equally to this work. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
  • 179.
    딥러닝과 피부과 전문의의 피부암분류 정확도 LETTE a b 0 1 Sensitivity 0 1 Specificity Melanoma: 130 images 1 Specificity Melanoma: 225 images 0 1 Sensitivity 0 1 Specificity Melanoma: 111 dermoscopy images 1 Specificity Carcinoma: 707 images 1 Specificity Melanoma: 1,010 dermoscopy images 0 1 Sensitivity 0 1 Specificity Carcinoma: 135 images Algorithm: AUC = 0.96 Dermatologists (25) Average dermatologist Algorithm: AUC = 0.94 Dermatologists (22) Average dermatologist Algorithm: AUC = 0.91 Dermatologists (21) Average dermatologist 21명 중에 인공지능보다 정확성이 떨어지는 피부과 전문의들이 상당수 있었음 피부과 전문의들의 평균 성적도 인공지능보다 좋지 않았음
  • 180.
    ARTICLES https://doi.org/10.1038/s41591-018-0177-5 A ccording to theAmerican Cancer Society and the Cancer Statistics Center (see URLs), over 150,000 patients with lung cancer succumb to the disease each year (154,050 expected for 2018), while another 200,000 new cases are diagnosed on a yearly basis (234,030 expected for 2018). It is one of the most widely spread cancers in the world because of not only smoking, but also exposure to toxic chemicals like radon, asbestos and arsenic. LUAD and LUSC are the two most prevalent types of non–small cell lung cancer1 , and each is associated with discrete treatment guidelines. In the absence of definitive histologic features, this important distinc- tion can be challenging and time-consuming, and requires confir- matory immunohistochemical stains. Classification of lung cancer type is a key diagnostic process because the available treatment options, including conventional chemotherapy and, more recently, targeted therapies, differ for LUAD and LUSC2 . Also, a LUAD diagnosis will prompt the search for molecular biomarkers and sensitizing mutations and thus has a great impact on treatment options3,4 . For example, epidermal growth factor receptor (EGFR) mutations, present in about 20% of LUAD, and anaplastic lymphoma receptor tyrosine kinase (ALK) rearrangements, present in<5% of LUAD5 , currently have tar- geted therapies approved by the Food and Drug Administration (FDA)6,7 . Mutations in other genes, such as KRAS and tumor pro- tein P53 (TP53) are very common (about 25% and 50%, respec- tively) but have proven to be particularly challenging drug targets so far5,8 . Lung biopsies are typically used to diagnose lung cancer type and stage. Virtual microscopy of stained images of tissues is typically acquired at magnifications of 20×to 40×, generating very large two-dimensional images (10,000 to>100,000 pixels in each dimension) that are oftentimes challenging to visually inspect in an exhaustive manner. Furthermore, accurate interpretation can be difficult, and the distinction between LUAD and LUSC is not always clear, particularly in poorly differentiated tumors; in this case, ancil- lary studies are recommended for accurate classification9,10 . To assist experts, automatic analysis of lung cancer whole-slide images has been recently studied to predict survival outcomes11 and classifica- tion12 . For the latter, Yu et al.12 combined conventional thresholding and image processing techniques with machine-learning methods, such as random forest classifiers, support vector machines (SVM) or Naive Bayes classifiers, achieving an AUC of ~0.85 in distinguishing normal from tumor slides, and ~0.75 in distinguishing LUAD from LUSC slides. More recently, deep learning was used for the classi- fication of breast, bladder and lung tumors, achieving an AUC of 0.83 in classification of lung tumor types on tumor slides from The Cancer Genome Atlas (TCGA)13 . Analysis of plasma DNA values was also shown to be a good predictor of the presence of non–small cell cancer, with an AUC of ~0.94 (ref. 14 ) in distinguishing LUAD from LUSC, whereas the use of immunochemical markers yields an AUC of ~0.94115 . Here, we demonstrate how the field can further benefit from deep learning by presenting a strategy based on convolutional neural networks (CNNs) that not only outperforms methods in previously Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning Nicolas Coudray 1,2,9 , Paolo Santiago Ocampo3,9 , Theodore Sakellaropoulos4 , Navneet Narula3 , Matija Snuderl3 , David Fenyö5,6 , Andre L. Moreira3,7 , Narges Razavian 8 * and Aristotelis Tsirigos 1,3 * Visual inspection of histopathology slides is one of the main methods used by pathologists to assess the stage, type and sub- type of lung tumors. Adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) are the most prevalent subtypes of lung cancer, and their distinction requires visual inspection by an experienced pathologist. In this study, we trained a deep con- volutional neural network (inception v3) on whole-slide images obtained from The Cancer Genome Atlas to accurately and automatically classify them into LUAD, LUSC or normal lung tissue. The performance of our method is comparable to that of pathologists, with an average area under the curve (AUC) of 0.97. Our model was validated on independent datasets of frozen tissues, formalin-fixed paraffin-embedded tissues and biopsies. Furthermore, we trained the network to predict the ten most commonly mutated genes in LUAD. We found that six of them—STK11, EGFR, FAT1, SETBP1, KRAS and TP53—can be pre- dicted from pathology images, with AUCs from 0.733 to 0.856 as measured on a held-out population. These findings suggest that deep-learning models can assist pathologists in the detection of cancer subtype or gene mutations. Our approach can be applied to any cancer type, and the code is available at https://github.com/ncoudray/DeepPATH. • 정상, adenocarcinoma(LUAD), squamous cell carcinoma(LUSC) 를 정확하게 구분 • Tumor vs. normal, LUAD vs. LUSC 의 구분에 AUC 0.99, 0.95 이상 • Normal, LUAD, LUSC 중 하나를 다른 두 가지와 구분하는 것도 5x 20x 모두 AUC 0.9 이상 • 이 정확도는 세 명의 병리과 전문의와 동등한 수준 • 딥러닝이 틀린 것 중에 50%는, 병리과 전문의 세 명 중 적어도 한 명이 틀렸고, • 병리과 전문의 세 명 중 적어도 한 명이 틀린 케이스 중, 83%는 딥러닝이 정확히 분류했다.
  • 181.
    • 더 나아가서TCGA를 바탕으로 개발된 인공지능을, • 완전히 독립적인, 특히 fresh frozen, FFPE, biopsy 의 세 가지 방식으로 얻은 • LUAD, LUSC 데이터에 적용해보았을 때에도 대부분 AUC 0.9 이상으로 정확하게 판독 ARTICLES NATURE MEDICINE fibrosis, inflammation or blood was also present, but also in very poorly differentiated tumors. Sections obtained from biopsies are usually much smaller, which reduces the number of tiles per slide, but the performance of our model remains consistent for the 102 samples tested (AUC ~0.834–0.861 using 20×magnification and 0.871–0.928 using 5×magnification; Fig. 2c), and the accuracy of the classification does not correlate with the sample size or the size of the area selected by our pathologist (Supplementary Fig. 4; the tumor area on the frozen and FFPE samples, then applied this model to the biopsies and finally applied the TCGA-trained three- way classifier on the tumor area selected by the automatic tumor selection model. The per-tile AUC of the automatic tumor selection model (using the pathologist’s tumor selection as reference) was 0.886 (CI, 0.880–0.891) for the biopsies, 0.797 (CI, 0.795–0.800) for the frozen samples, and 0.852 (CI, 0.808–0.895) for the FFPE samples. As demonstrated in Supplementary Fig. 3a (right-most bar LUAD at 5× AUC = 0.919, CI = 0.861–0.949 1 a b c 0.5 Truepositive 0 0 0.5 False positive 1 1 0.5 Truepositive 0 0 0.5 False positive 1 1 0.5 Truepositive 0 0 0.5 False positive 1 Frozen FFPE Biopsies LUSC at 5× AUC = 0.977, CI = 0.949–0.995 LUAD at 20× AUC = 0.913, CI = 0.849–0.963 LUSC at 20× AUC = 0.941, CI = 0.894–0.977 LUAD at 5× AUC = 0.861, CI = 0.792–0.919 LUSC at 5× AUC = 0.975, CI = 0.945–0.996 LUAD at 20× AUC = 0.833, CI = 0.762–0.894 LUSC at 20× AUC = 0.932, CI = 0.884–0.971 LUAD at 5× AUC = 0.871, CI = 0.784–0.938 LUSC at 5× AUC = 0.928, CIs = 0.871–0.972 LUAD at 20× AUC = 0.834, CI = 0.743–0.909 LUSC at 20× AUC = 0.861, CI = 0.780–0.928 Fig. 2 | Classification of presence and type of tumor on alternative cohorts. a–c, Receiver operating characteristic (ROC) curves (left) from tests on frozen sections (n=98 biologically independent slides) (a), FFPE sections (n=140 biologically independent slides) (b) and biopsies (n=102 biologically independent slides) from NYU Langone Medical Center (c). On the right of each plot, we show examples of raw images with an overlap in light gray of the mask generated by a pathologist and the corresponding heatmaps obtained with the three-way classifier. Scale bars, 1mm. Frozen FFPE Biopsy
  • 182.
    ARTICLES https://doi.org/10.1038/s41551-018-0301-3 C olonoscopy is thegold-standard screening test for colorectal cancer1–3 , one of the leading causes of cancer death in both the United States4,5 and China6 . Colonoscopy can reduce the risk of death from colorectal cancer through the detection of tumours at an earlier, more treatable stage as well as through the removal of precancerous adenomas3,7 . Conversely, failure to detect adenomas may lead to the development of interval cancer. Evidence has shown that each 1.0% increase in adenoma detection rate (ADR) leads to a 3.0% decrease in the risk of interval colorectal cancer8 . Although more than 14million colonoscopies are performed in the United States annually2 , the adenoma miss rate (AMR) is estimated to be 6–27%9 . Certain polyps may be missed more fre- quently, including smaller polyps10,11 , flat polyps12 and polyps in the left colon13 . There are two independent reasons why a polyp may be missed during colonoscopy: (i) it was never in the visual field or (ii) it was in the visual field but not recognized. Several hardware innovations have sought to address the first problem by improv- ing visualization of the colonic lumen, for instance by providing a larger, panoramic camera view, or by flattening colonic folds using a distal-cap attachment. The problem of unrecognized polyps within the visual field has been more difficult to address14 . Several studies have shown that observation of the video monitor by either nurses or gastroenterology trainees may increase polyp detection by up to 30%15–17 . Ideally, a real-time automatic polyp-detection system could serve as a similarly effective second observer that could draw the endoscopist’s eye, in real time, to concerning lesions, effec- tively creating an ‘extra set of eyes’ on all aspects of the video data with fidelity. Although automatic polyp detection in colonoscopy videos has been an active research topic for the past 20 years, per- formance levels close to that of the expert endoscopist18–20 have not been achieved. Early work in automatic polyp detection has focused on applying deep-learning techniques to polyp detection, but most published works are small in scale, with small development and/or training validation sets19,20 . Here, we report the development and validation of a deep-learn- ing algorithm, integrated with a multi-threaded processing system, for the automatic detection of polyps during colonoscopy. We vali- dated the system in two image studies and two video studies. Each study contained two independent validation datasets. Results We developed a deep-learning algorithm using 5,545colonoscopy images from colonoscopy reports of 1,290patients that underwent a colonoscopy examination in the Endoscopy Center of Sichuan Provincial People’s Hospital between January 2007 and December 2015. Out of the 5,545images used, 3,634images contained polyps (65.54%) and 1,911 images did not contain polyps (34.46%). For algorithm training, experienced endoscopists annotated the pres- ence of each polyp in all of the images in the development data- set. We validated the algorithm on four independent datasets. DatasetsA and B were used for image analysis, and datasetsC and D were used for video analysis. DatasetA contained 27,113colonoscopy images from colo- noscopy reports of 1,138consecutive patients who underwent a colonoscopy examination in the Endoscopy Center of Sichuan Provincial People’s Hospital between January and December 2016 and who were found to have at least one polyp. Out of the 27,113 images, 5,541images contained polyps (20.44%) and 21,572images did not contain polyps (79.56%). All polyps were confirmed histo- logically after biopsy. DatasetB is a public database (CVC-ClinicDB; Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy Pu Wang1 , Xiao Xiao2 , Jeremy R. Glissen Brown3 , Tyler M. Berzin 3 , Mengtian Tu1 , Fei Xiong1 , Xiao Hu1 , Peixi Liu1 , Yan Song1 , Di Zhang1 , Xue Yang1 , Liangping Li1 , Jiong He2 , Xin Yi2 , Jingjia Liu2 and Xiaogang Liu 1 * The detection and removal of precancerous polyps via colonoscopy is the gold standard for the prevention of colon cancer. However, the detection rate of adenomatous polyps can vary significantly among endoscopists. Here, we show that a machine- learningalgorithmcandetectpolypsinclinicalcolonoscopies,inrealtimeandwithhighsensitivityandspecificity.Wedeveloped the deep-learning algorithm by using data from 1,290 patients, and validated it on newly collected 27,113 colonoscopy images from 1,138 patients with at least one detected polyp (per-image-sensitivity, 94.38%; per-image-specificity, 95.92%; area under the receiver operating characteristic curve, 0.984), on a public database of 612 polyp-containing images (per-image-sensitiv- ity, 88.24%), on 138 colonoscopy videos with histologically confirmed polyps (per-image-sensitivity of 91.64%; per-polyp-sen- sitivity, 100%), and on 54 unaltered full-range colonoscopy videos without polyps (per-image-specificity, 95.40%). By using a multi-threaded processing system, the algorithm can process at least 25 frames per second with a latency of 76.80±5.60ms in real-time video analysis. The software may aid endoscopists while performing colonoscopies, and help assess differences in polyp and adenoma detection performance among endoscopists. • 딥러닝으로 clinical colonoscopy에서 정확하게 polyp detection하는 인공지능 • Real time으로 측정 가능: 초당 25 프레임 처리 • 이미지와 비디오 모두 validation • sensitivity 와 specificity 대부분 90% 이상
  • 183.
    •Some polyps weredetected with only partial appearance. •detected in both normal and insufficient light condition. •detected under both qualified and suboptimal bowel preparations. ARTICLESNATURE BIOMEDICAL ENGINEERING from patients who underwent colonoscopy examinations up to 2 years later. Also, we demonstrated high per-image-sensitivity (94.38% and 91.64%) in both the image (datasetA) and video (datasetC) analyses. DatasetsA and C included large variations of polyp mor- phology and image quality (Fig. 3, Supplementary Figs. 2–5 and Supplementary Videos 3 and 4). For images with only flat and iso- datasets are often small and do not represent the full range of colon conditions encountered in the clinical setting, and there are often discrepancies in the reporting of clinical metrics of success such as sensitivity and specificity19,20,26 . Compared with other metrics such as precision, we believe that sensitivity and specificity are the most appropriate metrics for the evaluation of algorithm performance because of their independence on the ratio of positive to negative Fig. 3 | Examples of polyp detection for datasetsA and C. Polyps of different morphology, including flat isochromatic polyps (left), dome-shaped polyps (second from left, middle), pedunculated polyps (second from right) and sessile serrated adenomatous polyps (right), were detected by the algorithm (as indicated by the green tags in the bottom set of images) in both normal and insufficient light conditions, under both qualified and suboptimal bowel preparations. Some polyps were detected with only partial appearance (middle, second from right). See Supplementary Figs 2–6 for additional examples. flat isochromatic polyps dome-shaped polyps sessile serrated adenomatous polypspedunculated polyps Examples of Polyp Detection for Datasets A and C
  • 184.
    •복잡한 의료 데이터의분석 및 insight 도출 •영상 의료/병리 데이터의 분석/판독 •연속 데이터의 모니터링 및 예방/예측 의료 인공지능의 세 유형
  • 187.
  • 188.
    Fig 1. Whatcan consumer wearables do? Heart rate can be measured with an oximeter built into a ring [3], muscle activity with an electromyographi sensor embedded into clothing [4], stress with an electodermal sensor incorporated into a wristband [5], and physical activity or sleep patterns via an accelerometer in a watch [6,7]. In addition, a female’s most fertile period can be identified with detailed body temperature tracking [8], while levels of me attention can be monitored with a small number of non-gelled electroencephalogram (EEG) electrodes [9]. Levels of social interaction (also known to a PLOS Medicine 2016
  • 189.
    S E PS I S A targeted real-time early warning score (TREWScore) for septic shock Katharine E. Henry,1 David N. Hager,2 Peter J. Pronovost,3,4,5 Suchi Saria1,3,5,6 * Sepsis is a leading cause of death in the United States, with mortality highest among patients who develop septic shock. Early aggressive treatment decreases morbidity and mortality. Although automated screening tools can detect patients currently experiencing severe sepsis and septic shock, none predict those at greatest risk of developing shock. We analyzed routinely available physiological and laboratory data from intensive care unit patients and devel- oped “TREWScore,” a targeted real-time early warning score that predicts which patients will develop septic shock. TREWScore identified patients before the onset of septic shock with an area under the ROC (receiver operating characteristic) curve (AUC) of 0.83 [95% confidence interval (CI), 0.81 to 0.85]. At a specificity of 0.67, TREWScore achieved a sensitivity of 0.85 and identified patients a median of 28.2 [interquartile range (IQR), 10.6 to 94.2] hours before onset. Of those identified, two-thirds were identified before any sepsis-related organ dysfunction. In compar- ison, the Modified Early Warning Score, which has been used clinically for septic shock prediction, achieved a lower AUC of 0.73 (95% CI, 0.71 to 0.76). A routine screening protocol based on the presence of two of the systemic inflam- matory response syndrome criteria, suspicion of infection, and either hypotension or hyperlactatemia achieved a low- er sensitivity of 0.74 at a comparable specificity of 0.64. Continuous sampling of data from the electronic health records and calculation of TREWScore may allow clinicians to identify patients at risk for septic shock and provide earlier interventions that would prevent or mitigate the associated morbidity and mortality. INTRODUCTION Seven hundred fifty thousand patients develop severe sepsis and septic shock in the United States each year. More than half of them are admitted to an intensive care unit (ICU), accounting for 10% of all ICU admissions, 20 to 30% of hospital deaths, and $15.4 billion in an- nual health care costs (1–3). Several studies have demonstrated that morbidity, mortality, and length of stay are decreased when severe sep- sis and septic shock are identified and treated early (4–8). In particular, one study showed that mortality from septic shock increased by 7.6% with every hour that treatment was delayed after the onset of hypo- tension (9). More recent studies comparing protocolized care, usual care, and early goal-directed therapy (EGDT) for patients with septic shock sug- gest that usual care is as effective as EGDT (10–12). Some have inter- preted this to mean that usual care has improved over time and reflects important aspects of EGDT, such as early antibiotics and early ag- gressive fluid resuscitation (13). It is likely that continued early identi- fication and treatment will further improve outcomes. However, the Acute Physiology Score (SAPS II), SequentialOrgan Failure Assessment (SOFA) scores, Modified Early Warning Score (MEWS), and Simple Clinical Score (SCS) have been validated to assess illness severity and risk of death among septic patients (14–17). Although these scores are useful for predicting general deterioration or mortality, they typical- ly cannot distinguish with high sensitivity and specificity which patients are at highest risk of developing a specific acute condition. The increased use of electronic health records (EHRs), which can be queried in real time, has generated interest in automating tools that identify patients at risk for septic shock (18–20). A number of “early warning systems,” “track and trigger” initiatives, “listening applica- tions,” and “sniffers” have been implemented to improve detection andtimelinessof therapy forpatients with severe sepsis andseptic shock (18, 20–23). Although these tools have been successful at detecting pa- tients currently experiencing severe sepsis or septic shock, none predict which patients are at highest risk of developing septic shock. The adoption of the Affordable Care Act has added to the growing excitement around predictive models derived from electronic health R E S E A R C H A R T I C L E onNovember3,2016http://stm.sciencemag.org/Downloadedfrom
  • 190.
    puted as newdata became avail when his or her score crossed t dation set, the AUC obtained f 0.81 to 0.85) (Fig. 2). At a spec of 0.33], TREWScore achieved a s a median of 28.2 hours (IQR, 10 Identification of patients b A critical event in the developme related organ dysfunction (seve been shown to increase after th more than two-thirds (68.8%) o were identified before any sepsi tients were identified a median (Fig. 3B). Comparison of TREWScore Weevaluatedtheperformanceof methods for the purpose of provid use of TREWScore. We first com to MEWS, a general metric used of catastrophic deterioration (17 oped for tracking sepsis, MEWS tion of patients at risk for severe Fig. 2. ROC for detection of septic shock before onset in the validation set. The ROC curve for TREWScore is shown in blue, with the ROC curve for MEWS in red. The sensitivity and specificity performance of the routine screening criteria is indicated by the purple dot. Normal 95% CIs are shown for TREWScore and MEWS. TPR, true-positive rate; FPR, false-positive rate. R E S E A R C H A R T I C L E A targeted real-time early warning score (TREWScore) for septic shock AUC=0.83 At a specificity of 0.67,TREWScore achieved a sensitivity of 0.85 
 and identified patients a median of 28.2 hours before onset.
  • 192.
  • 193.
    •미국에서 아이폰 앱으로출시 •사용이 얼마나 번거로울지가 관건 •어느 정도의 기간을 활용해야 효과가 있는가: 2주? 평생? •Food logging 등을 어떻게 할 것인가? •과금 방식도 아직 공개되지 않은듯
  • 194.
  • 195.
    An Algorithm Basedon Deep Learning for Predicting In-Hospital Cardiac Arrest Joon-myoung Kwon, MD;* Youngnam Lee, MS;* Yeha Lee, PhD; Seungwoo Lee, BS; Jinsik Park, MD, PhD Background-—In-hospital cardiac arrest is a major burden to public health, which affects patient safety. Although traditional track- and-trigger systems are used to predict cardiac arrest early, they have limitations, with low sensitivity and high false-alarm rates. We propose a deep learning–based early warning system that shows higher performance than the existing track-and-trigger systems. Methods and Results-—This retrospective cohort study reviewed patients who were admitted to 2 hospitals from June 2010 to July 2017. A total of 52 131 patients were included. Specifically, a recurrent neural network was trained using data from June 2010 to January 2017. The result was tested using the data from February to July 2017. The primary outcome was cardiac arrest, and the secondary outcome was death without attempted resuscitation. As comparative measures, we used the area under the receiver operating characteristic curve (AUROC), the area under the precision–recall curve (AUPRC), and the net reclassification index. Furthermore, we evaluated sensitivity while varying the number of alarms. The deep learning–based early warning system (AUROC: 0.850; AUPRC: 0.044) significantly outperformed a modified early warning score (AUROC: 0.603; AUPRC: 0.003), a random forest algorithm (AUROC: 0.780; AUPRC: 0.014), and logistic regression (AUROC: 0.613; AUPRC: 0.007). Furthermore, the deep learning– based early warning system reduced the number of alarms by 82.2%, 13.5%, and 42.1% compared with the modified early warning system, random forest, and logistic regression, respectively, at the same sensitivity. Conclusions-—An algorithm based on deep learning had high sensitivity and a low false-alarm rate for detection of patients with cardiac arrest in the multicenter study. (J Am Heart Assoc. 2018;7:e008678. DOI: 10.1161/JAHA.118.008678.) Key Words: artificial intelligence • cardiac arrest • deep learning • machine learning • rapid response system • resuscitation In-hospital cardiac arrest is a major burden to public health, which affects patient safety.1–3 More than a half of cardiac arrests result from respiratory failure or hypovolemic shock, and 80% of patients with cardiac arrest show signs of deterioration in the 8 hours before cardiac arrest.4–9 However, 209 000 in-hospital cardiac arrests occur in the United States each year, and the survival discharge rate for patients with cardiac arrest is <20% worldwide.10,11 Rapid response systems (RRSs) have been introduced in many hospitals to detect cardiac arrest using the track-and-trigger system (TTS).12,13 Two types of TTS are used in RRSs. For the single-parameter TTS (SPTTS), cardiac arrest is predicted if any single vital sign (eg, heart rate [HR], blood pressure) is out of the normal range.14 The aggregated weighted TTS calculates a weighted score for each vital sign and then finds patients with cardiac arrest based on the sum of these scores.15 The modified early warning score (MEWS) is one of the most widely used approaches among all aggregated weighted TTSs (Table 1)16 ; however, traditional TTSs including MEWS have limitations, with low sensitivity or high false-alarm rates.14,15,17 Sensitivity and false-alarm rate interact: Increased sensitivity creates higher false-alarm rates and vice versa. Current RRSs suffer from low sensitivity or a high false- alarm rate. An RRS was used for only 30% of patients before unplanned intensive care unit admission and was not used for 22.8% of patients, even if they met the criteria.18,19 From the Departments of Emergency Medicine (J.-m.K.) and Cardiology (J.P.), Mediplex Sejong Hospital, Incheon, Korea; VUNO, Seoul, Korea (Youngnam L., Yeha L., S.L.). *Dr Kwon and Mr Youngnam Lee contributed equally to this study. Correspondence to: Joon-myoung Kwon, MD, Department of Emergency medicine, Mediplex Sejong Hospital, 20, Gyeyangmunhwa-ro, Gyeyang-gu, Incheon 21080, Korea. E-mail: kwonjm@sejongh.co.kr Received January 18, 2018; accepted May 31, 2018. ª 2018 The Authors. Published on behalf of the American Heart Association, Inc., by Wiley. This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes. DOI: 10.1161/JAHA.118.008678 Journal of the American Heart Association 1 ORIGINAL RESEARCH byguestonJune28,2018http://jaha.ahajournals.org/Downloadedfrom
  • 196.
    •환자 수: 86,290 •cardiacarrest: 633 •Input: Heart rate, Respiratory rate, Body temperature, Systolic Blood Pressure (source: VUNO) Cardiac Arrest Prediction Accuracy
  • 197.
    •대학병원 신속 대응팀에서처리 가능한 알림 수 (A, B 지점) 에서 더 큰 정확도 차이를 보임 •A: DEWS 33.0%, MEWS 0.3% •B: DEWS 42.7%, MEWS 4.0% (source: VUNO) APPH(Alarms Per Patients Per Hour) (source: VUNO) Less False Alarm
  • 198.
  • 199.
    FOCUS | LETTERS https://doi.org/10.1038/s41591-018-0268-3 1 Departmentof Computer Science, Stanford University, Stanford, CA, USA. 2 iRhythm Technologies Inc., San Francisco, CA, USA. 3 Division of Cardiology, Department of Medicine, University of California San Francisco, San Francisco, CA, USA. 4 Department of Medicine and Center for Digital Health, Stanford University School of Medicine, Stanford, CA, USA. 5 Veterans Affairs Palo Alto Health Care System, Palo Alto, CA, USA. 6 These authors contributed equally: Awni Y. Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H. Tison. *e-mail: awni@cs.stanford.edu Computerized electrocardiogram (ECG) interpretation plays a critical role in the clinical ECG workflow1 . Widely available digital ECG data and the algorithmic paradigm of deep learn- ing2 present an opportunity to substantially improve the accu- racy and scalability of automated ECG analysis. However, a comprehensive evaluation of an end-to-end deep learning approach for ECG analysis across a wide variety of diagnostic classes has not been previously reported. Here, we develop a deep neural network (DNN) to classify 12 rhythm classes using 91,232 single-lead ECGs from 53,549 patients who used a single-lead ambulatory ECG monitoring device. When validated against an independent test dataset annotated by a consensus committee of board-certified practicing cardiolo- gists, the DNN achieved an average area under the receiver operating characteristic curve (ROC) of 0.97. The average F1 score, which is the harmonic mean of the positive predictive value and sensitivity, for the DNN (0.837) exceeded that of average cardiologists (0.780). With specificity fixed at the average specificity achieved by cardiologists, the sensitivity of the DNN exceeded the average cardiologist sensitivity for all rhythm classes. These findings demonstrate that an end- to-end deep learning approach can classify a broad range of distinct arrhythmias from single-lead ECGs with high diagnos- tic performance similar to that of cardiologists. If confirmed in clinical settings, this approach could reduce the rate of misdi- agnosed computerized ECG interpretations and improve the efficiency of expert human ECG interpretation by accurately triaging or prioritizing the most urgent conditions. The electrocardiogram is a fundamental tool in the everyday practice of clinical medicine, with more than 300 million ECGs obtained annually worldwide3 . The ECG is pivotal for diagnos- ing a wide spectrum of abnormalities from arrhythmias to acute coronary syndrome4 . Computer-aided interpretation has become increasingly important in the clinical ECG workflow since its intro- duction over 50years ago, serving as a crucial adjunct to physician interpretation in many clinical settings1 . However, existing com- mercial ECG interpretation algorithms still show substantial rates of misdiagnosis1,5–7 . The combination of widespread digitization of ECG data and the development of algorithmic paradigms that can benefit from large-scale processing of raw data presents an opportu- nity to reexamine the standard approach to algorithmic ECG analy- sis and may provide substantial improvements to automated ECG interpretation. Substantial algorithmic advances in the past five years have been driven largely by a specific class of models known as deep neural networks2 . DNNs are computational models consisting of multiple processing layers, with each layer being able to learn increasingly abstract, higher-level representations of the input data relevant to perform specific tasks. They have dramatically improved the state of the art in speech recognition8 , image recognition9 , strategy games such as Go10 , and in medical applications11,12 . The ability of DNNs to recognize patterns and learn useful features from raw input data without requiring extensive data preprocessing, feature engineer- ing or handcrafted rules2 makes them particularly well suited to interpret ECG data. Furthermore, since DNN performance tends to increase as the amount of training data increases2 , this approach is well positioned to take advantage of the widespread digitization of ECG data. A comprehensive evaluation of whether an end-to-end deep learning approach can be used to analyze raw ECG data to classify a broad range of diagnoses remains lacking. Much of the previous work to employ DNNs toward ECG interpretation has focused on single aspects of the ECG processing pipeline, such as noise reduc- tion13,14 or feature extraction15,16 , or has approached limited diag- nostic tasks, detecting only a handful of heartbeat types (normal, ventricular or supraventricular ectopic, fusion, and so on)17–20 or rhythm diagnoses (most commonly atrial fibrillation or ventric- ular tachycardia)21–25 . Lack of appropriate data has limited many efforts beyond these applications. Most prior efforts used data from the MIT-BIH Arrhythmia database (PhysioNet)26 , which is limited by the small number of patients and rhythm episodes present in the dataset. In this study, we constructed a large, novel ECG dataset that underwent expert annotation for a broad range of ECG rhythm classes. We developed a DNN to detect 12 rhythm classes from raw single-lead ECG inputs using a training dataset consisting of 91,232 ECG records from 53,549 patients. The DNN was designed to classify 10 arrhythmias as well as sinus rhythm and noise for a total of 12 output rhythm classes (Extended Data Fig. 1). ECG data were recorded by the Zio monitor, which is a Food and Drug Administration (FDA)-cleared, single-lead, patch-based ambula- tory ECG monitor27 that continuously records data from a single vector (modified Lead II) at 200Hz. The mean and median wear time of the Zio monitor in our dataset was 10.6 and 13.0days, respectively. Mean age was 69±16years and 43% were women. We validated the DNN on a test dataset that consisted of 328 ECG records collected from 328 unique patients, which was annotated by a consensus committee of expert cardiologists (see Methods). Mean age on the test dataset was 70±17years and 38% were women. The mean inter-annotator agreement on the test dataset was 72.8%. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network Awni Y. Hannun 1,6 *, Pranav Rajpurkar 1,6 , Masoumeh Haghpanahi2,6 , Geoffrey H. Tison 3,6 , Codie Bourn2 , Mintu P. Turakhia4,5 and Andrew Y. Ng1 FOCUS | LETTERS https://doi.org/10.1038/s41591-018-0268-3 NATURE MEDICINE | VOL 25 | JANUARY 2019 | 65–69 | www.nature.com/naturemedicine 65
  • 200.
    • 53,549명의 환자에게서얻은 91,232 건의 single-lead ECG 데이터 • ZIO patch (FDA cleared, single led, ambulatory ECG monitor) • 총 12가지 종류의 부정맥으로 구분하는 DNN 개발 (34-layer network) Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks
  • 201.
    •Validation • 6명의 독립적인cardiologist 의 평균적인 실력과 비교 • F1 score를 기준으로 비교 (precision과 recall의 조화평균)LETTERS | FOCUS NATURE MEDICINE Supplementary Table 1 shows the number of unique patients exhibiting each rhythm class. We first compared the performance of the DNN against the gold standard cardiologist consensus committee diagnoses by calculat- ing the AUC (Table 1a). Since the DNN algorithm was designed to make a rhythm class prediction approximately once per second (see Methods), we report performance both as assessed once every second—which we call “sequence-level” and consists of one rhythm class per interval—and once per record, which we call “set-level” scores on the 10% development dataset (n=8,761) were materially unchanged from the test dataset results, although they were slightly higher (Supplementary Tables 3 and 4). In addition, we retrained the DNN holding out an additional 10% of the training dataset as a second held-out test dataset (n=8,768); the AUC and F1 scores for all rhythms were materially unchanged (Supplementary Tables 5 and 6). We note that unlike the primary test dataset, which has gold- standard annotations from a committee of cardiologists, both sensi- tivity analysis datasets are annotated by certified ECG technicians. Table 1 | Diagnostic performance of the DNN and averaged individual cardiologists compared to the cardiologist committee consensus (n=328) Algorithm AUC (95% CI)a Algorithm F1 b Average cardiologist F1 Sequencea Setb Sequence Set Sequence Set Atrial fibrillation and flutter 0.973 (0.966–0.980) 0.965 (0.932–0.998) 0.801 0.831 0.677 0.686 AVB 0.988 (0.983–0.993) 0.981 (0.953–1.000) 0.828 0.808 0.772 0.761 Bigeminy 0.997 (0.991–1.000) 0.996 (0.976–1.000) 0.847 0.870 0.842 0.853 EAR 0.913 (0.889–0.937) 0.940 (0.870–1.000) 0.541 0.596 0.482 0.536 IVR 0.995 (0.989–1.000) 0.987 (0.959–1.000) 0.761 0.818 0.632 0.720 Junctional rhythm 0.987 (0.980–0.993) 0.979 (0.946–1.000) 0.664 0.789 0.692 0.679 Noise 0.981 (0.973–0.989) 0.947 (0.898–0.996) 0.844 0.761 0.768 0.685 Sinus rhythm 0.975 (0.971–0.979) 0.987 (0.976–0.998) 0.887 0.933 0.852 0.910 SVT 0.973 (0.960–0.985) 0.953 (0.903–1.000) 0.488 0.693 0.451 0.564 Trigeminy 0.998 (0.995–1.000) 0.997 (0.979–1.000) 0.907 0.864 0.842 0.812 Ventricular tachycardia 0.995 (0.980–1.000) 0.980 (0.934–1.000) 0.541 0.681 0.566 0.769 Wenckebach 0.978 (0.967–0.989) 0.977 (0.938–1.000) 0.702 0.780 0.591 0.738 Frequency-weighted average 0.978 0.977 0.807 0.837 0.753 0.780 a DNN algorithm area under the ROC compared to the cardiologist committee consensus. b DNN algorithm and averaged individual cardiologist F1 scores compared to the cardiologist committee consensus. Sequence-level describes the algorithm predictions that are made once every 256 input samples (approximately every 1.3s) and are compared against the gold-standard committee consensus at the same intervals. Set-level describes the unique set of algorithm predictions that are present in the 30-s record. Sequence AUC prediction, n=7,544; set AUC prediction, n=328. LETTERS | FOCUS https://doi.org/10.1038/s41591-018-0268-3LETTERS | FOCUS NATURE MEDICINE • Set level average F1 score: 전반적으로 인공지능이 더 나은 퍼포먼스 • DNN (0.837) > cardiologist (0.780) • DNN과 cardiologist 는 비슷한 추이의 F1 score를 보임 • VT, EAR 등에 대해서는 모두 낮음
  • 202.
    FOCUS | LETTERSNATUREMEDICINE Our study is the first comprehensive demonstration of a deep 0.0 0.1 0.2 0.3 Specificity Specificity Specificity 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity Class Atrial fibrillation Model Individual cardiologist Average cardiologist 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity Class Trigeminy Model Individual cardiologist Average cardiologist 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity Class AVB Model Individual cardiologist Average cardiologist b 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity (Recall) 0.0 0.2 0.4 0.6 0.8 1.0 PPV(precision) Class Atrial fibrillation Model Individual cardiologist Average cardiologist 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity (Recall) 0.0 0.2 0.4 0.6 0.8 1.0 PPV(precision) Class Trigeminy Model Individual cardiologist Average cardiologist 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity (Recall) 0.0 0.2 0.4 0.6 0.8 1.0 PPV(precision) Class AVB Model Individual cardiologist Average cardiologist a Fig. 1 | ROC and precision-recall curves. a, Examples of ROC curves calculated at the sequence level for atrial fibrillation (AF), trigeminy, and AVB. b, Examples of precision-recall curves calculated at the sequence level for atrial fibrillation, trigeminy, and AVB. Individual cardiologist performance is indicated by the red crosses and averaged cardiologist performance is indicated by the green dot. The line represents the ROC (a) or precision-recall curve (b) achieved by the DNN model. n=7,544 where each of the 328 30-s ECGs received 23 sequence-level predictions. https://doi.org/10.1038/s41591-018-0268-3FOCUS | LETTERSNATURE MEDICINE • DNN model met or exceeded the averaged cardiologist performance for all rhythm classes.
  • 203.
    ehensive demonstration ofa deep classification across a broad range rtant ECG rhythm diagnoses. Our hted AUC of 0.97, with higher aver- than cardiologists. These findings DNN approach has the potential acy of algorithmic ECG interpreta- mputational advances compel us to to automated ECG interpretation. aches whose performance improves uch as deep learning2 , can leverage CG data and provide clear oppor- ideal of a learning health care sys- this study of a dataset large enough learning approach to predict mul- nd our validation against the high sus committee. (Most cardiologists bnormalities.) We believe this is the ndard, since cardiologists perform y all clinical settings. the paradigm shift represented by nable a new approach to automated oach to automated ECG interpreta- across a series of steps that include raction, feature selection/reduction, hand-engineered heuristics and deri- developed with the ultimate aim to rhythm, such as atrial fibrillation31,32 . In contrast, DNNs enable an approach that is fundamentally different since a single algorithm can accomplish all of these steps ‘end-to-end’ without requiring class-specific feature extraction; in other words, the DNN can accept the raw ECG data as input and output diagnostic Table 2 | DNN algorithm and cardiologist sensitivity compared to the cardiologist committee consensus, with specificity fixed at the average specificity level achieved by cardiologists Specificity Average cardiologist sensitivity DNN algorithm sensitivity Atrial fibrillation and flutter 0.941 0.710 0.861 AVB 0.981 0.731 0.858 Bigeminy 0.996 0.829 0.921 EAR 0.993 0.380 0.445 IVR 0.991 0.611 0.867 Junctional rhythm 0.984 0.634 0.729 Noise 0.983 0.749 0.803 Sinus rhythm 0.859 0.901 0.950 SVT 0.983 0.408 0.487 Ventricular tachycardia 0.996 0.652 0.702 Wenckebach 0.986 0.541 0.651 raged cardiologist performance is indicated by the green dot. The line represents the ROC (a) or precision-recall curve 7,544 where each of the 328 30-s ECGs received 23 sequence-level predictions. 2019 | 65–69 | www.nature.com/naturemedicine 67 • Cardiologist 와 DNN의 sensitivity 비교 • DNN의 경우: specificity를 cardiologist와 동일하게 설정한 경우의 sensitivity • 12 종류의 부정맥 모두에 DNN이 더 높은 sensitivity를 보임
  • 204.
    •복잡한 의료 데이터의분석 및 insight 도출 •영상 의료/병리 데이터의 분석/판독 •연속 데이터의 모니터링 및 예방/예측 의료 인공지능의 세 유형
  • 205.
    Three Steps toImplement Digital Medicine • Step 1. Measure the Data • Step 2. Collect the Data • Step 3. Insight from the Data
  • 207.
    Feedback/Questions • E-mail: yoonsup.choi@gmail.com • Blog: http://www.yoonsupchoi.com • Facebook: 최윤섭 디지털 헬스케어 연구소