SlideShare a Scribd company logo
1 of 46
Download to read offline
Data science

from the trenches
Vsevolod Solovyov
CTO at Prophy Science
@murkt
Проблема:
•Существует 100+ млн научных публикаций

•Население растёт, количество учёных - тоже

•С каждым годом учёные пишут всё больше статей

•А ещё патенты, техническая документация,
клинические исследования...
Научная публикация
•Заголовок, текст

•Авторы

•Ссылки на предыдущие работы

•Место публикации (журнал, конференция)
Что нужно людям?
•Поиск

•Рекомендации нового

•Поиск экспертов

•Охватить неизвестную тему
Кто клиенты?
•Грантовые агентства

•Научные издательства

•Фармакологические компании

•Исследователи

•Internal documentation hell
Простой текст
неудобен
Все любят сокращать
•We study the impact of a warm dark matter (WDM)
cosmology on dwarf galaxy formation...

•...in both CDM and WDM models. WDM halos ...

•...the most massive WDM galaxy (Halo m10k) collapses...

•...their CDM counterparts, as can be seen by comparing
the colored lines (WDM)...
Все любят сокращать
Warm dark matter 3
WDM 219
Обычный стемминг
•polar, polars, polarize, polarized, polarizations → polar

•GAN, GaN → gan

•AND → and

•anyone, anyon → anyon

•WDMS, WDM → wdm
Поищем в тексте
Alpha thalassemia 2236
Alpha thalassaemia 748
α thalassemia 1046
α thalassaemia 276
SUM(1 + 2 + 3 + 4) 4306
OR(1 + 2 + 3 + 4) 4074
Решение
•Специализированный стеммер

•Стоп-слова (anyone)

•Онтология (списки терминов)

•Warm dark matter, WDM
•Wavelength division multiplexer, WDM

•Galaxy cluster, cluster of galaxies, GC, cluster
Расширение онтологии
•Keyphrase extraction

•On the structure and oxygen transmission rate of biodegradable cellulose
nanobarriers

•Synonym detection

•octadecylphosphonic acid (OPA, ODPA)

•convolutional LSTM (ConvLSTM, convolutional long short-term memory)

•decoupled extended Kalman filter (DEKF, decoupled EKF)

•human mesenchymal stem cells (human bone marrow stem cells, hMSC, HBMSC)
Кластеризация статей
30 миллионов статей — highload, bigdata :)
Было
•10 миллионов статей отъедало 130 Гб ОЗУ

•Считалось сутки, сохранялось часов пять

•Удаление занимало до сорока минут

•Десятки гигабайт в Postgres WAL

•Стриминговый бекап ложился
Что сделали
•Написали свою реализацию Louvain на Cython

•Кластеризация таблиц в Postgres

•import array

•Сохраняем параллельно с обсчетом
Сравним
•Скорость: 

10М за 30 часов → 31М статей за 15 часов

•Память:

10М и 130 Гб → 80-90 Гб, в пике до 110 Гб

•Удаление и фрагментация таблиц:

40 минут, потом VACUUM → DROP TABLE
А автор кто?
J Smith
•John Smith

•Johannes Smith

•Jekyll E Smith

•Jekyll M Smith

•Jekyll EM Smith
J Smith
•John Smith

•Johannes Smith

•Jekyll E Smith

•Jekyll M Smith

•Jekyll EM Smith

•J David Smith

•Jekyll Smith Jr.

•Jekyll Smith III
•John Smith

•Johannes Smith

•Jekyll E Smith

•Jekyll M Smith

•Jekyll EM Smith

•J David Smith

•Jekyll Smith Jr.

•Jekyll Smith III

•JekillE Smith

•JEKYLL. E. SMITH
J Smith
JL Smith, MB Salamon
Кто из:

•James L. Smith, Myron B. Salamon

•Janet L. Smith, Theodora Hatziioannou
JL Smith, MB Salamon
Кто из:

•James L. Smith, Myron B. Salamon

•Janet L. Smith, Theodora Hatziioannou

•Jerald L. Smith, Miriam Salamon
J Smith
•Больше 35 тысяч статей, подписанных каким-
нибудь J Smith (в нашей базе)

•Это больше тысячи реальных людей

•Кто-то написал сотни статей, кто-то - одну-две
Ох, китайцы
Странный народ
Топ имен
Wei Wang 16700
Wei Zhang 14300
Wei Li 12500
Lei Zhang 10550
Jing Wang 10250
Yan Li 10000
Lei Wang 9900
Еще исключительные случаи
•Тысячи авторов в одной статье

•Два автора с одинаковым именем в статье

•Группы авторов

•ATLAS collaboration

•Investigators of the European Huntington's Disease
Network
Идентификаторы 

спешат на помощь
•Специальные идентификаторы для научных авторов

•ORCID – 0000-0001-8073-3068

•ScopusID – 13204492100

•ResearcherID – E-3698-2015

•Email
Специальные идентификаторы
•Редко указывают в статьях

•Не человекочитаемы

•Их путают в статьях

•В базах идентификаторов куча бреда
V Petrov'ы
stanislaV, paVel, Vladimir

Lachezar? Sergei?
Email
•Один емейл точно принадлежит одному человеку 

•У одного человека может быть несколько

•Gmail

•Hotmail

•Разные университеты
Email принадлежит одному
человеку?
•isrn.molecular.biology@hindawi.com

•microbiologia.clinica@unt.edu.ar

•gyn-sekretariat@pius-hospital.de

•secretariat@grangettes.ch
А что делают другие?
•Cited "V S Saxena" but the profile reads "Vikram S
Saxena". This made it a tedious process to manually
resolve out the conflicts.

•For each pair of publication records, we compute all
basic features.

•Почти всегда разбирают заново
Какая информация полезна?
•Идентификаторы

•Ссылки на статьи

•Аффилиация
(университет)

•Место публикации
(журнал)

•Соавторы

•Имя

•Текст статьи
Кластеризация
•Пространственная

•k-means

•DBSCAN

•...

•Графовая

•Modularity

•Clique detection

•...
Пространственная
кластеризация
Графовая
кластеризация
+
Как совместить?
Как совместить?
•Расстояние/похожесть → ребра в графе

•Графы → координаты

•Graph embedding

•Metric learning
Реальность
•Не объединять Jurg и Jurgen в одного автора

•Но если e-mail или ORCID совпадает - объединять

•А если два разных ORCID - то не объединять

•Но...

•А если...

•И каждый день новые данные
Новые данные
Новые данные
•Результат должен быть стабилен 

•Больше миллиона новых имен в день

•100+ миллионов "статейных" авторов всего

•Наколенный скоринг с эвристиками
Выводы
•Знание данных бесценно

•Крайние случаи могут всё поломать

•Верить источникам нельзя

•Разнообразные хаки, чтоб работало
Фейсом об тейбл
дривен девелопмент
Спасибо за
внимание
Вопросы?
Vsevolod Solovyov
CTO at Prophy Science
@murkt

vsevolod.solovyov@gmail.com

More Related Content

Similar to Vsevolod Solovyov "Data science from the trenches"

Incubators 110726052706-phpapp01
Incubators 110726052706-phpapp01Incubators 110726052706-phpapp01
Incubators 110726052706-phpapp01AErmakov
 
А.Левенчук -- лекция о будущем (2014)
А.Левенчук -- лекция о будущем (2014)А.Левенчук -- лекция о будущем (2014)
А.Левенчук -- лекция о будущем (2014)Anatoly Levenchuk
 
Открытая лекция А. Левенчука
Открытая лекция А. ЛевенчукаОткрытая лекция А. Левенчука
Открытая лекция А. ЛевенчукаASIMP
 
Нейронные сетки: покруче интернета
Нейронные сетки: покруче интернетаНейронные сетки: покруче интернета
Нейронные сетки: покруче интернетаAnatoly Levenchuk
 
IT в большой науке или главный тормоз научного прогресса / Иван Золотухин
IT в большой науке или главный тормоз научного прогресса / Иван ЗолотухинIT в большой науке или главный тормоз научного прогресса / Иван Золотухин
IT в большой науке или главный тормоз научного прогресса / Иван ЗолотухинOntico
 
главчева про курс_12_02_2014
главчева про курс_12_02_2014главчева про курс_12_02_2014
главчева про курс_12_02_2014Vladimir Kukharenko
 
How to write texts. Introduction to Dramatic, News & Academic Structures
How to write texts. Introduction to Dramatic, News & Academic StructuresHow to write texts. Introduction to Dramatic, News & Academic Structures
How to write texts. Introduction to Dramatic, News & Academic StructuresAnton Gumenskiy
 
Sun Microsystems Course on AACIMP 2009: Agenda
Sun Microsystems Course on AACIMP 2009: AgendaSun Microsystems Course on AACIMP 2009: Agenda
Sun Microsystems Course on AACIMP 2009: AgendaSSA KPI
 

Similar to Vsevolod Solovyov "Data science from the trenches" (11)

239 talk
239 talk239 talk
239 talk
 
Incubators 110726052706-phpapp01
Incubators 110726052706-phpapp01Incubators 110726052706-phpapp01
Incubators 110726052706-phpapp01
 
Incubators
IncubatorsIncubators
Incubators
 
А.Левенчук -- лекция о будущем (2014)
А.Левенчук -- лекция о будущем (2014)А.Левенчук -- лекция о будущем (2014)
А.Левенчук -- лекция о будущем (2014)
 
Открытая лекция А. Левенчука
Открытая лекция А. ЛевенчукаОткрытая лекция А. Левенчука
Открытая лекция А. Левенчука
 
Нейронные сетки: покруче интернета
Нейронные сетки: покруче интернетаНейронные сетки: покруче интернета
Нейронные сетки: покруче интернета
 
IT в большой науке или главный тормоз научного прогресса / Иван Золотухин
IT в большой науке или главный тормоз научного прогресса / Иван ЗолотухинIT в большой науке или главный тормоз научного прогресса / Иван Золотухин
IT в большой науке или главный тормоз научного прогресса / Иван Золотухин
 
Выживший
ВыжившийВыживший
Выживший
 
главчева про курс_12_02_2014
главчева про курс_12_02_2014главчева про курс_12_02_2014
главчева про курс_12_02_2014
 
How to write texts. Introduction to Dramatic, News & Academic Structures
How to write texts. Introduction to Dramatic, News & Academic StructuresHow to write texts. Introduction to Dramatic, News & Academic Structures
How to write texts. Introduction to Dramatic, News & Academic Structures
 
Sun Microsystems Course on AACIMP 2009: Agenda
Sun Microsystems Course on AACIMP 2009: AgendaSun Microsystems Course on AACIMP 2009: Agenda
Sun Microsystems Course on AACIMP 2009: Agenda
 

More from Fwdays

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...Fwdays
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil TopchiiFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro SpodaretsFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym KindritskyiFwdays
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...Fwdays
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...Fwdays
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...Fwdays
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...Fwdays
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...Fwdays
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...Fwdays
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...Fwdays
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...Fwdays
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra MyronovaFwdays
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...Fwdays
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...Fwdays
 

More from Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Vsevolod Solovyov "Data science from the trenches"