Submit Search
Upload
«Анализ больших данных и их подготовка перед применением методов машинного обучения»
•
Download as PPTX, PDF
•
1 like
•
2,900 views
Olga Lavrentieva
Follow
Софья Парфенович (Data Scientist в Altoros)
Read less
Read more
Technology
Education
Report
Share
Report
Share
1 of 14
Download now
Recommended
Обзор Hadoop-дистрибутивов. Тюнинг «узких мест» Hadoop
Обзор Hadoop-дистрибутивов. Тюнинг «узких мест» Hadoop
Olga Lavrentieva
ACT Operations Research The Company
ACT Operations Research The Company
ACT OPERATIONS RESEARCH
Analytics in Online Retail
Analytics in Online Retail
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Demystifying Data Science
Demystifying Data Science
Jonathan Sedar
Data science applications and usecases
Data science applications and usecases
Sreenatha Reddy K R
Image Analytics for Retail
Image Analytics for Retail
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
cedrinemadera
Cssu dw dm
Cssu dw dm
sumit621
Recommended
Обзор Hadoop-дистрибутивов. Тюнинг «узких мест» Hadoop
Обзор Hadoop-дистрибутивов. Тюнинг «узких мест» Hadoop
Olga Lavrentieva
ACT Operations Research The Company
ACT Operations Research The Company
ACT OPERATIONS RESEARCH
Analytics in Online Retail
Analytics in Online Retail
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Demystifying Data Science
Demystifying Data Science
Jonathan Sedar
Data science applications and usecases
Data science applications and usecases
Sreenatha Reddy K R
Image Analytics for Retail
Image Analytics for Retail
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
cedrinemadera
Cssu dw dm
Cssu dw dm
sumit621
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
Denodo
Explainable AI with H2O Driverless AI's MLI module
Explainable AI with H2O Driverless AI's MLI module
Martin Dvorak
MUORO Analytics- Data Science Tool
MUORO Analytics- Data Science Tool
Vyom Bhardwaj
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
Sunderland City Council
Monomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted Data
ShurenBi1
Data Science for Retail Broking
Data Science for Retail Broking
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Data Science for Retail Broking
Data Science for Retail Broking
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Retail Design
Retail Design
jagishar
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise
deteo
AI in the Enterprise at Scale
AI in the Enterprise at Scale
Ganesan Narayanasamy
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?
Chris Sparshott
Splunk Business Analytics
Splunk Business Analytics
CleverDATA
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
mattdenesuk
Deliveinrg explainable AI
Deliveinrg explainable AI
Gary Allemann
Data mining and its applications!
Data mining and its applications!
COSTARCH Analytical Consulting (P) Ltd.
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics
Nandita Nityanandam
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big data
webwinkelvakdag
Explainable AI
Explainable AI
Equifax Ltd
Image Processing as a Part of Big Data Initiatives
Image Processing as a Part of Big Data Initiatives
IDEAS - Int'l Data Engineering and Science Association
15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4
Olga Lavrentieva
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Olga Lavrentieva
More Related Content
Similar to «Анализ больших данных и их подготовка перед применением методов машинного обучения»
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
Denodo
Explainable AI with H2O Driverless AI's MLI module
Explainable AI with H2O Driverless AI's MLI module
Martin Dvorak
MUORO Analytics- Data Science Tool
MUORO Analytics- Data Science Tool
Vyom Bhardwaj
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
Sunderland City Council
Monomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted Data
ShurenBi1
Data Science for Retail Broking
Data Science for Retail Broking
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Data Science for Retail Broking
Data Science for Retail Broking
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Retail Design
Retail Design
jagishar
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise
deteo
AI in the Enterprise at Scale
AI in the Enterprise at Scale
Ganesan Narayanasamy
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?
Chris Sparshott
Splunk Business Analytics
Splunk Business Analytics
CleverDATA
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
mattdenesuk
Deliveinrg explainable AI
Deliveinrg explainable AI
Gary Allemann
Data mining and its applications!
Data mining and its applications!
COSTARCH Analytical Consulting (P) Ltd.
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics
Nandita Nityanandam
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big data
webwinkelvakdag
Explainable AI
Explainable AI
Equifax Ltd
Image Processing as a Part of Big Data Initiatives
Image Processing as a Part of Big Data Initiatives
IDEAS - Int'l Data Engineering and Science Association
Similar to «Анализ больших данных и их подготовка перед применением методов машинного обучения»
(20)
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
Explainable AI with H2O Driverless AI's MLI module
Explainable AI with H2O Driverless AI's MLI module
MUORO Analytics- Data Science Tool
MUORO Analytics- Data Science Tool
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
Monomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted Data
Data Science for Retail Broking
Data Science for Retail Broking
Data Science for Retail Broking
Data Science for Retail Broking
Retail Design
Retail Design
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise
AI in the Enterprise at Scale
AI in the Enterprise at Scale
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?
Splunk Business Analytics
Splunk Business Analytics
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Deliveinrg explainable AI
Deliveinrg explainable AI
Data mining and its applications!
Data mining and its applications!
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big data
Explainable AI
Explainable AI
Image Processing as a Part of Big Data Initiatives
Image Processing as a Part of Big Data Initiatives
More from Olga Lavrentieva
15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4
Olga Lavrentieva
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Olga Lavrentieva
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Olga Lavrentieva
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущее
Olga Lavrentieva
Brug - Web push notification
Brug - Web push notification
Olga Lavrentieva
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Olga Lavrentieva
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"
Olga Lavrentieva
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"
Olga Lavrentieva
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Olga Lavrentieva
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Olga Lavrentieva
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»
Olga Lavrentieva
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»
Olga Lavrentieva
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»
Olga Lavrentieva
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»
Olga Lavrentieva
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
Olga Lavrentieva
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
Olga Lavrentieva
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
Olga Lavrentieva
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
Olga Lavrentieva
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»
Olga Lavrentieva
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»
Olga Lavrentieva
More from Olga Lavrentieva
(20)
15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущее
Brug - Web push notification
Brug - Web push notification
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»
Recently uploaded
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
Mohamed Sweelam
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
alexjohnson7307
Overview of Hyperledger Foundation
Overview of Hyperledger Foundation
Hyperleger Tokyo Meetup
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
BrainSell Technologies
2024 May Patch Tuesday
2024 May Patch Tuesday
Ivanti
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
Lorenzo Miniero
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
TopCSSGallery
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
FIDO Alliance
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
AnubhavMangla3
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
Kumar Satyam
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
danishmna97
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
jbellis
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
Pixlogix Infotech
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
DianaGray10
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
Working together SRE & Platform Engineering
Working together SRE & Platform Engineering
Marcus Vechiato
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
Safe Software
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
Srushith Repakula
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Leah Henrickson
Recently uploaded
(20)
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
Overview of Hyperledger Foundation
Overview of Hyperledger Foundation
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
2024 May Patch Tuesday
2024 May Patch Tuesday
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Working together SRE & Platform Engineering
Working together SRE & Platform Engineering
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
«Анализ больших данных и их подготовка перед применением методов машинного обучения»
1.
© ALTOROS Systems
| CONFIDENTIAL Parfenovich Sofia Data Science specialist 2013, Minsk
2.
© ALTOROS Systems
| CONFIDENTIAL 2 Typical Data analysis task Data pre-processing problem and k-means Pre-processing methods Feature selection and why You shouldn’t listen to the client
3.
© ALTOROS Systems
| CONFIDENTIAL 3 • Recommendation system • Groups in social networks • Image processing Clustering • Credit risk • Image processing • Trade systems • Biometrics Classification • Trade systems • Business tasks • Medicine Regression
4.
© ALTOROS Systems
| CONFIDENTIAL 4 • Time series prediction • Trading • Business internal tasks Prediction • Image processing • Semantics • Time series analysis Pattern recognition • Trading • Monitoring systems Anomalies detection
5.
© ALTOROS Systems
| CONFIDENTIAL 5 Recommendation system Customer Same purchase history: {item1, item2, item3} {item1, item2,??} Items Similar features: (for books): {author, genre, country, year..}
6.
© ALTOROS Systems
| CONFIDENTIAL 6 How to solve? Gain Data Use k-means for clustering Divide data into clusters Recommend items from the same cluster
7.
© ALTOROS Systems
| CONFIDENTIAL 7 Algorithm: Select initial centroids Calculate the distance between centroids and points Make clusters Re-calculate centroids Enjoy the result
8.
© ALTOROS Systems
| CONFIDENTIAL 8 Purchase history Point: Euclidian distance: (Client1, Client2) = (Client2, Client3) = (Client1, Client3) = Item features Point: Euclidian distance: (Item1, Item2) = 76786788 (Item2, Item3) = 67 (Item1, Item3) = 6757665567566 ??!! client ID Item 1 Item 2 … Item N-1 Item N 1 0 1 0 0 1 2 1 1 0 0 1 3 0 0 0 1 1 2 Item ID F1 (author) F2 (genre) … FN (year) 1 34354 12 … 1990 2 23 7 … 1898 3 5676 67 … 2013 3 1
9.
© ALTOROS Systems
| CONFIDENTIAL 9 Recommendation system Same purchase history Success!!! Similar features Fail!!!! Data pre- processing or algorithm modification Success!!!
10.
© ALTOROS Systems
| CONFIDENTIAL 10 Problem: Raw Data Non-numeric data Numeric data Missed values Internal Problems Outliers Noisy data Uniform distribution Solution: Raw Data Encoding Normalization (Standardization) Interpolation Internal Problems Detecting and smoothing De-noise Change dimension of the data space
11.
© ALTOROS Systems
| CONFIDENTIAL 11 Encoding – {“small”, “medium”, “big”} => {0,1,2} => {-1, 0, 1} – {“paris”, “london”, “milan”} => {{1,0,0},{0,1,0},{0,0,1}} Normalization – [-100; 300] => [-1: 1] or [0; 1] Standardization – {mean(data) = a, std(data) = b} => {mean(data) = 0, std(data) = 1} Interpolation – {0.3, 0.5, 0, NaN, -0.2} => {0.3, 0.5, 0, -0.1, -0.2} Outliers Noise Uniform distribution
12.
© ALTOROS Systems
| CONFIDENTIAL 12 Customer Initial data Method restrictions Feature selection Preliminary data analysis Changing of the dimension
13.
© ALTOROS Systems
| CONFIDENTIAL 13 Gain information • Try to gain as much information as possible Get some expert knowledge • Ask, what kind of results are expected • Understand main principles and nature of data Don’t eliminate features • Mean a priori elimination without any research • Don’t listen to client and experts Use preliminary Data analysis • Check weather data match the problem • Pre-processing?
14.
© ALTOROS Systems
| CONFIDENTIAL 14 Questions?
Download now