SlideShare a Scribd company logo
1 of 14
© ALTOROS Systems | CONFIDENTIAL
Parfenovich Sofia
Data Science specialist
2013, Minsk
© ALTOROS Systems | CONFIDENTIAL 2
Typical Data analysis task
Data pre-processing problem and k-means
Pre-processing methods
Feature selection and why You shouldn’t listen to
the client
© ALTOROS Systems | CONFIDENTIAL 3
• Recommendation
system
• Groups in social
networks
• Image processing
Clustering
• Credit risk
• Image processing
• Trade systems
• Biometrics
Classification
• Trade systems
• Business tasks
• Medicine
Regression
© ALTOROS Systems | CONFIDENTIAL 4
• Time series
prediction
• Trading
• Business internal
tasks
Prediction
• Image processing
• Semantics
• Time series
analysis
Pattern
recognition
• Trading
• Monitoring
systems
Anomalies
detection
© ALTOROS Systems | CONFIDENTIAL 5
Recommendation
system
Customer
Same purchase history:
{item1, item2, item3}
{item1, item2,??}
Items
Similar features:
(for books):
{author, genre, country, year..}
© ALTOROS Systems | CONFIDENTIAL 6
How to solve?
 Gain Data
 Use k-means for clustering
 Divide data into clusters
 Recommend items from the
same cluster
© ALTOROS Systems | CONFIDENTIAL 7
Algorithm:
 Select initial centroids
 Calculate the distance between
centroids and points
 Make clusters
 Re-calculate centroids
 Enjoy the result
© ALTOROS Systems | CONFIDENTIAL 8
Purchase history
Point:
Euclidian distance:
(Client1, Client2) =
(Client2, Client3) =
(Client1, Client3) =
Item features
Point:
Euclidian distance:
(Item1, Item2) = 76786788
(Item2, Item3) = 67
(Item1, Item3) = 6757665567566
??!!
client
ID
Item
1
Item
2
… Item
N-1
Item
N
1 0 1 0 0 1
2 1 1 0 0 1
3 0 0 0 1 1
2
Item
ID
F1
(author)
F2
(genre)
… FN
(year)
1 34354 12 … 1990
2 23 7 … 1898
3 5676 67 … 2013
3
1
© ALTOROS Systems | CONFIDENTIAL 9
Recommendation
system
Same purchase
history
Success!!!
Similar features
Fail!!!!
Data pre-
processing or
algorithm
modification
Success!!!
© ALTOROS Systems | CONFIDENTIAL 10
Problem:
Raw Data
 Non-numeric data
 Numeric data
 Missed values
Internal Problems
 Outliers
 Noisy data
 Uniform distribution
Solution:
Raw Data
 Encoding
 Normalization (Standardization)
 Interpolation
Internal Problems
 Detecting and smoothing
 De-noise
 Change dimension of the data space
© ALTOROS Systems | CONFIDENTIAL 11
 Encoding
– {“small”, “medium”, “big”} => {0,1,2} => {-1, 0, 1}
– {“paris”, “london”, “milan”} => {{1,0,0},{0,1,0},{0,0,1}}
 Normalization
– [-100; 300] => [-1: 1] or [0; 1]
 Standardization
– {mean(data) = a, std(data) = b} => {mean(data) = 0, std(data) = 1}
 Interpolation
– {0.3, 0.5, 0, NaN, -0.2} => {0.3, 0.5, 0, -0.1, -0.2}
Outliers
Noise
Uniform distribution
© ALTOROS Systems | CONFIDENTIAL 12
Customer
Initial data
Method
restrictions
Feature selection
Preliminary
data analysis
Changing of
the dimension
© ALTOROS Systems | CONFIDENTIAL 13
Gain
information
• Try to gain as much
information as possible
Get some
expert
knowledge
• Ask, what kind of results are expected
• Understand main principles and nature
of data
Don’t
eliminate
features
• Mean a priori elimination without any
research
• Don’t listen to client and experts
Use
preliminary
Data analysis
• Check weather data
match the problem
• Pre-processing?
© ALTOROS Systems | CONFIDENTIAL 14
Questions?

More Related Content

Similar to «Анализ больших данных и их подготовка перед применением методов машинного обучения»

¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?Denodo
 
Explainable AI with H2O Driverless AI's MLI module
Explainable AI with H2O Driverless AI's MLI moduleExplainable AI with H2O Driverless AI's MLI module
Explainable AI with H2O Driverless AI's MLI moduleMartin Dvorak
 
MUORO Analytics- Data Science Tool
MUORO Analytics-  Data Science ToolMUORO Analytics-  Data Science Tool
MUORO Analytics- Data Science ToolVyom Bhardwaj
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
Monomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted DataMonomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted DataShurenBi1
 
Retail Design
Retail DesignRetail Design
Retail Designjagishar
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise deteo
 
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?Chris Sparshott
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...mattdenesuk
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive AnalyticsNandita Nityanandam
 
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big datawebwinkelvakdag
 

Similar to «Анализ больших данных и их подготовка перед применением методов машинного обучения» (20)

¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
 
Explainable AI with H2O Driverless AI's MLI module
Explainable AI with H2O Driverless AI's MLI moduleExplainable AI with H2O Driverless AI's MLI module
Explainable AI with H2O Driverless AI's MLI module
 
MUORO Analytics- Data Science Tool
MUORO Analytics-  Data Science ToolMUORO Analytics-  Data Science Tool
MUORO Analytics- Data Science Tool
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Monomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted DataMonomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted Data
 
Data Science for Retail Broking
Data Science for Retail BrokingData Science for Retail Broking
Data Science for Retail Broking
 
Data Science for Retail Broking
Data Science for Retail BrokingData Science for Retail Broking
Data Science for Retail Broking
 
Retail Design
Retail DesignRetail Design
Retail Design
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise
 
AI in the Enterprise at Scale
AI in the Enterprise at ScaleAI in the Enterprise at Scale
AI in the Enterprise at Scale
 
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Data mining and its applications!
Data mining and its applications!Data mining and its applications!
Data mining and its applications!
 
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics
 
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big data
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Image Processing as a Part of Big Data Initiatives
Image Processing as a Part of Big Data InitiativesImage Processing as a Part of Big Data Initiatives
Image Processing as a Part of Big Data Initiatives
 

More from Olga Lavrentieva

15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v415 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4Olga Lavrentieva
 
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceСергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceOlga Lavrentieva
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraOlga Lavrentieva
 
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееВладимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееOlga Lavrentieva
 
Brug - Web push notification
Brug  - Web push notificationBrug  - Web push notification
Brug - Web push notificationOlga Lavrentieva
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Olga Lavrentieva
 
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Olga Lavrentieva
 
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Olga Lavrentieva
 
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Olga Lavrentieva
 
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Olga Lavrentieva
 
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Olga Lavrentieva
 
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Olga Lavrentieva
 
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Olga Lavrentieva
 
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Olga Lavrentieva
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»Olga Lavrentieva
 
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»Olga Lavrentieva
 
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»Olga Lavrentieva
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»Olga Lavrentieva
 
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»«Обзор возможностей Open cv»
«Обзор возможностей Open cv»Olga Lavrentieva
 
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»Olga Lavrentieva
 

More from Olga Lavrentieva (20)

15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v415 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4
 
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceСергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееВладимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущее
 
Brug - Web push notification
Brug  - Web push notificationBrug  - Web push notification
Brug - Web push notification
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
 
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"
 
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"
 
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
 
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
 
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»
 
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»
 
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»
 
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
 
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
 
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
 
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»«Обзор возможностей Open cv»
«Обзор возможностей Open cv»
 
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»
 

Recently uploaded

الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهMohamed Sweelam
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 

Recently uploaded (20)

الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 

«Анализ больших данных и их подготовка перед применением методов машинного обучения»

  • 1. © ALTOROS Systems | CONFIDENTIAL Parfenovich Sofia Data Science specialist 2013, Minsk
  • 2. © ALTOROS Systems | CONFIDENTIAL 2 Typical Data analysis task Data pre-processing problem and k-means Pre-processing methods Feature selection and why You shouldn’t listen to the client
  • 3. © ALTOROS Systems | CONFIDENTIAL 3 • Recommendation system • Groups in social networks • Image processing Clustering • Credit risk • Image processing • Trade systems • Biometrics Classification • Trade systems • Business tasks • Medicine Regression
  • 4. © ALTOROS Systems | CONFIDENTIAL 4 • Time series prediction • Trading • Business internal tasks Prediction • Image processing • Semantics • Time series analysis Pattern recognition • Trading • Monitoring systems Anomalies detection
  • 5. © ALTOROS Systems | CONFIDENTIAL 5 Recommendation system Customer Same purchase history: {item1, item2, item3} {item1, item2,??} Items Similar features: (for books): {author, genre, country, year..}
  • 6. © ALTOROS Systems | CONFIDENTIAL 6 How to solve?  Gain Data  Use k-means for clustering  Divide data into clusters  Recommend items from the same cluster
  • 7. © ALTOROS Systems | CONFIDENTIAL 7 Algorithm:  Select initial centroids  Calculate the distance between centroids and points  Make clusters  Re-calculate centroids  Enjoy the result
  • 8. © ALTOROS Systems | CONFIDENTIAL 8 Purchase history Point: Euclidian distance: (Client1, Client2) = (Client2, Client3) = (Client1, Client3) = Item features Point: Euclidian distance: (Item1, Item2) = 76786788 (Item2, Item3) = 67 (Item1, Item3) = 6757665567566 ??!! client ID Item 1 Item 2 … Item N-1 Item N 1 0 1 0 0 1 2 1 1 0 0 1 3 0 0 0 1 1 2 Item ID F1 (author) F2 (genre) … FN (year) 1 34354 12 … 1990 2 23 7 … 1898 3 5676 67 … 2013 3 1
  • 9. © ALTOROS Systems | CONFIDENTIAL 9 Recommendation system Same purchase history Success!!! Similar features Fail!!!! Data pre- processing or algorithm modification Success!!!
  • 10. © ALTOROS Systems | CONFIDENTIAL 10 Problem: Raw Data  Non-numeric data  Numeric data  Missed values Internal Problems  Outliers  Noisy data  Uniform distribution Solution: Raw Data  Encoding  Normalization (Standardization)  Interpolation Internal Problems  Detecting and smoothing  De-noise  Change dimension of the data space
  • 11. © ALTOROS Systems | CONFIDENTIAL 11  Encoding – {“small”, “medium”, “big”} => {0,1,2} => {-1, 0, 1} – {“paris”, “london”, “milan”} => {{1,0,0},{0,1,0},{0,0,1}}  Normalization – [-100; 300] => [-1: 1] or [0; 1]  Standardization – {mean(data) = a, std(data) = b} => {mean(data) = 0, std(data) = 1}  Interpolation – {0.3, 0.5, 0, NaN, -0.2} => {0.3, 0.5, 0, -0.1, -0.2} Outliers Noise Uniform distribution
  • 12. © ALTOROS Systems | CONFIDENTIAL 12 Customer Initial data Method restrictions Feature selection Preliminary data analysis Changing of the dimension
  • 13. © ALTOROS Systems | CONFIDENTIAL 13 Gain information • Try to gain as much information as possible Get some expert knowledge • Ask, what kind of results are expected • Understand main principles and nature of data Don’t eliminate features • Mean a priori elimination without any research • Don’t listen to client and experts Use preliminary Data analysis • Check weather data match the problem • Pre-processing?
  • 14. © ALTOROS Systems | CONFIDENTIAL 14 Questions?