Лекция №12 "Ограниченная машина Больцмана" Technosphere1
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Алгоритмы интеллектуальной обработки больших объемов данных", Лекция №12 "Ограниченная машина Больцмана"
Лектор - Павел Нестеров
Нейросетейвой автоэнкодер. Стохастические и рекурентные нейронные сети. Машина Больцмана и ограниченная машина Больцмана. Распределение Гиббса. Алгоритм contrastive divergence для обучения РБМ. Сэмплирование данных из РБМ. Бинарная РБМ и гауссово-бинарная РБМ. Влияние регуляризации, нелинейное сжатие размерности, извлечение признаков. Semantic hashing.
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9pyyrqknouMZbIPf4l3CwUP
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Алгоритмы интеллектуальной обработки больших объемов данных", Лекция №8 "Методы снижения размерности пространства"
Лектор - Владимир Гулин
Проблема проклятия размерности. Отбор и выделение признаков. Методы выделения признаков (feature extraction). Метод главных компонент (PCA). Метод независимых компонент (ICA). Методы основанные на автоэнкодерах. Методы отбора признаков (feature selection). Методы основанные на взаимной корреляции признаков. Метод максимальной релевантность и минимальной избыточности (mRMR). Методы основанные на деревьях решений.
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9pyyrqknouMZbIPf4l3CwUP
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Алгоритмы интеллектуальной обработки больших объемов данных", Лекция №7 "Машина опорных векторов"
Лектор - Николай Анохин
Разделяющая поверхность с максимальным зазором. Формулировка задачи оптимизации для случаев линейно-разделимых и линейно-неразделимых классов. Сопряженная задача. Опорные векторы. KKT-условия. SVM для задач классификации и регрессии. Kernel trick. Теорема Мерсера. Примеры функций ядра.
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9pyyrqknouMZbIPf4l3CwUP
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Алгоритмы интеллектуальной обработки больших объемов данных", Лекция №9 "Алгоритмические композиции. Начало"
Лектор - Владимир Гулин
Комбинации классификаторов. Модельные деревья решений. Смесь экспертов. Stacking. Стохастические методы построения ансамблей классификаторов. Bagging. RSM. Алгоритм RandomForest.
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9pyyrqknouMZbIPf4l3CwUP
Лекция №12 "Ограниченная машина Больцмана" Technosphere1
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Алгоритмы интеллектуальной обработки больших объемов данных", Лекция №12 "Ограниченная машина Больцмана"
Лектор - Павел Нестеров
Нейросетейвой автоэнкодер. Стохастические и рекурентные нейронные сети. Машина Больцмана и ограниченная машина Больцмана. Распределение Гиббса. Алгоритм contrastive divergence для обучения РБМ. Сэмплирование данных из РБМ. Бинарная РБМ и гауссово-бинарная РБМ. Влияние регуляризации, нелинейное сжатие размерности, извлечение признаков. Semantic hashing.
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9pyyrqknouMZbIPf4l3CwUP
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Алгоритмы интеллектуальной обработки больших объемов данных", Лекция №8 "Методы снижения размерности пространства"
Лектор - Владимир Гулин
Проблема проклятия размерности. Отбор и выделение признаков. Методы выделения признаков (feature extraction). Метод главных компонент (PCA). Метод независимых компонент (ICA). Методы основанные на автоэнкодерах. Методы отбора признаков (feature selection). Методы основанные на взаимной корреляции признаков. Метод максимальной релевантность и минимальной избыточности (mRMR). Методы основанные на деревьях решений.
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9pyyrqknouMZbIPf4l3CwUP
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Алгоритмы интеллектуальной обработки больших объемов данных", Лекция №7 "Машина опорных векторов"
Лектор - Николай Анохин
Разделяющая поверхность с максимальным зазором. Формулировка задачи оптимизации для случаев линейно-разделимых и линейно-неразделимых классов. Сопряженная задача. Опорные векторы. KKT-условия. SVM для задач классификации и регрессии. Kernel trick. Теорема Мерсера. Примеры функций ядра.
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9pyyrqknouMZbIPf4l3CwUP
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Алгоритмы интеллектуальной обработки больших объемов данных", Лекция №9 "Алгоритмические композиции. Начало"
Лектор - Владимир Гулин
Комбинации классификаторов. Модельные деревья решений. Смесь экспертов. Stacking. Стохастические методы построения ансамблей классификаторов. Bagging. RSM. Алгоритм RandomForest.
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9pyyrqknouMZbIPf4l3CwUP
This document discusses fiber tractography techniques for visualizing white matter tracts in the brain using diffusion tensor MRI (DT-MRI) data. It covers fiber tractography algorithms like deterministic and probabilistic methods. Specific algorithms discussed include moving least squares filtering for tensor field interpolation and streamline integration for fiber tracing. Examples of fiber tractography results are shown for the human brain and canine heart.
The document discusses Ancestry.com's use of data science and machine learning techniques to analyze their large collection of family history records and DNA data. Key points include:
- Ancestry has over 30,000 historical record collections, 11 billion records, and DNA samples from over 120,000 users.
- Machine learning is used for tasks like person search, record linkage, and suggesting record matches to help users build family trees.
- Analysis of 45 million user-contributed family trees provides insights into historical immigration patterns to the US over centuries.
Numerical Linear Algebra for Data and Link AnalysisLeonid Zhukov
The document discusses numerical linear algebra techniques for analyzing large graphs and networks. It provides examples of large social networks like Flickr that can be represented as graphs and analyzed using graph-based algorithms. Specifically, it discusses using techniques like PageRank to analyze link structures and rank nodes in a graph based on their importance. It also discusses computational methods like power iteration and Krylov subspace methods for efficiently solving the large systems of equations that arise in PageRank and related network analysis problems.
The document discusses a Data Loss Prevention (DLP) system called Monitorium. It can protect confidential information from theft or accidental loss by monitoring and analyzing outgoing internet traffic. It detects and blocks security-violating traffic using deep packet inspection of protocols like HTTP, FTP, and email. It analyzes message content, headers, and attachments to detect sensitive information leaving the network. The system provides real-time alerts, content analysis of multiple file formats and languages, and detailed reports.
This story discusses how networks with random connections tend to have nodes with similar numbers of connections that follow a normal distribution, while scale-free networks have hubs with a tremendous number of connections that follow a power law distribution. The early web search engines connected randomly, but Google grew differently by ranking pages based on the number and quality of links, allowing popular hubs to emerge. This preferential attachment mechanism causes the rich to get richer in scale-free networks.
Data scientists are in high demand due to a shortage projected between 140,000-190,000 by 2018. Data scientists love data and have an investigative mindset, using data to find patterns and create data-driven products. They have strong programming, statistics, and machine learning skills. Universities and online courses provide data science education, while conferences and meetups help data scientists network and stay informed of new developments in the field. Open questions remain around how important domain expertise is and whether data scientists will eventually be replaced by software.
This document discusses the large amount of data held by Ancestry.com, including 14 billion historical records, 60 million user-created family trees containing 6 billion profiles, 200 million uploaded photos and stories, and 400,000 DNA samples. It outlines the challenges of managing this data at such a large scale, including record linkage to connect information about the same individuals across different datasets, ensuring privacy of sensitive personal data, and balancing privacy concerns with enabling genomic research.
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRILeonid Zhukov
This document presents a method for tracing neural pathways from diffusion tensor MRI (DT-MRI) data through oriented tensor reconstruction. It introduces DT-MRI and discusses previous work in tensor visualization and fiber tracing. The presented algorithm uses moving least squares filtering and fiber tracing to extract anatomical structures from DT-MRI data such as the corona radiata, corpus callosum, and cingulum bundle. Results demonstrate the algorithm can smoothly reconstruct recognizable brain structures. Future work includes additional method developments and validation.
Social Networks: from Micromotives to MacrobehaviorLeonid Zhukov
1) The document discusses social networks and complex network models, outlining key findings like preferential attachment and the small world phenomenon.
2) It describes classic diffusion models showing how information or influence spreads through social networks via contact between connected individuals.
3) Threshold models are introduced showing how individuals adopt behaviors based on the opinions of their network neighbors surpassing a threshold.
1) The document discusses social networks and how information spreads through cascades and can enable viral marketing.
2) It describes how information cascades occur when people make decisions based on the actions of others rather than their own information.
3) Diffusion models are discussed as a way to model how information or influence spreads from person to person through social networks based on their connectivity and thresholds for adopting new information.
Big data refers to the large volumes of structured, semi-structured and unstructured data that are so large that traditional data processing applications are inadequate. This data comes from a wide variety of sources including sensors, social media, websites and more. Hadoop is an open-source software framework that allows distributed processing of large data sets across clusters of computers using simple programming models. It is commonly used by large companies for applications such as web search, data mining, and machine learning.
Ancestry.com aims to be the world's largest online family history resource. It has over 30,000 historical content collections containing 11 billion records and images dating back to the 16th century. It also has over 120,000 DNA samples and uses machine learning for tasks like record linkage, hint suggestions, and search. Ancestry analyzes user data like the 45 million family trees and 40 million daily searches to continue improving its services and discovering historical patterns around topics like immigration to the US.
This document summarizes the history and development of social network analysis from the 18th century to present day. It describes some of the key early studies and models that helped establish the field, such as Euler's analysis of the Königsberg bridges problem in the 18th century, Frigyes Karinthy coining the "six degrees of separation" concept in 1929, and Paul Erdos' work on random graph theory in the 1950s and 1960s. It then outlines some seminal studies from the 1960s-1970s that studied real-world social networks and established concepts like strong/weak ties and small world networks. The document concludes by describing some of the major areas of study within social networks today including physics,
This document discusses fiber tractography techniques for visualizing white matter tracts in the brain using diffusion tensor MRI (DT-MRI) data. It covers fiber tractography algorithms like deterministic and probabilistic methods. Specific algorithms discussed include moving least squares filtering for tensor field interpolation and streamline integration for fiber tracing. Examples of fiber tractography results are shown for the human brain and canine heart.
The document discusses Ancestry.com's use of data science and machine learning techniques to analyze their large collection of family history records and DNA data. Key points include:
- Ancestry has over 30,000 historical record collections, 11 billion records, and DNA samples from over 120,000 users.
- Machine learning is used for tasks like person search, record linkage, and suggesting record matches to help users build family trees.
- Analysis of 45 million user-contributed family trees provides insights into historical immigration patterns to the US over centuries.
Numerical Linear Algebra for Data and Link AnalysisLeonid Zhukov
The document discusses numerical linear algebra techniques for analyzing large graphs and networks. It provides examples of large social networks like Flickr that can be represented as graphs and analyzed using graph-based algorithms. Specifically, it discusses using techniques like PageRank to analyze link structures and rank nodes in a graph based on their importance. It also discusses computational methods like power iteration and Krylov subspace methods for efficiently solving the large systems of equations that arise in PageRank and related network analysis problems.
The document discusses a Data Loss Prevention (DLP) system called Monitorium. It can protect confidential information from theft or accidental loss by monitoring and analyzing outgoing internet traffic. It detects and blocks security-violating traffic using deep packet inspection of protocols like HTTP, FTP, and email. It analyzes message content, headers, and attachments to detect sensitive information leaving the network. The system provides real-time alerts, content analysis of multiple file formats and languages, and detailed reports.
This story discusses how networks with random connections tend to have nodes with similar numbers of connections that follow a normal distribution, while scale-free networks have hubs with a tremendous number of connections that follow a power law distribution. The early web search engines connected randomly, but Google grew differently by ranking pages based on the number and quality of links, allowing popular hubs to emerge. This preferential attachment mechanism causes the rich to get richer in scale-free networks.
Data scientists are in high demand due to a shortage projected between 140,000-190,000 by 2018. Data scientists love data and have an investigative mindset, using data to find patterns and create data-driven products. They have strong programming, statistics, and machine learning skills. Universities and online courses provide data science education, while conferences and meetups help data scientists network and stay informed of new developments in the field. Open questions remain around how important domain expertise is and whether data scientists will eventually be replaced by software.
This document discusses the large amount of data held by Ancestry.com, including 14 billion historical records, 60 million user-created family trees containing 6 billion profiles, 200 million uploaded photos and stories, and 400,000 DNA samples. It outlines the challenges of managing this data at such a large scale, including record linkage to connect information about the same individuals across different datasets, ensuring privacy of sensitive personal data, and balancing privacy concerns with enabling genomic research.
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRILeonid Zhukov
This document presents a method for tracing neural pathways from diffusion tensor MRI (DT-MRI) data through oriented tensor reconstruction. It introduces DT-MRI and discusses previous work in tensor visualization and fiber tracing. The presented algorithm uses moving least squares filtering and fiber tracing to extract anatomical structures from DT-MRI data such as the corona radiata, corpus callosum, and cingulum bundle. Results demonstrate the algorithm can smoothly reconstruct recognizable brain structures. Future work includes additional method developments and validation.
Social Networks: from Micromotives to MacrobehaviorLeonid Zhukov
1) The document discusses social networks and complex network models, outlining key findings like preferential attachment and the small world phenomenon.
2) It describes classic diffusion models showing how information or influence spreads through social networks via contact between connected individuals.
3) Threshold models are introduced showing how individuals adopt behaviors based on the opinions of their network neighbors surpassing a threshold.
1) The document discusses social networks and how information spreads through cascades and can enable viral marketing.
2) It describes how information cascades occur when people make decisions based on the actions of others rather than their own information.
3) Diffusion models are discussed as a way to model how information or influence spreads from person to person through social networks based on their connectivity and thresholds for adopting new information.
Big data refers to the large volumes of structured, semi-structured and unstructured data that are so large that traditional data processing applications are inadequate. This data comes from a wide variety of sources including sensors, social media, websites and more. Hadoop is an open-source software framework that allows distributed processing of large data sets across clusters of computers using simple programming models. It is commonly used by large companies for applications such as web search, data mining, and machine learning.
Ancestry.com aims to be the world's largest online family history resource. It has over 30,000 historical content collections containing 11 billion records and images dating back to the 16th century. It also has over 120,000 DNA samples and uses machine learning for tasks like record linkage, hint suggestions, and search. Ancestry analyzes user data like the 45 million family trees and 40 million daily searches to continue improving its services and discovering historical patterns around topics like immigration to the US.
This document summarizes the history and development of social network analysis from the 18th century to present day. It describes some of the key early studies and models that helped establish the field, such as Euler's analysis of the Königsberg bridges problem in the 18th century, Frigyes Karinthy coining the "six degrees of separation" concept in 1929, and Paul Erdos' work on random graph theory in the 1950s and 1960s. It then outlines some seminal studies from the 1960s-1970s that studied real-world social networks and established concepts like strong/weak ties and small world networks. The document concludes by describing some of the major areas of study within social networks today including physics,
"Социально-сетевой анализ форумов при помощи пакета UCINet"Witology
Докладчик: Алексей Друца,
аспирант Мех-Мата МГУ, м.н.с. Лаборатории Компьютерного Моделирования Мех-мата МГУ.
Доклад посвящен демонстрации функциональных возможностей программного пакета UCINet с точки зрения проведения социально-сетевого анализа обсуждений интернет-форума.
Программный пакет UCINet представляет собой интегрированную среду по форматированию и обработке входных и выходных данных о графе, которым является ветка обсуждения интернет-форума.
В рамках семинара будут представлены краткое описание основных характеристик графов, полученных результатов, а также подробно рассмотрены отдельные функциональные блоки пакета.
Видео: http://vimeo.com/user7862600
Марина Степанова "Кластеризатор в JS API Яндекс.Карт"Yandex
Рассказ о том, для чего и почему был сделан кластеризатор. Подробно про алгоритм кластеризации. А также про то, как добавлять и настраивать кластеризатор на карте.
2. План доклада
Talk outline
Социальные сети
нахождение сообществ
Поисковая реклама
сегментация рынка
Интернет радио
рекомендационная система
Математическая модель:
Граф
Кластеризация (алгоритмы на графах)
2
3. Социальные сети
Social networks
Социальная сеть (social network) — социальная структура,
состоящая из группы узлов, которыми являются социальные
объекты (люди или организации), и связей между ними
(социальных взаимоотношений) - Wikipedia
Интернет (2000 - ...)
MySpace (300 млн), FaceBook, (50 млн), Friendster, ...
Одноклассники (11 млн), В контакте (7 млн), Мой Круг ...
Математическое представление – граф G( V, E)
Множество вершин | V | – “люди”
Множество ребер | E | – “отношения”
Направленный / ненаправленный
3
4. Возможные исследование
Study topics
Анализ структуры
идентификации ролей пользователей
развитие и рост сети
нахождение сообществ
Процессы в сети
распостранение информации
распостранение влияния
сетевая экономика
Реклама и монетизация
4
10. Flickr: статистика
Flickr stats
количество узлов (пользователей)= 584,207
количество ребер (связей) = 3,555,115
максимальная входящая степень узла = 3531
максимальная выходящая степень узла = 8976
< входящая степень узла > = < выходящая степень узла > = 6
диаметр графа = 18
средняя длина пути = 5.3
число сильно связанных компонент = 152,324
наибольшие сильно связанные комп = 274,649 : 374 : 186 :155 : …
число связанных компонент = 43,189
наибольшие связанные компоненты = 404,893 : 378 : 112 : 108 : …
максимальное ядро (core number) = 249 (size 668)
10
11. Безмасштабные сети
Scale-free (complex) networks
Степенной закон распределения степеней узлов
(power law)
Медленно растущее среднее расстояние между
узлами (small world)
Высокий коэффициент кластеризации
Наличие гигантской связанной компоненты
11
12. Безмасштабные сети
Scale free
Функция вероятности распределения Кумулятивная функция распределения
PDF
CDF
CDF
Node degree Node degree
12
13. Безмасштабные сети
Scale free
Node degree
Nodes sorted
by in-degrees
Node number
Node degree
Nodes sorted
by out-degrees
13 Node number