This document discusses representing data as vectors. It explains that vectors are simply sets of numbers, and gives examples of representing human body measurements and image pixels as vectors of various dimensions. Higher-dimensional vectors can be used to encode complex data like images, time series, and survey responses. Visualizing vectors in coordinate systems becomes more abstract in higher dimensions. The key point is that vectors provide a unified way to represent diverse types of data.
The document discusses distances between data and similarity measures in data analysis. It introduces the concept of distance between data as a quantitative measure of how different two data points are, with smaller distances indicating greater similarity. Distances are useful for tasks like clustering data, detecting anomalies, data recognition, and measuring approximation errors. The most common distance measure, Euclidean distance, is explained for vectors of any dimension using the concept of norm from geometry. Caution is advised when calculating distances between data with differing scales.
The document discusses various measures of central tendency, including the mean, median, and weighted mean. It explains how to calculate the arithmetic mean of a data set by summing all values and dividing by the number of values. The mean works for vector data by taking the mean of each dimension separately. However, the mean can be overly influenced by outliers. Alternatively, the median is not influenced by outliers, as it is the middle value of the data when sorted. The document provides examples to illustrate these concepts for measuring central tendency.
1. The document discusses principal component analysis (PCA) and explains how it can be used to determine the "true dimension" of vector data distributions.
2. PCA works by finding orthogonal bases (principal components) that best describe the variance in high-dimensional data, with the first principal component accounting for as much variance as possible.
3. The lengths of the principal component vectors indicate their importance, with longer vectors corresponding to more variance in the data. Analyzing the variances of the principal components can provide insight into the shape of the distribution and its true dimension.
The document discusses distances between data and similarity measures in data analysis. It introduces the concept of distance between data as a quantitative measure of how different two data points are, with smaller distances indicating greater similarity. Distances are useful for tasks like clustering data, detecting anomalies, data recognition, and measuring approximation errors. The most common distance measure, Euclidean distance, is explained for vectors of any dimension using the concept of norm from geometry. Caution is advised when calculating distances between data with differing scales.
The document discusses various measures of central tendency, including the mean, median, and weighted mean. It explains how to calculate the arithmetic mean of a data set by summing all values and dividing by the number of values. The mean works for vector data by taking the mean of each dimension separately. However, the mean can be overly influenced by outliers. Alternatively, the median is not influenced by outliers, as it is the middle value of the data when sorted. The document provides examples to illustrate these concepts for measuring central tendency.
1. The document discusses principal component analysis (PCA) and explains how it can be used to determine the "true dimension" of vector data distributions.
2. PCA works by finding orthogonal bases (principal components) that best describe the variance in high-dimensional data, with the first principal component accounting for as much variance as possible.
3. The lengths of the principal component vectors indicate their importance, with longer vectors corresponding to more variance in the data. Analyzing the variances of the principal components can provide insight into the shape of the distribution and its true dimension.
The document discusses pattern recognition and classification. It begins by defining pattern recognition as a method for determining what something is based on data like images, audio, or text. It then provides examples of common types of pattern recognition like image recognition and speech recognition. It notes that while pattern recognition comes easily to humans, it can be difficult for computers which lack abilities like unconscious, high-speed, high-accuracy recognition. The document then discusses the basic principle of computer-based pattern recognition as classifying inputs into predefined classes based on their similarity to training examples.
This document discusses using linear algebra concepts to analyze data. It explains that vectors can be used to represent data, with each component of the vector corresponding to a different attribute or variable. The amount of each attribute in the data is equivalent to the component value. Vectors can be decomposed into the sum of their components multiplied by basis vectors, and recomposed using those values. This relationship allows the amount of each attribute to be calculated using the inner product of the vector and basis vector. So linear algebra provides a powerful framework for understanding and analyzing complex, multi-dimensional data.
This document discusses non-structured data analysis, focusing on image data. It defines structured and non-structured data, with images, text, and audio given as examples of non-structured data. Images are described as high-dimensional vectors that are generated from analog to digital conversion via sampling and quantization. Various types of image data and analysis tasks are introduced, including image recognition, computer vision, feature extraction and image compression. Image processing techniques like filtering and binarization are also briefly covered.
The document discusses pattern recognition and classification. It begins by defining pattern recognition as a method for determining what something is based on data like images, audio, or text. It then provides examples of common types of pattern recognition like image recognition and speech recognition. It notes that while pattern recognition comes easily to humans, it can be difficult for computers which lack abilities like unconscious, high-speed, high-accuracy recognition. The document then discusses the basic principle of computer-based pattern recognition as classifying inputs into predefined classes based on their similarity to training examples.
This document discusses using linear algebra concepts to analyze data. It explains that vectors can be used to represent data, with each component of the vector corresponding to a different attribute or variable. The amount of each attribute in the data is equivalent to the component value. Vectors can be decomposed into the sum of their components multiplied by basis vectors, and recomposed using those values. This relationship allows the amount of each attribute to be calculated using the inner product of the vector and basis vector. So linear algebra provides a powerful framework for understanding and analyzing complex, multi-dimensional data.
This document discusses non-structured data analysis, focusing on image data. It defines structured and non-structured data, with images, text, and audio given as examples of non-structured data. Images are described as high-dimensional vectors that are generated from analog to digital conversion via sampling and quantization. Various types of image data and analysis tasks are introduced, including image recognition, computer vision, feature extraction and image compression. Image processing techniques like filtering and binarization are also briefly covered.
This document introduces artificial intelligence (AI) and discusses examples of AI being used in everyday life. It defines AI as machines that mimic human intelligence, and notes most current AI is specialized or "weak AI" that can only perform specific tasks rather than general human-level intelligence. Examples discussed include voice recognition, chatbots, facial recognition, image recognition for medical diagnosis, recommendation systems, AI in games like Go, and applications in business like sharing economies and customer monitoring.
This document provides an introduction to probability and probability distributions for data analysis. It explains that probability, like histograms, can help understand how data is distributed. Probability distributions describe the "easiness" or likelihood that a random variable takes on a particular value, and can be discrete (for a finite number of possible values) or continuous (for infinite possible values). Key probability distributions like the normal distribution are fundamental to many statistical analyses and machine learning techniques. Understanding probability distributions allows expressing data distributions with mathematical formulas parameterized by a few values.
The document discusses predictive modeling and regression analysis using data. It explains that predictive modeling involves collecting data, creating a predictive model by fitting the model to the data, and then using the model to predict outcomes for new input data. Regression analysis specifically aims to model relationships between input and output variables in data to enable predicting outputs for new inputs. The document provides examples of using linear regression to predict exam scores from study hours, and explains that the goal in model fitting is to minimize the sum of squared errors between predicted and actual output values in the data.
This document discusses clustering and anomaly detection in data science. It introduces the concept of clustering, which is grouping a set of data into clusters so that data within each cluster are more similar to each other than data in other clusters. The k-means clustering algorithm is described in detail, which works by iteratively assigning data to the closest cluster centroid and updating the centroids. Other clustering algorithms like k-medoids and hierarchical clustering are also briefly mentioned. The document then discusses how anomaly detection, which identifies outliers in data that differ from expected patterns, can be performed based on measuring distances between data points. Examples applications of anomaly detection are provided.
Machine learning for document analysis and understandingSeiichi Uchida
The document discusses machine learning and document analysis using neural networks. It begins with an overview of the nearest neighbor method and how neural networks perform similarity-based classification and feature extraction. It then explains how neural networks work by calculating inner products between input and weight vectors. The document outlines how repeating these feature extraction layers allows the network to learn more complex patterns and separate classes. It provides examples of convolutional neural networks for tasks like document image analysis and discusses techniques for training networks and visualizing their representations.
An opening talk at ICDAR2017 Future Workshop - Beyond 100%Seiichi Uchida
What are the possible future research directions for OCR researchers (when we achieve almost 100% accuracy)? This slide is for a short opening talk to stimulate audiences. Actually, young researchers on OCR or other document processing-related research need to think about their "NEXT".
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.