The introduction to my class on machine learning. The subjects covered in this class go from:
1.- Linear Classifiers
2.- Non Linear Classifiers
3.- Graphical Models
4.- Clustering
5.- Etc
I am planning to upload the rest once I feel they are at the level.
Here is the basic introduction to the probability used in my Analysis of Algorithms course at the Cinvestav Guadalajara. They go from the basic axioms to the Expected Value and Variance.
In most of the algorithms analyzed until now, we have been looking and studying problems solvable in polynomial time. The polynomial time algorithm class P are algorithms that on inputs of size n have a worst case running time of O(n^k) for some constant k. Thus, informally, we can say that the Non-Polynomial (NP) time algorithms are the ones that cannot be solved in O(n^k) for any constant k
.
Evaporation is the process by which a liquid is converted into vapor below the boiling point. There are four main factors that affect the rate of evaporation: temperature, humidity, wind speed, and surface area. Higher temperatures, lower humidity, stronger winds, and larger surface areas of liquid all increase the rate at which evaporation occurs.
This document provides an overview of tree data structures and binary trees. It begins by defining trees and their basic concepts such as subtrees, leaves, levels, and roots. It then defines binary trees and contrasts them with general trees. The document discusses calculating the height of full binary trees and using trees to represent arithmetic expressions. It also covers traversing trees and different ways of representing trees in memory.
Here are my slides in some basic algorithms in Computational Geometry:
1.- Line Intersection
2.- Sweeping Line
3.- Convex Hull
They are the classic one, but there is still a lot for anybody wanting to get in computer graphics to study. I recomend
Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications (3rd ed. ed.). TELOS, Santa Clara, CA, USA.
Here are my slides for my preparation class for possible Master students in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
Here is the basic introduction to the probability used in my Analysis of Algorithms course at the Cinvestav Guadalajara. They go from the basic axioms to the Expected Value and Variance.
In most of the algorithms analyzed until now, we have been looking and studying problems solvable in polynomial time. The polynomial time algorithm class P are algorithms that on inputs of size n have a worst case running time of O(n^k) for some constant k. Thus, informally, we can say that the Non-Polynomial (NP) time algorithms are the ones that cannot be solved in O(n^k) for any constant k
.
Evaporation is the process by which a liquid is converted into vapor below the boiling point. There are four main factors that affect the rate of evaporation: temperature, humidity, wind speed, and surface area. Higher temperatures, lower humidity, stronger winds, and larger surface areas of liquid all increase the rate at which evaporation occurs.
This document provides an overview of tree data structures and binary trees. It begins by defining trees and their basic concepts such as subtrees, leaves, levels, and roots. It then defines binary trees and contrasts them with general trees. The document discusses calculating the height of full binary trees and using trees to represent arithmetic expressions. It also covers traversing trees and different ways of representing trees in memory.
Here are my slides in some basic algorithms in Computational Geometry:
1.- Line Intersection
2.- Sweeping Line
3.- Convex Hull
They are the classic one, but there is still a lot for anybody wanting to get in computer graphics to study. I recomend
Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications (3rd ed. ed.). TELOS, Santa Clara, CA, USA.
Here are my slides for my preparation class for possible Master students in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
This document discusses cluster validity, which is a method for quantitatively evaluating the results of a clustering algorithm. Cluster validity is important because clustering algorithms sometimes impose structure on data even when no natural clusters exist. The document outlines different techniques for cluster validity testing, including hypothesis testing, Monte Carlo techniques, and bootstrapping. It also discusses the concept of a power function, which compares the effectiveness of different statistical tests for validating clustering results. The overall goal of cluster validity is to determine the appropriate number of clusters and whether the data exhibits a genuine clustering structure.
This document summarizes properties of noble gases including their atomic radii, boiling points, melting points, electronegativities, ionization energies, common uses, and abundance in Earth's crust. It shows that noble gases are nonreactive, have complete valence shells, high ionization energies, and low electronegativities and boiling points, with all being gases at room temperature. Helium is used in balloons and deep sea diving, neon in liquid air, argon in lighting, krypton in lighting, xenon in powerful lamps and bubble chambers, and radon in cancer treatment.
The document provides information for an introductory chemistry unit titled "Matter and Measurement". It includes:
1) Learning objectives around systems being organised and developing methods for classification, measurement, and hypothesis testing.
2) Details of assessment tasks involving a unit test, science communication activities, and laboratory experiments.
3) An orientation to lab safety rules and equipment.
4) An assignment for students to create a science demonstration on water changes of state for younger students.
5) Guidance on the scientific method and variables to consider in experimentation.
This document discusses important issues in machine learning for data mining, including the bias-variance dilemma. It explains that the difference between the optimal regression and a learned model can be measured by looking at bias and variance. Bias measures the error between the expected output of the learned model and the optimal regression, while variance measures the error between the learned model's output and its expected output. There is a tradeoff between bias and variance - increasing one decreases the other. This is known as the bias-variance dilemma. Cross-validation and confusion matrices are also introduced as evaluation techniques.
This document provides an introduction to the Expectation Maximization (EM) algorithm. EM is used to estimate parameters in statistical models when data is incomplete or has missing values. It is a two-step process: 1) Expectation step (E-step), where the expected value of the log likelihood is computed using the current estimate of parameters; 2) Maximization step (M-step), where the parameters are re-estimated to maximize the expected log likelihood found in the E-step. EM is commonly used for problems like clustering with mixture models and hidden Markov models. Applications of EM discussed include clustering data using mixture of Gaussian distributions, and training hidden Markov models for natural language processing tasks. The derivation of the EM algorithm and
Here are my slides for my preparation class for possible students in the Master in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
This document discusses similarity and dissimilarity measures used for clustering data. It defines hard partitional clustering as seeking a K-partition of data points such that each cluster is non-empty, their union is the entire data set, and clusters are disjoint. Hierarchical clustering builds a nested structure of partitions. The document outlines different types of features, measures, and levels used to define similarity between data points for clustering algorithms.
This document provides an overview of machine learning decision trees. It discusses how decision trees work by applying a sequence of simple decision rules to divide data into progressively smaller and more homogeneous groups. Decision trees can be used for classification or regression problems. The document focuses on ordinary binary classification trees, which use binary questions of the form "is attribute x less than or equal to a?" to split data. It explains key decision tree concepts like nodes, branches, and leaves and discusses algorithms for training decision trees by selecting the optimal attribute to test at each node based on criteria like probabilistic impurity.
Here, we look at the problem of going from a source s to a possible multiple destinations. At them, each of the Lemmas, Theorems and Corollaries used to prove the properties of the
1. Bellman-Ford
2. Dijkstra
are examined in detail.
28 Dealing with the NP Poblems: Exponential Search and Approximation AlgorithmsAndres Mendez-Vazquez
Ā
This document discusses different approaches for dealing with NP-complete problems, including intelligent exhaustive search techniques like backtracking and branch-and-bound, as well as approximation algorithms. It provides an example of how backtracking can prune portions of the search space when solving Boolean satisfiability problems. The key decisions in backtracking are choosing which subproblem to expand next and which variable to branch on. The backtracking test checks subproblems for failure, success or uncertainty.
Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
This document discusses dimensionality reduction techniques for machine learning. It introduces Fisher Linear Discriminant analysis, which seeks projection directions that maximize separation between classes while minimizing within-class variance. It describes using the means and scatter measures of each class to define a cost function that is maximized to find the optimal projection direction. Principal Component Analysis is also briefly mentioned as another technique for dimensionality reduction.
This document contains information about waves and sound from a physics textbook. It includes chapter summaries and sections on the properties of waves, including wavelength, frequency, amplitude, speed, and types of interactions such as reflection, refraction, diffraction and absorption. It also describes transverse and longitudinal waves, and how constructive and destructive interference can occur when waves meet.
Supervised Hidden Markov Chains.
Here, we used the paper by Rabiner as a base for the presentation. Thus, we have the following three problems:
1.- How efficiently compute the probability given a model.
2.- Given an observation to which class it belongs
3.- How to find the parameters given data for training.
The first two follow Rabiner's explanation, but in the third one I used the Lagrange Multiplier Optimization because Rabiner lacks a clear explanation about how solving the issue.
The document discusses several major rivers in India, including the Himalayan rivers of the Indus, Ganges, and Brahmaputra as well as some peninsular rivers. It provides details on the sources and courses of these rivers, describes some key geographical features like drainage basins and patterns, and mentions some cultural and mythological aspects associated with the rivers.
Work energy power 2 reading assignment -revision 2sashrilisdi
Ā
This document discusses basic energy concepts including work, kinetic energy, potential energy, and the law of conservation of energy. It provides equations to calculate work (W=FĪx), kinetic energy (KE=1/2mv^2), and gravitational potential energy (GPE=mgh). Examples are given to demonstrate calculating energy transformations during events like a swing or skydiving fall. The key points are that energy cannot be created or destroyed, only transformed between kinetic and potential forms, and this transformation can be represented using energy bar charts. Power is also defined as the rate of energy transfer or transformation (P=ĪE/Īt or P=W/Īt).
This document discusses Maximum A Posteriori (MAP) estimation for machine learning and data mining. It begins by introducing the Bayesian rule and defining the MAP as the value of Ī that maximizes the posterior p(Ī|X). It then shows how to develop the MAP solution by taking the logarithm of the posterior and finding the value of Ī that maximizes it. The MAP allows prior beliefs about parameter values to be incorporated into the estimation. An example application to binary classification with a Bernoulli model is provided. It derives the maximum likelihood solution and then extends it to the MAP by specifying a Beta prior distribution over the parameter.
This document provides an agenda and information for a webinar on stormwater management. The webinar will cover regulations for stormwater management, hydrodynamic separation products including Vortechs and VortSentry, filtration with StormFilter, and infiltration using ChamberMaxx chambers. It includes details on the host and qualifications for continuing education credits for attendees.
Here are my slides for my preparation class for possible Master students in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
On Inherent Complexity of Computation, by Attila SzegediZeroTurnaround
Ā
The system you just recently deployed is likely an application processing some data, likely relying on some configuration, maybe using some plugins, certainly relying on some libraries, using services of an operating system running on some physical hardware. The previous sentence names 7 categories into which we compartmentalise various parts of a computation process thatās in the end going on in a physical world. Where do you draw the line of functionality between categories? From what vantage points do these distinctions become blurry? Finally, how does it all interact with the actual physical world in which the computation takes place? (What is the necessary physical minimum required to perform a computation, anyway?) Letās make a journey from your AOP-assembled, plugin-injected, YAML-configured, JIT compiled, Hotspot-executed, Linux-on-x86 hosted Java application server talking JSON-over-HTTP-over-TCP-over-IP-over-Ethernet all the way down to electrons. And then back. Recorded at GeekOut 2013.
This document summarizes a talk titled "AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis". The talk discusses how software approaches can outpace hardware for analyzing large biological datasets. It notes that current variant calling approaches have limitations due to being I/O intensive and requiring multiple passes over data. The talk introduces approaches using lossy compression and streaming algorithms that can perform analysis more efficiently using less memory and in a single pass. This could enable analyzing a human genome on a desktop computer by 2016 as wagered. The talk argues that with better algorithmic tools, biological data analysis need not require large computers and can scale with the information content of data rather than just data size.
This document discusses cluster validity, which is a method for quantitatively evaluating the results of a clustering algorithm. Cluster validity is important because clustering algorithms sometimes impose structure on data even when no natural clusters exist. The document outlines different techniques for cluster validity testing, including hypothesis testing, Monte Carlo techniques, and bootstrapping. It also discusses the concept of a power function, which compares the effectiveness of different statistical tests for validating clustering results. The overall goal of cluster validity is to determine the appropriate number of clusters and whether the data exhibits a genuine clustering structure.
This document summarizes properties of noble gases including their atomic radii, boiling points, melting points, electronegativities, ionization energies, common uses, and abundance in Earth's crust. It shows that noble gases are nonreactive, have complete valence shells, high ionization energies, and low electronegativities and boiling points, with all being gases at room temperature. Helium is used in balloons and deep sea diving, neon in liquid air, argon in lighting, krypton in lighting, xenon in powerful lamps and bubble chambers, and radon in cancer treatment.
The document provides information for an introductory chemistry unit titled "Matter and Measurement". It includes:
1) Learning objectives around systems being organised and developing methods for classification, measurement, and hypothesis testing.
2) Details of assessment tasks involving a unit test, science communication activities, and laboratory experiments.
3) An orientation to lab safety rules and equipment.
4) An assignment for students to create a science demonstration on water changes of state for younger students.
5) Guidance on the scientific method and variables to consider in experimentation.
This document discusses important issues in machine learning for data mining, including the bias-variance dilemma. It explains that the difference between the optimal regression and a learned model can be measured by looking at bias and variance. Bias measures the error between the expected output of the learned model and the optimal regression, while variance measures the error between the learned model's output and its expected output. There is a tradeoff between bias and variance - increasing one decreases the other. This is known as the bias-variance dilemma. Cross-validation and confusion matrices are also introduced as evaluation techniques.
This document provides an introduction to the Expectation Maximization (EM) algorithm. EM is used to estimate parameters in statistical models when data is incomplete or has missing values. It is a two-step process: 1) Expectation step (E-step), where the expected value of the log likelihood is computed using the current estimate of parameters; 2) Maximization step (M-step), where the parameters are re-estimated to maximize the expected log likelihood found in the E-step. EM is commonly used for problems like clustering with mixture models and hidden Markov models. Applications of EM discussed include clustering data using mixture of Gaussian distributions, and training hidden Markov models for natural language processing tasks. The derivation of the EM algorithm and
Here are my slides for my preparation class for possible students in the Master in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
This document discusses similarity and dissimilarity measures used for clustering data. It defines hard partitional clustering as seeking a K-partition of data points such that each cluster is non-empty, their union is the entire data set, and clusters are disjoint. Hierarchical clustering builds a nested structure of partitions. The document outlines different types of features, measures, and levels used to define similarity between data points for clustering algorithms.
This document provides an overview of machine learning decision trees. It discusses how decision trees work by applying a sequence of simple decision rules to divide data into progressively smaller and more homogeneous groups. Decision trees can be used for classification or regression problems. The document focuses on ordinary binary classification trees, which use binary questions of the form "is attribute x less than or equal to a?" to split data. It explains key decision tree concepts like nodes, branches, and leaves and discusses algorithms for training decision trees by selecting the optimal attribute to test at each node based on criteria like probabilistic impurity.
Here, we look at the problem of going from a source s to a possible multiple destinations. At them, each of the Lemmas, Theorems and Corollaries used to prove the properties of the
1. Bellman-Ford
2. Dijkstra
are examined in detail.
28 Dealing with the NP Poblems: Exponential Search and Approximation AlgorithmsAndres Mendez-Vazquez
Ā
This document discusses different approaches for dealing with NP-complete problems, including intelligent exhaustive search techniques like backtracking and branch-and-bound, as well as approximation algorithms. It provides an example of how backtracking can prune portions of the search space when solving Boolean satisfiability problems. The key decisions in backtracking are choosing which subproblem to expand next and which variable to branch on. The backtracking test checks subproblems for failure, success or uncertainty.
Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
This document discusses dimensionality reduction techniques for machine learning. It introduces Fisher Linear Discriminant analysis, which seeks projection directions that maximize separation between classes while minimizing within-class variance. It describes using the means and scatter measures of each class to define a cost function that is maximized to find the optimal projection direction. Principal Component Analysis is also briefly mentioned as another technique for dimensionality reduction.
This document contains information about waves and sound from a physics textbook. It includes chapter summaries and sections on the properties of waves, including wavelength, frequency, amplitude, speed, and types of interactions such as reflection, refraction, diffraction and absorption. It also describes transverse and longitudinal waves, and how constructive and destructive interference can occur when waves meet.
Supervised Hidden Markov Chains.
Here, we used the paper by Rabiner as a base for the presentation. Thus, we have the following three problems:
1.- How efficiently compute the probability given a model.
2.- Given an observation to which class it belongs
3.- How to find the parameters given data for training.
The first two follow Rabiner's explanation, but in the third one I used the Lagrange Multiplier Optimization because Rabiner lacks a clear explanation about how solving the issue.
The document discusses several major rivers in India, including the Himalayan rivers of the Indus, Ganges, and Brahmaputra as well as some peninsular rivers. It provides details on the sources and courses of these rivers, describes some key geographical features like drainage basins and patterns, and mentions some cultural and mythological aspects associated with the rivers.
Work energy power 2 reading assignment -revision 2sashrilisdi
Ā
This document discusses basic energy concepts including work, kinetic energy, potential energy, and the law of conservation of energy. It provides equations to calculate work (W=FĪx), kinetic energy (KE=1/2mv^2), and gravitational potential energy (GPE=mgh). Examples are given to demonstrate calculating energy transformations during events like a swing or skydiving fall. The key points are that energy cannot be created or destroyed, only transformed between kinetic and potential forms, and this transformation can be represented using energy bar charts. Power is also defined as the rate of energy transfer or transformation (P=ĪE/Īt or P=W/Īt).
This document discusses Maximum A Posteriori (MAP) estimation for machine learning and data mining. It begins by introducing the Bayesian rule and defining the MAP as the value of Ī that maximizes the posterior p(Ī|X). It then shows how to develop the MAP solution by taking the logarithm of the posterior and finding the value of Ī that maximizes it. The MAP allows prior beliefs about parameter values to be incorporated into the estimation. An example application to binary classification with a Bernoulli model is provided. It derives the maximum likelihood solution and then extends it to the MAP by specifying a Beta prior distribution over the parameter.
This document provides an agenda and information for a webinar on stormwater management. The webinar will cover regulations for stormwater management, hydrodynamic separation products including Vortechs and VortSentry, filtration with StormFilter, and infiltration using ChamberMaxx chambers. It includes details on the host and qualifications for continuing education credits for attendees.
Here are my slides for my preparation class for possible Master students in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
On Inherent Complexity of Computation, by Attila SzegediZeroTurnaround
Ā
The system you just recently deployed is likely an application processing some data, likely relying on some configuration, maybe using some plugins, certainly relying on some libraries, using services of an operating system running on some physical hardware. The previous sentence names 7 categories into which we compartmentalise various parts of a computation process thatās in the end going on in a physical world. Where do you draw the line of functionality between categories? From what vantage points do these distinctions become blurry? Finally, how does it all interact with the actual physical world in which the computation takes place? (What is the necessary physical minimum required to perform a computation, anyway?) Letās make a journey from your AOP-assembled, plugin-injected, YAML-configured, JIT compiled, Hotspot-executed, Linux-on-x86 hosted Java application server talking JSON-over-HTTP-over-TCP-over-IP-over-Ethernet all the way down to electrons. And then back. Recorded at GeekOut 2013.
This document summarizes a talk titled "AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis". The talk discusses how software approaches can outpace hardware for analyzing large biological datasets. It notes that current variant calling approaches have limitations due to being I/O intensive and requiring multiple passes over data. The talk introduces approaches using lossy compression and streaming algorithms that can perform analysis more efficiently using less memory and in a single pass. This could enable analyzing a human genome on a desktop computer by 2016 as wagered. The talk argues that with better algorithmic tools, biological data analysis need not require large computers and can scale with the information content of data rather than just data size.
This document outlines a data science competition to build a spam detector using email data. Participants will be provided with training data containing 600 emails and their corresponding labels (spam or not spam). They will use this data to build a model to classify new emails as spam or not spam. The goal is to correctly classify as many new test emails as possible. Visualization and interpretation of results will be important for evaluating model performance and identifying ways to improve the spam detection.
Talk at a Data Journalism BootCamp organised by ICFJ, World Bank Group and African Media Initiative in New Delhi to a group of 60 journalists, coders and social sector folks. Other amazing sessions included those from Govind Ethiraj of IndiaSpend, Andrew from BBC, Parul from Google, Nasr from HacksHacker, Thej from DataMeet and David from Code for Africa. http://delhi.dbootcamp.org/
The document discusses the history and principles of data visualization, beginning with pioneers in the field like Charles Minard in the 1800s. It covers the evolution of data visualization techniques over time from Minard's infographics to modern tools like Google Analytics. Key concepts discussed include the visual analytics maturity framework, best practices for visualization design, and how visualization can enhance cognition and amplify understanding when applied effectively. Real-world examples of both effective and misleading visualizations are provided.
- In the last decade, artificial intelligence (AI) and machine learning have gained significant attention in computer science and across different domains like finance, medicine, law, and administration.
- For AI systems to function like humans, they must be imparted with knowledge. Two important aspects of AI systems are knowledge acquisition and knowledge representation. Early systems acquired knowledge from human experts, but experts are not always available or consistent.
- Modern AI systems acquire knowledge from data using machine learning techniques. With the abundance of electronic data from various sources, knowledge can be extracted from data and represented in ways that enable computers to interpret it without human intervention.
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...Charlie Hull
Ā
The document discusses the future of search and analytics using streams of data from sources like the Internet of Things. It describes how search technologies can be used to process real-time streams of data by indexing the streams and querying them similar to how searches are currently done on stored data. Examples of searching streams are given, such as searching incoming news stories against stored search profiles to identify matches.
This document discusses the challenges and opportunities presented by the increasing volume and complexity of biological data. It outlines four main areas: 1) Developing methods to efficiently store, access, and analyze large datasets; 2) Broadening our understanding of gene function beyond a small number of well-studied genes; 3) Accelerating research through improved sharing of data, results, and methods; and 4) Leveraging exploratory analysis of integrated datasets to generate new insights. The author advocates for lossy data compression, streaming analysis, preprint sharing, improved metadata collection, and incentivizing open data practices.
This document provides an agenda and overview for a data science presentation. It begins with introductions and then discusses what data science is, how it draws from various influences like math, engineering, and business. It explores the skills and background of data scientists. The document discusses how data science applies the scientific method and gives examples of how data science is used in news stories, business applications, and emerging technologies. It addresses practicing data science professionally and maturing the field as a profession through personal development, integrating it into business, and nurturing the analytics community.
This document discusses getting to know data using R. It begins by outlining the typical steps in a data analysis, including defining the question, obtaining and cleaning the data, performing exploratory analysis, modeling, interpreting results, and creating reproducible code. It then describes different types of data science questions from descriptive to mechanistic. The remainder of the document provides more details on descriptive, exploratory, inferential, predictive, causal, and mechanistic analysis. It also discusses R, including its design, packages, data types like vectors, matrices, factors, lists, and data frames.
Library Users of the Future... Or, projecting outward from that fringe of res...James Baker
Ā
Deck for a talk I gave at the Anybook Oxford Libraries Conference, Oxford University, 24 June 2015.
Notes at https://gist.github.com/drjwbaker/6c5011d595cabfa70e97
A talk delivered by James Baker at the Anybook Oxford Libraries Conference 2015 - Adapting for the Future: Developing Our Professions and Services, 21st July 2015.
In 1971, David Parnas wrote the great paper, "On the criteria to be used decomposing the system into parts," and yet the problem of breaking down big projects into small parts that work well together remains a struggle in the industry. The ability to decompose a problem space and in turn, compose a solution is essential to our work.Ā
Things have gotten worse since 1971. With microservices, big data, and streaming systems, we're all going to be distributed systems engineers sooner or later. In distributed systems, effective decomposition has an even greater impact on the reliability, performance, and availability of our systems as it determines the frequency and weight of communication in the system.Ā
This talk speaks to the essential considerations for defining and evaluating boundaries and behaviors in large-scale distributed systems. It will touch on topics such as bulkhead design and architectural evolution.
High-Performance Networking Use Cases in Life SciencesAri Berman
Ā
Big data has arrived in the life science research domain and has driven the need for optimized high-performance networks in these research environments. Many petabytes of data transfer, storage and analytics are now a reality due to the fact that data is being produced cheaply and rapidly at unprecedented rates in academic, commercial and clinical laboratories. These data flows are complicated by the combination of high-frequency mouse flows as well as high-volume elephant flows, sometimes from the same application operating in parallel environments. Additional complicating factors include collaborative research efforts on large data stores that utilize both common and disparate compute resources, the need for high-performance data encryption in-flight to cover the transmission and handling of clinical data, and the relatively poor state of algorithm development from an IO standpoint throughout the industry. This presentation will cover representative advanced networking use cases from life sciences research, the challenges that they present in networking environments, some solutions that are being deployed with in both small and large institutions, and an overview of a few of the unresolved problems to date.
by Samantha Adams, Met Office.
Originally purely academic research fields, Machine Learning and AI are now definitely mainstream and frequently mentioned in the Tech media (and regular media too).
Weāve also got the explosion of Data Science which encompasses these fields and more. Thereās a lot of interesting things going on and a lot of positive as well as negative hype. The terms ML and AI are often used interchangeably and techniques are also often described as being inspired by the brain.
In this talk I will explore the history and evolution of these fields, current progress and the challenges in making artificial brains
From the FreshTech 2017 conference by TechExeter
www.techexeter.uk
The document discusses the changing nature of information and literacy in the 21st century. It argues that literacy now involves finding information from diverse online sources, decoding and evaluating it, and organizing it. It also involves expressing ideas compellingly through various media like images, sound, and video. Finally, it discusses the importance of ethics like truth, minimizing harm, and accountability in the digital age.
Presentation from October 4, 2015: Arts Midwest Orchestras 20/20: Context, Connection, Collaboration. An attempt to lay out the context of audience, competition, technology and strategy - then a set of practical steps to get things done.
This document outlines an introduction to Bayesian estimation. It discusses key concepts like the likelihood principle, sufficiency, and Bayesian inference. The likelihood principle states that all experimental information about an unknown parameter is contained within the likelihood function. An example is provided testing the fairness of a coin using different data collection scenarios to illustrate how the likelihood function remains the same. The document also discusses the history of the likelihood principle and provides an outline of topics to be covered.
The document discusses linear transformations and their applications in mathematics for artificial intelligence. It begins by introducing linear transformations and how matrices can be used to define functions. It describes how a matrix A can define a linear transformation fA that maps vectors in Rn to vectors in Rm. It also defines key concepts for linear transformations like the kernel, range, row space, and column space. The document will continue exploring topics like the derivative of transformations, linear regression, principal component analysis, and singular value decomposition.
This document provides an introduction and outline for a discussion of orthonormal bases and eigenvectors. It begins with an overview of orthonormal bases, including definitions of the dot product, norm, orthogonal vectors and subspaces, and orthogonal complements. It also discusses the relationship between the null space and row space of a matrix. The document then provides an introduction to eigenvectors and outlines topics that will be covered, including what eigenvectors are useful for and how to find and use them.
The document discusses square matrices and determinants. It begins by noting that square matrices are the only matrices that can have inverses. It then presents an algorithm for calculating the inverse of a square matrix A by forming the partitioned matrix (A|I) and applying Gauss-Jordan reduction. The document also discusses determinants, defining them recursively as the sum of products of diagonal entries with signs depending on row/column position, for matrices larger than 1x1. Complexity increases exponentially with matrix size.
This document provides an introduction to systems of linear equations and matrix operations. It defines key concepts such as matrices, matrix addition and multiplication, and transitions between different bases. It presents an example of multiplying two matrices using NumPy. The document outlines how systems of linear equations can be represented using matrices and discusses solving systems using techniques like Gauss-Jordan elimination and elementary row operations. It also introduces the concepts of homogeneous and inhomogeneous systems.
This document provides an outline and introduction to a course on mathematics for artificial intelligence, with a focus on vector spaces and linear algebra. It discusses:
1. A brief history of linear algebra, from ancient Babylonians solving systems of equations to modern definitions of matrices.
2. The definition of a vector space as a set that can be added and multiplied by elements of a field, with properties like closure under addition and scalar multiplication.
3. Examples of using matrices and vectors to model systems of linear equations and probabilities of transitions between web pages.
4. The importance of linear algebra concepts like bases, dimensions, and eigenvectors/eigenvalues for machine learning applications involving feature vectors and least squares error.
This document outlines and discusses backpropagation and automatic differentiation. It begins with an introduction to backpropagation, describing how it works in two phases: feed-forward to calculate outputs, and backpropagation to calculate gradients using the chain rule. It then discusses automatic differentiation, noting that it provides advantages over symbolic differentiation. The document explores the forward and reverse modes of automatic differentiation and examines their implementation and complexity. In summary, it covers the fundamental algorithms and methods for calculating gradients in neural networks.
This document summarizes the services of a company that provides data analysis and machine learning solutions. They have an interdisciplinary team with over 15 years of experience in areas like machine learning, artificial intelligence, big data, and data engineering. Their expertise includes developing data models, analysis products, and systems to help companies with forecasting, decision making, and improving data operations efficiency. They can help clients across various industries like telecom, finance, retail, and more.
My first set of slides (The NN and DL class I am preparing for the fall)... I included the problem of Vanishing Gradient and the need to have ReLu (Mentioning btw the saturation problem inherited from Hebbian Learning)
Reinforcement learning is a method for learning behaviors through trial-and-error interactions with an environment. The goal is to maximize a numerical reward signal by discovering the actions that yield the most reward. The learner is not told which actions to take directly, but must instead determine which actions are best by trying them out. This document outlines reinforcement learning concepts like exploration versus exploitation, where exploration involves trying non-optimal actions to gain more information, while exploitation uses current knowledge to choose optimal actions. It also discusses formalisms like Markov decision processes and the tradeoff between maximizing short-term versus long-term rewards in reinforcement learning problems.
This document provides an overview of a 65-hour course on neural networks and deep learning taught by Andres Mendez Vazquez at Cinvestav Guadalajara. The course objectives are to introduce students to concepts of neural networks, with a focus on various neural network architectures and their applications. Topics covered include traditional neural networks, deep learning, optimization techniques for training deep models, and specific deep learning architectures like convolutional and recurrent neural networks. The course grades are based on midterms, homework assignments, and a final project.
This document provides a syllabus for an introduction to artificial intelligence course. It outlines 14 topics that will be covered in the class, including what AI is, the mathematics behind it like probability and linear algebra, search techniques, constraint satisfaction problems, probabilistic reasoning, Bayesian networks, graphical models, neural networks, machine learning, planning, knowledge representation, reinforcement learning, logic in AI, and genetic algorithms. It also lists the course requirements, which include exams, homework, and a group project to simulate predators and prey.
The document outlines a proposed 8 semester curriculum for a Bachelor's degree in Machine Learning and Data Science. The curriculum covers fundamental topics in mathematics, computer science, statistics, physics and artificial intelligence in the first 4 semesters. Later semesters focus on more advanced topics in artificial intelligence, machine learning, neural networks, databases, and parallel programming. The final semester emphasizes practical applications of machine learning and data science through courses on large-scale systems and non-traditional databases.
This document outlines the syllabus for a course on analysis of algorithms and complexity. The course will cover foundational topics like asymptotic analysis and randomized algorithms, as well as specific algorithms like sorting, searching trees, and graph algorithms. It will also cover advanced techniques like dynamic programming, greedy algorithms, and amortized analysis. Later topics will include NP-completeness, multi-threaded algorithms, and approaches for dealing with NP-complete problems. The requirements include exams, homework assignments, and a project, and the course will be taught in English.
A review of one of the most popular methods of clustering, a part of what is know as unsupervised learning, K-Means. Here, we go from the basic heuristic used to solve the NP-Hard problem to an approximation algorithm K-Centers. Additionally, we look at variations coming from the Fuzzy Set ideas. In the future, we will add more about On-Line algorithms in the line of Stochastic Gradient Ideas...
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
This document provides an introduction to machine learning concepts including loss functions, empirical risk, and two basic methods of learning - least squared error and nearest neighborhood. It describes how machine learning aims to find an optimal function that minimizes empirical risk under a given loss function. Least squared error learning is discussed as minimizing the squared differences between predictions and labels. Nearest neighborhood is also introduced as an alternative method. The document serves as a high-level overview of fundamental machine learning principles.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Ā
Slides from talk presenting:
AleÅ” Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, NiÅ”, Serbia
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
Ā
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
We have compiled the most important slides from each speaker's presentation. This yearās compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
Ā
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to todayās integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Ā
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
Ā
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Ā
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, weāll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
Ā
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
1. Machine Learning for Data Mining
Introduction
Andres Mendez-Vazquez
May 13, 2015
1 / 56
2. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
2 / 56
3. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
3 / 56
4. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
5. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
6. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
7. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
8. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
9. However
Something Notable
What constitutes truly āhighā volume varies by industry and even
geography!!!
Simply look at the DNA data for a cellular cycle.
Example
5 / 56
10. However
Something Notable
What constitutes truly āhighā volume varies by industry and even
geography!!!
Simply look at the DNA data for a cellular cycle.
Example
5 / 56
11. Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
12. Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
13. Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
14. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
15. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
16. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
17. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
18. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
19. For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
20. For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
21. For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
22. However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
23. However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
24. However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
25. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
10 / 56
26. Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
27. Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
28. Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
29. And it is through the roof!!! Linking open-data community
project
12 / 56
30. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
31. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
32. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
33. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
34. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
35. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
36. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
38. Hollering Tabulating Machine
It was basically a sorter and counter
Using punching cards as memories.
And Mercury Sensors.
Example
15 / 56
39. Hollering Tabulating Machine
It was basically a sorter and counter
Using punching cards as memories.
And Mercury Sensors.
Example
15 / 56
40. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
41. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
42. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
43. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
44. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
46. The Problem
Actually, it never reached completion because
Babbage was actually a yucky project manager!!!
18 / 56
47. The Problem
Actually, it never reached completion because
Babbage was actually a yucky project manager!!!
18 / 56
48. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
19 / 56
49. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
50. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
51. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
52. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
53. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
54. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
55. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
56. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
57. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
58. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
59. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
60. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
61. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
62. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
63. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
64. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
65. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
66. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
67. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
68. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
69. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
71. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
72. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
73. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
74. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
75. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
76. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
77. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
78. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
79. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
25 / 56
80. Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā Class Identiļ¬cation ā Unsupervised Learning
3 Classiļ¬cation ā Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
81. Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā Class Identiļ¬cation ā Unsupervised Learning
3 Classiļ¬cation ā Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
82. Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā Class Identiļ¬cation ā Unsupervised Learning
3 Classiļ¬cation ā Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
83. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
27 / 56
84. Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The KarhunenāLoĆØve transform ā Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
85. Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The KarhunenāLoĆØve transform ā Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
86. Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The KarhunenāLoĆØve transform ā Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
87. Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The KarhunenāLoĆØve transform ā Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
88. Dimension Reduction/Feature Extraction
Deļ¬nition
Process to transform high-dimensional data into low-dimensional ones for
improving accuracy, understanding, or removing noises.
Why?
Curse of dimensionality: Complexity grows exponentially in volume by
adding extra dimensions.
29 / 56
89. Dimension Reduction/Feature Extraction
Deļ¬nition
Process to transform high-dimensional data into low-dimensional ones for
improving accuracy, understanding, or removing noises.
Why?
Curse of dimensionality: Complexity grows exponentially in volume by
adding extra dimensions.
29 / 56
90. Feature Selection
Feature Selection
Which features should be used for the classiļ¬er?
Why? The Curse of Dimensionality!!!
Hypothesis Testing to discriminate good features
30 / 56
92. What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi ā Āµ0) (Āµi ā Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļi .
Pi
ā¼= ni
N .
31 / 56
93. What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi ā Āµ0) (Āµi ā Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļi .
Pi
ā¼= ni
N .
31 / 56
94. What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi ā Āµ0) (Āµi ā Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļi .
Pi
ā¼= ni
N .
31 / 56
95. What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi ā Āµ0) (Āµi ā Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļi .
Pi
ā¼= ni
N .
31 / 56
96. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
97. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
98. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
99. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
100. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
101. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
33 / 56
102. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
103. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
104. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
105. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
106. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
107. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
108. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
109. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
111. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
112. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
113. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
114. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
115. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
116. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
117. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
37 / 56
118. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
119. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
120. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
121. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
122. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
124. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
125. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
126. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
127. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
128. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
129. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
130. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
131. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
132. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
41 / 56
133. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
134. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
135. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
136. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
137. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
138. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
139. Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about āAmazonā
What is Data Mining?
1 Certain names are more prevalent in certain US locations (OāBrien,
OāRurke, OāReilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
140. Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about āAmazonā
What is Data Mining?
1 Certain names are more prevalent in certain US locations (OāBrien,
OāRurke, OāReilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
141. Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about āAmazonā
What is Data Mining?
1 Certain names are more prevalent in certain US locations (OāBrien,
OāRurke, OāReilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
142. Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about āAmazonā
What is Data Mining?
1 Certain names are more prevalent in certain US locations (OāBrien,
OāRurke, OāReilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
143. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
44 / 56
144. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
145. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
146. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
147. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
148. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
149. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
150. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
151. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
152. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
153. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
46 / 56
154. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
155. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
156. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
157. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
158. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
159. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
160. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
161. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
162. Example: Frequent Itemsets
Then, we do the following
Transaction ID Cat Dog and a mated
1 1 1 1 0 0
2 1 1 1 1 1
3 1 1 0 1 0
4 0 1 0 0 0
48 / 56
163. Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāI
supp(x, D) = |{ti ā D|x ā ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
164. Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāI
supp(x, D) = |{ti ā D|x ā ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
165. Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāI
supp(x, D) = |{ti ā D|x ā ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
166. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
50 / 56
167. Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
168. Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
169. Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
170. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
52 / 56
171. Hardware Solutions: GPUās
IDEAS
Based on CUDA parallel computing architecture from Nvidia
Emphasis on executing many concurrent LIGHT threads instead of
one HEAVY thread as in CPUs
Hardware for 8800
53 / 56
172. Advantages
Massively parallel
Hundreds of cores, millions of threads
High throughput
Limitations
May not be applicable for all tasks
Generic hardware (CPUs) closing the gap
54 / 56
173. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
55 / 56
174. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
175. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
176. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
177. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
178. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
179. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
180. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
181. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56