The introduction to my class on machine learning. The subjects covered in this class go from:
1.- Linear Classifiers
2.- Non Linear Classifiers
3.- Graphical Models
4.- Clustering
5.- Etc
I am planning to upload the rest once I feel they are at the level.
Here is the basic introduction to the probability used in my Analysis of Algorithms course at the Cinvestav Guadalajara. They go from the basic axioms to the Expected Value and Variance.
In most of the algorithms analyzed until now, we have been looking and studying problems solvable in polynomial time. The polynomial time algorithm class P are algorithms that on inputs of size n have a worst case running time of O(n^k) for some constant k. Thus, informally, we can say that the Non-Polynomial (NP) time algorithms are the ones that cannot be solved in O(n^k) for any constant k
.
Here are my slides for my preparation class for possible students for the Master in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
Here are my slides in some basic algorithms in Computational Geometry:
1.- Line Intersection
2.- Sweeping Line
3.- Convex Hull
They are the classic one, but there is still a lot for anybody wanting to get in computer graphics to study. I recomend
Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications (3rd ed. ed.). TELOS, Santa Clara, CA, USA.
Here are my slides for my preparation class for possible Master students in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
Here is the basic introduction to the probability used in my Analysis of Algorithms course at the Cinvestav Guadalajara. They go from the basic axioms to the Expected Value and Variance.
In most of the algorithms analyzed until now, we have been looking and studying problems solvable in polynomial time. The polynomial time algorithm class P are algorithms that on inputs of size n have a worst case running time of O(n^k) for some constant k. Thus, informally, we can say that the Non-Polynomial (NP) time algorithms are the ones that cannot be solved in O(n^k) for any constant k
.
Here are my slides for my preparation class for possible students for the Master in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
Here are my slides in some basic algorithms in Computational Geometry:
1.- Line Intersection
2.- Sweeping Line
3.- Convex Hull
They are the classic one, but there is still a lot for anybody wanting to get in computer graphics to study. I recomend
Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications (3rd ed. ed.). TELOS, Santa Clara, CA, USA.
Here are my slides for my preparation class for possible Master students in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
Here are my slides for my preparation class for possible students in the Master in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
Here, we look at the problem of going from a source s to a possible multiple destinations. At them, each of the Lemmas, Theorems and Corollaries used to prove the properties of the
1. Bellman-Ford
2. Dijkstra
are examined in detail.
Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
Supervised Hidden Markov Chains.
Here, we used the paper by Rabiner as a base for the presentation. Thus, we have the following three problems:
1.- How efficiently compute the probability given a model.
2.- Given an observation to which class it belongs
3.- How to find the parameters given data for training.
The first two follow Rabiner's explanation, but in the third one I used the Lagrange Multiplier Optimization because Rabiner lacks a clear explanation about how solving the issue.
This informative tutorial by industry leading Armtec will provide an introduction to Stormwater treatment and the large number of regulations present in Canada. The tutorial will present a range of products to meet various Stormwater treatment regulations and will provide guidance on proper selection and sizing of the products, which is a key question in the development of many projects. Learn how to achieve greener projects through effective particle removal, water retention and infiltration. Stormwater detention will be addressed during the tutorial.
A range of products and application cases will be covered including the VortSentry, Vortechs and StormFilter units and ChamberMaxx stormwater detention/infiltration products.
Here are my slides for my preparation class for possible Master students in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
On Inherent Complexity of Computation, by Attila SzegediZeroTurnaround
Ā
The system you just recently deployed is likely an application processing some data, likely relying on some configuration, maybe using some plugins, certainly relying on some libraries, using services of an operating system running on some physical hardware. The previous sentence names 7 categories into which we compartmentalise various parts of a computation process thatās in the end going on in a physical world. Where do you draw the line of functionality between categories? From what vantage points do these distinctions become blurry? Finally, how does it all interact with the actual physical world in which the computation takes place? (What is the necessary physical minimum required to perform a computation, anyway?) Letās make a journey from your AOP-assembled, plugin-injected, YAML-configured, JIT compiled, Hotspot-executed, Linux-on-x86 hosted Java application server talking JSON-over-HTTP-over-TCP-over-IP-over-Ethernet all the way down to electrons. And then back. Recorded at GeekOut 2013.
Here are my slides for my preparation class for possible students in the Master in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
Here, we look at the problem of going from a source s to a possible multiple destinations. At them, each of the Lemmas, Theorems and Corollaries used to prove the properties of the
1. Bellman-Ford
2. Dijkstra
are examined in detail.
Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
Supervised Hidden Markov Chains.
Here, we used the paper by Rabiner as a base for the presentation. Thus, we have the following three problems:
1.- How efficiently compute the probability given a model.
2.- Given an observation to which class it belongs
3.- How to find the parameters given data for training.
The first two follow Rabiner's explanation, but in the third one I used the Lagrange Multiplier Optimization because Rabiner lacks a clear explanation about how solving the issue.
This informative tutorial by industry leading Armtec will provide an introduction to Stormwater treatment and the large number of regulations present in Canada. The tutorial will present a range of products to meet various Stormwater treatment regulations and will provide guidance on proper selection and sizing of the products, which is a key question in the development of many projects. Learn how to achieve greener projects through effective particle removal, water retention and infiltration. Stormwater detention will be addressed during the tutorial.
A range of products and application cases will be covered including the VortSentry, Vortechs and StormFilter units and ChamberMaxx stormwater detention/infiltration products.
Here are my slides for my preparation class for possible Master students in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
On Inherent Complexity of Computation, by Attila SzegediZeroTurnaround
Ā
The system you just recently deployed is likely an application processing some data, likely relying on some configuration, maybe using some plugins, certainly relying on some libraries, using services of an operating system running on some physical hardware. The previous sentence names 7 categories into which we compartmentalise various parts of a computation process thatās in the end going on in a physical world. Where do you draw the line of functionality between categories? From what vantage points do these distinctions become blurry? Finally, how does it all interact with the actual physical world in which the computation takes place? (What is the necessary physical minimum required to perform a computation, anyway?) Letās make a journey from your AOP-assembled, plugin-injected, YAML-configured, JIT compiled, Hotspot-executed, Linux-on-x86 hosted Java application server talking JSON-over-HTTP-over-TCP-over-IP-over-Ethernet all the way down to electrons. And then back. Recorded at GeekOut 2013.
Talk at a Data Journalism BootCamp organised by ICFJ, World Bank Group and African Media Initiative in New Delhi to a group of 60 journalists, coders and social sector folks. Other amazing sessions included those from Govind Ethiraj of IndiaSpend, Andrew from BBC, Parul from Google, Nasr from HacksHacker, Thej from DataMeet and David from Code for Africa. http://delhi.dbootcamp.org/
Library Users of the Future... Or, projecting outward from that fringe of res...James Baker
Ā
Deck for a talk I gave at the Anybook Oxford Libraries Conference, Oxford University, 24 June 2015.
Notes at https://gist.github.com/drjwbaker/6c5011d595cabfa70e97
A talk delivered by James Baker at the Anybook Oxford Libraries Conference 2015 - Adapting for the Future: Developing Our Professions and Services, 21st July 2015.
In 1971, David Parnas wrote the great paper, "On the criteria to be used decomposing the system into parts," and yet the problem of breaking down big projects into small parts that work well together remains a struggle in the industry. The ability to decompose a problem space and in turn, compose a solution is essential to our work.Ā
Things have gotten worse since 1971. With microservices, big data, and streaming systems, we're all going to be distributed systems engineers sooner or later. In distributed systems, effective decomposition has an even greater impact on the reliability, performance, and availability of our systems as it determines the frequency and weight of communication in the system.Ā
This talk speaks to the essential considerations for defining and evaluating boundaries and behaviors in large-scale distributed systems. It will touch on topics such as bulkhead design and architectural evolution.
High-Performance Networking Use Cases in Life SciencesAri Berman
Ā
Big data has arrived in the life science research domain and has driven the need for optimized high-performance networks in these research environments. Many petabytes of data transfer, storage and analytics are now a reality due to the fact that data is being produced cheaply and rapidly at unprecedented rates in academic, commercial and clinical laboratories. These data flows are complicated by the combination of high-frequency mouse flows as well as high-volume elephant flows, sometimes from the same application operating in parallel environments. Additional complicating factors include collaborative research efforts on large data stores that utilize both common and disparate compute resources, the need for high-performance data encryption in-flight to cover the transmission and handling of clinical data, and the relatively poor state of algorithm development from an IO standpoint throughout the industry. This presentation will cover representative advanced networking use cases from life sciences research, the challenges that they present in networking environments, some solutions that are being deployed with in both small and large institutions, and an overview of a few of the unresolved problems to date.
by Samantha Adams, Met Office.
Originally purely academic research fields, Machine Learning and AI are now definitely mainstream and frequently mentioned in the Tech media (and regular media too).
Weāve also got the explosion of Data Science which encompasses these fields and more. Thereās a lot of interesting things going on and a lot of positive as well as negative hype. The terms ML and AI are often used interchangeably and techniques are also often described as being inspired by the brain.
In this talk I will explore the history and evolution of these fields, current progress and the challenges in making artificial brains
From the FreshTech 2017 conference by TechExeter
www.techexeter.uk
Presentation from October 4, 2015: Arts Midwest Orchestras 20/20: Context, Connection, Collaboration. An attempt to lay out the context of audience, competition, technology and strategy - then a set of practical steps to get things done.
Senior data scientist and founder of the company Intelligentia Data I+D SA de CV. We are offering consultancy services, development of projects and products in Machine Learning, Big Data, Data Sciences and Artificial Intelligence.
My first set of slides (The NN and DL class I am preparing for the fall)... I included the problem of Vanishing Gradient and the need to have ReLu (Mentioning btw the saturation problem inherited from Hebbian Learning)
It has been almost 62 years since the invention of the term Artificial Intelligence by Samuel and Minsky et al. at the Dartmouth workshop College in 1956 (āDartmouth Summer Research Project on Artificial Intelligenceā) where this new area of āāComputer Science was invented. However, the history of Artificial Intelligence goes back to previous millennia, when the Greeks in their Myths spoke about golden robots at Hephaestus, and the Galatea of āāPygmalion. They were the first automatons known at the dawn of history, and although these first attempts were only myths, automatons were invented and built through multiple civilizations in history. Nevertheless, these automatons resembled in quite limited way their final objectives, representing animals and humans. In spite of that, the greatest illusion of an automaton, the Turk by Wolfgang von Kempelen, inspired many people, trough its exhibitions, as Alexander Graham Bell and Charles Babbage to develop inventions that would change forever human history. Thus, the importance of the concept āArtificial Intelligenceā as a driver of our technological dreams. And although Artificial Intelligence has never been defined in a precise practical way, the amount of research and methods that have been developed to tackle some of its basics tasks have been and are quite humongous. Thus, the importance of having an introduction to the concepts of Artificial Intelligence, thus the dream can continue.
A review of one of the most popular methods of clustering, a part of what is know as unsupervised learning, K-Means. Here, we go from the basic heuristic used to solve the NP-Hard problem to an approximation algorithm K-Centers. Additionally, we look at variations coming from the Fuzzy Set ideas. In the future, we will add more about On-Line algorithms in the line of Stochastic Gradient Ideas...
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
Final project report on grocery store management system..pdfKamal Acharya
Ā
In todayās fast-changing business environment, itās extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
Ā
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Ā
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Ā
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
Ā
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Cosmetic shop management system project report.pdfKamal Acharya
Ā
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Ā
01 Machine Learning Introduction
1. Machine Learning for Data Mining
Introduction
Andres Mendez-Vazquez
May 13, 2015
1 / 56
2. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
2 / 56
3. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
3 / 56
4. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
5. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
6. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
7. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
8. Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
9. However
Something Notable
What constitutes truly āhighā volume varies by industry and even
geography!!!
Simply look at the DNA data for a cellular cycle.
Example
5 / 56
10. However
Something Notable
What constitutes truly āhighā volume varies by industry and even
geography!!!
Simply look at the DNA data for a cellular cycle.
Example
5 / 56
11. Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
12. Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
13. Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
14. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
15. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
16. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
17. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
18. Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
19. For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
20. For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
21. For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
22. However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
23. However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
24. However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
25. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
10 / 56
26. Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
27. Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
28. Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
29. And it is through the roof!!! Linking open-data community
project
12 / 56
30. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
31. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
32. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
33. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
34. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
35. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
36. Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
38. Hollering Tabulating Machine
It was basically a sorter and counter
Using punching cards as memories.
And Mercury Sensors.
Example
15 / 56
39. Hollering Tabulating Machine
It was basically a sorter and counter
Using punching cards as memories.
And Mercury Sensors.
Example
15 / 56
40. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
41. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
42. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
43. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
44. It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageās Diļ¬erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
46. The Problem
Actually, it never reached completion because
Babbage was actually a yucky project manager!!!
18 / 56
47. The Problem
Actually, it never reached completion because
Babbage was actually a yucky project manager!!!
18 / 56
48. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
19 / 56
49. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
50. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
51. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
52. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
53. Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
54. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
55. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
56. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
57. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
58. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
59. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
60. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
61. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
62. The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: āBig data: The next frontier for innovation, competition, and pro-
ductivityā
McKinsey Global Institute 2011 21 / 56
63. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
64. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
65. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
66. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
67. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
68. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
69. Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
71. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
72. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
73. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
74. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
75. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
76. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
77. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
78. Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ālearnā from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
79. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
25 / 56
80. Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā Class Identiļ¬cation ā Unsupervised Learning
3 Classiļ¬cation ā Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
81. Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā Class Identiļ¬cation ā Unsupervised Learning
3 Classiļ¬cation ā Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
82. Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā Class Identiļ¬cation ā Unsupervised Learning
3 Classiļ¬cation ā Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
83. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
27 / 56
84. Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The KarhunenāLoĆØve transform ā Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
85. Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The KarhunenāLoĆØve transform ā Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
86. Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The KarhunenāLoĆØve transform ā Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
87. Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The KarhunenāLoĆØve transform ā Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
88. Dimension Reduction/Feature Extraction
Deļ¬nition
Process to transform high-dimensional data into low-dimensional ones for
improving accuracy, understanding, or removing noises.
Why?
Curse of dimensionality: Complexity grows exponentially in volume by
adding extra dimensions.
29 / 56
89. Dimension Reduction/Feature Extraction
Deļ¬nition
Process to transform high-dimensional data into low-dimensional ones for
improving accuracy, understanding, or removing noises.
Why?
Curse of dimensionality: Complexity grows exponentially in volume by
adding extra dimensions.
29 / 56
90. Feature Selection
Feature Selection
Which features should be used for the classiļ¬er?
Why? The Curse of Dimensionality!!!
Hypothesis Testing to discriminate good features
30 / 56
92. What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi ā Āµ0) (Āµi ā Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļi .
Pi
ā¼= ni
N .
31 / 56
93. What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi ā Āµ0) (Āµi ā Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļi .
Pi
ā¼= ni
N .
31 / 56
94. What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi ā Āµ0) (Āµi ā Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļi .
Pi
ā¼= ni
N .
31 / 56
95. What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi ā Āµ0) (Āµi ā Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļi .
Pi
ā¼= ni
N .
31 / 56
96. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
97. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
98. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
99. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
100. What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
101. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
33 / 56
102. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
103. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
104. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
105. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
106. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
107. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
108. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
109. Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
111. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
112. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
113. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
114. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
115. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
116. Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
117. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
37 / 56
118. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
119. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
120. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
121. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
122. Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
124. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
125. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
126. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
127. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
128. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
129. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
130. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
131. Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
132. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
41 / 56
133. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
134. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
135. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
136. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
137. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
138. What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their āinside storiesā:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
139. Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about āAmazonā
What is Data Mining?
1 Certain names are more prevalent in certain US locations (OāBrien,
OāRurke, OāReilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
140. Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about āAmazonā
What is Data Mining?
1 Certain names are more prevalent in certain US locations (OāBrien,
OāRurke, OāReilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
141. Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about āAmazonā
What is Data Mining?
1 Certain names are more prevalent in certain US locations (OāBrien,
OāRurke, OāReilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
142. Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about āAmazonā
What is Data Mining?
1 Certain names are more prevalent in certain US locations (OāBrien,
OāRurke, OāReilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
143. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
44 / 56
144. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
145. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
146. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
147. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
148. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
149. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
150. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
151. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
152. Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
153. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
46 / 56
154. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
155. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
156. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
157. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
158. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
159. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
160. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
161. Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ātransactions.ā
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
162. Example: Frequent Itemsets
Then, we do the following
Transaction ID Cat Dog and a mated
1 1 1 1 0 0
2 1 1 1 1 1
3 1 1 0 1 0
4 0 1 0 0 0
48 / 56
163. Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāI
supp(x, D) = |{ti ā D|x ā ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
164. Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāI
supp(x, D) = |{ti ā D|x ā ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
165. Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāI
supp(x, D) = |{ti ā D|x ā ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
166. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
50 / 56
167. Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
168. Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
169. Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
170. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
52 / 56
171. Hardware Solutions: GPUās
IDEAS
Based on CUDA parallel computing architecture from Nvidia
Emphasis on executing many concurrent LIGHT threads instead of
one HEAVY thread as in CPUs
Hardware for 8800
53 / 56
172. Advantages
Massively parallel
Hundreds of cores, millions of threads
High throughput
Limitations
May not be applicable for all tasks
Generic hardware (CPUs) closing the gap
54 / 56
173. Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vās
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUās
5 Projects
What projects can you do?
55 / 56
174. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
175. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
176. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
177. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
178. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
179. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
180. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
181. Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56