In design acknowledgment, the k-nearest neighbor calculation is a non-parametric strategy proposed by Thomas Cover utilized for characterization and relapse. In the two cases, the info comprises of the k nearest preparing models in the element space.
In design acknowledgment, the k-nearest neighbor calculation is a non-parametric strategy proposed by Thomas Cover utilized for characterization and relapse. In the two cases, the info comprises of the k nearest preparing models in the element space.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
20180526@Taiwan AI Academy, Professional Managers Class.
Covering important concepts of classical machine learning, in preparation for deep learning topics to follow. Topics include regression (linear, polynomial, gaussian and sigmoid basis functions), dimension reduction (PCA, LDA, ISOMAP), clustering (K-means, GMM, Mean-Shift, DBSCAN, Spectral Clustering), classification (Naive Bayes, Logistic Regression, SVM, kNN, Decision Tree, Classifier Ensembles, Bagging, Boosting, Adaboost) and Semi-Supervised learning techniques. Emphasis on sampling, probability, curse of dimensionality, decision theory and classifier generalizability.
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Sean Moran
In this paper we focus on improving the effectiveness of hashing-based approximate nearest neighbour search. Generating similarity preserving hashcodes for images has been shown to be an effective and efficient method for searching through large datasets. Hashcode generation generally involves two steps: bucketing the input feature space with a set of hyperplanes, followed by quantising the projection of the data-points onto the normal vectors to those hyperplanes. This procedure results in the makeup of the hashcodes depending on the positions of the data-points with respect to the hyperplanes in the feature space, allowing a degree of locality to be encoded into the hashcodes. In this paper we study the effect of learning both the hyperplanes and the thresholds as part of the same model. Most previous research either learn the hyperplanes assuming a fixed set of thresholds, or vice-versa. In our experiments over two standard image datasets we find statistically significant increases in retrieval effectiveness versus a host of state-of-the-art data-dependent and independent hashing models.
발표자: 송환준(KAIST 박사과정)
발표일: 2018.8.
(Parallel Clustering Algorithm Optimization for Large-Scale Data Analytics)
Clustering은 데이터 분석에 가장 널리 쓰이는 방법 중 하나로 주어진 데이터를 유사성에 기초하여 여러 개의 그룹으로 나누는 작업이다. 하지만 Clustering 방법의 높은 계산 복잡도 때문에 대용량 데이터 분석에는 잘 사용되지 못하고 있다. 최근 이 높은 복잡도 문제를 해결하기 위해 많은 연구가 Hadoop, Spark와 같은 분산 컴퓨팅 방식을 적용하고 있지만 기존 Clustering 알고리즘을 분산 환경에 최적화시키는 것은 쉽지 않다. 특히, 효율성을 높이기 위해 정확성을 손실하는 문제 그리고 여러 작업자들 간의 부하 불균형 문제는 알고리즘을 분산처리 할 때 발생하는 대표적인 문제이다. 본 세미나에서는 대표적 Clustering 알고리즘인 DBSCAN을 분산처리 할 때 발생하는 여러 도전 과제에 초점을 맞추고 이를 해결 할 수 있는 새로운 해결책을 제시한다. 실제로 이 방법은 최신 연구의 방법과 비교하여 정확도 손실 없이 최대 180배까지 알고리즘의 성능을 향상시켰다.
본 세미나는 SIGMOD 2018에서 발표한 다음 논문에 대한 내용이다.
Song, H. and Lee, J., "RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning," In Proc. 2018 ACM Int'l Conf. on Management of Data (SIGMOD), Houston, Texas, pp. 1173 ~ 1187, June 2018
1. Background
- Concept of Clustering
- Concept of Distributed Processing (MapReduce)
- Clustering Algorithms (Focus on DBSCAN)
2. Challenges of Parallel Clustering
- Parallelization of Clustering Algorithm (Focus on DBSCAN)
- Existing Work
- Challenges
3. Our Approach
- Key Idea and Key Contribution
- Overview of Random Partitioning-DBSCAN
4. Experimental Results
5. Conclusions
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
20180526@Taiwan AI Academy, Professional Managers Class.
Covering important concepts of classical machine learning, in preparation for deep learning topics to follow. Topics include regression (linear, polynomial, gaussian and sigmoid basis functions), dimension reduction (PCA, LDA, ISOMAP), clustering (K-means, GMM, Mean-Shift, DBSCAN, Spectral Clustering), classification (Naive Bayes, Logistic Regression, SVM, kNN, Decision Tree, Classifier Ensembles, Bagging, Boosting, Adaboost) and Semi-Supervised learning techniques. Emphasis on sampling, probability, curse of dimensionality, decision theory and classifier generalizability.
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Sean Moran
In this paper we focus on improving the effectiveness of hashing-based approximate nearest neighbour search. Generating similarity preserving hashcodes for images has been shown to be an effective and efficient method for searching through large datasets. Hashcode generation generally involves two steps: bucketing the input feature space with a set of hyperplanes, followed by quantising the projection of the data-points onto the normal vectors to those hyperplanes. This procedure results in the makeup of the hashcodes depending on the positions of the data-points with respect to the hyperplanes in the feature space, allowing a degree of locality to be encoded into the hashcodes. In this paper we study the effect of learning both the hyperplanes and the thresholds as part of the same model. Most previous research either learn the hyperplanes assuming a fixed set of thresholds, or vice-versa. In our experiments over two standard image datasets we find statistically significant increases in retrieval effectiveness versus a host of state-of-the-art data-dependent and independent hashing models.
발표자: 송환준(KAIST 박사과정)
발표일: 2018.8.
(Parallel Clustering Algorithm Optimization for Large-Scale Data Analytics)
Clustering은 데이터 분석에 가장 널리 쓰이는 방법 중 하나로 주어진 데이터를 유사성에 기초하여 여러 개의 그룹으로 나누는 작업이다. 하지만 Clustering 방법의 높은 계산 복잡도 때문에 대용량 데이터 분석에는 잘 사용되지 못하고 있다. 최근 이 높은 복잡도 문제를 해결하기 위해 많은 연구가 Hadoop, Spark와 같은 분산 컴퓨팅 방식을 적용하고 있지만 기존 Clustering 알고리즘을 분산 환경에 최적화시키는 것은 쉽지 않다. 특히, 효율성을 높이기 위해 정확성을 손실하는 문제 그리고 여러 작업자들 간의 부하 불균형 문제는 알고리즘을 분산처리 할 때 발생하는 대표적인 문제이다. 본 세미나에서는 대표적 Clustering 알고리즘인 DBSCAN을 분산처리 할 때 발생하는 여러 도전 과제에 초점을 맞추고 이를 해결 할 수 있는 새로운 해결책을 제시한다. 실제로 이 방법은 최신 연구의 방법과 비교하여 정확도 손실 없이 최대 180배까지 알고리즘의 성능을 향상시켰다.
본 세미나는 SIGMOD 2018에서 발표한 다음 논문에 대한 내용이다.
Song, H. and Lee, J., "RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning," In Proc. 2018 ACM Int'l Conf. on Management of Data (SIGMOD), Houston, Texas, pp. 1173 ~ 1187, June 2018
1. Background
- Concept of Clustering
- Concept of Distributed Processing (MapReduce)
- Clustering Algorithms (Focus on DBSCAN)
2. Challenges of Parallel Clustering
- Parallelization of Clustering Algorithm (Focus on DBSCAN)
- Existing Work
- Challenges
3. Our Approach
- Key Idea and Key Contribution
- Overview of Random Partitioning-DBSCAN
4. Experimental Results
5. Conclusions
"Number Crunching in Python": slides presented at EuroPython 2012, Florence, Italy
Slides have been authored by me and by Dr. Enrico Franchi.
Scientific and Engineering Computing, Numpy NDArray implementation and some working case studies are reported.
Lab course presentation to detect the parking space for the car given 2D image from the google maps and 3D point Cloud data of the current enivornment.
Github of the project can be find here.
https://github.com/amanullahtariq/ParkingSpaceDetection
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
1. George Valkanas1
, Apostolos N. Papadopoulos2
, Dimitrios Gunopulos1
Skyline Ranking à la IR
1
University of Athens, Greece
2
Aristotle University of Thessaloniki, Greece
1st
ExploreDB Workshop
Athens, Greece
28th
March, 2014
2. Skyline Problem Introduction
• Dataset D = (p1, p2, …, pn) in d-dimensional space
• Preferences for each dimension: min, max
• p dominates q iff pi ≤ qi i = 1,..,d && j: pj < qj
3. Usefulness of Skyline
• Multi-Objective optimization
• Exploratory Search
• Improve Recommendations
• Data summarization technique
• Building block for defining competitiveness
11. Bounding the Score
• Q1: What is the score for B ?
• A1: Depends on the assignment of the
remaining edges
12. Bounding the Score
• Q1: What is the score for B ?
• A1: Depends on the assignment of the
remaining edges
• Q2: What is the maximum score for B ?
13. Bounding the Score
• Q1: What is the score for B ?
• A1: Depends on the assignment of the
remaining edges
• Q2: What is the maximum score for B ?
• A2: Assign appropriately the remaining
edges
14. Bounding the Score
• Q1: What is the score for B ?
• A1: Depends on the assignment of the
remaining edges
• Q2: What is the maximum score for B ?
• A2: Assign appropriately the remaining
edges
• Q3: What is the appropriate way?
15. Bounding the Score
• Q1: What is the score for B ?
• A1: Depends on the assignment of the
remaining edges
• Q2: What is the maximum score for B ?
• A2: Assign appropriately the remaining
edges
• Q3: What is the appropriate way?
• A3:
– Same layer → Higher score (dp)
– Minimum overlap → Higher score (idp)
• No overlap → Loose bounds