Modern machine learning is immensely powerful but also has very significant limitations that don't always get the attention they deserve. In this talk, I tried to contrast machine learning against AI and the original goals of that field, give some context and discuss a potential path forward.
t-SNE is a modern visualization algorithm that presents high-dimensional data in 2 or 3 dimensions according to some desired distances. If you have some data and you can measure their pairwise differences, t-SNE visualization can help you identify various clusters.
Modern machine learning is immensely powerful but also has very significant limitations that don't always get the attention they deserve. In this talk, I tried to contrast machine learning against AI and the original goals of that field, give some context and discuss a potential path forward.
t-SNE is a modern visualization algorithm that presents high-dimensional data in 2 or 3 dimensions according to some desired distances. If you have some data and you can measure their pairwise differences, t-SNE visualization can help you identify various clusters.
High Dimensional Data Visualization using t-SNEKai-Wen Zhao
Review of the t-SNE algorithm which helps visualizing the high dimensional data on manifold by projecting them onto 2D or 3D space with metric preserving.
My talk at the Stockholm Natural Language Processing Meetup. I explained how word2vec is implemented and how to use it in Python with gensim. When words are represented as points in space, the spatial distance between words describes a similarity between these words. In this talk, I explore how to use this in practice and how to visualize the results (using t-SNE)
Visualizing and Communicating High-dimensional DataStefan Kühn
Slides from my talk at Data Natives, starting with the different Modes of Perception, the components of Visualization and Graphics and how to transport Information efficiently, then giving examples of how modern approximation techniques - manifold learning, principal curves - and visualization techniques - pair plots, correlation plots, parallel coordinates, grand tour - can be used in order to approach complex multi-dimensional data.
High Dimensional Data Visualization using t-SNEKai-Wen Zhao
Review of the t-SNE algorithm which helps visualizing the high dimensional data on manifold by projecting them onto 2D or 3D space with metric preserving.
My talk at the Stockholm Natural Language Processing Meetup. I explained how word2vec is implemented and how to use it in Python with gensim. When words are represented as points in space, the spatial distance between words describes a similarity between these words. In this talk, I explore how to use this in practice and how to visualize the results (using t-SNE)
Visualizing and Communicating High-dimensional DataStefan Kühn
Slides from my talk at Data Natives, starting with the different Modes of Perception, the components of Visualization and Graphics and how to transport Information efficiently, then giving examples of how modern approximation techniques - manifold learning, principal curves - and visualization techniques - pair plots, correlation plots, parallel coordinates, grand tour - can be used in order to approach complex multi-dimensional data.
Valencian Summer School 2015
Day 2
Lecture 15
Machine Learning - Black Art
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Introduction to Data Mining - A Beginner's Guidegokulprasath06
We live in a world where vast amounts of data are collected daily. Analyzing such data is an important need. Data mining can meet this demand by providing tools to discover knowledge from data.
Checkout: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization. However, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used. In this talk Sean will cover technical challenges in keeping data profiling agile in the Big Data era. He will discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches.
Sean is Trifacta’s Chief Technical Officer. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.
Goal: Provide an overview of data mining
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
A talk given by Eugene Dubossarsky on predictive analytics at the Big Data Analytics meetup in Sydney this month. The talk is available at http://www.youtube.com/watch?v=aG16YSFgtLY
A primer in Data Analysis. To substantiate the concepts, I presented Python code in the form of an ipython notebook (not included - get in touch for these, email and twitter are on the last slide).
The talk starts by describing general data analysis (and skills required). I then speak about computing descriptive statistics and explain the details of two types of predictive models (simple linear regression and naive Bayes classifiers). We build examples using both predictive models using python (Pandas and Matplotlib).
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Data visualization in data science: exploratory EDA, explanatory. Anscobe's quartet, design principles, visual encoding, design engineering and journalism, choosing the right graph, narrative structures, technology and tools.
Simple math for anomaly detection toufic boubez - metafor software - monito...tboubez
This is my presentation at Monitorama PDX in Portland on May 5, 2014
Simple math to get some signal out of your noisy sea of data
You’ve instrumented your system and application to the hilt. You can now “measure all the things”. Your team has set up thousands of metrics collecting millions of data points a day. Now what?
Most IT ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this mountain of data and extracting signal from the noise is not easy. The choice of what analytic method to use ranges from simple statistical analysis to sophisticated machine learning techniques. And one algorithm doesn’t fit all data.
The use of data and its modelling in science provides meaningful interpretation of real world problems. This presentation provides an easy to understand overview of data visualization and analytics , and snippets of data science applications using R - programming.
Similar to Data Visualization at codetalks 2016 (20)
Everyone is talking about Data Mesh architectures already - assuming that there is already a full-fledged self-service data platform in place. A reality check reveals, that most (data) platforms are not really working that well, and fail to deliver value at scale. And in contrast to the business notion of a platform, where network effects make platforms even more valuable, the more users and products are there, this does not hold true for data platforms in particular (at least I haven't seen a proof so far).
So where to start, when data-transforming an organization? One approach, inspired by the Lean framework, is outlined in this talk. It all starts with what is actually working - identify some (data) products that drive value already. These are the ones you can build a platform for. It's a myth that you just need to build a solid platform, and then everyone will come and build amazing data products. They will never come. But starting with what already works is a reasonable first step. Step two is about creating flow, supporting the value stream end-to-end. Co-creation is your main tool here, fostering collaboration and ownership. Then you can think of platformizing what is really, really needed, avoiding the "waste" that modern data systems / platforms / architectures tend to pile up. In the end, the "right" architecture for your organization will emerge, you cannot simply copy-paste "solutions" that are not addressing your specific challenges.
Long story short, there is a path to success, but it's not easy, it's not copying others, it's finding your own way. And as in all good strategies, you can specify the "qualities" you'd like to see in the end. And the concrete solutions need to emerge from the hard work of the motivated people, that are already driving value for your organization now.
Data teams are contributing to a variety of value streams, as they are delivering value to a variety of stakeholders. The value streams are often not well-supported and the involved teams are facing constant challenges like Data Quality and Data Ownership. Also, data products often rely on the same data points for building the product and for measuring its success - so a lack of data quality leads to poor product quality and weak measurability at the same time. These challenges become exponentially harder, the larger the organization has grown. We propose a way of conceptualizing and visualizing the process of building data products, using the concept of the data value chain. Applying the Five Principles of Lean, especially Defining Value and Mapping out Data Value Streams, to the way build data products and operate data systems at scale, we create a framework that allows to focus on value delivery, avoids "waste" and supports ownership.
Talk at MCubed London about Manifold Learning and ApplicationsStefan Kühn
How to make use of of Manifold Learning methods for Dimensionality Reduction, Data Visualization and Automated Feature Engineering, this time also with UMAP - most of the cool stuff is in the Jupyter notebooks
Talk at PyData Berlin about Manifold Learning and ApplicationsStefan Kühn
These are the slides from my talk at PyData Berlin about how to use Manifold Learning in the context of Data Visualization and Feature Engineering. There are several jupyter notebooks exporing this, you can find the on github under https://github.com/cc-skuehn/Manifold_Learning
My slides from the minds mastering machines conference 2018 in Cologne about Deep Learning and Mathematical Optimization, the methods that are used for training Neural Nets and how they perform with respect to Training and especially Learning, i.e. how well the trained predictors generalize
Manifold Learning and Data VisualizationStefan Kühn
Talk at PyData Hamburg 2018-03-01 about Manifold Learning and Data Visualization with Python and Scikit-learn plus Random Projections and PCA, includes links to all resources and the github repository with worked examples in form of jupyter notebooks - we recommend using jupyter lab
Talk at the Data Science Meetup Hamburg about Deep Learning, the most important Optimization methods in this field and the relationship between training and learning
Data quality - The True Big Data ChallengeStefan Kühn
Data Quality is one of the most-overlooked key aspect in any Big Data project or approach. This talk adresses the problem from various perspectives, discusses the main challenges and identifies possible solutions.
In this talk we discuss the connections between (Supervised) Learning and Mathematical Optimization. Topics include iterative algorithm, search directions and stepsizes. The talk has been held at the Computer Science, Machine Learning and Statistics Meetup Hamburg.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
2. Overview
• Data Visualization as you might know it
• Main Properties of Graphics (and Humans)
• A short story about Charts
• Pair Plots and Correlations
• Data Visualization as you might not know it
• Fundamental Problems
• SVD, t-SNE and other approximations
• Principal Components and Principal Curves
• Parallel Coordinates
• The Grand Tour
2
3. Takeaways - hopefully ;-)
• Data Visualization is complicated
• There is always an approximation
• There is always a bias
• There is always a misinterpretation
• Data Visualization is simple
• Lots of packages available
• Lots of studies + literature
• Lots of examples to learn from
3
5. The Modes of Perception
5
X
X
X
X
X
X
X
O
O
O O
O
O
O
O
X
X
X
X
O
O
O
O
X
X
X
X
X
X
X
X
O
O
O O
O
O
O
O
X
X X
X
X
O
O
O
Fast Slow
Find the outlier
6. The Modes of Perception
6
• Pre-attentive
• fast
• parallel processing
• effortless
• Pattern recognition
• semi-fast
• governed by laws of per by
• Attentive
• slow
• sequential
• high effort (attention a very limited resource)
7. Main Properties of Graphics
7
Category Example
Position
Shape
Size
Color
Orientation (Line)
Length (Line)
Type and Size (Line)
Brightness
8. Main Properties of Graphics Humans
8
Category Amount of pre-attentive information
Position very high
Shape ———
Size approx. 4
Color approx. 8
Orientation (Line) approx. 4
Length (Line) ———
Type and Size (Line) ———
Brightness approx. 8
9. Pre-attentive perception
9
• Position
• fast
• effective
• high number of different positions
• Color
• use with care
• Shape
• Orientation
Pre-attentive perception is effortless.
Exploit this as much as you can.
10. Pattern detection
10
„It is interesting to note that our brain […]
subconsciously always prefers meaningful
situations and objects.“
• Emergence
• Reiification
• Multi-stability
• Invariance
Pattern detection can be trained.
Exploit this for frequent visualizations.
14. Laws of Gestalt
14
„It is interesting to note that our brain, in
accordance with the laws of Gestalt,
subconsciously always prefers meaningful
situations and objects.“
15. Accuracy of Graphics
15
Square Pie vs Stacked Bar vs Pie vs Donut
What do you think?
https://eagereyes.org/blog/2016/a-reanalysis-of-a-study-about-square-pie-charts-from-2009
18. Fundamental Problems
• No accurate method in higher dimensions
• Approximations methods
• „Simulated“ dimensions (color, size, shape)
• Animations?
• No notion of quality or accuracy for
Visualizations
• Information Theory?
• „Stability“?
All Visualizations are wrong, but some are useful.
18
19. Approximation methods
• Pair Plots
• Axis-aligned projections
• Interpretable in terms of original variables
• Singular Value Decomposition
• Optimal with respect to 2-norm (Euclidean norm)
and supremum norm
• Comes with an error estimate
• Other methods
• Stochastic Neighborhood Embedding (t-SNE)
• „Manifold Learning“
19
23. Principal Components and Curves
• Principal Component Analysis
• orthogonal decomposition based on SVD
• linear in all variables
• tries to preserve variance
• Principal Curves
• minimize the Sum of Squared Errors with respect
to all variables (as PCA, preserve variance)
• nonlinear
• smooth
23
27. The Grand Tour
• Animated sequence of 2-D projections
• https://en.wikipedia.org/wiki/
Grand_Tour_(data_visualisation)
• Asimov (1985): The grand tour: a tool for viewing
multidimensional data.
• Underlying idea
• Randomly generate 2-D projections (random
walk)
• Over time generate a dense subset of all
possible 2-D projections
• Optional: Follow a given path / guided tour
27