What makes a model simple? Do we know what is likely before we see data? Can we use this to make better models. Existing and new approaches for bringing in more knowledge to solve machine learning problems.
The document outlines the design of a Brain Simulator. It will include multiple layers that simulate different aspects and functions of the brain at various levels of processing and time scales. The development of the simulator will be an ongoing, incremental process, with each increment adding new functionality and being validated before integration. Requirements may include simulating conditions like Alzheimer's disease or sleep and validating models through data like images or EEG readings. The simulator will use a sparse matrix to optimize memory usage and allow for an overlay of systems like the limbic system.
This document discusses machine learning. It defines machine learning as designing algorithms that extract valuable information from data. It states that machine learning is useful for problems where existing solutions require fine-tuning or many rules, complex problems traditional methods can't solve, and environments that are changing. Examples of machine learning tasks are given, such as image classification using convolutional neural networks, detecting tumors using convolutional neural networks, automatically classifying news articles using natural language processing, and creating chatbots or personal assistants using natural language processing and question-answering modules.
This document discusses machine learning techniques for classification and clustering. It introduces case-based reasoning and k-nearest neighbors classification. It discusses how kernels can allow non-linear classification by mapping data to higher dimensions. It describes k-means clustering, which groups data by minimizing distances to cluster means. It also introduces agglomerative clustering, which successively merges the closest clusters.
The Role Of Ontology In Modern Expert Systems Dallas 2008Jason Morris
The document discusses the role of ontologies in modern expert system development. It provides background on expert systems and ontologies, explaining that ontologies define domains of knowledge and are used to encapsulate domain knowledge for use in expert systems. The document outlines the process of developing ontologies, including identifying concepts and relationships in a domain. It also provides an example of an expert system called SINFERS that uses ontologies to select soil property prediction models.
Lessons learned from building practical deep learning systemsXavier Amatriain
1. There are many lessons to be learned from building practical deep learning systems, including choosing the right evaluation metrics, being thoughtful about your data and potential biases, and understanding dependencies between data, models, and systems.
2. It is important to optimize only what matters and beware of biases in your data. Simple models are often better than complex ones, and feature engineering is crucial.
3. Both supervised and unsupervised learning are important, and ensembles often perform best. Your AI infrastructure needs to support both experimentation and production.
This document discusses decision support systems (DSS). It describes DSS as using combinations of analytical tools like databases, spreadsheets, expert systems and neural networks to assist with decision making. Key features of DSS include handling large amounts of data, flexibility in reporting analysis, performing "what if" simulations and complex data analysis. DSS can be applied to structured, semi-structured or unstructured situations. Examples of DSS tools discussed include spreadsheets, expert systems and artificial neural networks. The document also covers topics like fuzzy logic, social/ethical issues, and suggests practical activities for students.
The document outlines the design of a Brain Simulator. It will include multiple layers that simulate different aspects and functions of the brain at various levels of processing and time scales. The development of the simulator will be an ongoing, incremental process, with each increment adding new functionality and being validated before integration. Requirements may include simulating conditions like Alzheimer's disease or sleep and validating models through data like images or EEG readings. The simulator will use a sparse matrix to optimize memory usage and allow for an overlay of systems like the limbic system.
This document discusses machine learning. It defines machine learning as designing algorithms that extract valuable information from data. It states that machine learning is useful for problems where existing solutions require fine-tuning or many rules, complex problems traditional methods can't solve, and environments that are changing. Examples of machine learning tasks are given, such as image classification using convolutional neural networks, detecting tumors using convolutional neural networks, automatically classifying news articles using natural language processing, and creating chatbots or personal assistants using natural language processing and question-answering modules.
This document discusses machine learning techniques for classification and clustering. It introduces case-based reasoning and k-nearest neighbors classification. It discusses how kernels can allow non-linear classification by mapping data to higher dimensions. It describes k-means clustering, which groups data by minimizing distances to cluster means. It also introduces agglomerative clustering, which successively merges the closest clusters.
The Role Of Ontology In Modern Expert Systems Dallas 2008Jason Morris
The document discusses the role of ontologies in modern expert system development. It provides background on expert systems and ontologies, explaining that ontologies define domains of knowledge and are used to encapsulate domain knowledge for use in expert systems. The document outlines the process of developing ontologies, including identifying concepts and relationships in a domain. It also provides an example of an expert system called SINFERS that uses ontologies to select soil property prediction models.
Lessons learned from building practical deep learning systemsXavier Amatriain
1. There are many lessons to be learned from building practical deep learning systems, including choosing the right evaluation metrics, being thoughtful about your data and potential biases, and understanding dependencies between data, models, and systems.
2. It is important to optimize only what matters and beware of biases in your data. Simple models are often better than complex ones, and feature engineering is crucial.
3. Both supervised and unsupervised learning are important, and ensembles often perform best. Your AI infrastructure needs to support both experimentation and production.
This document discusses decision support systems (DSS). It describes DSS as using combinations of analytical tools like databases, spreadsheets, expert systems and neural networks to assist with decision making. Key features of DSS include handling large amounts of data, flexibility in reporting analysis, performing "what if" simulations and complex data analysis. DSS can be applied to structured, semi-structured or unstructured situations. Examples of DSS tools discussed include spreadsheets, expert systems and artificial neural networks. The document also covers topics like fuzzy logic, social/ethical issues, and suggests practical activities for students.
The document discusses different types of knowledge that may need to be represented in AI systems, including objects, events, performance, and meta-knowledge. It also discusses representing knowledge at two levels: the knowledge level containing facts, and the symbol level containing representations of objects defined in terms of symbols. Common ways of representing knowledge mentioned include using English, logic, relations, semantic networks, frames, and rules. The document also discusses using knowledge for applications like learning, reasoning, and different approaches to machine learning such as skill refinement, knowledge acquisition, taking advice, problem solving, induction, discovery, and analogy.
[PR12] understanding deep learning requires rethinking generalizationJaeJun Yoo
The document discusses a paper that argues traditional theories of generalization may not fully explain why large neural networks generalize well in practice. It summarizes the paper's key points:
1) The paper shows neural networks can easily fit random labels, calling into question traditional measures of complexity.
2) Regularization helps but is not the fundamental reason for generalization. Neural networks have sufficient capacity to memorize data.
3) Implicit biases in algorithms like SGD may better explain generalization by driving solutions toward minimum norm.
4) The paper suggests rethinking generalization as the effective capacity of neural networks may differ from theoretical measures. Understanding finite sample expressivity is important.
Semi supervised learning machine learning made simpleDevansh16
Video: https://youtu.be/65RV3O4UR3w
Semi-Supervised Learning is a technique that combines the benefits of supervised learning (performance, intuitiveness) with the ability to use cheap unlabeled data (unsupervised learning). With all the cheap data available, Semi Supervised Learning will get bigger in the coming months. This episode of Machine Learning Made Simple will go into SSL, how it works, transduction vs induction, the assumptions SSL algorithms make, and how SSL compares to human learning.
About Machine Learning Made Simple:
Machine Learning Made Simple is a playlist that aims to break down complex Machine Learning and AI topics into digestible videos. With this playlist, you can dive head first into the world of ML implementation and/or research. Feel free to drop any feedback you might have down below.
Understanding Deep Learning Requires Rethinking GeneralizationAhmet Kuzubaşlı
My presentation of one of the ICLR2017 best paper by Google Brain. (arxiv.org/abs/1611.03530). I believe that generalization deserves more attention as we go deep into over-parameterization zone.
The document discusses research on learning to improve the efficiency of machine learning algorithms through speedup learning. It provides three key points:
1) Early work on explanation-based learning for speedup had limited success, but techniques like memoization and clause learning led to major improvements in SAT solvers.
2) More recent approaches use predictive models trained on dynamic features to learn optimal policies for controlling search algorithms, like setting noise levels or restart policies.
3) Open problems remain in developing optimal predictive policies with partial information and approximations, to continue improving search and reasoning performance.
The document discusses research on learning to improve the efficiency of machine learning algorithms through speedup learning. It provides three key points:
1) Early work on explanation-based learning for speedup had limited success, but techniques like memoization and clause learning led to major improvements in SAT solvers.
2) More recent approaches use machine learning to build predictive models of problem instances and solver behavior, in order to inform strategies like automatic noise setting and randomized restart policies.
3) Case studies demonstrate these learning-based approaches can outperform traditional techniques and fixed policies by customizing resource allocation and reformulation based on problem structure and solver progress.
Keynote at the European Semantic Web Conference (ESWC 2006). The talk tries to figure out what the main scientific challenges are in Semantic Web research.
This talk was also recorded on video, and is available on-line at http://videolectures.net/eswc06_harmelen_wswnj/
1) Machine learning involves developing algorithms that learn from data without being explicitly programmed. It is a multidisciplinary field that includes statistics, mathematics, artificial intelligence, and more.
2) There are three main areas of machine learning: supervised learning which uses labeled training data, unsupervised learning which finds patterns in unlabeled data, and reinforcement learning which learns from rewards and punishments.
3) Supervised learning is well-studied and includes techniques like support vector machines, neural networks, decision trees, and Bayesian algorithms which are used for problems like pattern recognition, regression, and time series analysis.
Machine learning involves developing systems that can learn from data and experience. The document discusses several machine learning techniques including decision tree learning, rule induction, case-based reasoning, supervised and unsupervised learning. It also covers representations, learners, critics and applications of machine learning such as improving search engines and developing intelligent tutoring systems.
In this talk we cover
1. Why NLP and DL
2. Practical Challenges
3. Some Popular Deep Learning models for NLP
Today you can take any webpage in any language and translate it automatically into language you know! You can also cut paste an article or other document into NLP systems and immediately get list of companies and people it talks about, topics that are relevant and the sentiment of the document. When you talk to Google or Amazon assistant, you are using NLP systems. NLP is not perfect but given the advances in last two years and continuing, it is a growing field. Let’s see how it actually works, specifically using Deep learning
About Shishir
Shishir is a Senior Data Scientist at Thomson Reuters working on Deep Learning and NLP to solve real customer pain, even ones they have become used to.
The document discusses two NSF-funded research projects on intelligence and security informatics:
1. A project to filter and monitor message streams to detect "new events" and changes in topics or activity levels. It describes the technical challenges and components of automatic message processing.
2. A project called HITIQA to develop high-quality interactive question answering. It describes the team members and key research issues like question semantics, human-computer dialogue, and information quality metrics.
This document provides an overview of the semantic web including its conceptual model and major aspects. The semantic web allows for describing things in a graph model using URIs, RDF, and SPARQL. It is good for describing unforeseen classes but not as good for tabular data or frequent updates. The conceptual model uses subject-predicate-object triples to describe resources and their relationships where the description and thing described are separate. Major aspects include the distributed graph model, RDF for descriptions, and SPARQL for retrieval.
This document discusses semi-supervised learning techniques which make use of both labeled and unlabeled data. It describes commonly used assumptions in semi-supervised learning models including that data points from the same class will cluster together. Several specific semi-supervised learning algorithms are also summarized, including self-training, generative models, graph-based models, and transductive support vector machines (TSVMs).
Presentation on Machine Learning and Data Miningbutest
The document discusses the differences between automatic learning/machine learning and data mining. It provides definitions for supervised vs unsupervised learning, what automated induction is, and the base components of data mining. Additionally, it outlines differences in the scientific approach between automatic learning and data mining, as well as differences from an industry perspective, including common data mining techniques used and tips for successful data mining projects.
Data Mining: Practical Machine Learning Tools and Techniques ...butest
The document discusses different ways of representing structural patterns learned from data mining. It provides examples of classification rules learned from weather data, association rules indicating item co-occurrences, and decision trees that classify instances by recursively partitioning them based on attribute tests. Representation determines how inferences can be made from the learned patterns.
Winning Kaggle competitions involves getting a good score as fast as possible using versatile machine learning libraries and models like Scikit-learn, XGBoost, and Keras. It also involves model ensembling techniques like voting, averaging, bagging and boosting to improve scores. The document provides tips for approaches like feature engineering, algorithm selection, and stacked generalization/stacking to develop strong ensemble models for competitions.
How data science works and how can customers helpDanko Nikolic
The document discusses how CSC creates specialized models for customers through data science. It explains that textbooks oversimplify real-world data modeling, and that data scientists create customized models rather than just applying existing ones. Specialized model architectures require less data and training than general ones. The customer can help data scientists develop specialized architectures by understanding their business needs, explaining the data generation process, formulating hypotheses, and providing domain experts for consultation. CSC provides data science expertise to develop specialized models that can achieve excellent results for customers.
This presentation describes an investigation of user-centred design methodologies intended to apply to metadata or information architecture evaluation and deployment. The primary focus of this work is investigation of user conceptual models and comparison with formally architected models.
The document discusses different types of knowledge that may need to be represented in AI systems, including objects, events, performance, and meta-knowledge. It also discusses representing knowledge at two levels: the knowledge level containing facts, and the symbol level containing representations of objects defined in terms of symbols. Common ways of representing knowledge mentioned include using English, logic, relations, semantic networks, frames, and rules. The document also discusses using knowledge for applications like learning, reasoning, and different approaches to machine learning such as skill refinement, knowledge acquisition, taking advice, problem solving, induction, discovery, and analogy.
[PR12] understanding deep learning requires rethinking generalizationJaeJun Yoo
The document discusses a paper that argues traditional theories of generalization may not fully explain why large neural networks generalize well in practice. It summarizes the paper's key points:
1) The paper shows neural networks can easily fit random labels, calling into question traditional measures of complexity.
2) Regularization helps but is not the fundamental reason for generalization. Neural networks have sufficient capacity to memorize data.
3) Implicit biases in algorithms like SGD may better explain generalization by driving solutions toward minimum norm.
4) The paper suggests rethinking generalization as the effective capacity of neural networks may differ from theoretical measures. Understanding finite sample expressivity is important.
Semi supervised learning machine learning made simpleDevansh16
Video: https://youtu.be/65RV3O4UR3w
Semi-Supervised Learning is a technique that combines the benefits of supervised learning (performance, intuitiveness) with the ability to use cheap unlabeled data (unsupervised learning). With all the cheap data available, Semi Supervised Learning will get bigger in the coming months. This episode of Machine Learning Made Simple will go into SSL, how it works, transduction vs induction, the assumptions SSL algorithms make, and how SSL compares to human learning.
About Machine Learning Made Simple:
Machine Learning Made Simple is a playlist that aims to break down complex Machine Learning and AI topics into digestible videos. With this playlist, you can dive head first into the world of ML implementation and/or research. Feel free to drop any feedback you might have down below.
Understanding Deep Learning Requires Rethinking GeneralizationAhmet Kuzubaşlı
My presentation of one of the ICLR2017 best paper by Google Brain. (arxiv.org/abs/1611.03530). I believe that generalization deserves more attention as we go deep into over-parameterization zone.
The document discusses research on learning to improve the efficiency of machine learning algorithms through speedup learning. It provides three key points:
1) Early work on explanation-based learning for speedup had limited success, but techniques like memoization and clause learning led to major improvements in SAT solvers.
2) More recent approaches use predictive models trained on dynamic features to learn optimal policies for controlling search algorithms, like setting noise levels or restart policies.
3) Open problems remain in developing optimal predictive policies with partial information and approximations, to continue improving search and reasoning performance.
The document discusses research on learning to improve the efficiency of machine learning algorithms through speedup learning. It provides three key points:
1) Early work on explanation-based learning for speedup had limited success, but techniques like memoization and clause learning led to major improvements in SAT solvers.
2) More recent approaches use machine learning to build predictive models of problem instances and solver behavior, in order to inform strategies like automatic noise setting and randomized restart policies.
3) Case studies demonstrate these learning-based approaches can outperform traditional techniques and fixed policies by customizing resource allocation and reformulation based on problem structure and solver progress.
Keynote at the European Semantic Web Conference (ESWC 2006). The talk tries to figure out what the main scientific challenges are in Semantic Web research.
This talk was also recorded on video, and is available on-line at http://videolectures.net/eswc06_harmelen_wswnj/
1) Machine learning involves developing algorithms that learn from data without being explicitly programmed. It is a multidisciplinary field that includes statistics, mathematics, artificial intelligence, and more.
2) There are three main areas of machine learning: supervised learning which uses labeled training data, unsupervised learning which finds patterns in unlabeled data, and reinforcement learning which learns from rewards and punishments.
3) Supervised learning is well-studied and includes techniques like support vector machines, neural networks, decision trees, and Bayesian algorithms which are used for problems like pattern recognition, regression, and time series analysis.
Machine learning involves developing systems that can learn from data and experience. The document discusses several machine learning techniques including decision tree learning, rule induction, case-based reasoning, supervised and unsupervised learning. It also covers representations, learners, critics and applications of machine learning such as improving search engines and developing intelligent tutoring systems.
In this talk we cover
1. Why NLP and DL
2. Practical Challenges
3. Some Popular Deep Learning models for NLP
Today you can take any webpage in any language and translate it automatically into language you know! You can also cut paste an article or other document into NLP systems and immediately get list of companies and people it talks about, topics that are relevant and the sentiment of the document. When you talk to Google or Amazon assistant, you are using NLP systems. NLP is not perfect but given the advances in last two years and continuing, it is a growing field. Let’s see how it actually works, specifically using Deep learning
About Shishir
Shishir is a Senior Data Scientist at Thomson Reuters working on Deep Learning and NLP to solve real customer pain, even ones they have become used to.
The document discusses two NSF-funded research projects on intelligence and security informatics:
1. A project to filter and monitor message streams to detect "new events" and changes in topics or activity levels. It describes the technical challenges and components of automatic message processing.
2. A project called HITIQA to develop high-quality interactive question answering. It describes the team members and key research issues like question semantics, human-computer dialogue, and information quality metrics.
This document provides an overview of the semantic web including its conceptual model and major aspects. The semantic web allows for describing things in a graph model using URIs, RDF, and SPARQL. It is good for describing unforeseen classes but not as good for tabular data or frequent updates. The conceptual model uses subject-predicate-object triples to describe resources and their relationships where the description and thing described are separate. Major aspects include the distributed graph model, RDF for descriptions, and SPARQL for retrieval.
This document discusses semi-supervised learning techniques which make use of both labeled and unlabeled data. It describes commonly used assumptions in semi-supervised learning models including that data points from the same class will cluster together. Several specific semi-supervised learning algorithms are also summarized, including self-training, generative models, graph-based models, and transductive support vector machines (TSVMs).
Presentation on Machine Learning and Data Miningbutest
The document discusses the differences between automatic learning/machine learning and data mining. It provides definitions for supervised vs unsupervised learning, what automated induction is, and the base components of data mining. Additionally, it outlines differences in the scientific approach between automatic learning and data mining, as well as differences from an industry perspective, including common data mining techniques used and tips for successful data mining projects.
Data Mining: Practical Machine Learning Tools and Techniques ...butest
The document discusses different ways of representing structural patterns learned from data mining. It provides examples of classification rules learned from weather data, association rules indicating item co-occurrences, and decision trees that classify instances by recursively partitioning them based on attribute tests. Representation determines how inferences can be made from the learned patterns.
Winning Kaggle competitions involves getting a good score as fast as possible using versatile machine learning libraries and models like Scikit-learn, XGBoost, and Keras. It also involves model ensembling techniques like voting, averaging, bagging and boosting to improve scores. The document provides tips for approaches like feature engineering, algorithm selection, and stacked generalization/stacking to develop strong ensemble models for competitions.
How data science works and how can customers helpDanko Nikolic
The document discusses how CSC creates specialized models for customers through data science. It explains that textbooks oversimplify real-world data modeling, and that data scientists create customized models rather than just applying existing ones. Specialized model architectures require less data and training than general ones. The customer can help data scientists develop specialized architectures by understanding their business needs, explaining the data generation process, formulating hypotheses, and providing domain experts for consultation. CSC provides data science expertise to develop specialized models that can achieve excellent results for customers.
This presentation describes an investigation of user-centred design methodologies intended to apply to metadata or information architecture evaluation and deployment. The primary focus of this work is investigation of user conceptual models and comparison with formally architected models.
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement.
Key highlights include:
Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making.
Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation.
Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau.
Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments.
Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team.
This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.
06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann
06-18-2024-Princeton Meetup-Introduction to Milvus
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/142-17June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications.
1. PRIOR ON MODEL SPACE
W h a t m a k e s a m o d e l s i m p l e ?
M E I R M A O R
C h i e f A r c h i t e c t @ S p a r k B e y o n d
2. Outline
Why are simple models desirable
Traditional approaches for simplicity
Alternative approaches
3. Why Simple models?
PAC model
No Free Lunch
Better generalization
In Reality
Transfer learning and non stationary distributions
Robust against correlated samples, etc. Leslie Valiant, 1984
4. Why Simple Models? Cont.
Understandable
Trustworthy
Explainable (also for Regulatory reasons)
Understandable models are ultimately more accurate
5. Traditional complexity control
Bias / Variance tradeoff → We must limit our search space
Shrink the hypothesis space:
limit boosting iterations
tree size
min sample in leaf
number of hidden nodes
impose sparsity constraint
...
6. Traditional complexity control cont.
Penalize “less favorable” models:
Lasso / ridge regularization
Bagging / Boot strap sampling
Drop out
7. Which is more likely?
Coefficients from two feed forward ReLu NN
single-output single hidden layer
NETWORK A NETWORK B
8. Which feature is a more likely?
Both have ℝ2 = 0.1
Math.ulp(x) - The positive distance between this floating-point value and the
double value next larger in magnitude
vs.
Math.log(x) - Natural logarithm
9. Which is a more likely feature? #2
The distance to the nearest railway station?
vs.
arctan(latitude * longitude)
10. Everyone is a domain expert!
We are experts in the world we live in.
Currently, humans have a much better prior than machines.
Many ideas repeat themselves across domains.
For example, you don’t have to be a rocket scientist to be familiar with second
derivatives.
11. Transfer Learning to the rescue
We can and must learn from previous problems
How can a child learn to identify a Ring Tailed Lemur from a single photo?
12. Becoming common
Pre-Trained Neural networks
Pre-Trained embeddings
A lot of work on Images and Text
Much less research on other data:
TimeNet - RNN for embedding time series data
Most real-life problems tend to have a more complicated shape
13. A different approach
Use already codified human knowledge
Explicitly look for patterns similar to things you have seen before
Extraordinary claims require extraordinary evidence
14. At SparkBeyond
Find the best hypotheses, using simple compositions of tried and true building
blocks
The building block may require a lot of code to implement. Yet, will be useful
across domains
Incorporate pre-trained embeddings
Use external knowledge
Prioritize simple hypotheses
Always meta-learn how to learn
15. ● Domain expert can review such a finding
● True phenomenon
● Insightful and actionable
Shops near recreational parks are more successful
18. Simple = compressible = common
A simple model or feature is one which we can be expressed briefly.
MDL - minimum description length is optimal compression.
Better compression leads to better model performance.
But should we be using a vanilla Turing machine for MDL?
19. Benefits of a better prior
Learn with less data
More robust to change
More robust to data issues
Understandable and explainable
Actionability without a complete model
20. Open questions and challenges
What is simple?
What makes an insight insightful?
What makes a feature likely to generalize?
Efficient search over insightful hypothesis space