Code: https://github.com/polymorpher/bittiger
Course (My Lectures + Tutorials): https://www.bittiger.io/livecourses/YQCMuXwL7fhHuQT5K
This is an introductory level course for theory, implementation, and applications of topic modelling (and NLP). It also includes some pointers to advanced topics and state-of-the-art research papers.
Evaluating Blockchain Crowdfunding Projects - From Technology Point-of-ViewAaron Li
This is an intermediate level talk about blockchain crowdfunding, on the topic of evaluating blockchain projects from technology point of view, including but not limited to ICOs.
Blockchain: the technologies behind Bitcoin, Ethereum, ICO, and moreAaron Li
Beginner level lecture for blockchain. Topics covered: the history and technologies behind Bitcoin, trading, ICO, applications, job opportunities, and more. These are the slides accompanying my online live lecture at BitTiger (Nov 15, 2017).
Video (Chinese): https://www.youtube.com/watch?v=lVVCu_Pxshk
Event page: https://www.bittiger.io/events/2SS4xnfKzKYvwBpPQ
1. The document discusses various methods for collecting data from websites, including scraping, using APIs, and contacting site owners. It provides examples of projects that used different techniques.
2. Scraping involves programmatically extracting structured data from websites and can be complicated due to legal and ethical issues. APIs provide a safer alternative as long as rate limits are respected.
3. The document provides tips for scraping courteously and effectively, avoiding burdening websites. It also covers common scraping challenges and potential workarounds or alternatives like using APIs or contracting data collection.
Structured data: Where did that come from & why are Google asking for itRichard Wallis
Structured data and Schema.org have become increasingly important for websites and search engines. Schema.org was created in 2011 as a joint effort by Google, Microsoft, Yahoo, and others to create a common set of schemas for structured data markup on web pages. Google and others now use structured data to better understand websites and display richer information in search features like Knowledge Panels. At a recent conference, a Google employee emphasized that implementing structured data using Schema.org can help websites appear in more search features and be better understood during crawling.
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
Evaluating Blockchain Crowdfunding Projects - From Technology Point-of-ViewAaron Li
This is an intermediate level talk about blockchain crowdfunding, on the topic of evaluating blockchain projects from technology point of view, including but not limited to ICOs.
Blockchain: the technologies behind Bitcoin, Ethereum, ICO, and moreAaron Li
Beginner level lecture for blockchain. Topics covered: the history and technologies behind Bitcoin, trading, ICO, applications, job opportunities, and more. These are the slides accompanying my online live lecture at BitTiger (Nov 15, 2017).
Video (Chinese): https://www.youtube.com/watch?v=lVVCu_Pxshk
Event page: https://www.bittiger.io/events/2SS4xnfKzKYvwBpPQ
1. The document discusses various methods for collecting data from websites, including scraping, using APIs, and contacting site owners. It provides examples of projects that used different techniques.
2. Scraping involves programmatically extracting structured data from websites and can be complicated due to legal and ethical issues. APIs provide a safer alternative as long as rate limits are respected.
3. The document provides tips for scraping courteously and effectively, avoiding burdening websites. It also covers common scraping challenges and potential workarounds or alternatives like using APIs or contracting data collection.
Structured data: Where did that come from & why are Google asking for itRichard Wallis
Structured data and Schema.org have become increasingly important for websites and search engines. Schema.org was created in 2011 as a joint effort by Google, Microsoft, Yahoo, and others to create a common set of schemas for structured data markup on web pages. Google and others now use structured data to better understand websites and display richer information in search features like Knowledge Panels. At a recent conference, a Google employee emphasized that implementing structured data using Schema.org can help websites appear in more search features and be better understood during crawling.
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
The document describes a study that used natural language processing to analyze scientific literature on specific topics and identify trends in technologies and methods used. The researchers extracted text from the first 100 articles returned for searches on Hawkes processes, galaxy evolution, T-cell receptor genomes, and natural language processing. NLP was used to identify noun phrases relating to predefined interesting words, and word clouds were generated to visualize frequencies. Results showed technologies and methods relevant to each topic. The researchers aim to continue improving the software to better connect researchers with useful tools.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
This document discusses natural language processing (NLP) and how it can be used with Talend. It describes various NLP techniques like tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis. It then discusses how NLP can be useful for extracting useful information from textual data sources, classifying discussions by topic, and linking entities. Finally, it provides an overview of how NLP components in Talend can be used to prepare text data, train models, and apply models to full texts for tasks like intelligent search, sentiment analysis, and complying with privacy regulations.
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
This document proposes a framework to automatically manage domain and range information for knowledge entries in a knowledge base. It does this by using word embeddings to generate feature vectors for subjects and objects of relationships, and then training supervised machine learning models to classify whether elements belong to the domain or range of each relationship. The models are generated from positive and negative training examples of existing knowledge entries. An experiment tests the accuracy of the models on 32 common relationships, showing they can accurately detect correct and incorrect entries.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
One of the biggest challenges in the data age is overcoming the problematic belief that data has all the answers. The truth is – data is a resource, not a solution. In order to extract valuable and actionable insights, it is necessary to ask and re-ask certain questions. This talk is about figuring out what these questions are and exposes some of the limitations of common, and seemingly intuitive, approaches to data problems. As an alternative, I introduce the concept of using human-centered design principles and an iterative process to approach what you do with Big (and small) Data. As exemplars, I will walk-through a quick informal example and a real Datascope client project to highlight the flexibility and speed of these techniques.
Currently Experience API (xAPI) mostly focuses on providing “structural” interoperability of xAPI statements via JavaScript Object Notation Language (JSON). Structural interoperability defines the syntax of the data exchange and ensures the data exchanged between systems can be interpreted at the data field level. In comparison, semantic interoperability leverages the structural interoperability of the data exchange, but provides a vocabulary so other systems and consumers can also interpret the data. Analytics produced by xAPI statements would benefit from more consistent and semantic approaches to describing domain-specific verbs, activityTypes, attachments, and extensions. The xAPI specification recommends implementers to adopt community-defined vocabularies, but the only current guidance is to provide very basic, human-readable identifier metadata (e.g., literal string name(display), description). The main objective of the Vocabulary and Semantic Interoperability Working Group (WG) is to research machine-readable, semantic technologies (e.g., RDF, JSON-LD) in order to produce guidance for Communities of Practice (CoPs) on creating, publishing, or managing controlled vocabulary datasets (e.g., verbs). In this session, you will see a brief introduction to modern controlled vocabulary practices and how they can be applied to xAPI to add semantic expressiveness of controlled vocabularies. The progress and resources from the Vocabulary WG (started in April 2015) will also be shared.
This document provides an overview of natural language processing (NLP). It discusses several commercial applications of NLP including information retrieval, information extraction, machine translation, question answering, and processing user-generated content. It notes that major tech companies have strong NLP research labs. The document then discusses why NLP is important due to the huge amount of online data and need to process large texts. It also notes challenges for computers in understanding language due to their lack of common sense knowledge. The rest of the document outlines various issues and subfields within NLP including syntax, semantics, information extraction, information retrieval, machine translation and more. It concludes by overviewing what will be covered in the NLP course.
The NLP muppets revolution! @ Data Science London 2019
video: https://skillsmatter.com/skillscasts/13940-a-deep-dive-into-contextual-word-embeddings-and-understanding-what-nlp-models-learn
event: https://www.meetup.com/Data-Science-London/events/261483332/
This document discusses approaches to analyzing lexis and grammar in texts. It examines both discourse-based, top-down approaches that identify characteristic lexical and grammatical traits across an entire text, as well as sentence-level, bottom-up approaches that discover linguistic traits through empirical observation of individual sentences. Specific aims discussed include analyzing language features for language learning or for a particular discipline like ESP/EAP. Examples are provided of analyzing academic registers and comparing lexical usage between non-native speakers, native speakers in different disciplines, and corpora in different languages and contexts. The document emphasizes the importance of examining both relative word frequencies and collocations/colligations to understand appropriate usage and identify potential gaps for language learners.
The document discusses using graphs and Neo4j for natural language processing tasks. It describes representing text as a graph by connecting adjacent words, and using this representation to find word associations and do opinion mining. Graph-based summarization and content recommendation are also covered. The resources provided give examples of opinion summarization using shortest path algorithms on the graph representation of reviews.
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
Originally presented at DataDay Texas in Austin, this presentation shows how a graph database such as Neo4j can be used for common natural language processing tasks, such as building a word adjacency graph, mining word associations, summarization and keyword extraction and content recommendation.
Presentation of my work "Using Linked Data Traversal to Label Academic Communities" at the SAVE-SD workshop, co-located with the 24th International World Wide Web Conference at Florence, Italy
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
The document provides an introduction to a course on natural language processing, outlining the course overview, topics to be covered including introductions to NLP and Watson, machine learning for NLP, and why NLP is difficult. It provides information on the course instructor, teaching assistant, homepages, office hours, goals and topics of the course, organization, recommended textbooks, assignments, grading, class policies, and an outline of course topics.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
More Related Content
Similar to Topic Modelling: for news recommendation, user behaviour modelling, and many more
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
The document describes a study that used natural language processing to analyze scientific literature on specific topics and identify trends in technologies and methods used. The researchers extracted text from the first 100 articles returned for searches on Hawkes processes, galaxy evolution, T-cell receptor genomes, and natural language processing. NLP was used to identify noun phrases relating to predefined interesting words, and word clouds were generated to visualize frequencies. Results showed technologies and methods relevant to each topic. The researchers aim to continue improving the software to better connect researchers with useful tools.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
This document discusses natural language processing (NLP) and how it can be used with Talend. It describes various NLP techniques like tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis. It then discusses how NLP can be useful for extracting useful information from textual data sources, classifying discussions by topic, and linking entities. Finally, it provides an overview of how NLP components in Talend can be used to prepare text data, train models, and apply models to full texts for tasks like intelligent search, sentiment analysis, and complying with privacy regulations.
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
This document proposes a framework to automatically manage domain and range information for knowledge entries in a knowledge base. It does this by using word embeddings to generate feature vectors for subjects and objects of relationships, and then training supervised machine learning models to classify whether elements belong to the domain or range of each relationship. The models are generated from positive and negative training examples of existing knowledge entries. An experiment tests the accuracy of the models on 32 common relationships, showing they can accurately detect correct and incorrect entries.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
One of the biggest challenges in the data age is overcoming the problematic belief that data has all the answers. The truth is – data is a resource, not a solution. In order to extract valuable and actionable insights, it is necessary to ask and re-ask certain questions. This talk is about figuring out what these questions are and exposes some of the limitations of common, and seemingly intuitive, approaches to data problems. As an alternative, I introduce the concept of using human-centered design principles and an iterative process to approach what you do with Big (and small) Data. As exemplars, I will walk-through a quick informal example and a real Datascope client project to highlight the flexibility and speed of these techniques.
Currently Experience API (xAPI) mostly focuses on providing “structural” interoperability of xAPI statements via JavaScript Object Notation Language (JSON). Structural interoperability defines the syntax of the data exchange and ensures the data exchanged between systems can be interpreted at the data field level. In comparison, semantic interoperability leverages the structural interoperability of the data exchange, but provides a vocabulary so other systems and consumers can also interpret the data. Analytics produced by xAPI statements would benefit from more consistent and semantic approaches to describing domain-specific verbs, activityTypes, attachments, and extensions. The xAPI specification recommends implementers to adopt community-defined vocabularies, but the only current guidance is to provide very basic, human-readable identifier metadata (e.g., literal string name(display), description). The main objective of the Vocabulary and Semantic Interoperability Working Group (WG) is to research machine-readable, semantic technologies (e.g., RDF, JSON-LD) in order to produce guidance for Communities of Practice (CoPs) on creating, publishing, or managing controlled vocabulary datasets (e.g., verbs). In this session, you will see a brief introduction to modern controlled vocabulary practices and how they can be applied to xAPI to add semantic expressiveness of controlled vocabularies. The progress and resources from the Vocabulary WG (started in April 2015) will also be shared.
This document provides an overview of natural language processing (NLP). It discusses several commercial applications of NLP including information retrieval, information extraction, machine translation, question answering, and processing user-generated content. It notes that major tech companies have strong NLP research labs. The document then discusses why NLP is important due to the huge amount of online data and need to process large texts. It also notes challenges for computers in understanding language due to their lack of common sense knowledge. The rest of the document outlines various issues and subfields within NLP including syntax, semantics, information extraction, information retrieval, machine translation and more. It concludes by overviewing what will be covered in the NLP course.
The NLP muppets revolution! @ Data Science London 2019
video: https://skillsmatter.com/skillscasts/13940-a-deep-dive-into-contextual-word-embeddings-and-understanding-what-nlp-models-learn
event: https://www.meetup.com/Data-Science-London/events/261483332/
This document discusses approaches to analyzing lexis and grammar in texts. It examines both discourse-based, top-down approaches that identify characteristic lexical and grammatical traits across an entire text, as well as sentence-level, bottom-up approaches that discover linguistic traits through empirical observation of individual sentences. Specific aims discussed include analyzing language features for language learning or for a particular discipline like ESP/EAP. Examples are provided of analyzing academic registers and comparing lexical usage between non-native speakers, native speakers in different disciplines, and corpora in different languages and contexts. The document emphasizes the importance of examining both relative word frequencies and collocations/colligations to understand appropriate usage and identify potential gaps for language learners.
The document discusses using graphs and Neo4j for natural language processing tasks. It describes representing text as a graph by connecting adjacent words, and using this representation to find word associations and do opinion mining. Graph-based summarization and content recommendation are also covered. The resources provided give examples of opinion summarization using shortest path algorithms on the graph representation of reviews.
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
Originally presented at DataDay Texas in Austin, this presentation shows how a graph database such as Neo4j can be used for common natural language processing tasks, such as building a word adjacency graph, mining word associations, summarization and keyword extraction and content recommendation.
Presentation of my work "Using Linked Data Traversal to Label Academic Communities" at the SAVE-SD workshop, co-located with the 24th International World Wide Web Conference at Florence, Italy
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
The document provides an introduction to a course on natural language processing, outlining the course overview, topics to be covered including introductions to NLP and Watson, machine learning for NLP, and why NLP is difficult. It provides information on the course instructor, teaching assistant, homepages, office hours, goals and topics of the course, organization, recommended textbooks, assignments, grading, class policies, and an outline of course topics.
Similar to Topic Modelling: for news recommendation, user behaviour modelling, and many more (20)
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Topic Modelling: for news recommendation, user behaviour modelling, and many more
1. Copyright 2017 Aaron Li (aaron@potatos.io)
Modelling
Aaron Li
aaron@potatos.io
for news recommendation, user behaviour modelling, and many more
2. About me
• Working on a stealth startup
• Former lead inference engineer at Scaled Inference
• Did AI / Machine Learning at Google Research,
NICTA, CMU, ANU, etc.
• https://www.linkedin.com/in/aaronqli/
Copyright 2017 Aaron Li (aaron@potatos.io)2
Copyright 2017 Aaron Li
(aaron@potatos.io)
3. Overview
• Theory (2 classes, 2h each)
• work out the problem & solutions & why
• discuss the math & models & NLP fundamentals
• Industry use cases & systems & applications
• Practice (2 classes, 2h each)
• live demo + coding + debugging
• data sets, open source tools, Q & A
Copyright 2017 Aaron Li (aaron@potatos.io)3
Copyright 2017 Aaron Li
(aaron@potatos.io)
4. Overview
• Background Knowledge
• Linear Algebra
• Probability Theory
• Calculus
• Scala / Go / Node / C++ (please vote)
Copyright 2017 Aaron Li (aaron@potatos.io)4
Copyright 2017 Aaron Li
(aaron@potatos.io)
5. Theory 1
What is news recommendation?
What is topic modeling? Why?
Basic architecture
NLP foundamentals
Basic model: LDA
Practice 1
LDA live demo
NLP tools introduction
Preprocessed Datasets
Code LDA + Experiments
Open source tools for industry
Theory 2
LDA Inference
Gibbs sampling
SparseLDA, AliasLDA, LightLDA
Applications & Industrial use cases
Practice 2
Set up NLP pipeline
SparseLDA, AliasLDA, LightLDA
Train & use the model
News recommendation demo
Schedule
Copyright 2017 Aaron Li (aaron@potatos.io)5
Copyright 2017 Aaron Li
(aaron@potatos.io)
7. 7
• A lot of people read news every day
• Flipboard, CNN, Facebook, WeChat …
• How do we make people more engaged?
• Personalise & Recommendation
• learn preference and show relevant content
• recommend articles based on the current one
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
8. 8
• Top websites / apps already doing this
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
10. Yahoo! News (now “Oath” News)
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)10
Copyright 2017 Aaron Li
(aaron@potatos.io)
11. 11
• Many websites don’t do it (e.g CNN)
• Why not? It’s not a easy problem
• Challenges
• News article vocabulary is large (100k ~ 1M)
• Documents are represented by high-dimensional
vector, based on count of vocabulary
• Traditional similarity measures don’t work
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
12. Example
In 1996 Linus Torvalds, the Finnish creator of the Open Source
operating system Linux, visited the National Zoo and Aquarium with
members of the Canberra Linux Users Group, and was captivated
by one of the Zoo's little Penguins. Legend has it that Linus was
infected with a mythical disease called Penguinitis. Penguinitis
makes you stay awake at night thinking about Penguins and feeling
great love towards them.
Not long after this event the Open Source Software community
decided they needed a logo for Linux. They were looking for
something fun and after Linus mentioned his fondness of penguins,
a slightly overweighted penguin sitting down after having a great
meal seemed to fit the bill perfectly. Hence, Tux the penguin was
created and now when people think of Linux they think of Tux.
Copyright 2017 Aaron Li (aaron@potatos.io)12
Copyright 2017 Aaron Li
(aaron@potatos.io)
13. Example
• Word count = 132, unique words = 91
• Very hard to measure its distance to other articles in our
database talking about Linux, Linus Torvalds, and the
creation of Tux
• Distance for low-dimensional space aren’t effective
• e.g. cosine similarity won’t make sense
• Need to represent things in low-dimensional vectors
• Capture semantics / topics efficiently
Copyright 2017 Aaron Li (aaron@potatos.io)13
Copyright 2017 Aaron Li
(aaron@potatos.io)
14. Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)14
Copyright 2017 Aaron Li
(aaron@potatos.io)
Step 1. Get text data
Step 2. ??? (Machine can’t read text)
Step 3. Model & Train
Step 4. Deploy & Predict
News articles
Emails
Legal docs
Resume
…
(i.e. documents)
15. 15
Step 2: NLP Preprocessing - common pipeline
Solutions
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
There are a lot more…
(used in advanced NLP tasks)
Chunking
Named Entity Recognition
Sentiment Analysis
Syntactic Analysis
Dependency Parsing
Coreference Resolution
Entity Relationship Extraction
Semantic Analysis
…
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
16. NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)16
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Mostly rules (by regex or FST)
• Look for sentence splitter
• For English: , . ! ? etc.
• Checkout Wikipedia article
• Open source code is good
• Also checkout this article
17. NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)17
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Find boundaries for words
• Easy for English (look for space)
• Hard for Chinese etc.
• Solution: FST, CRF, etc.
• Difficulties: see Wikipedia article
• Try making one by yourself using
FST! (CMU 11711 homework)
18. NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)18
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Stop words:
• occurs frequently
• semantically not meaningful
• i.e. am, is, who, what, etc.
• Small set of words
• Easy to implement
• e.g. in-memory hashset
19. NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)19
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Reduce word to root (stem)
• Usually used in IR system
• Root can be a non-word. e.g.
• fishing, fished, fisher => fish
• cats, catty => cat
• argument, arguing => argu
• Rule based implementation
• e.g. Porter’s Snowball stemmer
Also see Wikipedia article
20. NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)20
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• POS = Part of Speech
• Find grammar role of each word.
I ate a fish
PRP VBD DT NN
• Disambiguate same words used
in different context. e.g:
• “Train” as in “train a model”
• “Train” as in “catch a train”
• Techniques: HMM, CRF, etc.
• See this article for more details
21. NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)21
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Find base form of a word
• More complex than stemming
• Use POS tag information
• Different rules for different POS
• Base form is a valid word. e.g.
• walks, walking, walked =>walk
• am, are, is => be
• argument (NN) => argument
• arguing (VBG) => argue
• See Wikipedia article for details
22. NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)22
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Index pre-processed documents
and words with id and frequency
• e.g:
• id:1 word:(train, VBG) freq: 5
• id:2 word:(model, NN) freq: 2
• id:3 word:(train, NN) freq: 3
• …
See UCI Bag of Words dataset
23. Solutions
• Modelling & Training
• Naive Bayes
• Latent Semantic Analysis
• word2vec, doc2vec, …
• Topic Modelling
Copyright 2017 Aaron Li (aaron@potatos.io)23
Copyright 2017 Aaron Li
(aaron@potatos.io)
24. 24
• Naive Bayes (very old technique)
• Use only key words to get probability for K labels
• Good for spam detection
• Poor performance for news recommendation
• Does not capture semantics / topics
• https://web.stanford.edu/class/cs124/lec/
naivebayes.pdf
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
25. 25
• Latent Semantic Analysis (~1990 - 2000)
• SVD on a TF-IDF frequency matrix with documents as columns and
words as rows
• Gives a low-rank approximation of the matrix and represent documents
in low dimension vectors
• Problem: hard to interpret vectors / documents, probability distribution
is wrong (Gaussian)
• https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-
indexing-1.html
• Thomas Hofmann. Probabilistic Latent Semantic Analysis. In Kathryn B.
Laskey and Henri Prade, editors, UAI,
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
26. 26
• word2vec, doc2vec (2013~)
• Convert words to dense, low-dimensional, compositional
vectors (e.g. king - man + woman = queen)
• Good for classification problems
• Slow to train, hard to interpret (because of neural network),
yet to be tested in industrial use cases
• Mikolov, Tomas; et al. "Efficient Estimation of Word
Representations in Vector Space" ICLR 2013.
• Getting started with word2vec
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
27. 27
• Topic Models (LDA etc., 2003~)
• Define a generative structure involving latent variables (e.g topics)
using well-structured distributions and infer the parameters
• Represent documents / words using low-dimensional, highly
interpretable distributions
• Extensively used in industry. Many open source tools
• Extensive research on speeding up / scaling up
• D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation.
Journal of Machine Learning Research, 3:993–1022, 2003
• Tutorial: Parameter estimation for text analysis, Gregor Heinrich 2008
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
28. Copyright 2017 Aaron Li (aaron@potatos.io)28
Copyright 2017 Aaron Li
(aaron@potatos.io)
29. Topic Models
Copyright 2017 Aaron Li (aaron@potatos.io)29
Copyright 2017 Aaron Li
(aaron@potatos.io)
Latent Dirichlet Allocation (LDA)
Image from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
30. 30
• LDA (Latent Dirichlet Allocation)
• Arguably the most popular topic model since 2013
• Created by David Blei, Andrew Ng, Michael Jordan
• To be practical we use this topic model in class
Topic Models
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
31. LDA
Copyright 2017 Aaron Li (aaron@potatos.io)31
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
32. LDA
Copyright 2017 Aaron Li (aaron@potatos.io)32
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
33. LDA
Copyright 2017 Aaron Li (aaron@potatos.io)33
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
35. Example
Copyright 2017 Aaron Li (aaron@potatos.io)35
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
36. Example
Copyright 2017 Aaron Li (aaron@potatos.io)36
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [BleiNgJordan2003, Latent Dirichlet Allocation]
37. LDA
• Task: infer parameters
• each document’s representation by topic vector
• with this we can compute document similarity!
• each topic’s representation by words (counts)
• with this we can look at each topic manually, and
interpret the meaning of them!
Copyright 2017 Aaron Li (aaron@potatos.io)37
Copyright 2017 Aaron Li
(aaron@potatos.io)
38. Theory 1
Copyright 2017 Aaron Li (aaron@potatos.io)38
Copyright 2017 Aaron Li
(aaron@potatos.io)
End of class
Questions?
39. Industrial Applications &
Use cases
• Yi Wang et al. Peacock: Learning Long-Tail Topic Features for
Industrial Applications (TIST 2014)
• Advertising system in production
• Aaron Li et al. High Performance Latent Variable Models (arxiv,
2014)
• User preference learning from search data
• Arnab Bhadury, Clustering Similar Stories Using LDA
• News Recommendation
• And many more… search “AliasLDA” or “LightLDA” on Google
Copyright 2017 Aaron Li (aaron@potatos.io)39
Copyright 2017 Aaron Li
(aaron@potatos.io)
40. LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)40
Copyright 2017 Aaron Li
(aaron@potatos.io)
Bayes rule:
Where denotes all latent variables
41. 41
In LDA, the topic assignment for each word is latent
LDA Inference
Intractable: KL terms in denominator
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
42. 42
What can we do to address intractability?
• Gibbs sampling
• Variational inference (not discussed in class)
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
43. 43
Estimate by sampling:
LDA Inference
Gibbs sampling:
Sample if is known
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
44. LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)44
Copyright 2017 Aaron Li
(aaron@potatos.io)
We can compute using Bayes rules
Above equation is called “predictive probability”. It
can be applied to the latent variable which assigns a
topic to each word
i.e. compute the probability of a word is assigned
with a particular topic, given other topic assignments
and the data (docs, words)
45. LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)45
Copyright 2017 Aaron Li
(aaron@potatos.io)
Derive predictive probability
49. Terms on the right are known all the time!
We can compute the predictive probability (left
term) by normalising over all k’s
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)49
Copyright 2017 Aaron Li
(aaron@potatos.io)
50. 50
• Algorithm (Gibbs sampling):
• Randomly assign a topic to each word & doc
• For T iterations (a large number to ensure convergence)
• For each doc
• For each word
• For each topic, compute predictive prob
• Sample topic by normalising over all predictive prob
• Repeat for T’ iterations (a small number) and compute topic count per
word and per doc. Use them to estimate and
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
51. Speed up LDA
(Switch to my KDD 2014 slides)
https://www.slideshare.net/AaronLi11/
kdd-2014-presentation-best-research-
paper-award-alias-topic-modelling-
reducing-the-sampling-complexity-of-
topic-models
Copyright 2017 Aaron Li (aaron@potatos.io)51
Copyright 2017 Aaron Li
(aaron@potatos.io)
52. Theory 2
Copyright 2017 Aaron Li (aaron@potatos.io)52
Copyright 2017 Aaron Li
(aaron@potatos.io)
End of class
Questions?