This talk demonstrates how to use word2vec models in a Postgres database to facilitate semantic search of job posts. Attendees will learn to structure models for usage in a relational database.
Governança Digital é a utilização de tecnologias da informação e comunicação para melhorar a informação e a prestação de serviços, incentivando a participação dos cidadãos no processo de tomada de decisão e tornando o governo mais responsável, transparente e eficaz.
How Azure helps to build better business processes and customer experiences w...Maxim Salnikov
Artificial Intelligence is not the future, it is NOW. Cloud technology empowers developers and technology leaders to benefit from AI effectively and responsibly with the models and tools they need. In this session, we go through the portfolio of Azure AI services and run some demos to showcase how AI can improve daily life, safety, productivity, accessibility, and business outcomes.
Governança Digital é a utilização de tecnologias da informação e comunicação para melhorar a informação e a prestação de serviços, incentivando a participação dos cidadãos no processo de tomada de decisão e tornando o governo mais responsável, transparente e eficaz.
How Azure helps to build better business processes and customer experiences w...Maxim Salnikov
Artificial Intelligence is not the future, it is NOW. Cloud technology empowers developers and technology leaders to benefit from AI effectively and responsibly with the models and tools they need. In this session, we go through the portfolio of Azure AI services and run some demos to showcase how AI can improve daily life, safety, productivity, accessibility, and business outcomes.
A Practical Enterprise Feature Store on Delta LakeDatabricks
The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture.
Key Takeaways:
– Understand the key use cases that motivate the feature store from both a data science and engineering perspective.
– Consider edge cases where there may be opportunities for simplification such as “online” predictions.
– Review a typical logical data model for a feature store and how that can be applied to your business domain.
– Consider options for physical storage of the feature store in the Delta Lake.
– Understand common access patterns including metadata-based feature discovery.
AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...AWS Chicago
September 19th joint meetup with Serverless Chicago user group at RedShelf - Serverless in AWS.
"Go Serverless from your iPad: Building a Data-driven REST API with AWS CodeStar, Lambda, and Cognitect’s Datomic and Vase" - Chris Johnson Bidler, CTO at Centriq Technology, Inc
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)Pat Patterson
Why would anyone but the most pedestrian enterprise developer be interested in a data access protocol originally designed by Microsoft, implemented in XML and handed to OASIS for standardization? The Open Data Protocol, or OData for short, has evolved into a clean, RESTful interface for CRUD operations against data services. Alongside the usual enterprise suspects such as Microsoft, Salesforce and IBM, OData has been adopted by government and non-profit agencies to open up their data and make it accessible to the public. For developers wanting to consume data, or create their own OData services, there's no shortage of open source options, from Apache Olingo in Java to node-odata and ODataCpp. Whether you're accessing customer orders in SAP or the Whitehouse visitor book, you're going to need some OData smarts.
CCI2018 - Automatizzare la creazione di risorse con ARM template e PowerShellwalk2talk srl
Su Azure è possibile creare risorse in maniera veloce e standardizzata tramite template json che descrivono le risorse da creare sulla piattaforma. Vediamo insieme cosa possono fare, e come possono essere estesi con custom script extension e Powershell Desired State Configuration.
By Marco Obinu
SQL for Web APIs - Simplifying Data Access for API ConsumersJerod Johnson
From Nordic APIs Platform Summit 2019 - Stockholm, Sweden
As the data world evolves, businesses are moving more of their data out of databases and into SaaS applications. Despite the migration, SQL remains a ubiquitous language for data access, so much so that many SaaS applications and non-relational cloud data stores support SQL endpoints in their APIs. While these endpoints allow users to leverage SQL queries to easily request data, there are still costly challenges to overcome when it comes to processing and managing the returned data.
In this presentation, we'll showcase popular APIs that offer SQL endpoints, explore the benefits of providing customers SQL access, and cover how standards-based drivers enable SaaS integration and self-service data access through SQL.
Agenda:
MongoDB Overview/History
Workshop
1. How to perform operations to MongoDB – Workshop
2. Using MongoDB in your Java application
Advance usage of MongoDB
1. Performance measurement comparison – real life use cases
3. Doing Cluster setup
4. Cons of MongoDB with other document oriented DB
5. Map-reduce/ Aggregation overview
Workshop prerequisite
1. All participants must bring their laptops.
2. https://github.com/geek007/mongdb-examples
3. Software prerequisite
a. Java version 1.6+
b. Your favorite IDE, Preferred http://www.jetbrains.com/idea/download/
c. MongoDB server version – 2.6.3 (http://www.mongodb.org/downloads - 64 bit version)
d. Participants can install MongoDB client – http://robomongo.org/
About Speaker:
Akbar Gadhiya is working with Ishi Systems as Programmer Analyst. Previously he worked with PMC, Baroda and HCL Technologies.
Quick start guide to java script frameworks for sharepoint apps spsbe-2015Sonja Madsen
Learn about JavaScript frameworks and new developer practices in SharePoint and on Office 365. JavaScript frameworks are there for you to help you develop faster and easier. You don't need to do your apps from scratch.
Apps and the cloud app model have brought not only new ways to interact, send, write, and receive data from SharePoint. Apps have also brought JavaScript frameworks into SharePoint development. JavaScript frameworks are right there as part of the app template when you start a SharePoint hosted or a Cloud app. In this session, I'll show what you can do with JavaScript frameworks that are part of the app template. I'll show jQuery, Bootstrap, and modernizr.
A Practical Enterprise Feature Store on Delta LakeDatabricks
The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture.
Key Takeaways:
– Understand the key use cases that motivate the feature store from both a data science and engineering perspective.
– Consider edge cases where there may be opportunities for simplification such as “online” predictions.
– Review a typical logical data model for a feature store and how that can be applied to your business domain.
– Consider options for physical storage of the feature store in the Delta Lake.
– Understand common access patterns including metadata-based feature discovery.
AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...AWS Chicago
September 19th joint meetup with Serverless Chicago user group at RedShelf - Serverless in AWS.
"Go Serverless from your iPad: Building a Data-driven REST API with AWS CodeStar, Lambda, and Cognitect’s Datomic and Vase" - Chris Johnson Bidler, CTO at Centriq Technology, Inc
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)Pat Patterson
Why would anyone but the most pedestrian enterprise developer be interested in a data access protocol originally designed by Microsoft, implemented in XML and handed to OASIS for standardization? The Open Data Protocol, or OData for short, has evolved into a clean, RESTful interface for CRUD operations against data services. Alongside the usual enterprise suspects such as Microsoft, Salesforce and IBM, OData has been adopted by government and non-profit agencies to open up their data and make it accessible to the public. For developers wanting to consume data, or create their own OData services, there's no shortage of open source options, from Apache Olingo in Java to node-odata and ODataCpp. Whether you're accessing customer orders in SAP or the Whitehouse visitor book, you're going to need some OData smarts.
CCI2018 - Automatizzare la creazione di risorse con ARM template e PowerShellwalk2talk srl
Su Azure è possibile creare risorse in maniera veloce e standardizzata tramite template json che descrivono le risorse da creare sulla piattaforma. Vediamo insieme cosa possono fare, e come possono essere estesi con custom script extension e Powershell Desired State Configuration.
By Marco Obinu
SQL for Web APIs - Simplifying Data Access for API ConsumersJerod Johnson
From Nordic APIs Platform Summit 2019 - Stockholm, Sweden
As the data world evolves, businesses are moving more of their data out of databases and into SaaS applications. Despite the migration, SQL remains a ubiquitous language for data access, so much so that many SaaS applications and non-relational cloud data stores support SQL endpoints in their APIs. While these endpoints allow users to leverage SQL queries to easily request data, there are still costly challenges to overcome when it comes to processing and managing the returned data.
In this presentation, we'll showcase popular APIs that offer SQL endpoints, explore the benefits of providing customers SQL access, and cover how standards-based drivers enable SaaS integration and self-service data access through SQL.
Agenda:
MongoDB Overview/History
Workshop
1. How to perform operations to MongoDB – Workshop
2. Using MongoDB in your Java application
Advance usage of MongoDB
1. Performance measurement comparison – real life use cases
3. Doing Cluster setup
4. Cons of MongoDB with other document oriented DB
5. Map-reduce/ Aggregation overview
Workshop prerequisite
1. All participants must bring their laptops.
2. https://github.com/geek007/mongdb-examples
3. Software prerequisite
a. Java version 1.6+
b. Your favorite IDE, Preferred http://www.jetbrains.com/idea/download/
c. MongoDB server version – 2.6.3 (http://www.mongodb.org/downloads - 64 bit version)
d. Participants can install MongoDB client – http://robomongo.org/
About Speaker:
Akbar Gadhiya is working with Ishi Systems as Programmer Analyst. Previously he worked with PMC, Baroda and HCL Technologies.
Quick start guide to java script frameworks for sharepoint apps spsbe-2015Sonja Madsen
Learn about JavaScript frameworks and new developer practices in SharePoint and on Office 365. JavaScript frameworks are there for you to help you develop faster and easier. You don't need to do your apps from scratch.
Apps and the cloud app model have brought not only new ways to interact, send, write, and receive data from SharePoint. Apps have also brought JavaScript frameworks into SharePoint development. JavaScript frameworks are right there as part of the app template when you start a SharePoint hosted or a Cloud app. In this session, I'll show what you can do with JavaScript frameworks that are part of the app template. I'll show jQuery, Bootstrap, and modernizr.
In recent years, cheap drones and satellites have made it easier to monitor the earth. This data is used as the image layers of mapping websites for many scientific missions. This talk will explore how cloud-native, pure-Javascript apps can assemble and browse large quantities of map data using the JAMstack and Gatsby.
This talk discusses machine learning by presenting a case study of a project that uses Apache Zeppelin to build a dataset of images of appliances, then fine-tune a pre-existing model using mxnet.
Abstract:
Many machine learning algorithms can be implemented to run parallel operations on graphics cards. Deeplearning4j is a Java-based machine learning library, which includes implementations of many popular neural-network algorithms. Deeplearning4j uses uses a library called Nd4j to run matrix algebra operations on either CPUs or GPUs with NVIDIA’s CUDA API.
In this talk, I will show how to get a simple machine learning algorithm running on the GPU. I will also cover how to get started with CUDA development: how to get your code to run on the GPU, how to monitor the device, and how to write code to make effective use of parralelization.
Bio: Gary Sieling is a Lead Software Engineer at IQVIA, in Blue Bell, PA, with an interests in database technologies, machine learning, and software engineering practices. He has been involved in curating talks for a company lunch and learn program and the organizing committee for a tech conference. Building on these experiences, he built a search engine called FindLectures.com to help find great talks and speakers.
PHASE (Philly Area Scala Enthusiasts) - Word2vec in Scala. Talk explains concrete examples of how Word2vec works, built around a demo of constructing email alerts using concept search.
Lucene/Solr Revolution 2017: Indexing Videos in SolrGary Sieling
FindLectures.com is a discovery engine for tech talks, historic speeches, and academic lectures. The site rates audio and video content for quality, showing different recommended talks each day on a variety of topics.
FindLectures.com crawls conference sites to get talk metadata, such as speaker names and bios, descriptions, and the date a video was recorded. Often these attributes are sparsely populated, or available across multiple websites. Additional attributes are inferred from audio and video content, but require more sophisticated data extraction to be useful in a full-text search engine.
This talk will discuss interesting lessons learned from crawling historical videos and demonstrate information extraction with machine learning.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
5. Why Work in a relational database?
- Bring algorithms to the data
- E.g. machine learning in data warehouses
- Access to other DB features:
- Geospatial search
- Full text search
- Other DB types: trees, ranges, json
- Work with existing tuning options -
- materialized tables
- control here data is stored
6. Architectural Alternatives
- One or more large databases
- Could test/run locally in containers
- Sharded systems
- DB architected around many separate parts
- E.g. AWS S3 + Athena + Glue + Lambdas
8. Vectors
- Need to store in a DB
- Word2vec - Google News: 300-dimensional vectors for 3 million words and phrases.
- Image vectors (e.g. SIFT)
- Audio
- One large matrix
15. Why Word Vectors?
- Give access to “meaning”, rather than tokens:
- Search on similar words, concepts
- Or dis-similar words, concepts
- Averaging terms in documents allows you to compare meaning
- E.g. re-ranking top search results for “aboutness” or meaning diversity
16. Design
Take my resume, tokenize it
Average term vectors
Find a large list of related terms (e.g., javascript -> js, node, css)
Find matching postings
Take each posting, tokenize it
Average terms in the entire posting
Compute the cosine distance between the resume and posting
Sort
17. Average terms in a resume
CREATE TABLE resumes_average
SELECT
resumes.person_name,
tokenize(resume)
FROM resumes
18. Average terms in a job
CREATE TABLE job_averages AS
SELECT
url,
tokenize(terms) AS word_averages
FROM jobs
23. “Inverted File System with Asymmetric Distance Calculation”
- Locality sensitive hashing
- Store near-ish vectors together
- Distance can be between hashes, between hash and vector
- Choosing search performance vs. accuracy
25. Issues
- Doesn’t consider term or concept frequency
- Doesn’t show us jobs that are the next steps
- Doesn’t consider how old a job in the resume was
30. Improvements
● Tune TF*IDF implementation (not currently in Postgres)
● Search / cap results repeatedly. Can handle:
○ Aboutness / not-aboutness
○ Result diversity
31. Variations of Word2vec
FastText - incorporates letters with the words
StarSpace - Semantically similar sentences, categorization
36. FAISS
- Library by Facebook for fast vector search
- Vectors stored in Voroni cells
- Can be quantized
- Can use GPUs
- Offers Clustering, PCA
- E.g. nearest vectors:
D, I = index.search(xq, 5)
print(I[:5])
39. Google for
Concatenated orientation histograms
Why did they use euclidian distance?
restricted boltzman machine
Spectral hashing
Euclidean Locality-Sensitive Hashing
inverted file system with asymmetric distance calculation
40. How much disk space do these take?
Word2vec model: 3,644,258,522
Google_vecs.txt: 10,766,478,818
Quantized index:
IVSADC: