Practicing with unsupervised machine learning techniques I scraped data from Denver's "For Free" Craigslist category to categorize and better understand products available.
Regular Expressions (RegEx) is a very powerful search pattern language used to filter data from vast database. Here is an in-depth tutorial about regular expressions and their usage in Search Engine Optimization (SEO) and for managing vast amount of data in popular tools like Notepad++, Excel Sheet etc...
This document summarizes a text mining project analyzing Stack Overflow posts tagged with R, statistics, machine learning, and other tags. It describes the dataset, data cleaning process, feature engineering including creating word frequency matrices, and unsupervised feature selection to reduce the feature space. Power features beyond word frequencies were also extracted, such as counts of code blocks, LaTeX blocks, and words in titles and bodies. The goal was to classify posts as related to R or not related to R for supervised learning.
This document discusses dimensional modeling (DM) as a way to simplify entity-relationship (ER) data models that are used for data warehousing and online analytical processing (OLAP). DM results in a star schema with one central fact table linked to multiple dimension tables. This structure is simpler for users to understand and for query tools to navigate compared to complex ER models. While DM uses more storage space by duplicating dimensional data, it improves query performance through fewer joins. The document provides an example comparing the storage requirements of a phone call fact table under a star schema versus a snowflake schema.
This document describes a method for ranking entity types using contextual information from text. It presents several approaches for ranking types, including entity-centric, hierarchy-based, and context-aware methods. It also describes how the different ranking approaches are evaluated using crowdsourcing to collect relevance judgments on entity types within given contexts. The approaches are implemented in a system called TRank that uses inverted indices and MapReduce for scalability.
Incremental View Maintenance for openCypher QueriesGábor Szárnyas
Presented at the Fourth openCypher Implementers Meeting
Numerous graph use cases require continuous evaluation of queries over a constantly changing data set, e.g. fraud detection in financial systems, recommendations, and checking integrity constraints. For relational systems, incremental view maintenance has been researched for three decades, resulting in a wide body of literature. The property graph data model and the openCypher language, however, are recent developments, and therefore lack established techniques to perform efficient view maintenance. In this talk, we give an overview of the view maintenance problem for property graphs, discuss why it is particularly difficult and present an approach that tackles a meaningful subset of the language.
In this webinar Thomas Cook, Sales Director, AnzoGraph DB, provides a history lesson on the origins of SPARQL, including its roots in the Semantic Web, and how linked open data is used to create Knowledge Graphs. Then, he dives into "What is RDF?", "What is a URI?" and "What is SPARQL?", wrapping up with a real-world demonstration via a Zeppelin notebook.
Regular Expressions (RegEx) is a very powerful search pattern language used to filter data from vast database. Here is an in-depth tutorial about regular expressions and their usage in Search Engine Optimization (SEO) and for managing vast amount of data in popular tools like Notepad++, Excel Sheet etc...
This document summarizes a text mining project analyzing Stack Overflow posts tagged with R, statistics, machine learning, and other tags. It describes the dataset, data cleaning process, feature engineering including creating word frequency matrices, and unsupervised feature selection to reduce the feature space. Power features beyond word frequencies were also extracted, such as counts of code blocks, LaTeX blocks, and words in titles and bodies. The goal was to classify posts as related to R or not related to R for supervised learning.
This document discusses dimensional modeling (DM) as a way to simplify entity-relationship (ER) data models that are used for data warehousing and online analytical processing (OLAP). DM results in a star schema with one central fact table linked to multiple dimension tables. This structure is simpler for users to understand and for query tools to navigate compared to complex ER models. While DM uses more storage space by duplicating dimensional data, it improves query performance through fewer joins. The document provides an example comparing the storage requirements of a phone call fact table under a star schema versus a snowflake schema.
This document describes a method for ranking entity types using contextual information from text. It presents several approaches for ranking types, including entity-centric, hierarchy-based, and context-aware methods. It also describes how the different ranking approaches are evaluated using crowdsourcing to collect relevance judgments on entity types within given contexts. The approaches are implemented in a system called TRank that uses inverted indices and MapReduce for scalability.
Incremental View Maintenance for openCypher QueriesGábor Szárnyas
Presented at the Fourth openCypher Implementers Meeting
Numerous graph use cases require continuous evaluation of queries over a constantly changing data set, e.g. fraud detection in financial systems, recommendations, and checking integrity constraints. For relational systems, incremental view maintenance has been researched for three decades, resulting in a wide body of literature. The property graph data model and the openCypher language, however, are recent developments, and therefore lack established techniques to perform efficient view maintenance. In this talk, we give an overview of the view maintenance problem for property graphs, discuss why it is particularly difficult and present an approach that tackles a meaningful subset of the language.
In this webinar Thomas Cook, Sales Director, AnzoGraph DB, provides a history lesson on the origins of SPARQL, including its roots in the Semantic Web, and how linked open data is used to create Knowledge Graphs. Then, he dives into "What is RDF?", "What is a URI?" and "What is SPARQL?", wrapping up with a real-world demonstration via a Zeppelin notebook.
This document provides an introduction to genetic algorithms. It explains that genetic algorithms are inspired by Darwinian evolution and use processes like selection, crossover and mutation to iteratively improve a population of potential solutions. It discusses how genetic algorithms can be used for optimization problems and classification in data mining. Examples of genetic algorithm applications like the traveling salesman problem are also presented to illustrate genetic algorithm concepts and processes.
Semantic Search and Result Presentation with Entity CardsFaegheh Hasibi
The document describes semantic search and entity summarization techniques. It discusses generating query-dependent entity summaries using a knowledge graph. Facts about an entity are ranked based on importance and relevance to the query. The top facts are grouped by predicate and generated into a summary respecting length and width constraints. User studies found utility-based summaries outperformed relevance-only summaries. The techniques can power applications like knowledge-enabled search engines and personalized summaries.
How Graph Databases used in Police Department?Samet KILICTAS
This presentation delivers basics of graph concept and graph databases to audience. It clearly explains how graph databases are used with sample use cases from industry and how it can be used for police departments. Questions like "When to use a graph DB?" and "Should I solve a problem with Graph DB?" are answered.
Every year the financial industry loses billions because of fraud while in the meantime fraudsters are coming up with more and more sophisticated patterns.
Financial institutions have to find the balance between fraud protection and negative customer experience. Fraudsters bury their patterns in lots of data, but the traditional technologies are not designed to detect fraud in real-time or to see patterns beyond the individual account.
Analyzing relations with graph databases helps uncover these larger complex patterns and speeds up suspicious behavior identification.
Furthermore, graph databases enable fast and effective real-time link queries and passing context to machine learning models.
The earlier fraud pattern or network is identified, the faster the activity is blocked. As a result, losses and fines are minimized.
Keyword clustering is important to understand search behavior, identify profitable keywords, and group keywords into logical clusters. K-means clustering is an unsupervised learning technique that partitions keywords into 'k' clusters based on distance from cluster centers. Text is converted to numeric data using document term matrices before clustering. Keyword augmentation standardizes terms, removes stopwords and punctuation. Choosing the number 'k' of clusters can be done manually by guessing or using the elbow method to find a flattening point in the cost plot. Clusters are then named and tagged based on their theme.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
Elasticsearch is a distributed, scalable, and highly available search and analytics engine. The key points discussed in the document include proper cluster configuration, data mappings, and monitoring. The document outlines Elasticsearch's architecture including shards, replicas, and cluster topology. It also discusses modeling data through mappings, analysis, and handling relationships. Monitoring is important and can be done through Elasticsearch plugins, JVM tools, and the stats API.
SQL For Programmers -- Boston Big Data Techcon April 27thDave Stokes
SQL For Programmers is an introduction to SQL concepts, when SQL is a better choice, and a look at the future of databases. Presented April 27th, 2015 at Big Data Techcon Boston
Slide for Arithmer Seminar given by Dr. Daisuke Sato (Arithmer) at Arithmer inc.
The topic is on "explainable AI".
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution
This session will present a detailed tear-down and walk-through of a working soup-to-nuts recommendation engine that uses observations of multiple kinds of behavior to do combined recommendation and cross recommendation. The system is built using Mahout to do off-line analysis and Solr to provide real-time recommendations. The presentation will also include enough theory to provide useful working intuitions for those desiring to adapt this design.
The entire system including a data generator, off-line analysis scripts, Solr configurations and sample web pages will be made available on github for attendees to modify as they like.
This document discusses various techniques for data preprocessing, including data cleaning, integration and transformation, reduction, and discretization. It provides details on techniques for handling missing data, noisy data, and data integration issues. It also describes methods for data transformation such as normalization, aggregation, and attribute construction. Finally, it outlines various data reduction techniques including cube aggregation, attribute selection, dimensionality reduction, and numerosity reduction.
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
Bayard Rock, LLC, is a private research and software development company with headquarters in the Empire State Building. It is a leader in the filed in the research and development of tools for improving the state of the art in anti-money laundering and fraud detection. As you might imagine, these tools rely heavily on mathematics and graph algorithms. In this talk, Richard Minerich will discuss the research activities of Bayard Rock and its approaches to build tools to find the “bad guys”. Richard Minerich is Bayard Rock’s Director of Research and Development. Rick has expertise in F#, C#, C, C++, C++/CLI,. NET (1.1, 2.0, 3.0, 3.5, 4.0, and 4.5), Object Oriented Design, Functional Design, Entity Resolution, Machine Learning, Concurrency, and Image Processing. He is interested in working on algorithmically, mathematically complex projects and remains open to explore new ideas.
Rick holds 2 patents. The first one, co-invented with a colleague, is titled “Method of Image Analysis Using Sparse Hough Transform.” The other independently held is known as “Method for Document to Template Alignment.”
Combining machine learning and search through learning to rankJettro Coenradie
With advanced tools available for search like Solr and Elasticsearch, companies are embedding search in almost all their products and websites. Search is becoming mainstream. Therefore we can focus on teaching the search engine tricks to return more relevant results. One new trick is called "learning to rank". During the presentation, you'll learn what Learning To Rank is, when to apply it and of course, you'll get an example to show how it works using Elasticsearch and a learning to rank plugin. After this presentation, you have learned to combine machine learning models and search.
This document discusses principles for designing systems and databases for high performance. It emphasizes that:
1. Performance requirements should be specified upfront and the architecture should support meeting those requirements.
2. The database design impacts performance, so logical and physical designs like normalization, indexing, partitioning, and caching should be considered.
3. Application architecture also affects performance through techniques like minimizing network traffic and SQL parsing overhead.
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
From usability to performance, analytics to architecture; as report developers, the user experience design (UX) of your data model is quickly becoming more important than the pretty pictures that sit on top of it. This session will concentrate on the design decisions needed to increase the usage of your reports.
This document provides an introduction to genetic algorithms. It explains that genetic algorithms are inspired by Darwinian evolution and use processes like selection, crossover and mutation to iteratively improve a population of potential solutions. It discusses how genetic algorithms can be used for optimization problems and classification in data mining. Examples of genetic algorithm applications like the traveling salesman problem are also presented to illustrate genetic algorithm concepts and processes.
Semantic Search and Result Presentation with Entity CardsFaegheh Hasibi
The document describes semantic search and entity summarization techniques. It discusses generating query-dependent entity summaries using a knowledge graph. Facts about an entity are ranked based on importance and relevance to the query. The top facts are grouped by predicate and generated into a summary respecting length and width constraints. User studies found utility-based summaries outperformed relevance-only summaries. The techniques can power applications like knowledge-enabled search engines and personalized summaries.
How Graph Databases used in Police Department?Samet KILICTAS
This presentation delivers basics of graph concept and graph databases to audience. It clearly explains how graph databases are used with sample use cases from industry and how it can be used for police departments. Questions like "When to use a graph DB?" and "Should I solve a problem with Graph DB?" are answered.
Every year the financial industry loses billions because of fraud while in the meantime fraudsters are coming up with more and more sophisticated patterns.
Financial institutions have to find the balance between fraud protection and negative customer experience. Fraudsters bury their patterns in lots of data, but the traditional technologies are not designed to detect fraud in real-time or to see patterns beyond the individual account.
Analyzing relations with graph databases helps uncover these larger complex patterns and speeds up suspicious behavior identification.
Furthermore, graph databases enable fast and effective real-time link queries and passing context to machine learning models.
The earlier fraud pattern or network is identified, the faster the activity is blocked. As a result, losses and fines are minimized.
Keyword clustering is important to understand search behavior, identify profitable keywords, and group keywords into logical clusters. K-means clustering is an unsupervised learning technique that partitions keywords into 'k' clusters based on distance from cluster centers. Text is converted to numeric data using document term matrices before clustering. Keyword augmentation standardizes terms, removes stopwords and punctuation. Choosing the number 'k' of clusters can be done manually by guessing or using the elbow method to find a flattening point in the cost plot. Clusters are then named and tagged based on their theme.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
Elasticsearch is a distributed, scalable, and highly available search and analytics engine. The key points discussed in the document include proper cluster configuration, data mappings, and monitoring. The document outlines Elasticsearch's architecture including shards, replicas, and cluster topology. It also discusses modeling data through mappings, analysis, and handling relationships. Monitoring is important and can be done through Elasticsearch plugins, JVM tools, and the stats API.
SQL For Programmers -- Boston Big Data Techcon April 27thDave Stokes
SQL For Programmers is an introduction to SQL concepts, when SQL is a better choice, and a look at the future of databases. Presented April 27th, 2015 at Big Data Techcon Boston
Slide for Arithmer Seminar given by Dr. Daisuke Sato (Arithmer) at Arithmer inc.
The topic is on "explainable AI".
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution
This session will present a detailed tear-down and walk-through of a working soup-to-nuts recommendation engine that uses observations of multiple kinds of behavior to do combined recommendation and cross recommendation. The system is built using Mahout to do off-line analysis and Solr to provide real-time recommendations. The presentation will also include enough theory to provide useful working intuitions for those desiring to adapt this design.
The entire system including a data generator, off-line analysis scripts, Solr configurations and sample web pages will be made available on github for attendees to modify as they like.
This document discusses various techniques for data preprocessing, including data cleaning, integration and transformation, reduction, and discretization. It provides details on techniques for handling missing data, noisy data, and data integration issues. It also describes methods for data transformation such as normalization, aggregation, and attribute construction. Finally, it outlines various data reduction techniques including cube aggregation, attribute selection, dimensionality reduction, and numerosity reduction.
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
Bayard Rock, LLC, is a private research and software development company with headquarters in the Empire State Building. It is a leader in the filed in the research and development of tools for improving the state of the art in anti-money laundering and fraud detection. As you might imagine, these tools rely heavily on mathematics and graph algorithms. In this talk, Richard Minerich will discuss the research activities of Bayard Rock and its approaches to build tools to find the “bad guys”. Richard Minerich is Bayard Rock’s Director of Research and Development. Rick has expertise in F#, C#, C, C++, C++/CLI,. NET (1.1, 2.0, 3.0, 3.5, 4.0, and 4.5), Object Oriented Design, Functional Design, Entity Resolution, Machine Learning, Concurrency, and Image Processing. He is interested in working on algorithmically, mathematically complex projects and remains open to explore new ideas.
Rick holds 2 patents. The first one, co-invented with a colleague, is titled “Method of Image Analysis Using Sparse Hough Transform.” The other independently held is known as “Method for Document to Template Alignment.”
Combining machine learning and search through learning to rankJettro Coenradie
With advanced tools available for search like Solr and Elasticsearch, companies are embedding search in almost all their products and websites. Search is becoming mainstream. Therefore we can focus on teaching the search engine tricks to return more relevant results. One new trick is called "learning to rank". During the presentation, you'll learn what Learning To Rank is, when to apply it and of course, you'll get an example to show how it works using Elasticsearch and a learning to rank plugin. After this presentation, you have learned to combine machine learning models and search.
This document discusses principles for designing systems and databases for high performance. It emphasizes that:
1. Performance requirements should be specified upfront and the architecture should support meeting those requirements.
2. The database design impacts performance, so logical and physical designs like normalization, indexing, partitioning, and caching should be considered.
3. Application architecture also affects performance through techniques like minimizing network traffic and SQL parsing overhead.
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
From usability to performance, analytics to architecture; as report developers, the user experience design (UX) of your data model is quickly becoming more important than the pretty pictures that sit on top of it. This session will concentrate on the design decisions needed to increase the usage of your reports.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
3. INTRODUCTION & PROBLEM STATEMENT
Craigslist was founded in 1995 and has become one of the most powerful sources for classified
advertisements found on the web today.
11/17/17 3
Source: https://denver.craigslist.org/
THE PROBLEM
Oftentimes productsare
misclassified or the category
itself is too broad to be useful.
PRIMARY OBJECTIVE
The goal is to explore the
“For Sale/Free” category,
leverage unsupervised
learning to cluster and
categorize products listed.
4. DATA ACQUISITION & WRANGLING
11/17/17 4
Data was scraped fromthe Denver, Boulder, Colorado Springs, and Fort Collins (Greater Denver
Metro) Craigslist sitesvia the Python package “Scrapy”.
DATA WRANGLING
• Duplicate Records – 133 records (3% of the acquired records) were dropped since they contained duplicate title
AND timestamp. New timestamps are created if listings are “reposted”.
• Non-letter Removal – remove numbers, punctuation, emojis, etc. in order to have only letters.
• Lowercase Conversion – all letters were converted to be lowercase for comparison.
2) Grab the title for each
listing on the page.
3) Parse the pages until
all listings have been
scraped.
FINAL DATASET
3,927 Listings
1) Initiate a “Scrapy”
spider/crawler to acquire
data based on HTML tag.
5. EXPLORATORY DATA ANALYSIS
11/17/17 5
The dataset acquired was highly skewedtowards infrequently found words.
DATASET CHALLENGES
• 2,585 unique words were found in 3,927
unique listings.
• The average word is found 4.51 times
with a min count of 1 and max count of
274 counts.
• 57% of the words are found only one time
throughout the many listings.
• 82% of the words are found fewer than 5
times.
FREQUENCY VS UNIQUE WORD COUNT
Charts above: frequency of all words (left) along side the frequency of words found fewer than 5
times (right).Frequency
6. FURTHER ANALYSIS
11/17/17 6
Analyzing the most (and least) frequent words begins to show some of the potential categories.
MOST FREQUENT VS INFREQUENT WORDS
COUNT TOP 10 WORDS
274 wood
267 couch
160 tv
143 boxes
132 chair
125 dirt
121 pallets
121 curb
115 moving
68 firewood
COUNT BOTTOM 10 WORDS
1 ability
1 newspaper
1 newer
1 neutral
1 network
1 net
1 nepro
1 needing
1 necklaces
1 mtb
Charts above: the most frequent word found in the listings was “wood” at 274 (left
chart). There are many words with a count of w (right chart).
We can already start to see some
of the potential categories based
on word frequency alone (e.g.
“wood” and ”firewood” are
amongst the top ten words found).
7. MACHINE LEARNING
11/17/17 7
The primary goal of the machine learning (ML) section is to identify the best algorithm to properly
classify the “free stuff” listings on Craigslist.
MACHINE LEARNING APPROACH
1) Text analysis was performed with the term frequency-
inverse document frequency (tf-idf) algorithm and the
vectorizers were then transformed to the tf-idf matrix.
2) Parameter tuning included setting min/max term
frequency, ”N grams”, stop words, and logarithmic
data transformations.
3) Perform clustering with the ”K-Means” algorithm.
4) Understand model results and identify the best model
to name the clusters.
FOCUS ON UNSUPERVISED LEARNING
This is an unsupervised ML exercise and focused on
testing and understanding key parameters as relates
to the exploratory data analysis findings.
8. MACHINE LEARNING – TEXT ANALYSIS
11/17/17 8
Text analysis was performed with the term frequency-inverse document frequency (tf-idf) algorithm and
the vectorizers were then transformed to the tf-idf matrix.
STEPS TO CREATE THE TF-IDF MATRIX
1. Count word (term) occurrences by document.
2. Transform into a document-term matrix.
3. Apply the term frequency-inverse document
frequency weighting.
4. Perform sensitivity analysis on frequency weighting
as well as some of the other key parameters.
VISUAL EXAMPLE TF-IDF MATRIX
9. MACHINE LEARNING – PARAMETER TUNING
11/17/17 9
Sensitivity analysis was performed on key model parameters to determine the impact.
KEY MODEL PARAMATERS
• Min/Max Frequency – this is the frequency within the listings a given term can have to be used in the tf-idf matrix.
E.g. If the term is in greater than X% of the listings it probably carries little meaning.
• N grams: accounts for term relativity. As an example, setting this value equal to “2” would mean that I consider
unigrams and bigrams, e.g. “leather” and “leather couch” which are both important to me.
• Stop Words: some words are too common (e.g. the, and) to be considered (stop words) for analysis and need to
be removed. The only customization made for this model was to include the word “free” in the stop words list as it
was found in nearly all listings.
• Logarithmic Transformation: setting this value to “true” provided a log10 transformation of the data, which in this
case was sorely needed given the dataset was skewed towards infrequent words.
10. MACHINE LEARNING – K-MEANS CLUSTERING
11/17/17 10
The clustering algorithm of choice for this model was “K-means” which required a pre-determined
number of clusters for initialization.
K Value Sum-of-Squares Score Silhouette Score
2 -2,273 0.134
3 -2,201 0.145
4 -2,128 0.163
5 -2,081 0.181
6 -1,981 0.202
7 -1,994 0.191
8 -1,893 0.223
9 -1,861 0.233
COMPARING VISUAL RESULTS OF K CLUSTERS
Charts above: at k=6 (left) Interesting to see the supposed cluster on the right-hand side of the
chart with multiple colors of dots next to each other. At k=3 (right) the clusters visually appear
more relevant, however, yellow and green dots are still very close together on the right-hand side
of the page. Looking back at the scores for k=3, the SS is 0.145 which means a weak structure.
K = 6 (clusters) K = 3 (clusters)
ITERATING OVER POTENTIAL K VALUES
The silhouette score (SS) ranges from -1 (a poor
clustering) to +1 (a very dense clustering) with 0
denoting the situation where clusters overlap. SS
scores ranged from 0.134 to 0.233 depending
on the K value.
11. MACHINE LEARNING – MODEL RESULTS
11/17/17 11
Performing sensitivity analysis on key model parametersproduced a model with k=3 clusters and the
following results.
Cluster 1
“Moving Items”
Cluster 2
“Scrap Materials”
Cluster 3
“Furniture”
Wood Firewood Couch
Tv Wood Leather
Pallets Pallets Leather couch
Chair Today Loveseat
Boxes Come Recliner
Desk Scrap Chair
Dirt Yard Reclining
Stuff Small Sofa
Table Mulch Brown
Moving Pallet Bed
NAMING THE CLUSTERSA few key observations:
• As expected, some of the words are present across
clusters. This was to be expected given the low
silhouette scores and the visual of the many different
colored dots close together.
• Only one bigram is present, meaning that the non-
unigram words were not as important as initially
thought.
• An attempt at naming the clusters might be:
• Cluster 1 would be named “Moving Items”.
• Cluster 2 would be named “Scrap Materials”.
• Cluster 3 would be named “Furniture”. Chart above: top 10 terms in each cluster of the model results of k=3.
12. RESULTS SUMMARY
11/17/17 12
KEY FINDINGS
• The data set acquired for the Greater Denver Metro “Free Stuff” analysis was heavily skewed by infrequent
terms. Even a log 10 transformation was not able to properly redistribute or normalize the data.
• This does not mean it is “bad data”, only that the majority of the postings in the “free stuff” section represent
very few terms.
• With a Silhouette Score of 0.14 on the training set and 0.15 on the test set, the K-means clustering model
provided weak (at best) structure for the designated clusters.
• In this analysis of the “Greater Denver Metro” Craigslist sites, there was an abundance of items in the
Craigslist “free stuff” that was classified into three categories:
• Category 1 is “Moving Items”.
• Category 2 is “Scrap Materials”.
• Category 3 is “Furniture”.
13. RECOMMENDATIONS & FUTURE IMPROVEMENTS
11/17/17 13
Key results show that the Craigslist team could implement a clustering model to help users filter and find
the products.
RECOMMENDATIONS & FUTURE IMPROVEMENTS
• Craigslist site administrators should create and enable “tags” or “categories” to assist in filtering the “free
stuff” section.
• Perform unsupervised learning on the broader, national data set of all Craigslist site for better
understanding of what is offered for free across the country.
• There could be a similar analysis done regarding miscategorized or mislabeled listings where the goal
would be to find the outlying data points to understand their value.