QBIC is an experimental project that aims to speed up queries to databases using machine learning. It uses techniques like predicting data dependencies and pre-generating optimized query results ("pregens") to return data without needing to send the full query to the database. QBIC represents queries and their relationships as a lattice structure and applies algorithms like knapsack problems to determine the optimal pregens to maintain based on factors like space usage and query popularity. It also aims to predict metrics and cardinalities for queries to further improve performance through techniques like "sharpening" that can return approximated results without querying the database.
Neural network Intuition
Neural network and vocabulary
Neural network algorithm
Math behind neural network algorithm
Building the neural networks
Validating the neural network model
Neural network applications
Image recognition using neural networks
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Neural network Intuition
Neural network and vocabulary
Neural network algorithm
Math behind neural network algorithm
Building the neural networks
Validating the neural network model
Neural network applications
Image recognition using neural networks
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Tech Jobs Interviews Preparation - GeekGap Webinar #1
Part 1 - Algorithms & Data Structures
What is an algorithm?
What is a data structure (DS)?
Why study algorithms & DS?
How to assess good algorithms?
Algorithm & DS interviews structure
Case study: Binary Search
2 Binary Search variants
Part 2 - System Design
What is system design?
Why study system design?
System design interviews structure
Case study: ERD with Lucidchart
Demo Time: SQLAlchemy
Githu Repo: http://bit.ly/gg-io-webinar-1-github
by www.geekgap.io
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017PAPIs.io
Feature engineering is one of the most important, yet elusive, skills to master if you want to be a good data scientist. Machine learning competitions are hardly ever won with strong modeling techniques alone -- it is the combination of creative feature engineering and powerful modeling techniques that makes the difference. This tutorial will give the audience practical tips and tricks to improve the performance of machine learning algorithms. We will broadly look at feature engineering for applied machine learning, touching on subjects like: categorical vs. numerical variables, data cleaning, feature extraction, transformations, and imputation.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
The slide of the talk in http://www.meetup.com/R-Users-Sydney/events/223867196/
There is a web version here: http://wush978.github.io/FeatureHashing/index.html
Entity Resolution is the task of disambiguating manifestations of real world entities through linking and grouping and is often an essential part of the data wrangling process. There are three primary tasks involved in entity resolution: deduplication, record linkage, and canonicalization; each of which serve to improve data quality by reducing irrelevant or repeated data, joining information from disparate records, and providing a single source of information to perform analytics upon. However, due to data quality issues (misspellings or incorrect data), schema variations in different sources, or simply different representations, entity resolution is not a straightforward process and most ER techniques utilize machine learning and other stochastic approaches.
Abstract: This PDSG workshop introduces basic concepts of categorical variables in training data. Concepts covered are dummy variable conversion, and dummy variable trap.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
These slides are for the talk on how to use R language to work with data and perform Machine Learning tasks.
Presentation was given at OSCON (Austin, TX), 2017
The world of Machine Learning and Computer Intelligence is coming and is a fast-moving train. Faithful to the Atwood's Law, ML also comes to JavaScript. But there is no such thing as a ‘best language for ML’ and it all depends on what you want to build, what is your background and most important - why you get involved in machine learning. This session is about demystifying and bringing machine learning into the browser.
A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Tech Jobs Interviews Preparation - GeekGap Webinar #1
Part 1 - Algorithms & Data Structures
What is an algorithm?
What is a data structure (DS)?
Why study algorithms & DS?
How to assess good algorithms?
Algorithm & DS interviews structure
Case study: Binary Search
2 Binary Search variants
Part 2 - System Design
What is system design?
Why study system design?
System design interviews structure
Case study: ERD with Lucidchart
Demo Time: SQLAlchemy
Githu Repo: http://bit.ly/gg-io-webinar-1-github
by www.geekgap.io
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017PAPIs.io
Feature engineering is one of the most important, yet elusive, skills to master if you want to be a good data scientist. Machine learning competitions are hardly ever won with strong modeling techniques alone -- it is the combination of creative feature engineering and powerful modeling techniques that makes the difference. This tutorial will give the audience practical tips and tricks to improve the performance of machine learning algorithms. We will broadly look at feature engineering for applied machine learning, touching on subjects like: categorical vs. numerical variables, data cleaning, feature extraction, transformations, and imputation.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
The slide of the talk in http://www.meetup.com/R-Users-Sydney/events/223867196/
There is a web version here: http://wush978.github.io/FeatureHashing/index.html
Entity Resolution is the task of disambiguating manifestations of real world entities through linking and grouping and is often an essential part of the data wrangling process. There are three primary tasks involved in entity resolution: deduplication, record linkage, and canonicalization; each of which serve to improve data quality by reducing irrelevant or repeated data, joining information from disparate records, and providing a single source of information to perform analytics upon. However, due to data quality issues (misspellings or incorrect data), schema variations in different sources, or simply different representations, entity resolution is not a straightforward process and most ER techniques utilize machine learning and other stochastic approaches.
Abstract: This PDSG workshop introduces basic concepts of categorical variables in training data. Concepts covered are dummy variable conversion, and dummy variable trap.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
These slides are for the talk on how to use R language to work with data and perform Machine Learning tasks.
Presentation was given at OSCON (Austin, TX), 2017
The world of Machine Learning and Computer Intelligence is coming and is a fast-moving train. Faithful to the Atwood's Law, ML also comes to JavaScript. But there is no such thing as a ‘best language for ML’ and it all depends on what you want to build, what is your background and most important - why you get involved in machine learning. This session is about demystifying and bringing machine learning into the browser.
Speaker: Oscar Westra van Holthe - Kind
Genre & level: Backend, Medior
Data congestion slows down even the fastest database. Want to avoid getting your data jammed? I’ll show you how you can turn your database into an information highway. Get your recipes for query performance improvements here.
Using R in Kaggle Competitions.
Kaggle has been the most popular data science platform linking close to half a million of data scientists worldwide. How to get yourself a decent ranking on Kaggle competitions with R programming, eXtreme Gradient BOOSTing, and a laptop. Great machine learning tools for all levels to get started and learn. Find out how to perform features engineering, tuning XGB models, selecting a sizable cross validations and performing model ensembles.
Machine Learning with ML.NET and Azure - Andy CrossAndrew Flatters
ML.NET is an open source, machine learning framework built in .NET and runs on Windows, Linux and macOS. It allows developers to integrate custom machine learning into their applications without any prior expertise in developing or tuning machine learning models. Enhance your .NET apps with sentiment analysis, price prediction, fraud detection and more using custom models built with ML.NET
About Andy Cross
Andy Cross (@andyelastacloud) is a co-founder of Elastacloud, an Azure Insider, co-founder of the UK London Azure User Group, an Azure MVP and a Microsoft Regional Director. An international speaker, Andy has lead teams building the largest Hadoop and HDInsight specialist deployments on Azure.
His passion for embedded software and high performance compute clusters gives him a unique insight into a sphere of computation from the very small and resource constrained to the massively scalable, limitless potential of the cloud.
This is a single day course, allows the learner to get experience with the basic details of deep learning, first half is building a network using python/numpy only and the second half we build the more advanced netwrok using TensorFlow/Keras.
At the end you will find a list of usefull pointers to continue.
course git: https://gitlab.com/eshlomo/EazyDnn
Supervised Machine learning in R is discussed with R basics and how to clean, pre-process , partitioning. It also discusess some algorithms and how to control training itself using cross-validation.
Machine learning for IoT - unpacking the blackboxIvo Andreev
Have you ever considered Machine Learning as a black box? It sounds as a kind of magic happening. Although being one among many solutions available, Azure ML has proved to be a great balance between flexibility, usability and affordable price. But how does Azure ML compare with the other ML providers? How to choose the appropriate algorithm? Do you understand the key performance indicators and how to improve the quality of your models? The session is about understanding the black box and using it for IoT workload and not only.
Talk about DSL, How to write DSL in Clojure, How to use Instaparse (simplest library for parsing grammars) and how we use Clojure and Instaparse in Zoomdata
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
2. QBIC
• Cubic a.k.a Query Boost Intelligent Compiler
• Experimental project in Labs
• Recall Math
• Find Data Dependencies
• Use Machine Learning
• ...to speed up queries to db
3. Rationale
• Most BI-tools has nothing related with letter I
• UI -> buttons -> query -> response -> chart
• Query could run forever
• There are ways to speed up (partitions, indexes,
cache, cubes)
• ...
• Is there a way to get data without sending query?
6. Set
• Not related to java.util.Set
• Set of elements, for example {apple, tuesday,
42}
• No duplicates, {1, 1, 2} the same as {1, 2}
• No order, {1, 2} the same as {2, 1}
• Set could be element of another set {1, 2, {3,
4}}
7. Set of all subsets
• Powerset
• A is subset of B ( )
• Every element from A in B
• {1, 2} subset of {1, 2, 3}
• Powerset for {1, 2, 3}:
8. Ordered set
• Set, with defined order relation ( )
• is a binary operation (a b)
• Set {1, 9, 42} is ordered, with order relation
"less"
• Set {
!
,
"
,
#
} is ordered, with order "more
funny emoji"
• holds for every pair of elements!
9.
10. Partially Ordered Set
• Something ordered, something not
• Everything what could be ordered, ordered.
• Some pair of elements could be incomparable
• {1, 4, 7, Tuesday, 42}
11. Lattice
• Partially ordered set, for every pair of
elements there is exact upper bound (supremum)
and exact lower bound (infimum)
• So, even elements A and B are not comparable,
there is an element C, which "greater" or
"less" than both A and B
• Hard?
12. Values of Senior Engineer
(sarcasm)
• House, car
• Build all possible combinations of those values
(powerset)
{{no house, no car},
{house, no car},
{no house, car},
{house, car}}
20. Typical Query
select city, gender, count(*) -- groups and metrics
from table -- table
where income > 1000 and state = 'P' -- filters
group by city, gender -- groups again
order by count(*) desc -- sortings
limit 10 offset 0 -- limits
having count(*) > 5 -- more filters
21. Typical Query: simplify 1
select city, gender, count(*)
from table
group by city, gender
1
Leave attribute groups, only count(*) from metrics, no filters, no sorting, no limits. Later
we will try to add them back.
22. Build lattice by groups
• Take all attribute columns from table
city, gender, state
• Build all possible group combinations, powerset
{city, gender, state}
{city, gender}, {city, state}, {gender, state}
{city}, {gender}, {state}
{}
25. Pregen
• Pregen is completed query from database with data
saved somewhere (aka cache or view)
• Pros:
(1) very fast getting data from pregen
(2) fast getting data from all pregens below
• Cons:
(1) need time for caculating pregen (and
supporting)
(2) need disk space for saving pregen
26.
27.
28.
29. Light formalization
• There is a lattice
• If no pregens exist, user query time equals to
database query time ( ), no space for
pregens needed ( )
• If pregen exist, user query time will be very fast
( ), but pregen will occupy some space ( )
and database query time needed once to prepare a
pregen ( )
Note: you can calculate all pregens below
30. Pregen optimization flow
• Scan table
Read metadata
Detect what columns can be groups
• Build lattice by groups
• Fill every lattice node with props
- pregen size (rows or MB)
- time to retrieve data from db
• Apply optimization algorithm
33. Knapsack Problem
• 0-1 Knapsack Problem
• Detect what items take into
knapsack, to maximize its
value and do not exceed
maximum available weight
• NP-hard
• Exact and Approximate
solutions APD1
APD1
Appendix 1 - Review of algorithms for Knapsack Problem
34.
35. (Practical) Knaspsack #1
• Dynamic programming with rounded values
• Power10 (log) scale
• Examples:
0.2, could be rounded to 1 (10^0)
1.2, could be rounded to 1 (10^0)
8.6, could be rounded to 10 (10^1)
123.6, could be rounded to 100 (10^2)
..and so on
36. (Practical) Knaspsack #2
• Genetic with Greedy init population
• Genetic behaves not worse than Greedy
• You can control the running time of algorithm
• Tradeoff (time/accuracy)
• Open for modifications
38. Lattice: Weight
• Space occupied by pregen (rows, better
megabytes)
• Time to prepare pregen (time to query db )
• Time to support pregen (depends on update
frequency)
39. Lattice: Value
• If pregen is taken, the value 1, otherwise 0
• How many other pregens could be calculated
• Query popularity
40. Pregen Optimization
• At this point we can get a list of pregens
which are most optimal to maintain
• If new properties appear in the system
(popularity, data updates) we can recalculate
new optimal lattice
• Future: filters
41. Problem
• For large tables it's very expensive process to
label lattice nodes with actual values
• Basically, we need to calculate all possible
group bys
• Since its only optimization we don't need exact
values
• ...
• What if we can predict them?
45. Lattice Puzzle
• select city, count(*) -- 50
• select gender, count(*) -- 2
• ...
• How many rows will return
query?
(magic: without sending the
query)
select city, gender, count(*)
48. Lattice Puzzle: Linear
Scaling
• From sample you can get group counts
CITY(30), GENDER(2), CITY/GENDER(56)
• Build Interval as
[max(city,gender)..max(city,gender)*min(city,gender)]
Interval = [30..60], L=30, R=60, V=56
• Calculate
• Apply ratio to real data interval [50..100] to get estimate
49. Lattice Puzzle: Linear
Scaling
• Ok for small cardinality usecases
• Bad for high cardinality use case, because it get
diverged very soon
• Bad for small samples
• Bad for non-uniformly distributed samples
• Bad for non-linear functions
• Bad for other metrics
55. Anscombe Quartet
• Four different datasets
• All their statistical properties are equal
(mean, variance, correlation, etc.)
• Linear regression line is the same
56. you can always trick algorithm
if you know how it behaves
Be optimistic
58. Curve Fitting Flow
• Break sample into input points (x,y)
SAMPLE_SIZE, GROUP_COUNT
• Use the last point as evaluation of your model
• Apply different curves
• Model which is more closely to the evaluation
point will be your model
• Use this model to predict the lattice values
59. Cardinality Levels
• SMALL (0..10)
Gender, Income Group, Status
• MEDIUM (10..5000)
City, County, Category
• HIGH (5000..)
UserID, SKU
60. Cardinality Levels
• SMALL x SMALL (log works well)
• SMALL x MEDIUM (log works well)
• MEDIUM x MEDIUM (log works well)
• Anything x HIGH (hard to predict)
• ...
• Maybe drop high-cardinality use cases?
68. Metric: COUNT(*)
• Additive
• Easily can be predicted by sample because looks
like LINEAR
• Could be improved by better sampling techniques
• Could be improved by microqueries
• ...
• What about other metrics?
69. Metric: SUM
• Additive
• Linear as well
• Can be predicted the same way as count(*)
by using linear scale
• Accuracy can be improved by using numeric field
ranges
70. Metric: AVG
• Not additive
• But if we assume avg=sum/count (not 100% true)
could be easily predicted as well by linear
scale
71. Metric: MIN/MAX
• Additive
• But not linear
• Do not apply linear scale
• Min/max from sample is not a worst idea, either
• If first level lattice calculated from db,
second could be predicted by few tricks
74. Rest data could be filled
through
• Extra query
• Additional carried data with the lattice
Along with max value, save the row
• Even without them, you have the structure and values
so, you can start drawing the results, while query
is running
• Another form of sharpening
75. Metric: DISTINCT_COUNT
• Not additive
• Not linear
• We can get a raw estimate by using min/max
scale
or by applying curve fitting on sampling
similar to what we did in predicting the
lattice
77. If we can predict the data
we can use it in
Sharpening™
78. (Better) Sharpening
• Instant fast
• No queries to DB
• No requirement on time fields only
• No requirement on indexed/partitions support
• Still can be combined with microquerying!
79. Well, if you can return data
without sending the queries...
80. Offline Mode
• Explore the data without generating DB load
• Find correct questions before asking DB
• No need for the separate "RUN" button
• No need for cancellation support if you sent
wrong query
• You can (and should) always fallback to
database
82. Data Dependencies
• Data is not just rows and columns
• There are hidden connections between cells,
rows, columns
• We can identify them to make software easier
• They are not 100% accurate, since there are
always excepetions, but with clever approach...
83. DD1: High Cardinality
• High cardinality fields has a huge amount of distinct
values
• Define what high means for your application
• If distinct count on sample equal to sample size,
it's probably high cardinlaity field (some sort of id)
• High cardinality fields has lower importance in group
bys and defaults than regular fields, also you can
drop them from pregen and lattice since they hold too
much data.
84. DD2: One-to-one fields
• One-to-one fields are pairs, with related data
• STORE_ID and STORE_NAME
• Having one field, you can infer other and vice verca
• If two fields have the same distinct count on a sample,
they are probably one-to-one fields
• NAME has more importance than its ID brother for UI
• Also you can drop the ID part from lattice and maintain
only NAME with mapping.
85. DD3: Child-Parent fields
• Parent child fiels are similar to one-to-one fields,
but you can go only from child to parent not vice verca
• CITY and COUNTRY (child-parent)
• CITY is more important than COUNTRY (having CITY you
can guess the COUNTRY)
• To find child-parent fields explore the values in a
sample
• More optimization to the lattice and to UI
recomendations
86. DD4: Empty Columns
• Sounds easy, but currently empty columns
treated the same way as other
• We can easily drop them from the lattice to
reduce its size
• We can remove them from UI
87. DD?: Define your own Data
Dependency
• Explain dependency?
• Rule to identify data dependecy?
• Where the knowledge can be applied?
94. Bruteforce
• KSBruteforceSolver.java
• Label each object either 1 (take) or 0 (do not take)
• Build all possible configurations
• Select the best one
• Complexity , - number of columns, -
lattice size
• For example, for columns, there are
possible combinations
95. Bound & Branch
• KSBoundBranchSolver.java
• Explore search space in depth-first search
manner, providing optimistic estimate for each
node to be explored, this way you can prune
nodes which optimistic estimate worse than best
of found solutions
• In most cases faster than bruteforce, but worse
case compexity is still
96. Dynamic Programming
• KSDynamicProgrammingSolver.java
• Explore the classic DP idea, solve smaller
problem first, then reuse results in bigger
problem.
• For some cases extra fast (pseudopolynomial
complexity)
• A lot of constraints (weights are integers,
weight limits should be relatively small)
98. Greedy Algorithm
• KSGreedySolver.java
• Order all items by "value" (do not confuse with
knapsack value) and take until you reach the weight
constraint
• Strategies: VALUABLE, LIGHTER,DENSITY` (value/weight)
• Complexity , could be even faster
• Practical approach when constraints are large values
(!)
99. Genetic Algorithm
• KSGeneticSolver.java
• Algorithm which could be used to solve any problem.
Init. generate random population (set of solutions)
Selection. Choose the best solutions
Crossover. Produce new solutions by breeding previous
best solutions.
Mutation (Optional). Randomly change new-born
solutions.
Generation Step. If new populations is better than
previous, replace it and move to selection again.