Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 4: Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking
Lecturer: Dr. José Antonio Ortega - jao (BigML)
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBigML, Inc
Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 3: REST API, Bindings, and Basic Workflows
Lecturer: Dr. José Antonio Ortega - jao (BigML)
Logistic Regression is one of the most popular Machine Learning methods for solving classification problems. With Logistic Regressions in your Dashboard and in the BigML API, you will be able to easily create and download models to your environment for fast local predictions.
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBigML, Inc
Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 3: REST API, Bindings, and Basic Workflows
Lecturer: Dr. José Antonio Ortega - jao (BigML)
Logistic Regression is one of the most popular Machine Learning methods for solving classification problems. With Logistic Regressions in your Dashboard and in the BigML API, you will be able to easily create and download models to your environment for fast local predictions.
VSSML16 L7. REST API, Bindings, and Basic WorkflowsBigML, Inc
VSSML16 L7. REST API, Bindings, and Basic Workflows
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 7
REST API, Bindings, and Basic Workflows
jao -- Jose A. Ortega (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
A developer's overview of the world of predictive APIsLouis Dorard
Predictive APIs are making it easier to integrate Machine Learning in your apps and to add predictive features to them. Starting with some basics we'll see what the different types of APIs are and we'll give some examples of proprietary predictive APIs. We'll go over some ways of exposing your own predictive models as APIs served by 3rd party platforms, and open source frameworks for creating and serving your own APIs on your infrastructure of choice. We'll give some remarks on recent (and missing) tools to make it easier to use and compare all these APIs. Finally, we'll give some pointers to a Virtual Machine to help you get started with these technologies...
Slides from my talk at the Valencian Summer School on Machine Learning (#VSMML15)
Learn all you need to know about BigML's implementation of Latent Dirichlet Allocation (LDA), one of the most popular probabilistic methods for topic modeling. Topic Models, BigML's latest resource, helps you find relevant terms thematically related in your unstructured text data. With the BigML Topic Models in your Dashboard and in the BigML API, you will be able to discover the hidden topics in your text fields and use them as final output for information retrieval tasks, collaborative filtering, or for assessing document similarity, among others. You can also use the topics discovered as input features to train other models.
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture Review: Summary Day 2 Sessions. By Mercè Martín Prats (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Our Summer 2017 release presents Deepnets, a highly effective supervised learning method that solves classification and regression problems in a way that can match or exceed human performance, especially in domains where effective feature engineering is difficult. BigML Deepnets bring two unique parameter optimization options: Automatic Network Search and Structure Suggestion. These options avoid the difficult and time-consuming work of hand-tuning the algorithm and ensure the best network among all possible networks to solve your problem. This new resource is available from the BigML Dashboard, API, as well as from WhizzML for its automation. Deepnets are state-of-the-art in many important supervised learning applications.
Automating your own Machine Learning Projects - Workshop: Working with the Masters.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.
VSSML16 L5. Basic Data Transformations
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 5
Basic Data Transformations
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 L7. REST API, Bindings, and Basic WorkflowsBigML, Inc
VSSML16 L7. REST API, Bindings, and Basic Workflows
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 7
REST API, Bindings, and Basic Workflows
jao -- Jose A. Ortega (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
A developer's overview of the world of predictive APIsLouis Dorard
Predictive APIs are making it easier to integrate Machine Learning in your apps and to add predictive features to them. Starting with some basics we'll see what the different types of APIs are and we'll give some examples of proprietary predictive APIs. We'll go over some ways of exposing your own predictive models as APIs served by 3rd party platforms, and open source frameworks for creating and serving your own APIs on your infrastructure of choice. We'll give some remarks on recent (and missing) tools to make it easier to use and compare all these APIs. Finally, we'll give some pointers to a Virtual Machine to help you get started with these technologies...
Slides from my talk at the Valencian Summer School on Machine Learning (#VSMML15)
Learn all you need to know about BigML's implementation of Latent Dirichlet Allocation (LDA), one of the most popular probabilistic methods for topic modeling. Topic Models, BigML's latest resource, helps you find relevant terms thematically related in your unstructured text data. With the BigML Topic Models in your Dashboard and in the BigML API, you will be able to discover the hidden topics in your text fields and use them as final output for information retrieval tasks, collaborative filtering, or for assessing document similarity, among others. You can also use the topics discovered as input features to train other models.
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture Review: Summary Day 2 Sessions. By Mercè Martín Prats (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Our Summer 2017 release presents Deepnets, a highly effective supervised learning method that solves classification and regression problems in a way that can match or exceed human performance, especially in domains where effective feature engineering is difficult. BigML Deepnets bring two unique parameter optimization options: Automatic Network Search and Structure Suggestion. These options avoid the difficult and time-consuming work of hand-tuning the algorithm and ensure the best network among all possible networks to solve your problem. This new resource is available from the BigML Dashboard, API, as well as from WhizzML for its automation. Deepnets are state-of-the-art in many important supervised learning applications.
Automating your own Machine Learning Projects - Workshop: Working with the Masters.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.
VSSML16 L5. Basic Data Transformations
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 5
Basic Data Transformations
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 LR2. Summary Day 2
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Summary Day 2
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...Ian Lumb
Watch On Demand Anytime via http://www.univa.com/resources/webinar-machine-learning.php
Armed with nothing more than an Apache Spark toting laptop, you have all the trappings required to prototype the application of Machine Learning against your data-science needs. From programmability in Scala, Java or Python, to built-in support for Machine Learning via MLlib, Spark is an exceedingly effective enabler that allows you to rapidly produce results.
Of course, as soon as your prototyping proves successful, you'll want to scale out to embrace the volume, variety and velocity that characterizes today's Big Data demands... in production. Because Spark is as comfortable on an isolated laptop as it is in a distributed-computing environment, addressing Big Data requirements in production boils down to effectively and efficiently embracing containers and clusters for Big Data Analytics.
And this is where offerings from Univa shine - i.e., in making the transition from prototype to production completely seamless. For some use cases, it makes sense to scale-in Spark based applications within Docker containers via Univa Grid Engine Container Edition or Navops by Univa; whereas in others, Spark is interfaced (as a Mesos-compliant framework) with Univa Universal Resource Broker, to permit scaling out on a cluster. In both scenarios, your production Spark applications are scheduled alongside other classes of workload - without a need for dedicated resources.
Agenda:
• Overview of Apache Spark as a platform for Deep Learning - from Python-based Jupyter Notebooks to Spark's Machine Learning library MLlib
• Overview of prototyping Machine Learning via Apache Spark on a laptop - without and within Docker containers
• Introductions to Univa Grid Engine Container Edition and Univa Universal Resource Broker plus Navops by Univa
• Overview of production Big Data Analytics platforms for Machine Learning
• Docker-containerized Apache Spark and Univa Grid Engine Container Edition
• Docker-containerized Apache Spark and Navops by Univa
• Apache Spark plus Univa Universal Resource Broker
• Introducing support for GPUs without and within Docker containers
• Use case example - using Machine Learning to classify data from Twitter without and within Docker containers
• Summary and next steps
Watch On Demand Anytime via http://www.univa.com/resources/webinar-machine-learning.php
WhizzML is a domain-specific language for automating Machine Learning workflows, implement high-level Machine Learning algorithms, and easily share them with others. WhizzML offers out-of-the-box scalability, abstracts away the complexity of underlying infrastructure, and helps analysts, developers, and scientists reduce the burden of repetitive and time-consuming analytics tasks.
VSSML17 L7. REST API, Bindings, and Basic WorkflowsBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 7: REST API, Bindings, and Basic Workflows. By jao - Jose A. Ortega - (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
WhizzML is a domain-specific language for automating Machine Learning workflows, implement high-level Machine Learning algorithms, and easily share them with others. WhizzML offers out-of-the-box scalability, abstracts away the complexity of underlying infrastructure, and helps analysts, developers, and scientists reduce the burden of repetitive and time-consuming analytics tasks.
WhizzML is a domain-specific language for automating Machine Learning workflows, implement high-level Machine Learning algorithms, and easily share them with others. WhizzML offers out-of-the-box scalability, abstracts away the complexity of underlying infrastructure, and helps analysts, developers, and scientists reduce the burden of repetitive and time-consuming analytics tasks.
▪ Developed a recursive-descent parser to generate an intermediate representation for subsequent optimizations in Java
▪ Implemented common subexpression elimination and copy propagation on control flow graph
▪ Deployed a code generator for the source language that yields optimized native programs
An introduction to Augustus, an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). Augustus is able to produce and consume models with 10,000s of segments. Developed by Open Data Group, written in Python, PMML 4.0 compliant and freely available.
Software engineering research often requires analyzing
multiple revisions of several software projects, be it to make and
test predictions or to observe and identify patterns in how software evolves. However, code analysis tools are almost exclusively designed for the analysis of one specific version of the code, and the time and resources requirements grow linearly with each additional revision to be analyzed. Thus, code studies often observe a relatively small number of revisions and projects. Furthermore, each programming ecosystem provides dedicated tools, hence researchers typically only analyze code of one language, even when researching topics that should generalize
to other ecosystems. To alleviate these issues, frameworks and models have been developed to combine analysis tools or automate the analysis of multiple revisions, but little research has gone into actually removing redundancies in multi-revision, multi-language code analysis. We present a novel end-to-end approach that systematically avoids redundancies every step of the way: when reading sources from version control, during parsing, in the internal code representation, and during the actual analysis. We evaluate our open-source implementation, LISA, on the full
history of 300 projects, written in 3 different programming languages, computing basic code metrics for over 1.1 million program revisions. When analyzing many revisions, LISA requires less than a second on average to compute basic code metrics for all files in a single revision, even for projects consisting of millions of lines of code.
This tutor introduces the basic idea of machine learning with a very simple example. Machine learning teaches machines (and me too) to learn to carry out tasks and concepts by themselves. It is that simple, so here is an overview:
http://www.softwareschule.ch/examples/machinelearning.jpg
Headache from using mathematical softwarePVS-Studio
It so happened that during some period of time I was discussing on the Internet, one would think, different topics: free alternatives of Matlab for universities and students, and finding errors in algorithms with the help of static code analysis. All these discussions were brought together by the terrible quality of the code of modern programs. In particular, it is about quality of software for mathematicians and scientists. Immediately there arises the question of the credibility to the calculations and studies conducted with the help of such programs. We will try to reflect on this topic and look for the errors.
It is quite often that software developers have absolutely no clue about the cost of an error. It is very important that the error be found at the earliest possible stage.
« Le « Machine Learning » – « Apprentissage statistique » ou « Analyse prédictive » - sort des labos de recherche et des cercles de spécialistes pour être de plus en plus être utilisé au sein des entreprises, et pas seulement les startups. En témoigne l’essor de la toolkit OpenSource Scikit-learn très vite répandue internationalement comme l’un des nouveaux standards de cette nouvelle façon de faire du logiciel, mais aussi la disponibilité depuis juillet 2014 d’Azure ML, le service de Machine Learning de Microsoft Azure. Dans cette session nous vous proposons un aperçu du développement de logiciel d’apprentissage statistique en Python avec SciKit-Learn. Nous invitons l'un des principaux contributeurs de cette toolkit, Olivier Grisel , ingénieur de recherche dans l’équipe équipe Inria PARIETAL à Saclay, à venir nous en présenter un aperçu dans une session interactive et basée sur de nombreux exemples et démos. Pour en savoir plus: http://scikit-learn.org https://team.inria.fr/parietal/ https://twitter.com/ogrisel
Similar to BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking (20)
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
Keyanoush Razavidinani, Digital Services Consultant at A1 Digital, a BigML Partner, highlights why it is important to identify and reduce human bottlenecks that optimize processes and let you focus on important activities. Additionally, Guillem Vidal, Machine Learning Engineer at BigML completes the session by showcasing how Machine Learning is put to use in the manufacturing industry with a use case to detect factory failures.
The Road to Production: Automating your Anomaly Detectors - by jao (Jose A. Ortega), Co-Founder and Chief Technology Officer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
Machine Learning for Anti Money Laundering Compliance, by Kevin Nagel, Consultant and Data Scientist at INFORM.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
Multi Perspective Anomalies, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
My First Anomaly Detector: Practical Workshop, by Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
Introduction to End-to-End Machine Learning: Classification and Regression - Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
A Data-Driven Company: 21 Lessons for Large Organizations to Create Value from AI, by Richard Benjamins, Chief AI and Data Strategist at Telefónica.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
How Machine Learning Transforms and Automates Legal Services, by Arnoud Engelfriet, Co-Founder at Lynn Legal.
*Machine Learning School in The Netherlands 2022.
Machine Learning for Public Safety: Reducing Violence and Discrimination in Stadiums.
Speakers: Ramon van Ingen, Co-Founder at Siip, Entrepreneur, Researcher, and Pablo González, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
Process Optimization in Manufacturing Plants, by Keyanoush Razavidinani, Digital Business Consultant at A1 Digital.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
Lessons Learned Applying Anomaly Detection at Scale, by Álvaro Clemente, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
Citizen Development in AI, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
This new feature is a continuation of and improvement on our previous Image Processing release. Now, Object Detection lets you go a step further with your image data and allows you to locate objects and annotate regions in your images. Once your image regions are defined, you can train and evaluate Object Detection models, make predictions with them, and automate end-to-end Machine Learning workflows on a single platform. To make that possible, BigML enables Object Detection by introducing the regions optype.
As with any other BigML feature, Object Detection is available from the BigML Dashboard, API, and WhizzML for automation. Object Detection is extremely helpful to tackle a wide range of computer vision use cases such as medical image analysis, quality control in manufacturing, license plate recognition in transportation, people detection in security surveillance, among many others.
This new release brings Image Processing to the BigML platform, a feature that enhances our offering to solve image data-driven business problems with remarkable ease of use. Because BigML treats images as any other data type, this unique implementation allows you to easily use image data alongside text, categorical, numeric, date-time, and items data types as input to create any Machine Learning model available in our platform, both supervised and unsupervised.
Now, it is easier than ever to solve a wide variety of computer vision and image classification use cases in a single platform: label your image data, train and evaluate your models, make predictions, and automate your end-to-end Machine Learning workflows. As with any other BigML feature, Image Processing is available from the BigML Dashboard, API, and WhizzML, and it can be applied to solve use cases such as medical image analysis, visual product search, security surveillance, and vehicle damage detection, among others.
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
This session presents a quite common situation for those working in food and beverage retail (FnB) and highlights interesting insights to fight waste reduction.
Speaker: Stephen Kinns, CEO and Co-Founder at catsAi.
*ML in Retail 2021: Webinar.
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
This is an introductory session about the role that Machine Learning is playing in the retail sector and how it is being deployed across the different areas of this industry.
Speaker: Atakan Cetinsoy, VP of Predictive Applications at BigML.
*ML in Retail 2021: Webinar.
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
This presentation analyzes the role that Machine Learning plays in legal automation with a real-world Machine Learning application.
Speaker: Arnoud Engelfriet, Co-Founder at Lynn Legal.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
This is a real-life Machine Learning use case about integrated risk.
Speakers: Thomas Rengersen, Product Owner of the Governance Risk and Compliance Tool for Rabobank, and Thomas Alderse Baas, Co-Founder and Director of The Bowmen Group.
*ML in GRC 2021: Virtual Conference.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
2. Outline
1 Server-side workflows: WhizzML
2 Basic Workflow: Model or ensemble?
3 Case study: Using Flatline in Whizzml
4 Advanced Workflows
5 Case Study: Stacked Generalization in WhizzML
#BSSML16 Automating Machine Learning December 2016 2 / 46
3. Outline
1 Server-side workflows: WhizzML
2 Basic Workflow: Model or ensemble?
3 Case study: Using Flatline in Whizzml
4 Advanced Workflows
5 Case Study: Stacked Generalization in WhizzML
#BSSML16 Automating Machine Learning December 2016 3 / 46
4. Client-side Machine Learning Automation
Problems of client-side solutions
Complexity Lots of details outside the problem domain
Reuse No inter-language compatibility
Scalability Client-side workflows hard to optimize
Extensibility Bigmler hides complexity at the cost of flexibility
Not enough abstraction
#BSSML16 Automating Machine Learning December 2016 4 / 46
5. Machine Learning Automation for real
Solution (complexity, reuse): Domain-specific languages
#BSSML16 Automating Machine Learning December 2016 5 / 46
6. Machine Learning Automation for real
Solution (scalability, reuse): Back to the server
#BSSML16 Automating Machine Learning December 2016 6 / 46
7. Machine Learning Automation for real
Solution (scalability, reuse): Back to the server
#BSSML16 Automating Machine Learning December 2016 6 / 46
8. WhizzML in a Nutshell
• Domain-specific language for ML workflow automation
High-level problem and solution specification
• Framework for scalable, remote execution of ML workflows
Sophisticated server-side optimization
Out-of-the-box scalability
Client-server brittleness removed
Infrastructure for creating and sharing ML scripts and libraries
#BSSML16 Automating Machine Learning December 2016 7 / 46
9. WhizzML REST Resources
Library Reusable building-block: a collection of
WhizzML definitions that can be imported by
other libraries or scripts.
Script Executable code that describes an actual
workflow.
• Imports List of libraries with code used by
the script.
• Inputs List of input values that
parameterize the workflow.
• Outputs List of values computed by the
script and returned to the user.
Execution Given a script and a complete set of inputs,
the workflow can be executed and its outputs
generated.
#BSSML16 Automating Machine Learning December 2016 8 / 46
10. Different ways to create WhizzML Scripts/Libraries
Github
Script editor
Gallery
Other scripts
Scriptify
−→
#BSSML16 Automating Machine Learning December 2016 9 / 46
17. Outline
1 Server-side workflows: WhizzML
2 Basic Workflow: Model or ensemble?
3 Case study: Using Flatline in Whizzml
4 Advanced Workflows
5 Case Study: Stacked Generalization in WhizzML
#BSSML16 Automating Machine Learning December 2016 16 / 46
18. Model or Ensemble?
• Split a dataset in test and training parts
• Create a model and an ensemble with the training dataset
• Evaluate both with the test dataset
• Choose the one with better evaluation (f-measure)
https://github.com/whizzml/examples/tree/master/model-or-ensemble
#BSSML16 Automating Machine Learning December 2016 17 / 46
19. Model or Ensemble?
;; Functions for creating the two dataset parts
;; Sample a dataset taking a fraction of its rows (rate) and
;; keeping either that fraction (out-of-bag? false) or its
;; complement (out-of-bag? true)
(define (sample-dataset origin-id rate out-of-bag?)
(create-dataset {"origin_dataset" origin-id
"sample_rate" rate
"out_of_bag" out-of-bag?
"seed" "example-seed-0001"})))
;; Create in parallel two halves of a dataset using
;; the sample function twice. Return a list of the two
;; new dataset ids.
(define (split-dataset origin-id rate)
(list (sample-dataset origin-id rate false)
(sample-dataset origin-id rate true)))
#BSSML16 Automating Machine Learning December 2016 18 / 46
20. Model or Ensemble?
;; Functions to create an ensemble and extract the f-measure from
;; evaluation, given its id.
(define (make-ensemble ds-id size)
(create-ensemble ds-id {"number_of_models" size}))
(define (f-measure ev-id)
(let (ev-id (wait ev-id) ;; because fetch doesn't wait
evaluation (fetch ev-id))
(evaluation ["result" "model" "average_f_measure"]))
#BSSML16 Automating Machine Learning December 2016 19 / 46
21. Model or Ensemble?
;; Function encapsulating the full workflow
(define (model-or-ensemble src-id)
(let (ds-id (create-dataset {"source" src-id})
[train-id test-id] (split-dataset ds-id 0.8)
m-id (create-model train-id)
e-id (make-ensemble train-id 15)
m-f (f-measure (create-evaluation m-id test-id))
e-f (f-measure (create-evaluation e-id test-id)))
(log-info "model f " m-f " / ensemble f " e-f)
(if (> m-f e-f) m-id e-id)))
;; Compute the result of the script execution
;; - Inputs: [{"name": "input-source-id", "type": "source-id"}]
;; - Outputs: [{"name": "result", "type": "resource-id"}]
(define result (model-or-ensemble input-source-id))
#BSSML16 Automating Machine Learning December 2016 20 / 46
22. Outline
1 Server-side workflows: WhizzML
2 Basic Workflow: Model or ensemble?
3 Case study: Using Flatline in Whizzml
4 Advanced Workflows
5 Case Study: Stacked Generalization in WhizzML
#BSSML16 Automating Machine Learning December 2016 21 / 46
23. Transforming item counts to features
basket milk eggs flour salt chocolate caviar
milk,eggs Y Y N N N N
milk,flour Y N Y N N N
milk,flour,eggs Y Y Y N N N
chocolate N N N N Y N
#BSSML16 Automating Machine Learning December 2016 22 / 46
28. Flatline code generation with WhizzML
(define (field-flatline field item yes no)
(flatline "(if (contains-items? {{field}} {{item}})"
"{{yes}}"
"{{no}})"))
(define (item-fields field items yes no)
(for (item items)
{"field" (field-flatline field item yes no)}))
(define (dataset-item-fields ds-id field)
(let (ds (fetch ds-id)
item-dist (ds ["fields" field "summary" "items"])
items (map head item-dist))
(item-fields field items "Y" "N")))
#BSSML16 Automating Machine Learning December 2016 25 / 46
29. Flatline code generation with WhizzML
(define output-dataset
(let (fs {"new_fields" (dataset-item-fields input-dataset
field)})
(create-dataset input-dataset fs)))
{"inputs": [{"name": "input-dataset",
"type": "dataset-id",
"description": "The input dataset"},
{"name": "field",
"type": "string",
"description": "Id of the items field"}],
"outputs": [{"name": "output-dataset",
"type": "dataset-id",
"description": "The id of the generated dataset"}]}
#BSSML16 Automating Machine Learning December 2016 26 / 46
30. Outline
1 Server-side workflows: WhizzML
2 Basic Workflow: Model or ensemble?
3 Case study: Using Flatline in Whizzml
4 Advanced Workflows
5 Case Study: Stacked Generalization in WhizzML
#BSSML16 Automating Machine Learning December 2016 27 / 46
31. What Do We Know About WhizzML?
• It’s a complete programming language
• Machine learning “operations” are first-class
• Those operations are performed in BigML’s backend
One-line of code to perform API requests
We get scale “for free”
• Everything is Composable
Functions
Libraries
The Web Interface
#BSSML16 Automating Machine Learning December 2016 28 / 46
32. What Can We Do With It?
• Non-trivial Model Selection
n-fold cross validation
Comparison of model types (tree, ensemble, logistic)
• Automation of Drudgery
One-click retraining/validation
Standarized dataset transformations / cleaning
• Sure, but what else?
#BSSML16 Automating Machine Learning December 2016 29 / 46
33. Algorithms as Workflows
• Many ML algorithms can be thought of as workflows
• In these algorithms, machine learning operations are the
primitives
Make a model
Make a prediction
Evaluate a model
• Many such algorithms can be implemented in WhizzML
Reap the advantages of BigML’s infrastructure
Once implemented, it is language-agnostic
#BSSML16 Automating Machine Learning December 2016 30 / 46
37. Examples: SMACdown
Objective: Find the best set of parameters even more quickly!
• Do:
Generate several random sets of parameters for an ML algorithm
Do 10-fold cross-validation with those parameters
Learn a predictive model to predict performance from parameter
values
Use the model to help you select the next set of parameters to
evaluate
• Until you get a set of parameters that performs “well” or you get
bored
#BSSML16 Automating Machine Learning December 2016 34 / 46
38. Examples: Boosting
• General idea: Iteratively model the dataset
Each iteration is trained on the mistakes of previous iterations
Said another way, the objective changes each iteration
The final model is a summation of all iterations
• Lots of variations on this theme
Adaboost
Logitboost
Martingale Boosting
Gradient Boosting
• Let’s take a look at a WhizzML implementation of the latter
#BSSML16 Automating Machine Learning December 2016 35 / 46
40. Outline
1 Server-side workflows: WhizzML
2 Basic Workflow: Model or ensemble?
3 Case study: Using Flatline in Whizzml
4 Advanced Workflows
5 Case Study: Stacked Generalization in WhizzML
#BSSML16 Automating Machine Learning December 2016 37 / 46
42. Stacked generalization
Objective: Improve predictions by modeling the output scores of
multiple trained models.
• Create a training and a holdout set
• Create n different models on the training set (with some difference
among them; e.g., single-tree vs. ensemble vs. logistic regression)
• Make predictions from those models on the holdout set
• Train a model to predict the class based on the other models’
predictions
#BSSML16 Automating Machine Learning December 2016 39 / 46
43. A Stacked generalization library: creating the stack
;; Splits the given dataset, using half of it to create
;; an heterogeneous collection of models and the other
;; half to train a tree that predicts based on those other
;; models predictions. Returns a map with the collection
;; of models (under the key "models") and the meta-prediction
;; as the value of the key "metamodel". The key "result"
;; has as value a boolean flag indicating whether the
;; process was successful.
(define (make-stack dataset-id)
(let ([train-id hold-id] (create-random-dataset-split dataset-id 0.5)
models (create-stack-models train-id)
id (create-stack-predictions models hold-id)
orig-fields (model-inputs (head models))
obj-id (dataset-get-objective-id train-id)
meta-id (create-model {"dataset" id
"excluded_fields" orig-fields
"objective_field" obj-id})
success? (resource-done? (fetch (wait meta-id))))
{"models" models "metamodel" meta-id "result" success?}))
#BSSML16 Automating Machine Learning December 2016 40 / 46
44. A Stacked generalization library: using the stack
;; Use the models and metamodels computed by make-stack
;; to make a prediction on the input-data map. Returns
;; the identifier of the prediction object.
(define (make-stack-prediction models meta-model input-data)
(let (preds (map (lambda (m) (create-prediction {"model" m
"input_data" input-data}))
models)
preds (map (lambda (p)
(head (values ((fetch p) "prediction"))))
preds)
meta-input (make-map (model-inputs meta-model) preds))
(create-prediction {"model" meta-model "input_data" meta-input})))
#BSSML16 Automating Machine Learning December 2016 41 / 46
45. A Stacked generalization library: auxiliary functions
;; Extract for a batchpredction its associated dataset of results
(define (batch-dataset id)
(wait ((fetch id) "output_dataset_resource")))
;; Create a batchprediction for the given model and datasets,
;; with a map of additional options and using defaults appropriate
;; for model stacking
(define (make-batch ds-id mod-id)
(let (name (resource-type mod-id))
(create-batchprediction ds-id mod-id {"all_fields" true
"output_dataset" true
"prediction_name" name})))
;; Auxiliary function extracting the model_inputs of a model
(define (model-inputs mod-id)
(fetch mod-id) "input_fields"))
#BSSML16 Automating Machine Learning December 2016 42 / 46
46. A Stacked generalization library: auxiliary functions
;; Auxiliary function to create the set of stack models
(define (create-stack-models train-id)
[(create-model {"dataset" train-id})
(create-ensemble {"dataset" train-id
"number_of_models" 20
"randomize" false})
(create-ensemble {"dataset" train-id
"number_of_models" 20
"randomize" true})
(create-logisticregression {"dataset" train-id})])
;; Auxiliary funtion to successively create batchpredictions using the
;; given models over the initial dataset ds-id. Returns the final
;; dataset id.
(define (create-stack-predictions models ds-id)
(reduce (lambda (did mid)
(batch-dataset (make-batch did mid)))
ds-id
models))
#BSSML16 Automating Machine Learning December 2016 43 / 46
47. A Stacked generalization library: creating the stack
;; Splits the given dataset, using half of it to create
;; an heterogeneous collection of models and the other
;; half to train a tree that predicts based on those other
;; models predictions. Returns a map with the collection
;; of models (under the key "models") and the meta-prediction
;; as the value of the key "metamodel". The key "result"
;; has as value a boolean flag indicating whether the
;; process was successful.
(define (make-stack dataset-id)
(let ([train-id hold-id] (create-random-dataset-split dataset-id 0.5)
models (create-stack-models train-id)
id (create-stack-predictions models hold-id)
orig-fields (model-inputs (head models))
obj-id (dataset-get-objective-id train-id)
meta-id (create-model {"dataset" id
"excluded_fields" orig-fields
"objective_field" obj-id})
success? (resource-done? (fetch (wait meta-id))))
{"models" models "metamodel" meta-id "result" success?}))
#BSSML16 Automating Machine Learning December 2016 44 / 46