This presentation will cover all aspects of modeling, from preparing data, training and evaluating the results. There will be descriptions of the mainline ML methods including, neural nets, SVM, boosting, bagging, trees, forests, and deep learning. common problems of overfitting and dimensionality will be covered with discussion of modeling best practices. Other topics will include field standardization, encoding categorical variables, feature creation and selection. It will be a soup-to-nuts overview of all the necessary procedures for building state-of-the art predictive models.
Valencian Summer School 2015
Day 1
Lecture 3
Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 3
Ensembles of Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 3
Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 3
Ensembles of Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
L9 using datawarrior for scientific data visualizationSeppo Karrila
A tutorial for beginning graduate students on data visualization, by hands-on training in using DataWarrior. These are only handout notes so the students can try things out on their own laptops, with the free software, instead of scribbling notes themselves. The instructor needs to demonstrate the options or functions listed in the handout notes.
At the end of this Lesson (Part 1) the students should be able to know the following
Introduction
Data Entry
Variable and Value Label
Entering Data
File management
Descriptive statistics
Editing and modifying the data
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Using CART For Beginners with A Teclo Example DatasetSalford Systems
Familiarize yourself with CART Decision Tree technology in this beginner's tutorial using a telecommunications example dataset from the 1990s. By the end of this tutorial you should feel comfortable using CART on your own with sample or real-world data.
Find best statistics experts to do your Business Statistics Assignments. Just send us an email at support@homeworkguru.com with your homework assignment and we will help you with the best statistics resources available online.
Fairly Measuring Fairness In Machine LearningHJ van Veen
We look at a case and two research papers on measuring discrimination in machine learning models for extending credit. Presentation given as part of the Sao Paulo Machine Learning Meetup, theme "Ethics in Data Science".
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
This talk will walk through the important building blocks of Automated AI. Rajiv will highlight the current gaps in the analytics organizations, how to close those gaps using automated AI. Some of the issues discussed around automated AI are the accuracy of models, tradeoffs around control when using automation, interpretability of models, and integration with other tools. These issues will be highlighted with examples of automated analytics in different industries. The talk will end with some examples of how automated AI in the hands of data scientists and business analysts is transforming analytic teams and organizations.
L9 using datawarrior for scientific data visualizationSeppo Karrila
A tutorial for beginning graduate students on data visualization, by hands-on training in using DataWarrior. These are only handout notes so the students can try things out on their own laptops, with the free software, instead of scribbling notes themselves. The instructor needs to demonstrate the options or functions listed in the handout notes.
At the end of this Lesson (Part 1) the students should be able to know the following
Introduction
Data Entry
Variable and Value Label
Entering Data
File management
Descriptive statistics
Editing and modifying the data
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Using CART For Beginners with A Teclo Example DatasetSalford Systems
Familiarize yourself with CART Decision Tree technology in this beginner's tutorial using a telecommunications example dataset from the 1990s. By the end of this tutorial you should feel comfortable using CART on your own with sample or real-world data.
Find best statistics experts to do your Business Statistics Assignments. Just send us an email at support@homeworkguru.com with your homework assignment and we will help you with the best statistics resources available online.
Fairly Measuring Fairness In Machine LearningHJ van Veen
We look at a case and two research papers on measuring discrimination in machine learning models for extending credit. Presentation given as part of the Sao Paulo Machine Learning Meetup, theme "Ethics in Data Science".
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
This talk will walk through the important building blocks of Automated AI. Rajiv will highlight the current gaps in the analytics organizations, how to close those gaps using automated AI. Some of the issues discussed around automated AI are the accuracy of models, tradeoffs around control when using automation, interpretability of models, and integration with other tools. These issues will be highlighted with examples of automated analytics in different industries. The talk will end with some examples of how automated AI in the hands of data scientists and business analysts is transforming analytic teams and organizations.
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
Artificial Intelligence has entered a renaissance thanks to rapid progress in domains as diverse as self-driving cars, intelligent assistants, and game play. Underlying this progress is Deep Learning – driven by significant improvements in Graphic Processing Units and computational models inspired by the human brain that excel at capturing structures hidden in massive complex datasets. These techniques have been pioneered at research universities and digital giants but mainstream enterprises are starting to apply them as open source tools and improved hardware become available. Learn how AI is impacting analytics today and in the future.
Learn how AI is affecting the enterprise including applications like fraud detection, mobile personalization, predicting failures for IoT and text analysis to improve call center interactions. We look at how practical examples of assessing the opportunity for AI, phased adoption, and lessons going from research, to prototype, to scaled production deployment.
Driven by the rapid progress in Artificial Intelligence (AI) research, intelligent machines are gaining the ability to learn, improve and make calculated decisions in ways that will enable them to perform tasks previously thought to rely solely on human experience, creativity, and ingenuity. As a result, we will in the near future see large parts of our lives influenced by AI.
AI innovation will also be central to the achievement of the United Nations' Sustainable Development Goals (SDGs) and will help solving humanity's grand challenges by capitalizing on the unprecedented quantities of data now being generated on sentiment behavior, human health, commerce, communications, migration and more.
With large parts of our lives being influenced by AI, it is critical that government, industry, academia and civil society work together to evaluate the opportunities presented by AI, ensuring that AI benefits all of humanity. Responding to this critical issue, ITU and the XPRIZE Foundation organized AI for Good Global Summit in Geneva, 7-9 June, 2017 in partnership with a number of UN sister agencies. The Summit aimed to accelerate and advance the development and democratization of AI solutions that can address specific global challenges related to poverty, hunger, health, education, the environment, and others.
The Summit provided a neutral platform for government officials, UN agencies, NGO's, industry leaders, and AI experts to discuss the ethical, technical, societal and policy issues related to AI, offer reccommendations and guidance, and promote international dialogue and cooperation in support of AI innovation.
Please visit the AI for Good Global Summit page for more resources: https://www.itu.int/en/ITU-T/AI/Pages/201706-default.aspx
If you would like to speak, partner or sponsor the 2018 edition of the summit, please contact: ai@itu.int
Top 5 Deep Learning and AI Stories - October 6, 2017NVIDIA
Read this week's top 5 news updates in deep learning and AI: Gartner predicts top 10 strategic technology trends for 2018; Oracle adds GPU Accelerated Computing to Oracle Cloud Infrastructure; chemistry and physics Nobel Prizes are awarded to teams supported by GPUs; MIT uses deep learning to help guide decisions in ICU; and portfolio management firms are using AI to seek alpha.
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
DutchMLSchool. Machine Learning: A Technical Perspective
TITLE AS IN SCHEDULE - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
Decision Trees and Ensemble Methods is a different form of Machine Learning algorithm classes. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
By popular demand, here is a case study of my first Kaggle competition from about a year ago. Hope you find it useful. Thank you again to my fantastic team.
Machine learning and linear regression programmingSoumya Mukherjee
Overview of AI and ML
Terminology awareness
Applications in real world
Use cases within Nokia
Types of Learning
Regression
Classification
Clustering
Linear Regression Single Variable with python
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
Despite widespread adoption and success most machine learning models remain black boxes. Many times users and practitioners are asked to implicitly trust the results. However understanding the reasons behind predictions is critical in assessing trust, which is fundamental if one is asked to take action based on such models, or even to compare two similar models. In this talk I will (1.) formulate the notion of interpretability of models, (2.) provide a review of various attempts and research initiatives to solve this very important problem and (3.) demonstrate real industry use-cases and results focusing primarily on Deep Neural Networks.
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
Words are no longer sufficient in delivering the search results users are looking for, particularly in relation to image search. Text and languages pose many challenges in describing visual details and providing the necessary context for optimal results. Machine Learning technology opens a new world of search innovation that has yet to be applied by businesses.
In this session, Mike Ranzinger of Shutterstock will share a technical presentation detailing his research on composition aware search. He will also demonstrate how the research led to the launch of AI technology allowing users to more precisely find the image they need within Shutterstock’s collection of more than 150 million images. While the company released a number of AI search enabled tools in 2016, this new technology allows users to search for items in an image and specify where they should be located within the image. The research identifies the networks that localize and describe regions of an image as well as the relationships between things. The goal of this research was to improve the future of search using visual data, contextual search functions, and AI. A combination of multiple machine learning technologies led to this breakthrough.
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
In many modern applications data are collected in unusual form. Connectome or brain imaging data are graphs. Wearable devices measuring activity are functions over time. In many cases these objects are collected for each individual or transaction leaving the statistician with the challenge of analyzing populations of data not in classical numeric and categorical formats in big spreadsheets. In this talk I introduce object oriented data analysis with an application we recently developed for regression analysis. This talk will be aimed at the general data scientist and emphasis on the concepts and not mathematical detail. The take home message is how can we use covariates (i.e., meta-data) to predict what the structure of a brain image graph will be.
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
This talk aims to dive into technical details in machine learning model development, implementation and values it bring to Monsanto breeding pipeline. We genotype over 100 million seeds a year in order to save field resources and product development cycle time. Automation and high throughput production from the lab becomes key to R&D success. In house predictive model development incorporated random forest ensemble based approach with additional features derived from gaussian mixture model. The results show over 95% accuracy with less than 1% false positives/negatives. Model is highly generalizable with over 10 million data points being trained and tested on. The model also offers probabilistic approach to present genotypes in a more meaningful way and help enhanced downstream genomics analyses. The talk targets audience who are in breeding, genetics, molecular biology, and data scientists who are interested in practical applications.
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
While artificial intelligence for self-driving cars and virtual assistants gets a lot of the notion of communicating the needs, effectiveness and measurements is complicated when speaking “geek”! The work of an analyst, however, does not just involve conducting data analysis within but communicating, championing and speaking simply when talking to the organization, clients and management.
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
In this session, we’ll discuss approaches for applying convolutional neural networks to novel computer vision problems, even without having millions of images of your own. Pretrained models and generic image data sets from Google, Kaggle, universities, and other places can be leveraged and adapted to solve industry and business specific problems. We’ll discuss the approaches of transfer learning and fine tuning to help anyone get started on using deep learning to get cutting edge results on their computer vision problems.
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
Like the story of the six blind men trying to explain the nature of an elephant, current research in cognitive computational systems attempts to identify the nature of an illness, human behavior, or socio-economical phenomenon, from their own perspective.
At present, there is no agreed upon definition for cognitive systems. One large communication corporation defines cognitive systems as a category of technology that uses artificial intelligence, machine learning and reasoning, to enable people and machines to interact more naturally. It also extends and magnifies human expertise and cognition to enable accurate decisions on time. Two of the most famous risk and financial advisory firms agree with that interpretation. A different large corporation, however, considers “cognitive systems” as merely marketing jargon.
If cognitive systems are going to help us solve challenging problems in medicine, economics, or other fields, three aspects must be considered in order to reveal the “true nature of the elephant”.
§ All facets of the problem must be addressed, like the main parts of the elephant had to be touched by the men.
§ These facets must be properly assembled, like the men needed to join hands around the elephant in order to understand what it was.
§ This assembly must be completed within sufficient time to anticipate future decisions. Just like the men needed to know what an elephant is before the next one charges them.
This talk will explain how agnostic (unsupervised, blinded) machine learning findings can be assembled by multiobjective and multimodal optimization research techniques would be utilized to uncover a multifaceted view of the “elephant”, in this case the human being (e.g., genomic variants, personality traits, brain images). It will also give real-world examples of how this knowledge will “extend the human capabilities” by achieving an integrative assessment of the whole person in relation to their risk, which will allow professionals to generate accurate person-centered policies: from personalized diagnoses, business opportunities, or the prevention of outbreaks.
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
This session will focus on how to execute Data Science caliber efforts by creating teams with the attributes of Data Science to deliver meaningful results. As Data Scientists are harder to find and keep, this session should appeal to anyone who is either seeking an alternative approach to executing Data Science delivery or augmenting their current Data Science model with additional options.
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
Enterprises typically have many data silos of partial customer data and a common theme in big data projects to use big data tools and pipelines to unify all siloed customer data into a single, queryable, platform for improving all future customer interactions. This data often comes from billing, website traffic, logistics, and marketing; all in different formats with different properties. Graph provides a way to unify all of the data into a single place for use in tracking the flow of a user through the various silos. Graph can also be used for visualizations and analytics that are difficult in other systems.
In this talk we will explore the ways in which Graph can be leveraged in a customer 360 use case. What it can add to a more conventional system and what the approach to developing a graph based Customer 360 system should be.
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python.
System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
Big Data doesn’t have to just mean Hadoop any more. Big Data can be done in the cloud, using tools developed by the Cloud providers. This session will cover using Amazon AWS services to implement a Big Data application. We will compare and contrast different services from Amazon with the Hadoop equivalents.
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
Using big data isn’t about doing the same things we’ve always done just with different technologies. The technology advances that we’ve chosen to label as big data create the opportunity for wholly new kinds of solutions. Two of the key advances that are enabling new business capabilities are cloud-based data management platforms and streaming data processing and analytics.
In this session, Paul Boal will drill into the cloud-based streaming data architecture that has made possible EVŌ, a new breakthrough health and wellness platform. EVŌ uses a game-changing approach that leverages over 60 billion data points and a predictive analytics engine to intervene BEFORE someone becomes critically ill. All of this is possible by leveraging data from smartphones and wearable fitness devices along with advanced analytics which then help users develop and sustain positive behaviors. Attendees will learn how to create a cloud- based architecture that can receive data, apply multiple layers of dynamic business rules, and drive alerts and decisions through real-time stream processing using technologies including web services, Amazon DynamoDB and Kinesis, Drools, and Apache Spark.
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
The collection and use of Big Data has become an important part of modern business practice. The Internet of Things (IoT) movement promises to provide new opportunities for businesses interested in the intersection of people and technology. It is also wrought with pitfalls for practitioners and researchers who struggle to make sense of an increasing cacophony of signals. How should they poll and collect data from millions of signals in a way that is manageable, scalable, and statistically valid? How should they analyze and predict using these data? This presentation will discuss these challenges with applied examples from monitoring and managing one of the world’s largest computers.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
Companies today are all focused on finding new consumption models to better utilize the data they produce. This presentation will provide insights and best practices for creating the organization and sponsorship necessary to set the foundation for success.
For this session, Dan will provide an overview of the process and methodologies he employs to establish and sustain a Data Driven Culture. Key topics will include:
Data Driven Culture
Executive Sponsorship
Organizational Structure – Collaboration Hubs and Bi-Modal Analytics
Role of Hadoop and Big Data as Part of Data Driven Culture
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
The Internet of (Human) Things is just beginning to take shape. The human body is an inexhaustible source of data about personal health, and the healthcare industry is just beginning to scratch the surface of the potential insights and value that will come from that data. While much of healthcare traditionally focuses on the episodic delivery of services, the Affordable Care Act is pushing healthcare providers, payers, and self-funded employer groups to look at ways to proactively encourage healthy behaviors. Providing personal health devices as a way to promote individual health is one way that healthcare is beginning to take advantage of IoT technologies. This session provides insight into how IoT is being leveraged in population health management through a solution jointly delivered by Amitech Solutions and Big Cloud Analytics. Attendees will learn how Hadoop is being used to gather personal device from various vendors, integrate and analyze that information, differentiate trends across regional and cultural diversity, and provide personal recommendations and insights into health risks. This session presents one important way the healthcare industry is leveraging IoT.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
Hadoop adoption is a journey. Depending on the business the process can take weeks, months, or even years. Hadoop is a transformative technology so the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. There are challenges for companies who have lived with an application driven business for the last two decades to suddenly become data driven. Companies need to begin thinking less in terms of single, silo’d servers and more about “the cluster”.
The concept of the cluster becomes the center of data gravity drawing all the applications to it. Companies, especially the IT organizations, embark on a process of understanding how to maintain and operationalize this environment and provide the data lake as a service to the businesses. They must empower the business by providing the resources for the use cases which drive both renovation and innovation. IT needs to adopt new technologies and new methodologies which enable the solutions. This is not technology for technology sake. Hadoop is a data platform servicing and enabling all facets of an organization. Building out and expanding this platform is the ongoing journey as word gets out to businesses that they can have any data they want and any time. Success is what drives the journey.
The length of the journey varies from company to company. Sometimes the challenges are based on the size of the company but many times the challenges are based on the difficulty of unseating established IT processes companies have adopted without forethought for the past two decades. Companies must navigate through the noise. Sifting through the noise to find those solutions which bring real value takes time. As the platform matures and becomes mainstream, more and more companies are finding it easier to adopt Hadoop. Hundreds of companies have already taken many steps; hundreds more have already taken the first step. As the wave of successful Hadoop adoption continues, more and more companies will see the value in starting the journey and paving the way for others.
Visualizing Big Data – The FundamentalsStampedeCon
This session will touch upon two visual languages, one to describe the context around what is being asked from the data, and the other, to describe what is quantifiable. From these two visual constructs we will go specifically into the following topics: Grids, Balance, Proximity, Contextual Kernels and Hierarchy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Foundations of Machine Learning - StampedeCon AI Summit 2017
1. FOUNDATIONS OF MODELING
AND MACHINE LEARNING
Stephen Coggeshall
Retired Chief Analytics and Science Officer, ID Analytics, Lifelock
Professor USC, UCSD
17 October, 2017
10/24/2017 1
2. What is a Model?
•A model is a representation
of reality
10/24/20172
y
x
• Could be a statistical model, “learned” from data examples
• Could be first principles equations
• Could be simulations, rule systems
3. What is a Statistical Model?
• A statistical model is a functional relationship, y = f(x), between a bunch of inputs and an
output, where this relationship is approximated from discrete data point examples.
• Examples:
• Will this person pay back a loan? (credit score)
• Is this healthcare claim a fraud?
• Will John buy this product if I offer it to him?
• Where will the stock market be in a month?
• A statistical model is built (trained) by “showing” it a data set of examples.
10/24/20173
4. There are Many Methods for Statistical Models
• Statistics community:
• Linear, logistic regressions
• PCR, PLS
• Factor analysis
• ARIMA
• Machine learning community
• Decision trees
• Neural nets/MLP
• SVM
• Nearest neighbor
• Clustering
10/24/20174
• Boosted trees
• Random forests
• Bayesian networks
• Deep learning
• Convolutional neural nets
• Generally show all the data at once.
• Generally linear.
• Generally learn incrementally, point by
point.
• Generally nonlinear.
…
…
5. What Does Data Look Like?
Think this way for data
preparation, statistical
analysis, normalization,
standardization…
10/24/20175
x1 x2 x3 … xn y
x1 x2 x3 … xn y
x1 x2 x3 … xn y
x1 x2 x3 … xn y
…
x1 x2 x3 … xn y
Many hundreds
Many
millions
Look at means, sd’s,
distributions, normalizations
Think this way for model building
process, algorithm design and
selection
x1
x2
y
Note – We first need to build these variables (x’s) from the raw data.
6. How to Build Variables
• Need to convert the raw data into normalized numeric variables
• Numeric fields: scale carefully, log transforms…
• Categorical fields: Generally need to convert to a number
• Build combinations of fields: sums, ratios, min/max’s…
• Build expert variables
• Talk to domain experts, understand the phenomenon as well as possible
• Build special, explicit variables that best relate the raw fields to the output
• Examples:
• Credit risk scores: open-to-buy, revolve/transact (pay/bal), max # delinquencies in past 12 months…
• General fraud: velocity (# events/time window), # events grouped by address…
• Tax preparer fraud: % returns with foster children, % returns with schedule C…
• Marketing: Income/age, % purchases in a category (e.g. travel, shopping, online…)
10/24/20176
7. How to Scale Variables
• Most ML algorithms are sensitive to the differences in scale of the input values:
• Counts are typically a few, maybe less than ~10
• $ values can be thousands, millions…
• Generally need to put all variables on the same footing or scale; Critical for clustering.
• Could divide by the range: 𝑥′
=
𝑥−𝜇
𝑥 𝑚𝑎𝑥
−𝑥 𝑚𝑖𝑛
, but this is very sensitive to outliers.
• Better is z scaling: 𝑥′
=
𝑥−𝜇
𝜎
, but be careful when calculating m and s.
• Another very good method is binning:
10/24/20177
2019181716151 1423 …
5%
5%5%
5%5%
data distribution
• Bin the data into bins of equal population
• Replace the value by the bin number
• Good to use a large number of bins, like 1000
• Particularly good when combining multiple
model outputs/scores
Example showing 20 bins
xi
# records
with that
value of xi
Bad
Good
Good
8. How to Handle Categorical Variables
• Ordinal coding: Assign an integer to each category:
• [A, B, C…] -> [1, 2, 3…]
• Problem: many modeling algorithms assume there is a metric (1 is closer to 2 than 3)
• Dummy, one hot: Expand the category list into new variables:
• [A, B, C…] -> x1, x2, x3… with binary values. So a record with value “B” -> 0,1,0,0…
• Problem: explodes dimensionality, which is the opposite of what we want in modeling
• There are many dimensionality expanding methods (contrast, Helmert, GLM, comparison…);
All have this problem
• Risk tables (for supervised models only): For each possible category assign a specific value. Replace
the field by its assigned value.
• No dimensionality expansion. Each categorical variable becomes one continuous variable
• Assigned value: average of the dependent variable for all records in that category
• Good – direct encoding to what you’re trying to predict
• Bad - loss of interaction information
10/24/20178
Let’s say you have a field x that can take values A, B, C…
How do you encode these categories A, B, C… into numbers to go into a modeling algorithm?
Bad
Bad
Good
9. How to Handle Missing Values
• First, don’t forget that “Missing” can be a relevant value for a categorical variable (particularly for fraud)
• Ignore the records that have a missing field? – not a good approach. Could be many records
• Replace the missing field value with something reasonable:
• Build a model to predict the missing value, given the other values. Usually too much work; Do this only if you
think its very important.
• Fill in the missing value with the average or most common value of that field over all records, or
• The average or most common value of that field over a relevant subset of records:
• Select one or more other fields that you think are important in determining the missing field
• Bin or group the selected field(s) into categories
• Replace the missing field with the average or most common value for its binned or other appropriate
group
10/24/20179
Record # stories sq feet valueZip
1
2
3
4
5
6
7
8
9
2
1
3
3
2
NaN
4
3
2
22041
22043
22042
22041
22052
22041
22043
22052
22051
1032
539
NaN
4837
462
583
2094
7632
489
294
495
3847
9278
129
948
847
1029
947
Fill in with
• Average # stories across all records, or
• Average # stories for records in that zip
• …
Fill in with
• Average value across all records, or
• Average value for records in zip 22042, or
• Average value for all records with # stories = 3, or
• Average value for all records in zip 22042 and # stories = 3
• …
Given data (about buildings):
10. • A simpler model likely fits new data better
The Dark Side of Modeling: Overfitting
10/24/201710
• With a sufficiently complex model you
can always get a perfect fit
y
x
Overly-complex models don’t generalize well – Overfitting
Must balance model complexity with generalization
y
x
y
x
y
x
• What if you want to fit a model to this
1-d data:
• But when new data comes in that your model
hasn’t seen before this fit can be very bad
11. 0
0.02
0.04
0.06
0 20 40 60 80 100
How to Avoid Overfitting: Training/Testing Data
10/24/201711
• Randomly separate the data into two sets: one for training and one for testing
• Build the model on the training data, then evaluate it on the testing data
Measure of
model fitness
For example,
model error
Model complexity
( # free parameters, # training iterations…)
Model error for the
training data
Model error for the
testing data
Stop
here
This is the
REAL model
performance
12. Feature Selection Methods
• Filter – independent of any modeling method
• PCA, Pierson correlations, mutual information, univariate KS…
• Wrapper – uses a model “wrapped” around the feature selection process
• stepwise selection
• Embedded – does feature selection as the model is built
• Trees, use of a regularization/loss function (e.g., Lasso)
• Methods can be linear or nonlinear
• Some supervised, some unsupervised
10/24/201712
Always best to work in as low dimensionality as possible
Try to eliminate less important features (aka dimensions, inputs, variables…)
13. Model Measures of Goodness
10/24/201713
• Continuous target ($ loss, profit…). Use sum of errors
• Each record has an error. Just add up the errors, or the square of the
errors (MSE)
• Marketing models typically use lift
• Lift of 3 means response is 3 times random
• Binary target (good/bad). Use FDR, KS, ROC
• KS is a robust measure of how well two distributions
are separated (goods vs bads)
# in that bin
Score bin
goods bads
14. Decision Trees, Boosted Trees
10/24/201714
• A decision tree cuts up input space into boxes and assigns a value to each box
• A boosted tree model is a linear combination of many weak models
x1
x2
y = 1.00 + .99 + .97 + .96 + .94 …
Decision tree model
Boosted decision tree model
y = 23
15. Boosting
10/24/201715
• A way for training a series of weak learners to result in a strong learner. Any learner can
be used but trees are the most common.
• First boosting algorithms (1989) built the next weak learner using misclassified records.
• Adaptive Boosting (AdaBoost, 1995) is assigning a new weight on each data record for
training the next weak learner in the series (uses all records but with weights).
• AdaBoost increases the weights on misclassified records so the next iteration can pay
more attention to them. Popular AdaBoost algorithms: GentleBoost, LogitBoost,
BrownBoost, DiscreteAB, KLBoost, RealAB…
• It’s often called gradient boosting because one descends down decreasing model error in
a greedy (local gradient) way.
• Stochastic gradient boosting uses a random subset of the data at each iteration.
• Gradient boosting is currently one of the most popular ML methods.
16. Bootstrap and Bagging
10/24/201716
• Bootstraping - Build many models, each uses only a sample of the records, with
replacement, so there is overlap between the many data samples.
• Bagging – Bootstrap Aggregation. The final model is a simple combination of all the
sampled models, typically the average over the sample models.
17. Random Forests
10/24/201717
• Build many trees, each uses only a randomly-chosen subset of features
• Different from bootstrap, which uses a random subset of records
• Combine all the results by averaging or voting
18. Neural Net
10/24/201718
Input
layer
Hidden
nodes
Output
layer
• A typical neural net consists of an input layer, some number of hidden
layers and an output layer.
• All the independent variables (x’s) are the input layer.
• The dependent variable y is the output layer.
• Each node in the hidden layer receives weighted signals from all the
nodes in the previous layer and does a nonlinear transform on this
linear combination of signals.
• The nonlinear transform/activation function can be a logistic function
(hyperbolic tangent, sigmoid), or something else.
• The weights are trained by backpropagating the error, record by record.
• Data is shown to the neural net, record by record, and the weights are
slightly adjusted for each record.
• The entire data set is passed through many times, each complete pass
is called a training epoch.
19. Deep Learning Net
10/24/201719
Deep Learning is a neural net architecture with more than one layer (hence “deep”). It’s not
really new but wasn’t very practical until hardware and algorithms improved sufficiently.
Convolutional Neural Nets (CNN) are Deep NNs with loosely connected first layers
followed by fully connected layers. It is designed to work well with images.
20. Support Vector Machine (SVM)
10/24/201720
• Use a linear classifier separator, but with the data
projected to a higher dimension.
• Since all we care about is distance we can use a “kernel
trick” to construct a distance measure in an abstract higher
dimension.
• Also use the concept of “margin” to find a more robust
linear separator location.
• The separator is completely defined by the location of the
data points on the boundary, which are called the “support
vectors.”
Data is
separable in a
higher
dimension
Margin
optimization
gives better
separation
21. Each ML method Has An Objective Function and
Learning Rules
10/24/201721
The regularization term minimizes model complexity by minimizing the parameters a.
Can use a variety of regularization forms:
• L1 norm (Lasso):
• L2 norm:
Loss/error term can be
• MSE:
• Logistic error:
Objective function is typically a Loss (error) plus Regularization:
Error Regularization
Each modeling method also has a set of training rules to adjust the parameters a:
adjustable
parameters
learning rate
22. How To Choose Which Method to Use
• Do you have labeled data? supervised vs. unsupervised
• Nature of output to be predicted – categorical, binary, continuous
• Dimensionality, noise, nature of raw data/features/variables
• Try several methods
• Start simple (linear/logistic regression), increase complexity
• Always strive to minimize dimensionality. In high dimensions:
• All points become outliers (points become closer to outside boundaries than center)
• Everything looks linear (# data points required to see nonlinearity increases exponentially)
10/24/201722
x1
x2
y
23. Summary
• A statistical model is a functional relationship y=f(x) found using data
• Modeling process:
• Prepare data: build expert variables, encode categoricals,
scale, missing values, feature selection
• Build models: training/testing to avoid overfitting,
measures of goodness
• There are a lot of nonlinear modeling methods. Understand them before you use them.
10/24/201723
x1
x2
y
0
0.02
0.04
0.06
0 20 40 60 80 100
Model error for the
training data
Model error for the
testing data
Stop
here
This is the
REAL model
performance
x1
x2
Score = 1.00 + .99 + .97 + .96 + .94 …
Boosted decision tree model
• Always seek simplicity
• Minimize dimensionality via feature selection. Data is always sparse in high dimensions
• Minimize complexity. Start simple
…
25. What To Do When You Don’t Have Tags?
25
# address links
# apps in last 3 months
Anomalies/outliers
People/events with unusual behavior
Possible frauds
• How can you build a model to predict something when you don’t have examples of it? In this
situation you can build an unsupervised model
• Unsupervised modeling examines how the data points are distributed in the input space only
(there is no output/label)
• You simply look at how the data points scatter. Can you find clusters? Can you see outliers?
• Clusters: groups of people or events that are similar in nature
• Outliers: unusual, anomalous events – potential frauds
• Can work well for fraud models where you’re looking for anomalies
• Hardest part is figuring out what variables (features, dimensions) to examine
26. Principal Component Analysis and Regression (PCA,
PCR)
26
• PCA finds the dominant directions in the data
• The PC’s are orthogonal, and ordered by the
variance (spread) in their direction
• The magnitude of the PC’s are the
eigenvalues of the PC’s
• The eigenvalues are proportional to the
variance in that PC’s direction
One can look at the decaying of the eigenvalues and
select a subset of the PC’s as a subspace to work in.
One can then regress in this PC subspace. That’s PCR.
PC1
PC2
27. Modeling Method – k Nearest Neighbor (kNN)
27
• Put all data into a selected subspace
(variable creation and selection)
• For any new (test) record , see where it
falls in this space, grow a sphere around it
until it contains the “k nearest neighbors”
• Return the average value for these k
records as the predicted value
• Selecting the right features is important
• Requires no training, but can be slow for
test/go-forward use
Predicted
value for
test point
Ozone
levels
28. Modeling Method – K-Means, Fuzzy C-Means, SOM
28
Unsupervised modeling method – finds natural groupings of data points in
independent variable (x) space
29. Mahalanobis Distance
29
X1 – distance from home
X2 – time
since last
event
Which point is more anomalous, the red or green?
The Mahalnobis distance takes into account the different
scales and correlations. It draws equal contours by scaling
based on the correlations and different standard deviations.
This is also called “whitening” the data (without rotation).
The Mahalanobis distance from the center is the same
as first transforming to the Principal Component space
and scaling by the eigenvectors, or first PCA then z
scaling.
30. Another Unsupervised Model Method
• Use an Autoencoder:
• This is nonlinear and thus more general than the previous
methods (PCA, Mahalanobis), which are linear
30
Original
data
records
Reproduced
data
records
• Train the model to reproduce the original data
• The reproduction error is a measure of the
record’s unusualness, and is thus a fraud
score
10/24/2017
31. Some Fuzzy Matching Algorithms for Individual
Field Values
31
• Levenshtein edit distance: minimum # single character edits (insertion, deletion,
substitutions)
• Damerau-Levenshtein: Levenshtein plus transpositions
• Jaro-Winkler: single adjacent character transpositions
• Hamming: % characters that are different (substitutions only)
• N grams: a sliding window of n characters. Can count how many are the same between two
fields.
• Soundex: a phonetic (”sounds like”) comparison
• Stemming, root match: determine the root of the word and compare by roots
• Field specific, ad hoc
32. Some Fuzzy Matching Algorithms for Columns
32
• Linear correlation
• Divergence
• Kulback-Leibler distance (KL)
• Kolmogorov-Smirnov distance (KS)
33. Some Multidimensional Distance Metrics (Rows)
33
• Euclidean (L-2) (only useful after proper scaling)
• Mahalanobis: z scaling then Euclidian
• Minkowski (L-n)
• Manhattan: distance (L-1) traversed along axes
• Chebyshev (L-infinity): max along any axis
34. Glossary 1
34
Model – A representation of something. Models can be physical (e.g., a model airplane, a person wearing new clothes) or
mathematical. They can be dynamic (models of processes) or static (no time evolution). Mathematical models could be based
on first principles (e.g., solving differential equations), expert knowledge (e.g., a priori rule systems), or based on data (statistical
models). Statistical models are the ones built in the discipline called data science, machine learning, etc.
Input, x, variable, dimension, feature, independent variables – The numbers that are the predictors or inputs to the functional
form of the model y = f(x).
Output, tag, label, y, target, dependent variable – The quantity or category that you’re trying to predict with a model. It could be a
continuous value (regression) or a categorical value (classification). Many times for classification the target/output is binary:
yes/no, 0/1, good/bad.
Supervised modeling – Frequently in modeling one has an identified output that one is trying to fit/predict for each record. The
output or dependent variable is the target of the model, and we try to learn the functional relationship between the inputs and
this output. This is called supervised learning since the learning algorithm is supervised during training by constantly looking at
the error between the predicted and actual outputs.
Unsupervised modeling – Here we don’t have an output or dependent variable, all we are given is a set of independent
variables. There are several things we can do with independent variables only, for example we can
• Look for interrelationships between the variables, either linear (e.g. PCA) or nonlinear
• Look for macro structure in the given data in input/independent variable space alone. Methods include PCA and
clustering.
• Look for outliers or anomalies in the records, looking only at the inputs/independent variables. This is frequently a task
in building fraud models in situations when you have no labeled data, that is, no records that have already been
determined to be fraud. Generally you look for what is normal in the data and then look for outliers to this typical data.
35. Glossary 2
35
Data science, machine learning, statistical modeling, data mining, predictive analytics, data analytics – Roughly the same
meaning but many people ascribe some subtle differences between these terms. They all relate to the process of building
statistical models from sets of data.
Variable creation, feature engineering, variable encoding – the process of constructing thoughtful inputs to a model. Typically
one examines the fields in the raw data, does analysis, cleaning, standardization, and then transformations and combinations
of these fields to create special variables that are candidate inputs to models. Examples of this process are encoding of
categorical variables, risk tables, z-scaling, other normalizations and outlier suppression, nonlinear transformations such as
taking the log or binning, construction of ratios or products of fields.
Feature selection, variable reduction, dimensionality reduction – The process of reducing the number of inputs to a model by
considering which inputs are the most important to the model. Most modeling methods degrade when presented with more
inputs than is needed for a robust, stable model.
Model validation – Typically one separates the data into multiple sets to ensure that the model is robust. In good modeling
practices one reserves a set of data that is never used during training and testing that is used for evaluation of the model on
data that it has never seen before. Frequently we reserve this holdout sample from a time that was not used during training
and testing, and that is called out of time validation.
Boosting – an iterative procedure to create a series of weak models, where the final model is then a linear combination of this
series of weak models. Generally the next model in the series is trained on a weighted data set where the records with the
largest error so far are more heavily weighted.
Bagging – the term comes from “bootstrap aggregation.” Technique to improve model stability and accuracy. It
combines/aggregates the outcomes of many models, each having been built via bootstrap sampling.
36. Glossary 3
36
Cross validation – The process of splitting data into many separate bunches. We train on all but one of the bunches, test on the
remaining bunch, and then continue by designating a different bunch as the testing data set and training on all the others. We get an
ensemble of model performance measures which we can average. Cross validation uses all records once per iteration whereas
bootstrap will use records a random # of times.
Bootstrap – A method of selecting training data sets that are a subset of the data through random sampling of the records many times,
always with replacement. Provides estimates for the variance of standard statistical measures that would otherwise be point estimates
without bootstrap.
SOM and K Means - The difference is that in SOM the distance of each input from all of the reference vectors instead of just the closest
one is taken into account, weighted by the neighborhood kernel h. Thus, the SOM functions as a conventional clustering algorithm if the
width of the neighborhood kernel is zero.
Random Forests – Using an ensemble of trees and averaging the result across all the trees. Each tree is built from a random subset of
features from the entire feature set. It uses bagging to select the features that are used for each tree.
Reject inference – The process of inferring the labels on records that were rejected at some point in the process, where the actual
outcome isn’t known. This is done so that the model is hardened, or made more robust, to this larger population that it will see when
implemented (the entire TTD, Through The Door, population).
Goods, Bads – When we are doing a 2-class classification problem (as opposed to a regression where the output is continuous) we
frequently refer to one class as Goods and the other as Bads. For fraud problems the Bads are the records labeled as fraud and the
Goods are the records labeled as non fraud.
False Positives – (FP) Goods that have been incorrectly classified as Bads.
True Positives – (TP) Bads correctly classified as Bad.
True Negatives – (TN) Goods correctly classified as Good.
False Negative – (FN) Bads that have been incorrectly classified as Goods.
False Positive Rate is FPR = #FP / #examined = #FP / (#FP + #TP).
False Positive Ratio = #FP / #TP.