Robert Grossman and Collin Bennett of the Open Data Group discuss building and deploying big data analytic models. They describe the life cycle of a predictive model from exploratory data analysis to deployment and refinement. Key aspects include generating meaningful features from data, building and evaluating multiple models, and comparing models through techniques like confusion matrices and ROC curves to select the best performing model.
Machine learning and linear regression programmingSoumya Mukherjee
Overview of AI and ML
Terminology awareness
Applications in real world
Use cases within Nokia
Types of Learning
Regression
Classification
Clustering
Linear Regression Single Variable with python
What is boosting
Boosting algorithm
Building models using GBM
Algorithm main Parameters
Finetuning models
Hyper parameters in GBM
Validating GBM models
Developing visual material can help to recall memory and also be a quick way to show lots of information. Visualization helps us remember (like when we try to picture where we’ve parked our car, and what's in our cupboards when writing a shopping list). We can create diagrams and visual aids depicting module materials and put them up around the house so that we are constantly reminded of our learning
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Machine learning and linear regression programmingSoumya Mukherjee
Overview of AI and ML
Terminology awareness
Applications in real world
Use cases within Nokia
Types of Learning
Regression
Classification
Clustering
Linear Regression Single Variable with python
What is boosting
Boosting algorithm
Building models using GBM
Algorithm main Parameters
Finetuning models
Hyper parameters in GBM
Validating GBM models
Developing visual material can help to recall memory and also be a quick way to show lots of information. Visualization helps us remember (like when we try to picture where we’ve parked our car, and what's in our cupboards when writing a shopping list). We can create diagrams and visual aids depicting module materials and put them up around the house so that we are constantly reminded of our learning
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin Rshanelynn
Self-Organising maps for Customer Segmentation using R.
These slides are from a talk given to the Dublin R Users group on 20th January 2014. The slides describe the uses of customer segmentation, the algorithm behind Self-Organising Maps (SOMs) and go through two use cases, with example code in R.
Accompanying code and datasets now available at http://shanelynn.ie/index.php/self-organising-maps-for-customer-segmentation-using-r/.
In this presentation, we approach a two-class classification problem. We try to find a plane that separates the class in the feature space, also called a hyperplane. If we can't find a hyperplane, then we can be creative in two ways: 1) We soften what we mean by separate, and 2) We enrich and enlarge the featured space so that separation is possible.
Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies
This presentation discusses the following -
Possible strategies to follow when working on a new machine learning problem.
The common problems with classifiers (how to detect them and eliminate them).
Popular approaches on how to use binary classifiers to problems with multi class classification.
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin Rshanelynn
Self-Organising maps for Customer Segmentation using R.
These slides are from a talk given to the Dublin R Users group on 20th January 2014. The slides describe the uses of customer segmentation, the algorithm behind Self-Organising Maps (SOMs) and go through two use cases, with example code in R.
Accompanying code and datasets now available at http://shanelynn.ie/index.php/self-organising-maps-for-customer-segmentation-using-r/.
In this presentation, we approach a two-class classification problem. We try to find a plane that separates the class in the feature space, also called a hyperplane. If we can't find a hyperplane, then we can be creative in two ways: 1) We soften what we mean by separate, and 2) We enrich and enlarge the featured space so that separation is possible.
Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies
This presentation discusses the following -
Possible strategies to follow when working on a new machine learning problem.
The common problems with classifiers (how to detect them and eliminate them).
Popular approaches on how to use binary classifiers to problems with multi class classification.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Shou-de Lin is currently a full professor in the CSIE department of National Taiwan University. He holds a BS in EE department from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from the University of Southern California. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of machine learning and data mining, social network analysis, and natural language processing. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, best paper award in ASONAM 2011, US Aerospace AFOSR/AOARD research award winner for 5 years. He is the all-time winners in ACM KDD Cup, leading or co-leading the NTU team to win 5 championships. He also leads a team to win WSDM Cup 2016 Champion. He has served as the senior PC for SIGKDD and area chair for ACL. He is currently the associate editor for International Journal on Social Network Mining, Journal of Information Science and Engineering, and International Journal of Computational Linguistics and Chinese Language Processing. He receives the Young Scholars' Creativity Award from Foundation for the Advancement of Outstanding Scholarship and Ta-You Wu Memorial Award.
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
This presentation will cover all aspects of modeling, from preparing data, training and evaluating the results. There will be descriptions of the mainline ML methods including, neural nets, SVM, boosting, bagging, trees, forests, and deep learning. common problems of overfitting and dimensionality will be covered with discussion of modeling best practices. Other topics will include field standardization, encoding categorical variables, feature creation and selection. It will be a soup-to-nuts overview of all the necessary procedures for building state-of-the art predictive models.
We analysed fertility rate on total population of Island which includes Northern Ireland and Republic of Ireland. We used “All Island Population dataset” and checked the relationship between the dependent variable and multiple independent variables to find the meaningful information to enhance sales. Tools: Python Programming.
Slides covered during Analytics Boot Camp conducted with the help of IBM, Venturesity. Special credits to Kumar Rishabh (Google) and Srinivas Nv Gannavarapu (IBM)
Recommender Systems from A to Z – Model TrainingCrossing Minds
This second meetup will be about training different models for our recommender system. We will review the simple models we can build as a baseline. After that, we will present the recommender system as an optimization problem and discuss different training losses. We will mention linear models and matrix factorization techniques. We will end the presentation with a simple introduction to non-linear models and deep learning.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
6. Life Cycle of Predictive Model
Exploratory Data Analysis (R) Process the data (Hadoop)
Build model in
dev/modeling environment
Deploy model in
operational systems with
scoring application
(Hadoop, streams &
Augustus)
Refine model
Analytic modeling
Operations
PMML
Log
files
Retire model and deploy
improved model
8. Key Modeling Activities
Exploring data analysis – goal is to understand the
data so that features can be generated and model
evaluated
Generating features – goal is to shape events into
meaningful features that enable meaningful
prediction of the dependent variable
Building models – goal is to derive statistically valid
models from learning set
Evaluating models – goal is to develop meaningful
report so that the performance of one model can be
compared to another model
9. ? Naïve Bayes
? K nearest neighbor
1957 K-means
1977 EM Algorithm
1984 CART
1993 C4.5
1994 A Priori
1995 SVM
1997 AdaBoost
1998 PageRank
9
Beware of any vendor
whose unique value
proposition includes
some “secret analytic
sauce.”
Source: X. Wu, et. al., Top 10 algorithms in data mining, Knowledge and Information Systems,
Volume 14, 2008, pages 1-37.
10. Pessimistic View: We Get a Significantly
New Algorithm Every Decade or So
1970’s neural networks
1980’s classifications and regression trees
1990’s support vector machine
2000’s graph algorithms
11. In general, understanding the data through
exploratory analysis and generating good
features is much more important than the
type of predictive model you use.
12. Questions About Datasets
• Number of records
• Number of data fields
• Size
– Does it fit into memory?
– Does it fit on a single disk or disk array?
• How many missing values?
• How many duplicates?
• Are there labels (for the dependent variable)?
• Are there keys?
13. Questions About Data Fields
• Data fields
– What is the mean, variance, …?
– What is the distribution? Plot the histograms.
– What are extreme values, missing values, unusual
values?
• Pairs of data fields
– What is the correlation? Plot the scatter plots.
• Visualize the data
– Histograms, scatter plots, etc.
17. Computing Count Distributions
def valueDistrib(b, field, select=None):
# given a dataframe b and an attribute field,
# returns the class distribution of the field
ccdict = {}
if select:
for i in select:
count = ccdict.get(b[i][field], 0)
ccdict[b[i][field]] = count + 1
else:
for i in range(len(b)):
count = ccdict.get(b[i][field], 0)
ccdict[b[i][field]] = count + 1
return ccdict
18. Features
• Normalize
– [0, 1] or [-1, 1]
• Aggregate
• Discretize or bin
• Transform continuous to continuous or
discrete to discrete
• Compute ratios
• Use percentiles
• Use indicator variables
19. Predicting Office Building Renewal
Using Text-based Features
Classification Avg R
imports 5.17
clothing 4.17
van 3.18
fashion 2.50
investment 2.42
marketing 2.38
oil 2.11
air 2.09
system 2.06
foundation 2.04
Classification Avg R
technology 2.03
apparel 2.03
law 2.00
casualty 1.91
bank 1.88
technologies 1.84
trading 1.82
associates 1.81
staffing 1.77
securities 1.77
20. Twitter Intention & Prediction
• Use machine learning
classification
techniques:
– Support Vector
Machines
– Trees
– Naïve Bayes
• Models require clean
data and lots of hand-
scored examples
21. SVM Naïve Bayes Trees
• Namibia floods classified by R package E1071.
• Often implementation dependent.
22. If You Can Ask For Something
• Ask for more data
• Ask for “orthogonal data”
24. Trees
For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone,
Classification and Regression Trees, 1984, Chapman & Hall.
25. Trees Partition Feature Space
• Trees partition the feature space into regions by asking whether an
attribute is less than a threshold.
R
M
1.75
2.45 4.95
M < 2.45
R < 1.75?
A
B
C
C
A B
C
M < 4.95?
C
26. Split Using
Entropy
split 1
split 2
split 3
split 4
split 1
split 2
split 3
split 4
vs
increase in information = 0.32 increase in information = 0
information = 0.64objects have attributes
or
27. Growing the Tree Step 1. Class proportions.
Node u with n objects
n1 of class A (red)
n2 of class B (blue), etc.
Step 2. Entropy
I (u) = - nj /n log nj /n
Step 3. Split proportions.
m1 sent to child 1– node u1
m2 sent to child 2– node u2
Step 4. Choose attribute
to maximize
D = I(u) - mj /n I (uj)
blue
blue
red
red
blue
u
u
1
u
2
31. Ensembles
are unreasonably effective at building
models.
Average for regression trees. Use majority
voting for classification.
f1(x) + f2(x) + … + f64(x)
33. Ensemble Models
1. Partition data and
scatter
2. Build models (e.g.
tree-based model)
3. Gather models
4. Form collection of
models into ensemble
(e.g. majority vote for
classification &
averaging for
regression)
1. partition data
3. gather
models
model
4. form
ensemble
2. build
model
34. Example: Fraud
• Ensembles of trees have proved remarkably
effective in detecting fraud
• Trees can be very large
• Very fast to score
• Lots of variants
– Random forests
– Random trees
• Sometimes need to reverse engineer reason
codes.
36. 37
CUSUM
Assume that we have a mean and
standard deviation for both
distributions.
Observed
Distribution
Baseline
f0(x) = baseline density
f1(x) = observed density
g(x) = log(f1(x)/f0(x))
Z0 = 0
Zn+1 = max{Zn+g(x), 0}
Alert if Zn+1>threshold
f0(x) f1(x)
37. 38
Cubes of Models (Segmented Models)
Build separate segmented model
for every entity of interest
Divide & conquer data (segment)
using multidimensional data
cubes
For each distinct cube, establish
separate baselines for each
quantify of interest
Detect changes from baselines
Entity
(bank, etc.)
Geospatial
region
Estimate separate baselines for each
quantify of interest
Type of
Transaction
38. 39
Examples - Change Detection
Using Cubes of Models
• Highway Traffic Data
– each day (7) x each hour (24) x each sensor
(hundreds) x each weather condition (5) x each
special event (dozens)
– 50,000 baselines models used in current
testbed
39. Multiple Models in Hadoop
• Trees in Mapper
• Build lots of small trees (over historical data in
Hadoop)
• Load all of them into the mapper
• Mapper
– For each event, score against each tree
– Map emits
• Key t value = event id t tree id, score
• Reducer
– Aggregate scores for each event
– Select a “winner” for each event
41. The Take Home Message
• Given two predictive models, develop a
methodology to determine which is better.
• Get a predictive model up in development
quickly, no matter how bad. Call it the
Champion.
• Two methodologies
– Champion-Challenger
– Automated Testing and Deployment Environment
42. Champion Challenger
Methodology
• Develop a new model each week. Call it the
Challenger.
• Meet each week to discuss the performance
of the model. Compare the two models and
keep the better one as the new Champion.
43. Automating Testing & Deployment
Environment
• Develop a custom environment for the large scale
testing and deployment of many (tens, hundreds,
thousands) of different models.
• Develop an appropriate experimental design.
• Test on a small scale.
• Deploy those that test well on a large(r) scale.
• Continuously improve the design and automation
of the process.
• Think of as an automated scale up of A/B testing.
44. Have Weekly Meetings
• Have the modeler and business owner and
engineer on the operations side meet once a
week.
• Discuss misclassifications.
• Discuss business value from the alerts
• Avoid jargon (Detection rate, False Positives,
etc.) whenever possible. Instead…
• Always use visualizations.
45. 2-Class Confusion Matrix
• There are two rates
true positive rate = TPrate = (#TP)/(#P)
false positive rate = FPrate = (#FP)/(#N)
True class
Predicted class
positive negative
positive (#P) #TP #P - #TP
negative (#N) #FP #N - #FP
46. Confusion Matrix
• Recall the TP and FP rates are defined by:
– TPrate = TP / P = recall
– FPrate = FP / N
• Precision and accuracy are defined by:
– Precision = TP / (TP + FP)
– Classifier assigns TP + FP to the positive class
– Accuracy = (TP + TN) / (P + N)
47. Precision and Accuracy
• Recall the TP and FP rates are defined by:
– TPrate = TP / P = recall
– FPrate = FP / N
• Precision and accuracy are defined by:
– Precision = TP / (TP + FP)
– Classifier assigns TP + FP to the positive class
– Accuracy = (TP + TN) / (P + N)
Graph to produce
the ROC Curve
48. Why visualize performance ?
• A single number tells only part of the story
– There are fundamentally two numbers of interest
(FP and TP), which cannot be summarized with a
single number.
– How are errors distributed across the classes?
• Shape of curves are much more informative.
50. ROC plot for the 3 Classifiers
Perfect
Classifier
False Positive Rate
Tree Positive Rate
Random
Classifier
51. ROC Curve
• Recall the TP and FP rates are defined by:
– TPrate = TP / P
– FPrate = FP / N
• The ROC Curve is defined by:
– x = FPrate
– y = TPrate
52. Creating an ROC Curve
• A classifier produces a single ROC point.
• If the classifier has a threshold or sensitivity
parameter, varying it produces a series of ROC
points (confusion matrices).
54. Performance of a Random Classifier
• When there are an equal number of Positives
(P) and Negatives (N), then a random classifier
is accurate about 50% of the time.
• The lift measures the relative performance of
a classifier as compared to a random classifier
• Especially important with imbalanced classes.
55. Lift Curve
• The Lift Curve is defined by:
– x = (TP + FP) / (P + N)
– y = TP
56. Questions?
For the most current version of these slides,
please see tutorials.opendatagroup.com
57. About Open Data
• Open Data began operations in 2001 and has
built predictive models for companies for over
ten years
• Open Data provides management consulting,
outsourced analytic services, & analytic staffing
• For more information
• www.opendatagroup.com
• info@opendatagroup.com