Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture Review: Summary Day 2 Sessions. By Mercè Martín Prats (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture Review: Summary Day 2 Sessions. By Mercè Martín Prats (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 L5. Basic Data Transformations
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 5
Basic Data Transformations
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML17 L2. Ensembles and Logistic RegressionsBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 2: Ensembles and Logistic Regressions. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR2. Summary Day 2
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Summary Day 2
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML17 L3. Clusters and Anomaly DetectionBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 3: Clusters and Anomaly Detection. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 L5. Basic Data Transformations
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 5
Basic Data Transformations
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML17 L2. Ensembles and Logistic RegressionsBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 2: Ensembles and Logistic Regressions. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR2. Summary Day 2
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Summary Day 2
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML17 L3. Clusters and Anomaly DetectionBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 3: Clusters and Anomaly Detection. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Enhancing and Automating Decision Making with Machine Learning. Feature Engineering: Creating Features that Make Machine Learning Work, by BigML.
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Enhancing and Automating Decision Making with Machine Learning - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017MLconf
CompanyDepot: Employer Name Normalization in the Online Recruitment Industry
In the recruitment domain, the employer name normalization task, which links employer names in job postings or resumes to entities in an employer knowledge base (KB), is important to many business applications. It has several unique challenges: handling employer names from both job postings and resumes, leveraging the corresponding location and url context, as well as handling name variations, irrelevant input data, and noises in the KB. In this talk, we present a system called CompanyDepot which uses machine learning techniques to address these challenges. The proposed system achieves 2.5%- 21.4% higher coverage at the same precision level compared to a legacy system used at CareerBuilder over multiple real-world datasets. After applying it to several applications at CareerBuilder, we faced a new challenge: how to avoid duplicate normalization results when the KB is noisy and contains many duplicate entities. To address this challenge, we extend the CompanyDepot system to normalize employer names not only at entity level, but also at cluster level by mapping a query to a cluster in the KB that best matches the query. The proposed system performs an efficient graph-based clustering based on external knowledge from five mapping sources. We also propose a new metric based on success rate and diversity reduction ratio for evaluating the cluster-level normalization. Through experiments and applications, we demonstrate a large improvement on normalization quality from entity-level to cluster-level normalization.
The Merchant Lookup Service at Intuit enables users and products to look up business details by:
Business name (including partial name & misspellings)
Business location (street address, latitude and longitude)
Business type (category, SIC)
User location (IP,GPS-enabled device location)
This powerful service enables auto-suggest, auto-complete and auto-correct within product. The project aims at providing a more complete, canonical business profile by bringing together data and metadata from across the various information providers as well as merchants from Intuit's small business customer base. The Business Directory Service is available as a web-service that can be integrated into desktop, web and mobile applications. It is available through a REST API whose response times are minimized because the data is indexed in Solr and distributed. The backend is powered by HBase, which stores this comprehensive,deduplicated, canonical merchant information. Hundreds of millions of records that have duplicates that exist due to sparse, manually entered information by Intuit's small business customers as well as records from different information providers are de-duplicated through a series of Hadoop jobs resulting in a canonical set of merchants. The deduping pipeline has various components like Reader, Index Generator, various Matchers, Score Combiner and Merchant Splicer.
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
Bayard Rock, LLC, is a private research and software development company with headquarters in the Empire State Building. It is a leader in the filed in the research and development of tools for improving the state of the art in anti-money laundering and fraud detection. As you might imagine, these tools rely heavily on mathematics and graph algorithms. In this talk, Richard Minerich will discuss the research activities of Bayard Rock and its approaches to build tools to find the “bad guys”. Richard Minerich is Bayard Rock’s Director of Research and Development. Rick has expertise in F#, C#, C, C++, C++/CLI,. NET (1.1, 2.0, 3.0, 3.5, 4.0, and 4.5), Object Oriented Design, Functional Design, Entity Resolution, Machine Learning, Concurrency, and Image Processing. He is interested in working on algorithmically, mathematically complex projects and remains open to explore new ideas.
Rick holds 2 patents. The first one, co-invented with a colleague, is titled “Method of Image Analysis Using Sparse Hough Transform.” The other independently held is known as “Method for Document to Template Alignment.”
Why BI ?
Performance management
Identify trends
Cash flow trend
Fine-tune operations
Sales pipeline analysis
Future projections
business Forecasting
Decision Making Tools
Convert data into information
How to Think ?
What happened?
What is happening?
Why did it happen?
What will happen?
What do I want to happen?
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
ING bank is a Dutch multinational, multi-product bank that offers banking services to 33 million retail and commercial customers in over 40 countries. At this scale, ING naturally faces a multitude of data consolidation tasks across its disparate sources. A common consolidation problem is fuzzy name matching: given a name (streaming) or a list of names (batch), find out the most similar name(s) from a different list.
Popular methods such as Levenshtein distance are not appropriate because of the time complexity and sheer volume of names involved. In this talk, we will introduce how we use a Spark custom ML pipeline and Structured Streaming to build fuzzy name matching products in batch and streaming. This can successfully match 8000 names per second against a 10 million name list, using a ten-node cluster. Firstly, we will give an introduction into the name matching problem.
Secondly, we will explain why Levenshtein distance approach is limited, and demonstrate a faster approach; token-based cosine similarity matching. Next, we will show how a ML pipeline helps to build an elegant solution. Here, we will deep dive into the detail of each stage, including customized preprocessing, tokenization, term-frequency, customized inverse document frequency, customized cosine similarity with distributed sparse matrix multiplication, and a customized supervision stage.
Finally, we will show how we deploy the ML pipeline within a batch data pipeline, and additionally as a fuzzy search engine in a streaming manner. Â The main conclusions will be: (1) a spark custom ML pipeline provides a powerful way to handle complicated data science problems (2) a uniform ML pipeline can serve both batch and streaming products easily from the same codebase.
MLSEV. Use Case: The Data-Driven FactoryBigML, Inc
Supervised and Unsupervised Learning Techniques in the Real World: The Data-Driven Factory, by T2Client.
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
Migrating your data warehouse to Amazon Redshift can substantially improve query and data load performance, increase scalability, and save costs. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost effective to analyze data using your existing business intelligence tools. AWS Database Migration Service and AWS Schema Conversion Tool make it easier to migrate your schema and data from your Oracle data warehouse to Amazon Redshift, without disrupting the applications that rely on the data source.
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
Keyanoush Razavidinani, Digital Services Consultant at A1 Digital, a BigML Partner, highlights why it is important to identify and reduce human bottlenecks that optimize processes and let you focus on important activities. Additionally, Guillem Vidal, Machine Learning Engineer at BigML completes the session by showcasing how Machine Learning is put to use in the manufacturing industry with a use case to detect factory failures.
The Road to Production: Automating your Anomaly Detectors - by jao (Jose A. Ortega), Co-Founder and Chief Technology Officer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
Machine Learning for Anti Money Laundering Compliance, by Kevin Nagel, Consultant and Data Scientist at INFORM.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
Multi Perspective Anomalies, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
My First Anomaly Detector: Practical Workshop, by Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
Introduction to End-to-End Machine Learning: Classification and Regression - Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
A Data-Driven Company: 21 Lessons for Large Organizations to Create Value from AI, by Richard Benjamins, Chief AI and Data Strategist at Telefónica.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
How Machine Learning Transforms and Automates Legal Services, by Arnoud Engelfriet, Co-Founder at Lynn Legal.
*Machine Learning School in The Netherlands 2022.
Machine Learning for Public Safety: Reducing Violence and Discrimination in Stadiums.
Speakers: Ramon van Ingen, Co-Founder at Siip, Entrepreneur, Researcher, and Pablo González, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
Process Optimization in Manufacturing Plants, by Keyanoush Razavidinani, Digital Business Consultant at A1 Digital.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
Lessons Learned Applying Anomaly Detection at Scale, by Álvaro Clemente, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
Citizen Development in AI, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
This new feature is a continuation of and improvement on our previous Image Processing release. Now, Object Detection lets you go a step further with your image data and allows you to locate objects and annotate regions in your images. Once your image regions are defined, you can train and evaluate Object Detection models, make predictions with them, and automate end-to-end Machine Learning workflows on a single platform. To make that possible, BigML enables Object Detection by introducing the regions optype.
As with any other BigML feature, Object Detection is available from the BigML Dashboard, API, and WhizzML for automation. Object Detection is extremely helpful to tackle a wide range of computer vision use cases such as medical image analysis, quality control in manufacturing, license plate recognition in transportation, people detection in security surveillance, among many others.
This new release brings Image Processing to the BigML platform, a feature that enhances our offering to solve image data-driven business problems with remarkable ease of use. Because BigML treats images as any other data type, this unique implementation allows you to easily use image data alongside text, categorical, numeric, date-time, and items data types as input to create any Machine Learning model available in our platform, both supervised and unsupervised.
Now, it is easier than ever to solve a wide variety of computer vision and image classification use cases in a single platform: label your image data, train and evaluate your models, make predictions, and automate your end-to-end Machine Learning workflows. As with any other BigML feature, Image Processing is available from the BigML Dashboard, API, and WhizzML, and it can be applied to solve use cases such as medical image analysis, visual product search, security surveillance, and vehicle damage detection, among others.
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
This session presents a quite common situation for those working in food and beverage retail (FnB) and highlights interesting insights to fight waste reduction.
Speaker: Stephen Kinns, CEO and Co-Founder at catsAi.
*ML in Retail 2021: Webinar.
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
This is an introductory session about the role that Machine Learning is playing in the retail sector and how it is being deployed across the different areas of this industry.
Speaker: Atakan Cetinsoy, VP of Predictive Applications at BigML.
*ML in Retail 2021: Webinar.
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
This presentation analyzes the role that Machine Learning plays in legal automation with a real-world Machine Learning application.
Speaker: Arnoud Engelfriet, Co-Founder at Lynn Legal.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
This is a real-life Machine Learning use case about integrated risk.
Speakers: Thomas Rengersen, Product Owner of the Governance Risk and Compliance Tool for Rabobank, and Thomas Alderse Baas, Co-Founder and Director of The Bowmen Group.
*ML in GRC 2021: Virtual Conference.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. BigML, Inc 2
Basic Transformations
Making Data Machine Learning Ready
Poul Petersen
CIO, BigML, Inc
3. BigML, Inc 3Basic Transformations
In a Perfect World…
Q: How does a physicist milk a cow?
A: Well, first let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, first let us consider perfectly formatted data…
5. BigML, Inc 5Basic Transformations
The Reality
CRM
Web Accounts
Transactions
ML Ready?
6. BigML, Inc 6Basic Transformations
Obstacles
• Data Structure
• Scattered across systems
• Wrong "shape"
• Unlabelled data
• Data Value
• Format: spelling, units
• Missing values
• Non-optimal correlation
• Non-existant correlation
• Data Significance
• Unwanted: PII, Non-Preferred
• Expensive to collect
• Insidious: Leakage, obviously correlated
Data Transformation
Feature Engineering
Feature Selection
7. BigML, Inc 7Basic Transformations
The Process
• Define a clear idea of the goal.
• Sometimes this comes later…
• Understand what ML tasks will achieve the goal.
• Transform the data
• where is it, how is it stored?
• what are the features?
• can you access it programmatically?
• Feature Engineering: transform the data you have into
the data you actually need.
• Evaluate: Try it on a small scale
• Accept that you might have to start over….
• But when it works, automate it!!!!
9. BigML, Inc 9Basic Transformations
BigML Tasks
Goal
• Will this customer default on a
loan?
• How many customers will apply for
a loan next month?
• Is the consumption of this product
unusual?
• Is the behavior of the customers
similar?
• Are these products purchased
together?
ML Task
Classification
Regression
Anomaly Detection
Cluster Analysis
Association Discovery
15. BigML, Inc 15Basic Transformations
ML Ready DataInstances
Fields
(Features)
Tabular Data (rows and columns):
• Each row
• is one instance.
• contains all the information about that one instance.
• Each column
• is a field that describes a property of the instance.
16. BigML, Inc 16Basic Transformations
Data Labeling
Unsupervised
Learning Supervised
Learning
• Anomaly Detection
• Clustering
• Association Discovery
• Classification
• Regression
The only difference, in terms of
ML-Ready structure is the
presence of a "label"
17. BigML, Inc 17Basic Transformations
Data Labelling
Data is often not labeled
Create labels with a transformation
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123.23 0 0
Jane Plain 0 0 0
Mary Happy 0 55.22 243.33
Tom Thumb 12.34 8.34 14.56
Un-‐Labelled
Data
Labelled
data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123.23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55.22 243.33 FALSE
Tom Thumb 12.34 8.34 14.56 FALSE
Can be done at Feature
Engineering step as well
18. BigML, Inc 18Basic Transformations
SF Restaurants Example
https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
create database sf_restaurants;
use sf_restaurants;
create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100),
postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));
load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100));
load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated
by 'rn' ignore 1 lines;
create table violations (business_id int, vdate varchar(8), description varchar(1000));
load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100));
load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn'
ignore 1 lines;
20. BigML, Inc 20Basic Transformations
Data Cleaning
Homogenize missing values and different types in the same
feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original
data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned
data
update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,'
[ date violation corrected:') > 0;
22. BigML, Inc 22Basic Transformations
Define a Goal
• Predict rating: Poor / Needs Improvement / Adequate /
Good
• This is a classification problem
• Based on business profile:
• Description: kitchen, cafe, etc.
• Location: zip, latitude, longitude
23. BigML, Inc 23Basic Transformations
Denormalizing
business
inspections
violations
scores
Instances
Features
(millions)
join
Data is usually normalized in relational databases, ML-Ready
datasets need the information de-normalized in a single dataset.
create table scores select * from businesses left join inspections using (business_id);
create table scores_last select a.* from scores as a JOIN (select business_id,max(idate)
as idate from scores group by business_id) as b where a.business_id=b.business_id and
a.idate=b.idate;
Denormalize
ML-‐Ready:
Each
row
contains
all
the
information
about
that
one
instance.
create table scores_last_label select scores_last.*, Description as score_label from
scores_last join legend on score <= Maximum_Score and score >= Minimum_Score;
Add
Label
25. BigML, Inc 25Basic Transformations
Structuring Output
• A CSV file uses plain text to store tabular data.
• In a CSV file, each row of the file is an instance.
• Each column in a row is usually separated by a comma (,) but other
"separators" like semi-colon (;), colon (:), pipe (|), can also be used.
Each row must contain the same number of fields
• but they can be null
• Fields can be quoted using double quotes (").
• Fields that contain commas or line separators must be quoted.
• Quotes (") in fields must be doubled ("").
• The character encoding must be UTF-8
• Optionally, a CSV file can use the first line as a header to provide the
names of each field.
After all the data transformations, a CSV (“Comma-Separated
Values) file has to be generated, following the rules below:
select * from scores_last_label into outfile "./scores_last_label.csv";
select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'score_label' UNION select name, address, city,
state, postal_code, latitude, longitude, score_label from scores_last_label into outfile "./scores_last_label_headers.csv" ;
27. BigML, Inc 27Basic Transformations
Define a Goal
• Predict rating: Poor / Needs Improvement / Adequate / Good
• This is a classification problem
• Based on business profile:
• Description: kitchen, restaurant, etc.
• Location: zip code, latitude, longitude
• Number of violations, text of violations
28. BigML, Inc 28Basic Transformations
Aggregating
User Num.Playbacks Total Time Pref.Device
User001 3 830 Tablet
User002 1 218 Smartphone
User003 3 1019 TV
User005 2 521 Tablet
Aggregated data (list of users)
When the entity to model is different from the provided data,
an aggregation to get the entity might be needed.
Content Genr
e
Duration Play Time User Device
Highway
star
Rock 190 2015-05-12
16:29:33
User001 TV
Blues alive Blues 281 2015-05-13
12:31:21
User005 Tablet
Lonely
planet
Tech
no
332 2015-05-13
14:26:04
User003 TV
Dance,
dance
Disco 312 2015-05-13
18:12:45
User001 Tablet
The wall Reag
ge
218 2015-05-14
09:02:55
User002 Smartphone
Offside
down
Tech
no
240 2015-05-14
11:26:32
User005 Tablet
The
alchemist
Blues 418 2015-05-14
21:44:15
User003 TV
Bring me
down
Class
ic
328 2015-05-15
06:59:56
User001 Tablet
The
scarecrow
Rock 269 2015-05-15
12:37:05
User003 Smartphone
Original data (list of playbacks)
create table violations_aggregated select business_id,count(*) as violation_num,group_concat(description) as violation_txt from
violations group by business_id;
create table scores_last_label_violations select * from scores_last_label left join violations_aggregated USING (business_id);
tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c
tail -n+2 playlist.csv | awk -F',' '{arr[$5]+=$3} END {for (i in arr) {print arr[i],i}}'
SET @@group_concat_max_len = 15000
select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'violation_num', 'violation_txt', 'score_label'
UNION select name, address, city, state, postal_code, latitude, longitude, violation_num, violation_txt, score_label from
scores_last_label_violations into outfile "./scores_last_label_violations_headers.csv" ;
30. BigML, Inc 30Basic Transformations
Pivoting
Different values of a feature are pivoted to new columns in the
result dataset.
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet
The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playback
s
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns
31. BigML, Inc 31Basic Transformations
Time Windows
Create new features using values over different periods of time
Instances
Features
Time
Instances
Features
(millions)
(thousands)
t=1 t=2 t=3
create table scores_2013 select a.business_id, a.score as score_2013, a.idate as idate_2013 from inspections as a JOIN ( select
business_id, max(idate) as idate from inspections where substr(idate,1,4) = "2013" group by business_id) as b where a.business_id =
b.business_id and a.idate = b.idate;
create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
33. BigML, Inc 33Basic Transformations
Updates
Need a current view of the data, but new data only comes in
batches of changes
day
1day
2day
3
Instances
Features
34. BigML, Inc 34Basic Transformations
Streaming
Data only comes in single changes
data
stream
Instances
Features
Stream
Batch
(kafka, etc)
35. BigML, Inc 35Basic Transformations
Prosper Loan Life Cycle
Submit
Cancelled Withdraw Expired
FundedBids Current
Q: Which new listings make it to funded?
Q: Which funded loans make it to paid?
Q: If funded, what will be the rate?
Classification
Regression
Classification
Goal ML Task
Defaulted
Paid
Late
Listings Loans
36. BigML, Inc 36Basic Transformations
Prosper Example
D a t a P ro v i d e d i n X M L
updates!!
export.sh
fetch.sh
“curl”
daily
import.py
XML
bigml.sh
Model
Predict
Share in gallery
Status
LoanStatus
BorrowerRate
Denormalization with join
37. BigML, Inc 37Basic Transformations
Prosper Example
• XML… yuck!
• MongoDB has CSV export and is record based so it is easy to
handle changing data structure.
• Feature Engineering
• There are 5 different classes of “bad” loans
• Date cleanup
• Type casting: floats and ints
• Would be better to track over time
• number of late payments
• compare predictions and actuals
• XML… yuck!
Tidbits and Lessons Learned….
40. BigML, Inc 40Basic Transformations
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example
41. BigML, Inc 41Basic Transformations
Summary
• Data is awful
• Requires clean-up
• Transformations
• Consumes an enormous part of the effort in
applying ML
• Techniques:
• Denormalizing
• Aggregating / Pivoting
• Time windows / Streaming
• What a real Workflow looks like and the tools required
42. BigML, Inc 2
Feature Engineering
Creating Features that Make Machine Learning Work
Poul Petersen
CIO, BigML, Inc
43. BigML, Inc 3Feature Engineering
what is Feature Engineering
• This is really, really important - more than algorithm selection!
• In fact, so important that BigML often does it automatically
• ML Algorithms have no deeper understanding of data
• Numerical: have a natural order, can be scaled, etc
• Categorical: have discrete values, etc.
• The "magic" is the ability to find patterns quickly and efficiently
• ML Algorithms only know what you tell/show it with data
• Medical: Kg and M, but BMI = Kg/M2 is better
• Lending: Debt and Income, but DTI is better
• Intuition can be risky: remember to prove it with an evaluation!
Feature Engineering: applying domain knowledge of
the data to create new features that allow ML
algorithms to work better, or to work at all.
44. BigML, Inc 4Feature Engineering
Built-in Transformations
2013-09-25 10:02
Date-Time Fields
… year month day hour minute …
… 2013 Sep 25 10 2 …
… … … … … … …
NUM NUMCAT NUM NUM
• Date-Time fields have a lot of information "packed" into them
• Splitting out the time components allows ML algorithms to
discover time-based patterns.
DATE-TIME
45. BigML, Inc 5Feature Engineering
Built-in Transformations
Categorical Fields for Clustering/LR
… alchemy_category …
… business …
… recreation …
… health …
… … …
CAT
business health recreation …
… 1 0 0 …
… 0 0 1 …
… 0 1 0 …
… … … … …
NUM NUM NUM
• Clustering and Logistic Regression require numeric fields for
inputs
• Categorical values are transformed to numeric vectors
automatically*
• *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be configured.
46. BigML, Inc 6Feature Engineering
Built-in Transformations
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
TEXT
Text Fields
… great afraid born achieve …
… 4 1 1 1 …
… … … … … …
NUM NUM NUM NUM
• Unstructured text contains a lot of potentially interesting
patterns
• Bag-of-words analysis happens automatically and extracts
the "interesting" tokens in the text
• Another option is Topic Modeling to extract thematic meaning
47. BigML, Inc 7Feature Engineering
Help ML to Work Better
{
“url":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News “,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.”
}
TEXT
title body
Breaking News… news covering…
… …
TEXT TEXT
When text is not actually unstructured
• In this case, the text field has structure (key/value pairs)
• Extracting the structure as new features may allow the ML
algorithm to work better
49. BigML, Inc 9Feature Engineering
Help ML to Work at all
When the pattern does not exist
Highway Number Direction Is Long
2 East-West FALSE
4 East-West FALSE
5 North-South TRUE
8 East-West FALSE
10 East-West TRUE
… … …
Goal: Predict principle direction from highway number
( = (mod (field "Highway Number") 2) 0)
51. BigML, Inc 11Feature Engineering
Feature Engineering
Discretization
Total Spend
7,342.99
304.12
4.56
345.87
8,546.32
NUM
“Predict will spend
$3,521 with error
$1,232”
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
“Predict customer
will be Top 33% in
spending”
53. BigML, Inc 13Feature Engineering
Built-ins for FE
• Discretize: Converts a numeric value to categorical
• Replace missing values: fixed/max/mean/median/etc
• Normalize: Adjust a numeric value to a specific range of
values while preserving the distribution
• Math: Exponentiation, Logarithms, Squares, Roots, etc
• Types: Force a field value to categorical, integer, or real
• Random: Create random values for introducing noise
• Statistics: Mean, Population
• Refresh Fields:
• Types: recomputes field types. Ex: #classes
>
1000
• Preferred: recomputes preferred status
54. BigML, Inc 14Feature Engineering
Flatline Add Fields
Computing with Existing Features
Debt Income
10,134 100,000
85,234 134,000
8,112 21,500
0 45,900
17,534 52,000
NUM NUM
(/ (field "Debt") (field "Income"))
Debt
Income
Debt to Income Ratio
0.10
0.64
0.38
0
0.34
NUM
56. BigML, Inc 16Feature Engineering
What is Flatline?
• DSL:
• Invented by BigML - Programmatic / Optimized for speed
• Transforms datasets into new datasets
• Adding new fields / Filtering
• Transformations are written in lisp-style syntax
• Feature Engineering
• Computing new fields: (/
(field
"Debt")
(field
“Income”))
• Programmatic Filtering:
• Filtering datasets according to functions that evaluate to
true/false using the row of data as an input.
Flatline: a domain specific language for feature
engineering and programmatic filtering
57. BigML, Inc 17Feature Engineering
Flatline
• Lisp style syntax: Operators come first
• Correct: (+
1
2) => NOT Correct: (1
+
2)
• Dataset Fields are first-class citizens
• (field
“diabetes
pedigree”)
• Limited programming language structures
• let, cond, if, map, list operators, */+-‐, etc.
• Built-in transformations
• statistics, strings, timestamps, windows
58. BigML, Inc 18Feature Engineering
Flatline s-expressions
(=
0
(+
(abs
(
f
"Month
-‐
3"
)
)
(abs
(
f
"Month
-‐
2"))
(abs
(
f
"Month
-‐
1")
)
))
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123.23 0 0
Jane Plain 0 0 0
Mary Happy 0 55.22 243.33
Tom Thumb 12.34 8.34 14.56
Un-‐Labelled
Data
Labelled
data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123.23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55.22 243.33 FALSE
Tom Thumb 12.34 8.34 14.56 FALSE
Adding Simple Labels to Data
Define "default" as
missing three payments
in a row
65. BigML, Inc 25Feature Engineering
Feature Engineering
Fix Missing Values in a “Meaningful” Way
F i l t e r
Zeros
Model
insulin
Predict
insulin
Select
insulin
Fixed
Dataset
Amended
Dataset
Original
Dataset
Clean
Dataset
( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
69. BigML, Inc 29Feature Engineering
Feature Selection
• Sales pipeline where step n-1 has no other outcome then
step n.
• Stock close predicts stock open
• Churn retention: the worst rep is actually the best
(correlation != causation)
• Cancer prediction where one input is a doctor ordered test
for the condition
• Account ID predicts fraud (because only new accounts are
fraudsters)
Leakage
71. BigML, Inc 31Feature Engineering
Evaluate & Automate
• Evaluate
• Did you meet the goal?
• If not, did you discover something else useful?
• If not, start over
• If you did…
• Automate - You don’t want to hand code that every time,
right?
• Consider tools that are easy to automate
• Scripting interface
• APIs
• Ability to maintain is important
72. BigML, Inc 32Feature Engineering
The Process
Data
Transform
Define Goal
Model &
Evaluate
no
yes
Better
Data
Not
Possible
Tune
Algorithm
Goal
Met?
Automate
Feature
Engineer &
Selection
Better
Features
73. BigML, Inc 33Feature Engineering
Summary
• Feature Engineering: what is it / why it is important
• Automatic transformations: date-time, text, etc
• Built-in functions: filtering and feature engineering
• Discretization / Normalization / etc.
• Flatline: programmatic feature engineering / filtering
• Structure
• Examples: Adding fields / filtering
• When building features it is important to watch for leakage
• The critical importance of automating