This document discusses vectorization, which is the process of converting raw data like text into numerical feature vectors that can be fed into machine learning algorithms. It covers the vector space model for text vectorization where each unique word is mapped to an index in a vector and the value is the word count. Common text vectorization strategies like bag-of-words, TF-IDF, and kernel hashing are explained. General vectorization techniques for different attribute types like nominal, ordinal, interval and ratio are also overviewed along with feature engineering methods and the Canova tool.
Building Deep Learning Workflows with DL4JJosh Patterson
In this session we will take a look at a practical review of what is deep learning and introduce DL4J. We’ll look at how it supports deep learning in the enterprise on the JVM. We’ll discuss the architecture of DL4J’s scale-out parallelization on Hadoop and Spark in support of modern machine learning workflows. We’ll conclude with a workflow example from the command line interface that shows the vectorization pipeline in Canova producing vectors for DL4J’s command line interface to build deep learning models easily.
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
Introduction to deep learning and DL4J - http://deeplearning4j.org/ - a guest lecture by Josh Patterson at Georgia Tech for the cse6242 graduate class.
Building Deep Learning Workflows with DL4JJosh Patterson
In this session we will take a look at a practical review of what is deep learning and introduce DL4J. We’ll look at how it supports deep learning in the enterprise on the JVM. We’ll discuss the architecture of DL4J’s scale-out parallelization on Hadoop and Spark in support of modern machine learning workflows. We’ll conclude with a workflow example from the command line interface that shows the vectorization pipeline in Canova producing vectors for DL4J’s command line interface to build deep learning models easily.
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
Introduction to deep learning and DL4J - http://deeplearning4j.org/ - a guest lecture by Josh Patterson at Georgia Tech for the cse6242 graduate class.
Presentation from reactconf 2014 in San Francisco.
Covers Event Stream Processing, some of the theory behind it and some implementation details in the context of local and distributed. Also covers some Big Data technologies
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Comparing TensorFlow and MxNet from a cloud engineering and production app building perspective. Looks at adoption, support, deployment and cloud support cross vendors.
Deep learning goes beyond the traditional machine learning of big data and analytics. In this session, we will review the AWS offering, Amazon Machine Learning, and the AWS GPU-intensive family of servers that run native machine learning and deep-learning algorithms. We will also cover some basic deep-learning algorithms using open source software. Session sponsored by Day1 Solutions.
http://sigir2013.ie/industry_track.html#GrantIngersoll
Abstract: Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we'll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.
DeepLearning4J and Spark: Successes and Challenges - François GarillotSteve Moore
At the recent Spark & Machine Learning Meetup in Brussels, François Garillot of Skymind delivered this lightning talk to a sold-out crowd.
Specifically, François offered a tour of the DeepLearning4J architecture intermingled with applications. He went over the main blocks of this deep learning solution for the JVM that includes GPU acceleration, a custom n-dimensional array library, a parallelized data-loading swiss army tool, deep learning and reinforcement learning libraries — all with an easy-access interface.
Along the way, he pointed out the strategic points of parallelization of computation across machines and gave insight on where Spark helps — and where it doesn't.
Apache Toree provides the interactive notebook for Spark/Scala. Toree is a IPython/Jupyter kernel. It lets you mix Spark/Scala code with markdown, execute the notebook, and publish it on the web.
Asim will talk about how to install and get started with Apache Toree, how to use it to develop Spark applications interactively in notebooks, and how to publish your notebooks.
LuceneRDD for (Geospatial) Search and Entity Linkagezouzias
In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
Presentation from reactconf 2014 in San Francisco.
Covers Event Stream Processing, some of the theory behind it and some implementation details in the context of local and distributed. Also covers some Big Data technologies
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Comparing TensorFlow and MxNet from a cloud engineering and production app building perspective. Looks at adoption, support, deployment and cloud support cross vendors.
Deep learning goes beyond the traditional machine learning of big data and analytics. In this session, we will review the AWS offering, Amazon Machine Learning, and the AWS GPU-intensive family of servers that run native machine learning and deep-learning algorithms. We will also cover some basic deep-learning algorithms using open source software. Session sponsored by Day1 Solutions.
http://sigir2013.ie/industry_track.html#GrantIngersoll
Abstract: Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we'll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.
DeepLearning4J and Spark: Successes and Challenges - François GarillotSteve Moore
At the recent Spark & Machine Learning Meetup in Brussels, François Garillot of Skymind delivered this lightning talk to a sold-out crowd.
Specifically, François offered a tour of the DeepLearning4J architecture intermingled with applications. He went over the main blocks of this deep learning solution for the JVM that includes GPU acceleration, a custom n-dimensional array library, a parallelized data-loading swiss army tool, deep learning and reinforcement learning libraries — all with an easy-access interface.
Along the way, he pointed out the strategic points of parallelization of computation across machines and gave insight on where Spark helps — and where it doesn't.
Apache Toree provides the interactive notebook for Spark/Scala. Toree is a IPython/Jupyter kernel. It lets you mix Spark/Scala code with markdown, execute the notebook, and publish it on the web.
Asim will talk about how to install and get started with Apache Toree, how to use it to develop Spark applications interactively in notebooks, and how to publish your notebooks.
LuceneRDD for (Geospatial) Search and Entity Linkagezouzias
In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
La función de los paneles PG-4 es la de filtrar el polvo existente en la atmósfera y de esta forma introducir el aire tratado totalmente exento de polvo. Pre filtro fabricado en chapa galvanizada pintada al horno con pintura poliéster, incorpora manta filtrante G4 ideal para equipos que funcionan con una velocidad de aspiración del aire de 1,5m/s.
Datos de estructura de la manta filtrante G4
- Fabricación partiendo de fibras ignifugas de estructura progresiva.
- Clasificación F1 al comportamiento ante la llama según norma DIN 43.438/2.
- Filtro NO regenerable, debe ser sustituido cuando alcance su saturación.
Datos técnicos:
- Pe a filtro LIMPIO velocidad 1,5 m/s = 4,4 mmH2O
- Pe a filtro SATURADO velocidad 1,5 m/s = 25 mmH2O
- Velocidad nominal de trabajo = 1,5 m/s
- Rendimiento según norma EN 779 = 91%
- Temperatura máxima de trabajo = 100°C
- Capacidad de acumulación de polvo = 503 g/m2
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
Brief Introduction to Generative AI and LLM in particular.
Overview of the market, and usages of LLMs.
What's it like to train and build a model.
Retrieval Augmented Generation 101, explained for non savvies, and a perspective of what are the moving parts making it complex
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.
To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
Search engines have focused on solving the document retrieval problem, so their scoring functions do not handle naturally non-traditional IR data types, such as numerical or categorical. Therefore, on domains beyond traditional search, scores representing strengths of associations or matches may vary widely. As such, the original model doesn’t suffice, so relevance ranking is performed as a two-phase approach with 1) regular search 2) external model to re-rank the filtered items. Metrics such as click-through and conversion rates are associated with the users’ response to items served. The predicted selection rates that arise in real-time can be critical for optimal matching. For example, in recommender systems, predicted performance of a recommended item in a given context, also called response prediction, is often used in determining a set of recommendations to serve in relation to a given serving opportunity. Similar techniques are used in the advertising domain. To address this issue the authors have created ML-Scoring, an open source framework that tightly integrates machine learning models into a popular search engine (SOLR/Elasticsearch), replacing the default IR-based ranking function. A custom model is trained through either Weka or Spark and it is loaded as a plugin used at query time to compute custom scores.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
Slides from a short presentation on xAPI Vocabulary and how it can be applied in Learning Analytics, as given at the LAK 2016 JISC Learning Analytics Hackathon.
Vectors in Search - Towards More Semantic MatchingSimon Hughes
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.
Webinar: Simpler Semantic Search with SolrLucidworks
Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
Time series data is increasingly ubiquitous. This trend is especially obvious in health and wellness, with both the adoption of electronic health record (EHR) systems in hospitals and clinics and the proliferation of wearable sensors. In 2009, intensive care units in the United States treated nearly 55,000 patients per day, generating digital-health databases containing millions of individual measurements, most of those forming time series. In the first quarter of 2015 alone, over 11 million health-related wearables were shipped by vendors. Recording hundreds of measurements per day per user, these devices are fueling a health time series data explosion. As a result, we will need ever more sophisticated tools to unlock the true value of this data to improve the lives of patients worldwide.
Deep learning, specifically with recurrent neural networks (RNNs), has emerged as a central tool in a variety of complex temporal-modeling problems, such as speech recognition. However, RNNs are also among the most challenging models to work with, particularly outside the domains where they are widely applied. Josh Patterson, David Kale, and Zachary Lipton bring the open source deep learning library DL4J to bear on the challenge of analyzing clinical time series using RNNs. DL4J provides a reliable, efficient implementation of many deep learning models embedded within an enterprise-ready open source data ecosystem (e.g., Hadoop and Spark), making it well suited to complex clinical data. Josh, David, and Zachary offer an overview of deep learning and RNNs and explain how they are implemented in DL4J. They then demonstrate a workflow example that uses a pipeline based on DL4J and Canova to prepare publicly available clinical data from PhysioNet and apply the DL4J RNN.
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopJosh Patterson
As the data world undergoes its cambrian explosion phase our data tools need to become more advanced to keep pace. Deep Learning has emerged as a key tool in the non-linear arms race of machine learning. In this session we will take a look at how we parallelize Deep Belief Networks in Deep Learning on Hadoop’s next generation YARN framework with Iterative Reduce. We’ll also look at some real world examples of processing data with Deep Learning such as image classification and natural language processing.
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelize parameter estimation for linear models on the next-gen YARN framework Iterative Reduce and the parallel machine learning library Metronome. We also take a look at non-linear modeling with the introduction of parallel neural network training in Metronome as well.
Have you ever been recommended a friend on Facebook? Or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.
A Deep Dive into Classification with Naive Bayes. Along the way we take a look at some basics from Ian Witten's Data Mining book and dig into the algorithm.
Presented on Wed Apr 27 2011 at SeaHUG in Seattle, WA.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. Presenter: Josh Patterson
• Email:
– josh@pattersonconsultingtn.com
• Twitter:
– @jpatanooga
• Github:
– https://github.com/
jpatanooga
Past
Published in IAAI-09:
“TinyTermite: A Secure Routing
Algorithm”
Grad work in Meta-heuristics, Ant-
algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
Cloudera
Principal Solution Architect
Today: Patterson Consulting
3. Topic Index
• Why Vectorization?
• Vector Space Model
• Text Vectorization
• General Vectorization
4. WHY VECTORIZATION?
“How is it possible for a slow, tiny brain, whether biological or
electronic, to perceive, understand, predict, and manipulate a
world far larger and more complicated than itself?”
--- Peter Norvig, “Artificial Intelligence: A Modern Approach”
6. What Needs to Happen?
• Need each tweet as some structure that can be
fed to a learning algorithm
– To represent the knowledge of “negative” vs
“positive” tweet
• How does that happen?
– We need to take the raw text and convert it into what
is called a “vector”
• Vector relates to the fundamentals of linear
algebra
– “Solving sets of linear equations”
7. Wait. What’s a Vector Again?
• An array of floating point numbers
• Represents data
– Text
– Audio
– Image
• Example:
–[ 1.0, 0.0, 1.0, 0.5 ]
8. VECTOR SPACE MODEL
“I am putting myself to the fullest possible use, which is
all I think that any conscious entity can ever hope to do.”
--- Hal, 2001
9. Vector Space Model
• Common way of vectorizing text
– every possible word is mapped to a specific integer
• If we have a large enough array then every word
fits into a unique slot in the array
– value at that index is the number of the times the
word occurs
• Most often our array size is less than our corpus
vocabulary
– so we have to have a “vectorization strategy” to
account for this
10. Text Can Include Several Stages
• Sentence Segmentation
– can skip straight to tokenization depending on use case
• Tokenization
– find individual words
• Lemmatization
– finding the base or stem of words
• Removing Stop words
– “the”, “and”, etc
• Vectorization
– we take the output of the process and make an array of
floating point values
11. TEXT VECTORIZATION STRATEGIES
“A man who carries a cat by the tail learns something he can learn
in no other way.”
--- Mark Twain
12. Bag of Words
• A group of words or a document is represented as a bag
– or “multi-set” of its words
• Bag of words is a list of words and their word counts
– simplest vector model
– but can end up using a lot of columns due to number of words
involved.
• Grammar and word ordering is ignored
– but we still track how many times the word occurs in the
document
• has been used most frequently in the document
classification
– and information retrieval domains.
13. Term frequency inverse document
frequency (TF-IDF)
• Fixes some issues with “bag of words”
• allows us to leverage the information about
how often a word occurs in a document (TF)
– while considering the frequency of the word in the
corpus to control for the facet that some words
will be more common than others (IDF)
• more accurate than the basic bag of words
model
– but computationally more expensive
14. Kernel Hashing
• When we want to vectorize the data in a single
pass
– making it a “just in time” vectorizer.
• Can be used when we want to vectorize text right
before we feed it to our learning algorithm.
• We come up with a fixed sized vector that is
typically smaller than the total possible words
that we could index or vectorize
– Then we use a hash function to create an index into
the vector.
16. Four Major Attribute Types
• Nominal
– Ex: “sunny”, “overcast”, and “rainy”
• Ordinal
– Like nominal but with order
• Interval
– “year” but expressed in fixed and equal lengths
• Ratio
– scheme defines a zero point and then a distance
from this fixed zero point
17. Techniques of Feature Engineering
• Taking the values directly from the attribute unchanged
– If the value is something we can use out of the box
• Feature scaling
– standardization
– or Normalizing an attribute
• Binarization of features
– 0 or 1
• Dimensionality reduction
– Use only the most interesting features
18. Canova
• Command Line Based
– We don’t want to write custom code for every dataset
• Examples of Usage
– Convert the MNIST dataset from raw binary files to
the svmLight text format.
– Convert raw text into TF-IDF based vectors in a text
vector format {svmLight, arff}
• Scales out on multiple runtimes
– Local, hadoop
• Open Source, ASF 2.0 Licensed
– https://github.com/deeplearning4j/Canova
Editor's Notes
Advantage to use kernel hashing is that we don’t need the pre-cursor pass like we do with TF-IDF
but we run the risk of having collisions between words
The reality is that these collisions occur very infrequently
and don’t have a noticeable impact on learning performance
Feature scaling (or “feature normalization”) can improve convergence speed of certain algorithms (example: stochastic gradient descent)
When we “standardize” a vector we subtract a measure of location (minimum, maximum, median, etc) and then divide by a measure of scale (variance, standard deviation, range, etc).
Another method of feature normalization is “pre-whitening”. Pre-whitening is a decorrelation transformation that makes the input independent by transforming it against a transformed input covariance matrix. The transformation is called “pre-whitening” due to how it changes the input vector into a white noise vector.