Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners.
This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning.
Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.
---------------------------------------------------------
Andreas is an Assistant Research Scientist at the NYU Center for Data Science, building a group to work on open source software for data science. Previously he worked as a Machine Learning Scientist at Amazon, working on computer vision and forecasting problems. He is one of the core developers of the scikit-learn machine learning library, and maintained it for several years.
Material will be posted here:
https://github.com/amueller/pydata-nyc-advanced-sklearn
Blog:
peekaboo-vision.blogspot.com
Twitter:
https://twitter.com/t3kcit
7. 7
Training Data
Test Data
Training Labels
Model
Prediction
Supervised Machine Learning
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
8. 8
clf.score(X_test, y_test)
Training Data
Test Data
Training Labels
Model
Prediction
Test Labels Evaluation
Supervised Machine Learning
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
34. 34
Randomized Parameter Search
● Always use distributions for continuous
variables.
● Don't use for low dimensional spaces.
● Future: Bayesian optimization based search.
53. 53
Think twice!
● Old laptop: 4GB Ram
● 1073741824 float32
● Or 1mio data points with 1000 features
● EC2 : 256 GB Ram
● 68719476736 float32
● Or 68mio data points with 1000 features
55. 55
Out of Core Learning
sgd = SGDClassifier()
for i in range(9):
X_batch, y_batch = cPickle.load(open("batch_%02d" % i))
sgd.partial_fit(X_batch, y_batch, classes=range(10))
Possibly go over the data multiple times.
62. 62
Out of Core Text Classification
sgd = SGDClassifier()
hashing_vectorizer = HashingVectorizer()
for i in range(9):
text_batch, y_batch = cPickle.load(open("text_%02d" % I))
X_batch = hashing_vectorizer.transform(text_batch)
sgd.partial_fit(X_batch, y_batch, classes=range(10))