Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Machine Learning Intro Session
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23. Identifying use cases – without Google
• Anything repetitive (classifying digits/gestures/road conditions/e-mail contents)
• Capturing best practices
25. The Nikon D700 has a 1,005-pixel RGB (red, green, blue) sensor that
measures the intensity of the light and the color of a scene. The
camera then compares the information to information from 30,000
images stored in its database. The D700 determines the exposure
settings based on the findings from the comparison. Simplified, it
works like this: You're photographing a portrait outdoors, and the
sensor detects that the light in the center of the frame is much
dimmer than the edges. The camera takes this information along with
the focus distance and compares it to the ones in the database. The
images in the database with similar light and color patterns and
subject distance tell the camera that this must be a close-up portrait
with flesh tones in the center and sky in the background. From this
information, the camera decides to expose primarily for the center of
the frame although the background may be over or underexposed.
Source: http://my.safaribooksonline.com/book/photography/9780470413203/nikon-
d700-essentials/metering_modes
• Note the effort on Data collection.
• Need for synthetic data.
26. • Confusion Matrix
Precision is the fraction of retrieved
instances that are relevant
Recall is the fraction of relevant
instances that are retrieved
Confusion Matrix
27.
28. Sample code
• Demo 1 – Simple fit() & predict()
• Demo 2 – With Cross Validation
• Demo 3 – Use a pickled Classifier
29.
30.
31. PCA – Hum Dus, Humara Ek!
• Why? Computationally efficient. A pre-processing
step when features are large.
• Up-to 10X reduction in number of features, without
losing information.
• Demo 4 – Original # of features 1850. Features used
150
33. Recommendation Engines
• Where the money is – 75% sales
• Don’t make money on hardware – Amazon
• User based - based on User Similarity –
Collaborative Filtering
• Item based – “Users who bought X also
bought Y”
• Demo 5
34. Anomaly Detection
So anomaly detection doesn't know what they look like, but knows what
they don't look like!
Very small number of positive examples
35. Error Analysis: Example - data center monitoring.
Features
x1 = memory use
x2 = number of disk access/sec
x3 = CPU load
x4 = network traffic
We suspect CPU load and network traffic grow linearly with one another
If server is serving many users, CPU is high and network is high
Fail case is infinite loop, so CPU load grows but network traffic is low
New feature - CPU load/network traffic
Multivariate Gaussian algorithm is
aware of “covariance”.