This document summarizes the internship of a student at Culture Machine, a data science company. The internship objectives were to learn about working in a professional environment and assess career fit. Key projects included developing an algorithm to search for memes by image matching and building a prediction model using SVM to classify memes and non-memes. The internship helped enhance communication skills and technical knowledge in areas like machine learning, image processing, and Python libraries. Overall, the experience provided valuable insight into a career in data science.
2. CONTENTS
1. Acknowledgement
2. (a) Introduction
(b) Internship Objectives
(c) Personal Developments Target
3. Project Details
(a) Searching Algorithm for Memes
(b) Prediction Model using SVM Algorithm
4.Other Tasks and Activities
5.Reflection
6.Conclusion
3. ACKNOWLEDGEMENT
The internship opportunity I had with Culture Machine was a
great chance for learning and professional development.
Therefore, I consider myself as a very lucky individual as I was
provided with an opportunity to be a part of it. I am also grateful
for having a chance to meet so many wonderful people and
professionals who led me through this period.
For this opportunity, I would like to thank:
Amit Garde , who is the Head of Engineering of Culture
Machine,Pune and my intern mentor. He always emphasised on the
fact to go in the depth of the field and guided me through my
internship with advice and feedback despite his busy schedule.
Meghana Negi, who is the research engineer and my internship
coach. I want to thank her for giving me the opportunity to do my
internship at the company.She helped me during my internship by
giving me feedback and tips on how to handle and approach
situations. She had always time to answer all my questions
concerning my internship.
Furthermore, I would like to thank Arvind Hulgeri, Abhishek Kolipey
and Jaju who were really helpful and created a good environment to
work in.
Besides my internship, I really enjoyed my stay in Pune. It was a
great experience and I want to thank everybody for it.
4. RECOGNISING VISUAL MEMES
Introduction
This report is a short description of my one and half month internship
carried out in the data science team at Culture Machine.
Internship Objectives:
● To see what it is like to work in a professional environment
● To see if this kind of work is a possibility for my future career
Personal Development Targets:
I set up personal developments targets to practise, improve and
develop during my internship.
● To enhance my communication skills
● To improve my knowledge in the field of Data Science.
This report contains my activities that have contributed to achieve a
number of my stated goals, along with the description of the projects
undertaken by me. Finally I give a conclusion on the internship
experience according to my learning goals.
5. RECOGNISING VISUAL MEMES
A. Searching Algorithm for Memes
1. Project Description :
This project’s aim is to search for the memes of a particular
instance from the dataset of memes and show all the images
corresponding to it. It has the potential to determine and classify the
incoming streams of images from facebook,youtube etc into various
categories of memes, hence an asset to the Culture Machine, being a
data oriented media company.
As a data science intern, my role was to design a searching algorithm for
finding meme of a particular instance given a dataset of memes
belonging to different categories/instances.
Subsequently I came up with solutions that address the problems
assigned.
2. Steps
● Downloaded a dataset containing meme ids corresponding to their
urls, their names along with their ratings, date and data id.
● Defining a function that converts the url to its corresponding image
using urllib.
● Defining a matching function which takes input of the name of
meme and return set of those memes in the following way:
1. Converting all the memes into grey and resizing to the same
size i.e. 100 x 110, keeping the average aspect ration in
mind to avoid distortion.
6. 2. Detecting keypoints (features) and their descriptions using
ORB, a binary descriptor which uses Oriented FAST
algorithm to detect keypoints and Rotated BRIEF to build
their descriptors.
3. Taking a meme of particular instance and comparing its
descriptors with descriptors of other memes in the dataset
using Brute Force Algorithm, which finds the Hamming
distance between each matched features.
4. Setting a threshold in order to classify a meme as a good
match to the input meme. ( e.g hamming distance between
two matched features should be less than 5 units )
Set of 5 input memes -
7. When matching function is called with ‘ Redditors Wife’ ( Meme Name )
as parameter, it takes a single meme image (primary image) of this
instance from the primary dataset, knowing this image belongs to
Redditors wife and compare it with all the images in the input dataset.
Primary Image -
3. Final Outcome:
8. ● Total number of memes of that particular instance present in the
dataset.
● Slideshow of memes of that instance present in the dataset.
The matching function returns the following memes as the output :
9. Matching of keypoints is done by calculating the hamming distance
between the descriptors (features). Lesser the distance, more similar the
features. More number of similar features, more similarity between two
images.
10. B. Prediction Model using SVM Algorithm
1. Project Description:
The project’s aim is to predict the name of the memes from the
dataset of images containing memes as well as non-memes. It has the
capability to differentiate between meme & non-meme and also to
classify meme furthermore in its category.
As a data science intern, my role was to design a prediction model which
takes input of a set of images containing memes as well as non-memes,
apply an algorithm and return returns output of images with their
categories.
Subsequently I came up with solutions that address the problems
assigned.
2. Steps:
● Taking a dataset of images having ImageID which leads to their
urls and their names.
● Assigning indexes to different categories of memes and
non-memes. In my case, 0 is given for all non-memes while
starting from 1, particular index is assigned to a particular category
of meme.
● Input dataset is divided into train and test dataset using ratio that
gives the best result.
● For train dataset, extracting descriptors (features) of each image
using ORB descriptors.
Now, a basic question arises in our mind, how exactly features extraction
is done for images ?
11. Following is my understanding of this question :
(a)ORB is basically a fusion of FAST keypoint detector and BRIEF
descriptor with many modifications to enhance the performance.
First it use FAST to find keypoints, then apply Harris corner
measure to find top N points among them.
(b) FAST algorithm - We identify the similarity between two images
by looking at points which has a significant intensity variation with
respect to its neighbouring pixels.
1. Select a pixel in the image which is to be classified as the interest
point or not. Lets its intensity be Ip.
2. Select an appropriate threshold, t.
3. Consider a circle of 16 pixels around the pixel under test. (See the
image below)
4. Now the pixel is a corner if there exists a set of N pixels in the
circle (of 16 pixels) which are all brighter than Ip + t, or all darker
than Ip - t. (Shown as white dash lines in the above image). N was
chosen to be 12.
5. A high-speed test was proposed to exclude a large number of
non-corners. This test examines only the four pixels at 1, 9, 5 and
12. 13 (First 1 and 9 are tested if they are too brighter or darker. If so,
then checks 5 and 13). If is a corner, then at least three of these
must all be brighter than Ip +t or darker than Ip-t.
ORB also improves the rotation invariance of the keypoints computed by
FAST algorithm.
(c) BRIEF Descriptors :
1. It is an example of binary descriptors. Binary descriptors are
preferred over SIFT or SURF as they do not involve computation
of gradients of a pixel in each patch and hence it is comparatively
faster.
2. In general, Binary descriptors are composed of three parts: A
sampling pattern, orientation compensation and sampling pairs.
3. Consider a small patch centered around a keypoint. We’d like to
describe it as a binary string. First thing, take a sampling pattern
around the keypoint, for example – points spread on a set of
concentric circle s.
13. 4. Next, choose, 256 pairs of points on this sampling pattern.
Now go over all the pairs and compare the intensity value of the first
point in the pair with the intensity value of the second point in the pair – If
the first value is larger then the second, write ‘1’ in the string, otherwise
write ‘0’. That’s it. After going over all 256 pairs, we’ll have a 256
characters string, composed out of ‘1’ and ‘0’ that encoded the local
information around the keypoint. (OpenCV represents it in bytes. So it
will be 32 characters string )
In case of ORB, it doesn’t have an elaborate sampling pattern, uses
moments for orientation calculation and learned pairs are taken as
sampling pairs.
● Extracting features of each image in form of a matrix of shape N x
32 ( N is the number of keypoints detected )
● Vector- quantising these features in form of histograms (bag of
visual words) in order to train them using SVM classifier.
1. Conversion of each image in a form of vector in n dimensional
space.
2. Creating bag of visual words/ features by using k-means
clustering algorithm which determines the center of each cluster (
Number of clusters is taken as square root of M/2, where M is the
total number of features of all the images)
3. Using the approximate nearest neighbour algorithm for
construction a feature histogram for each image.The function then
increments histogram bins based on the proximity of the descriptor
to a particular cluster center.
14. ● Training the SVM classifier using features of train dataset and their
corresponding indexes. ( Linear kernel of SVM is used which uses
“one vs rest” approach for multi-class classification )
● Creating bag of features for images in test dataset as well.
● Predicting the index of the images in test dataset and calculating
the accuracy of the model.
3. Output :
Saving the output dataset as csv file having the predicted names
of the memes ( e.g “ Redditors Wife” ) and non- memes ( “Non-memes”
title is given).
15. OTHER TASKS AND ACTIVITIES
● Learning the use of shell script in installing applications,reading
files and controlling their parameters.
● Learning the mechanism and complexity of algorithms through
book named “Algorithms” by Sanjoy Dasgupta.
● Learning various basics of spark and its implementation when the
size of data is big.
16. REFLECTION
The internship has been a fulfilling experience. I have been able to
accomplish all my stated learning goals, and my expectations have been
exceeded.
The months spent with Culture Machine have given me a great insight
into the startup world. I was genuinely impressed with their work culture,
being both flexible and open.
I have found that the professionals at Culture Machine are all highly
qualified and hardworking individuals, and it was an honor to be working
in their guidance. My mentors and the data science team as a whole has
been very warm and supportive throughout, and I am glad to have built a
strong bond with them.
17. CONCLUSION
How data science and machine learning are carried out in a professional
environment was always a part of my curiosity and the internship with
Culture Machine has helped me experience it.
I hope to pursue this as a career later in life, and therefore the
experience and skill set gained here are invaluable. Some of the skills I
gained are listed below :
● Image processing
● Feature detection and extraction of images
● Using different Machine Learning algorithms
● Knowledge of python libraries such as pandas,numpy,cv2,sklearn
● Tools such as shell script and jupyter notebook for python.
It was a wonderful experience. Thank you Culture Machine for this
opportunity. :)