1. Summer Internship Report Page 1
Summer Internship Report
s
8th
July, 2016
Submitted By
Durga Kant Gupta
(Roll No. 13267)
Undergraduate student at IIT Kanpur
Department: Biological Sciences and Bio Engineering
Under The Guidance
Of
IndiaMART Guide: IndiaMART Co-Guide:
Mr. Somesh Kumar, Mr. Anirudh Singh,
VP Business Analytics, Asst. VP Business Analytics,
IndiaMART InterMESH Ltd. IndiaMART InterMESH Ltd.
2. Summer Internship Report Page 2
CONTENTS
ACKNOWLEDGEMENT..........................................................................................................................4
ABOUT THE COMPANY ........................................................................................................................5
CORE VALUES: .....................................................................................................................................5
PRODUCTS:..........................................................................................................................................6
LISTING SERVICES: ...............................................................................................................................7
BUY LEADS:..........................................................................................................................................7
ACCESS TO SERVICE: ............................................................................................................................8
SOFTWARE OR LANGUAGES USED:.......................................................................................................9
DESCRIPTION OF PROJECTS / ACTIVITIES............................................................................................10
PROJECT#1 ........................................................................................................................................10
AIM: ..................................................................................................................................................10
PROCEDURE: .....................................................................................................................................10
COMPARISON WITH THE CURRENT SEARCH ALGORITHM: .................................................................13
RESULT:.............................................................................................................................................15
PROJECT#2 ........................................................................................................................................16
AIM: ..................................................................................................................................................16
PROCEDURE: .....................................................................................................................................17
PROJECT#3 ........................................................................................................................................18
AIM: ..................................................................................................................................................18
DATA DESCRIPTION: ..........................................................................................................................18
PROCEDURE: .....................................................................................................................................19
PROJECT#4 ........................................................................................................................................20
Aim:...................................................................................................................................................20
PROCEDURE: .....................................................................................................................................20
PROJECT#5 ........................................................................................................................................20
PROCEDURE: .....................................................................................................................................21
BIBLIOGRAPHY:..................................................................................................................................24
APPENDIX:......................................................................................................................................24
3. Summer Internship Report Page 3
CERTIFICATE OF INTERNSHIP COMPLETION
This is to certify that Mr. Durga Kant Gupta, a 3rd
year undergraduate student of Biological
Sciences and Bio Engineering department at Indian Institute of Technology, Kanpur has
successfully completed his summer internship from 9th
May, 2016 to 8th
July 2016.
During this period his performance was excellent and we found him dedicated, hardworking and
sincere. We have derived immense benefit from the project and his contribution to our
organization is highly appreciated.
I hereby convey my best wishes to him for all his future endeavors.
Somesh Kumar | VP - Business Analytics
Mobile: +91-9717776552
Email: somesh.kumar@indiamart.com
IndiaMART InterMESH Ltd.
"Kaam Yahi Banta Hai"
7th Floor, Advant-Navis Business Park,
Plot No -7 Sector-142, Noida - 201305
Ph: +91-(0120)-6777 777 Extn : 7787
4. Summer Internship Report Page 4
ACKNOWLEDGEMENT
I take this opportunity to extend my sincere thanks to IndiaMART for offering me a unique
platform to gain exposure and garner knowledge in the field of Business Analytics.
I would like to extend my heartfelt gratitude to my Internship guide Mr. Somesh Kumar and
co-guide Mr. Anirudh Singh for having made my summer training a great learning experience
by their constant guidance, encouragement and support.
Last but not the least I would like to express my profound gratitude to each and every employee
of Business Analytics Division, IndiaMART InterMESH Limited who contributed in their own
ways in successful completion of my Internship.
Durga Kant Gupta
5. Summer Internship Report Page 5
ABOUT THE COMPANY
IndiaMART is India‟s largest B2B online marketplace, connecting buyers with suppliers. The
online channel focuses on providing a platform for buyers, who can be SMEs, large enterprises
as well as individuals. Buyers typically gain access to a wider marketplace; diverse portfolios of
quality products to choose from and tap a one-stop-shop which caters to all their specific
requirements, thereby aiding the discerning buyer make well-informed choices!
IndiaMART offers a platform and tools to over 2.6 crore buyers to search from over 3.3 crore
products and get connected with over 22 lakh reliable and competitive suppliers. Founded in
1996, the company‟s mission is „to make doing businesses, easy‟.
CORE VALUES:
There are four Core values of IndiaMART, in short known as TRIP.
Team Work: “Together we can achieve the impossible” is our belief. Our success is a
result of our team work. We have experts from the field of management, marketing, IT,
arts, content & various other disciplines who work cordially as a team on every project,
every endeavor. Dedication and passion are the true means to our mission fulfillment.
Responsible: Responsible, not just for quality work but for continuous self-development,
of our decisions and of our actions. This helps us think rationally and provides a sense of
accountability to ourselves, our commitment to customers and to our colleagues.
Integrity: We realize the importance of the job & information we handle. We understand
the responsibility that each member of our team has to shoulder and we do that with
highest levels of trust, honesty and integrity – of purpose and action.
Passion: Work at IndiaMART involves constant innovation and creativity. It involves a
continuous thought process to get tangible benefits for our customers, taking into account
the uniqueness of their purpose. Passionate people with a determination to make the
difference are the ones who make this possible.
6. Summer Internship Report Page 6
Customers are of two types:
Buyers: Users who use the service with an intention to buy something.
Suppliers: Users who use the service with an intention to sell something.
A customer can be both Buyer and Supplier.
Suppliers are of two types:
Free Listed Customers: Use basic service which is available at zero cost.
Paid Customers: Bought products and listing service by paying some cost.
IndiaMART works on Freemium model. It earns revenue from the products, listing
services and buy leads packages.
PRODUCTS:
IndiaMART offers following products:
MDC (Mini Dynamic Catalogue): IndiaMART develops compact 4 page home page
showcasing key strengths of the customer, Zoom up Window for detailed product view
and Preferred Number Service. Website hosted on sub domain. Also 10 Buy
Leads/Tenders worth Rs. 2000 free every month under IndiaMART Advantage Program.
Maximiser: Website hosted on personalized domain. 360 degree visibility through
PDF/Mobile Video (30 sec). 10 Buy Leads/ Tenders worth Rs. 2000 free every week
under IndiaMART Advantage Program. Add up to 400 products. Preferred Number
Service
7. Summer Internship Report Page 7
LISTING SERVICES:
The various listing services are as follows:
TrustSEAL: Third party verified TrustSEAL report. Edge over non-certified competitors
online. Certified members attract genuine buyers & more business enquiries.
Star supplier: Priority listing among other catalog clients. Corporate video of supplier‟s
company. Preferred Number Service. 15 Buy Leads/ Tenders worth Rs. 3000 free every
week under IndiaMART Advantage Program
Leading supplier: Priority listing among all clients. Corporate video of supplier‟s
company. Preferred Number service. 20 Buy Leads/Tenders worth Rs. 4000 free every
week under IndiaMART Advantage Program.
Keyword Premium Listing: Listing of clients as per Keywords (for products to be
bought) typed by buyers. One keyword can be bought by a single supplier only.
Featured Premium Listing: Listing of clients as per their preferred city for business.
Industry leader: Top priority listing service. Will always be listed on top whenever a
product is searched related to that industry. Only one supplier can be an Industry leader
of any particular industry.
These listing services help in Search Engine Optimization which facilitates the visibility of
suppliers on the platform.
BUY LEADS:
Buy Leads provide instant access to Buyers and their requirements.
Buy Leads are generated through three ways:
Free buy requirement: Buy requirement made by buyers to IndiaMART.ss
Direct buy requirement: Buy requirement made by buyers directly to the suppliers.
8. Summer Internship Report Page 8
Intent: Our system analyzes activities of users on the website and application, and figures
out their intent to buy product if any. Henceforth, it creates buy leads and posts them to
supplier‟s account after verification.
These leads are posted at supplier‟s account and they can buy the leads as per their requirement.
So Buy leads package, provide a pre-paid system for having Leads in your account which they
can consume at any point of time.
Customers access the service both from website and mobile application.
IndiaMART Website:
Suppliers can purchase any of the products or listing service for three different tenures i.e.
monthly, annually or 3-years (multi yearly).
ACCESS TO SERVICE:
9. Summer Internship Report Page 9
SOFTWARE OR LANGUAGES USED:
R
R is the free software environment and programming language for statistical analysis and
graphics. The R language is widely used by statistician and data miners for various
statistical analysis and statistical software development. R is supported by wide varieties
of UNIX platforms, windows and MacOS. I used R to perform various statistical analysis
and text mining. Some examples of the libraries used are stringr(), stringdist(), plyr() etc.
Version used: R 3.2.1
SQL
Used it for extracting the required data for the analysis from the online database system.
SQL (Structured Query Language) is a standard interactive and programming language
for getting information from a database. Queries take the form of a command language
that lets you select , insert , update, find out location of data and so forth. There is also a
programming interface.
Download this free guide
10. Summer Internship Report Page 10
DESCRIPTION OF PROJECTS / ACTIVITIES
PROJECT#1
AIM:
Analysis and Implementation of Product which maps a product to its most relevant Mcat by
considering the maximum string match and maximum number of leads for that Mcat in the
previous three months.
DATA DESCRIPTION:
Product Name –
Contains the list of all the product names to which we have to assign the most relevant Mcat
PC_ITEM_GLUSR_ID PC_ITEM_ID PC_ITEM_NAME MCAT_ID
Lead Name –
Contain the LEAD_OFR_TITLE to which the product name is matched and the corresponding
MCAT_ID is stored in the Match_Results.
ETO_OFR_TITLE MCAT_ID ETO_OFR_GLCAT_MCAT_NAME
Match Results -
The output containing 7 columns , having the info related to the best match of OFR_TITLE and
product names.
Match Results Final-
Here we also considered the no. of leads corresponding to the Mcat IDs selected based on string
matching. Merge to files by GL_MCAT_ID . And sort the result based on the no. of leads.
PROCEDURE:
1. First I removed “ ,” and ( ) from the PC_ITEM_NAME and then the extra spaces
produced due to removing , and (). This was done using regular expressions in R.
11. Summer Internship Report Page 11
2. Then I removed “ ,” and ( ) from the ETO_OFR_TITLE and then the extra spaces
produced due to removing , and (). This was also done using regular expressions in R.
3. Since I had to match the PC_ITEM_NAME with the LEAD_OFR_TITLE and to find out
how
much match is there, I had to break up the PC_ITEM_NAME into smaller fragments.
4. So, I splitted the PC_ITEM_NAME into single words using strsplit() function in stringr()
library of R and put this output in a list. Splitted_Row is a list of splitted
PC_ITEM_NAME.
5. Similarly I splitted the LEAD_OFR_TITLE and stored the output in a list.
Splitted_OFR_ID is a list of splitted LEAD_OFR_TITLE.
6. Now I created vectors of columns of Product_Name matrix and put them in a list and as a
list so that accessing the elements of a list becomes easy.
7. Similarly I created vectors of columns of Lead_Name matrix and put them in a list and as
a list so that accessing the elements of a list becomes easy.
8. Then I combined these vectors using cbind() function in R and formed two datasets
namely product_list1 and Lead_list1 which had list inside list.
9. After all this data preparation and modifications I started with the loop. Before that I
created a null dataframe namely Match_Results which had these columns OP_USR_ID,
OP_ITEM_ID,OP_ITEM_NAME,OP_MCAT_ID,OP_LN_OFR_TITLE,OP_LN_MCAT
_NAME,
OP_LN_MCAT_ID all initiated to zero.
Loop:
1. For a particular row in the product_list1 which contain the splitted PC_ITEM_NAME,
access its elements one by one and check if it matches with any of the elements in the
Lead_list1 which contains the splitted LEAD_OFR_TITLE.
2. Once a match is found increase the score by one and check for the next splitted word of
the same PC_ITEM_NAME.
3. If the last word of splitted PC_ITEM_NAME matches with any fragment of the splitted
LEAD_OFR_TITLE then increase t by 1.
12. Summer Internship Report Page 12
4. Similarly If the second last word of splitted PC_ITEM_NAME matches with any
fragment of the splitted LEAD_OFR_TITLE then increase v by 1.
5. After checking all the fragments of the splitted PC_ITEM_NAME with the splitted
LEAD_OFR_TITLE, check the values of s, t and v.
6. To consider LEAD_OFR_TITLE as a match it has to satisfy certain criteria. The values of
s and t should not be equal to 0 that means that last and second last word must
compulsorily match.
7. The next condition to consider a LEAD_OFR_TITLE as a match is that it should satisfy a
certain threshold of percentage match with PC_ITEM_NAME which varies with different
length of different PC_ITEM_NAME.
8. Now after deciding that weather this is a match or not go to the next
LEAD_OFR_TITLE and do the same. It has be done for all the LEAD_OFR_TITLE.
9. If a particular LEAD_OFR_TITLE is considered as a match then put this
LEAD_OFR_TITLE in OP_LN_OFR_TITLE which is a null vector. Similarly put the
corresponding MCAT_ID and MCAT_NAME in the OP_LN_MCAT_ID and
OP_LN_MCAT_NAME vectors respectively.
10. Similarly for a particular PC_ITEM_NAME if a match is found in the
LEAD_OFR_TITLE , then put corresponding MCAT_ID, PC_ITEM_NAME,
GLUSR_ID and PC_ITEM_ID in the null vectors OP_MCAT_ID, OP_ITEM_NAME,
OP_USR_ID and OP_ITEM_ID respectively and use rbind() function to repeat the
observations till the loop iterates for LEAD_OFR_TITLE.
11. Now use the cbind() function to combine all of the above mentioned vectors and give it
the name Match_Results_K. This is only for one PC_ITEM_NAME.
12. So, repeat it for all the PC_ITEM_NAME and use rbind() function to get the final result in
a dataframe which was named as Match_Results.
13. Now before going for the next PC_ITEM_NAME, empty all the vectors so that they can
store the new values related to next PC_ITEM_NAME.
14. Finally Remove the first row of zeroes from the Match_Results and merge it with
Lead_Count data by MCAT_ID and the final result as Match_Results_Final which have
13. Summer Internship Report Page 13
all the Match_Results data along with the no. of leads corresponding to every
PC_ITEM_NAME.
15. Then I exported this data in csv format using write.csv() command in R. And sorted the
output in excel by PC_ITEM_NAME and then added a level of no. of leads.
16. This gave me the final output which contained the PC_ITEM_NAME and all the
LEAD_OFR_TITLE which were considered as a match in sorted format according to the
no. of leads. The LEAD_OFR_TITLE with maximum of no. of lead comes at top for a
particular PC_ITEM_NAME.
OP_LN_MCAT_ID OP_USR_ID OP_ITEM_ID OP_ITEM_NAME
OP_MCAT_ID OP_LN_OFR_TITLE OP_LN_MCAT_NAME NO._OF_LEADS
COMPARISON WITH THE CURRENT SEARCH ALGORITHM:
When we were ready with our algorithm which maps a product to an Mcat which gets the
maximum no. of leads, Mr. Samarendra Pratap (AVP, Product Management, IndiaMART) gave
a list of 9000 products.
On these products we had to run our algorithm and find the difference in the no. of leads for a
particular product mapped to a particular Mcat by our algorithm and the current search algorithm
which is live in the system.
DATA DESCRIPTION:
1. Samar_Products
Table containing all the products wih more than one Mcat assigned, elements in the first column
repeat themselves instead of putting "" so that merging is possible when required.
2. Lead_Count
It is the master Mcat which contains the no. of leads corresponding to every Mcat
3.paid_supplier_new_products_mcat1
All the products with more than one Mcats, items in left column repeat themselves
4. Samar_Products_Max_Leads
Data containing the product and the corresponding Mcat which comes on top in search
14. Summer Internship Report Page 14
results when searched on IndiaMART portal.
5. Somesh_Final_Results
Data of all the products and all the corresponding Mcats with the no. of leads
6. Somesh_Final_Max_Results
A subset of Somesh_Final_Results, where it contains only the Mcats with max leads
7. Samar_Max_Results
Data containing the no. of leads corresponding to only the Mcat which comes on top in search
results
8. paid_supplier_new_products_mcat
The original data set of 9k products with removed blank rows
9. Samar_Final_Max_Results
Data containing the product and the corresponding Mcat which comes on top in search
results and also the lead count
10. paid_supplier_new_products_mcat2
Top most Mcat corresponding to a product
LOOP:
1. Firstly I removed the products which have only one Mcat because in that case no
comparison could be made.
2. To fecilitate merging I had to repeat the Product in the 1st column for the corresponding
Mcats.
For this I checked if the first column is blank and the Mcat column has some value in it,
then put the value of the previous row in the product column to the current cell.
3. Now I merged it with the Lead count table which contain the no. of leads for a Mcat.
4. The result is the table which contain all the products with their corresponding Mcats and
the no. of leads.
5. Finally I considered only the Mcat which had the maximum leads corresponding to a
product.
15. Summer Internship Report Page 15
6. Also create a table which contains only one Mcat which comes on top on searching on
IndiaMART portal corresponding to a particular product.
7. Merge this table with the Lead Count data by Mcat ID. Now we have the data of the
product, the corresponding Mcat which comes on top while searched on IndiaMART
portal and the corresponding no. of leads.
8. After all this we can compare or find the difference between the no. of leads of Mcat
assigned by our algorithm and that which comes on top when being searched.
RESULT:
After analyzing the output files and comparing them, It was found that on an average 142 leads
comes to an Mcat which comes on top while being searched.
While if we apply our algorithm and assign a different Mcat with maximum leads to the same
product. On an average the new no. of leads would be 321.
Therefore the gain is of 179 leads per product which will make suppliers much more happy and
engaged on IndiaMART portal.
16. Summer Internship Report Page 16
PROJECT#2
AIM:
Analysis regarding the auto-rejection of intent generated leads by matching their secondary
Mcats in deleted leads . Also to find the potential loss that would occur if auto-rejection system
is implemented.
DATA DESCRIPTION:
Deleted_total.csv
The list of all the deleted leads with code 1 and 45 which implies – with manual deletion and
auto deletion. The corresponding deletion date is also mentioned.(1st
-7th
May)
FK_GLUSR_USR_ID ETO_OFR_FENQ_DATE FK_GLCAT_MCAT_ID
Approved_total.csv
This is the data of leads which were approved in the same period. From this we can infer the
potential loss by matching the secondary Mcats.
Secondary_Mcat.csv
The data of all the leads which were live during 1st
– 7th
May.
From this data we can check if there are leads with same secondary Mcats as in deleted leads.
Then that many leads could have been auto rejected.
And we can also check for the potential loss by finding leads with same secondary Mcats as in
approved leads. Then that many approved leads would have been auto rejected if we implement
the auto-rejection system.
mydata
The output file, which had the following format.
USR_ID OFR_ID MCAT_ID
17. Summer Internship Report Page 17
PROCEDURE:
To find the leads which could have been Auto-rejected:
1. First of all to find all the leads which could be autorejected, I checked for the cases in the
deleted leads which had same secondary Mcats.
2. Now I checked for another condition that those leads where there was a secondary Mcat
match must have the same GLUSR_ID.
3. The next condition to check was that the lead must have been offered before the deletion.
4. Since Secondary_Mcat table also contains some primary Mcats, Therefore the next
condition was to check if there was a primary Mcat match then just Ignore that case.
To check for potential loss:
1. Here, I followed the same steps as mentioned above. The only change here is to use the
Approved_table instead of Deleted_table.
2. Approved table contains all the approved leads. Now I applied the same conditions as
mentioned above and found the leads with matching secondary Mcats.
3. These matches implies that these many leads would be auto-rejected if we implement the
auto-rejection system. So, in other way we came to know the potential loss.
4. For both of the above activities the output was in the following format.
RESULT:
1. The exact no. of deleted leads which could have been auto-rejected was found to be
2690.
2. While the exact no. of approved leads which would have been auto-rejected by
implementing the auto-rejection system was found to be 8440.
3. Therefore, it was found that the loss due to the rejection of good leads is more than the
cost saved from rejecting false positive leads . So, it was decided not to implement the
autorejection system.
18. Summer Internship Report Page 18
PROJECT#3
AIM:
For a particular ticket raised by a customer, find which of the standard issues were present in the
description of the ticket by string matching. The expected output is a matrix with the following format: A
particular row is corresponding to Ticket_ID and all the columns are corresponding to the standard issues
which can be possibly the reasons for ticket generation. These columns should contain 1 if the that issues
is present else 0.
DATA DESCRIPTION:
Somesh_Ticket:
Data related to the tickets raised by suppliers. It contains their user Ids, customer ticket Ids , the
date of ticket issue, ticket detail, appendum and the history of the ticket.
GLID TICKET_ID ISSUE_DATE TICKET_DETAIL APPENDUM TICKET_HISTORY
Result:
Output table containing all the above mentioned columns and 1 and 0 in the new columns. Other
new columns are following:
These are which created to capture some particular issues, if they are present in text provided by
a particular customer.
STOP NO_BENEFIT NO_MATURITY NO_TIME_TO_USE
FAKE_BUYERS IRRELEVANT_ENQUIRIES FAKE_BL NOTICE_PERIOD
BUSINESS_CLOSED HYPER_LOCAL_BUYER MISCOMMITMENT PHYSICAL_VISIT
NOT_TECH_SA
VY
BUYER_WANT_LOW_PR
ICE
LANGUAGE_BARRI
ER
CHANGES_REQUIR
ED
WRONG_PRODUCT WRONG_IMAGE WRONG_CATALOG CHANGE_NUMBER
19. Summer Internship Report Page 19
CHANGE_EMAIL CLOSE_ACCOUNT NOT_INTERESTED
REMOVE_PRICE DIFFERENT_CATALOG CHANGE_AFTER_APPROVAL
PROCEDURE:
1. Firstly I created a null dataframe with above mentioned column names so that they
represent different standard issues and initially assigned them the value 0.
2. The list of standard issues is following:
1."stop the service/ Deactivate the service"
2."Did not get benefit/ No Benefit/Did not get Business"
3."Did not get maturity/ No Maturity/ Maturity Issues/Deal Not Maturing"
4."No time to use the service"
5."Buyers not responding/fake Buyers/Fake Leads"
6."Irrelevant Enquiries/ Less Enquiries/Bulk Enquiries/Low Enquiry/Retail
Enquiry/Wrong Enquiry"
Enquiry/Inquiry/Query
7."Wrong Buy Leads/Fake Buy Lead/ Wrong BL"
Buy Lead= BL
8."Notice Period"
9."Business Closed/Out of India/Partnership Issue/Personal reason/Changed My
Business"
10."Need hyper-local buyer/ Hyper local enquiries"
11."Wrong commitment from sales/Miscommitment/mis-commitment"
12."physical visit"
13."Client is not Tech Savy/Tech Savvy/ Computer Savvy"
14. "Buyer quote very less price/Buyer asking Low Price/Buyer want low price"
15. "Language Barrier/Tamil"
16. Changes required
17. Wrong Product
18. Wrong Image
19. Wrong Catalog
20. Change number
21. Change email
22. Close the account
23. Not interested
24. Remove Price
25. not the same catalog that i approved
26. change after approval/ change after hosting approval
3. In every ticket description I checked for the following strings using grepl() function in R.
Each string has a corresponding column in the output dataframe.
20. Summer Internship Report Page 20
4. If string was found to be present I put 1 in that column for that particular row else I put
zero.
PROJECT#4
Aim:
To predict weather a customer is going to renew his subscription or not.
DATA DESCRIPTION:
Complete file.csv
Data containing the information about the customers eg. What is their turn over value , state from
which they operate , how many emloyees they have, what is their business type ie. manufacturer,
wholesale trader, service provider etc.
PROCEDURE:
Import the data file in R.
Then consider columns in the input as deciding variables.
Create a model to predict whether an existing customer is going to renew his subscription
at the end of his subscription cycle or not.
Use decision tree C5.0 in R to create a large set of rules which will be used for final
predictions as stated above.
PROJECT#5
Aim:
To find out the most recurring Brands for a particular Mcat so that they can form separate
category. Also to find out the most asked specifications for products, so that only these
specifications can be made compulsory for the agent to enquire and get rid off not so important
ones to reduce the calling cost.
DATA DESCRIPTION:
Direct _Text:
Table containing only 2 columns ie. lead description and the Mcat of direct leads. This
description is in text format and contains all the brands and specifications which we have to find
out.
Brand_Data:
21. Summer Internship Report Page 21
The table which contains the description of the lead, Brand name and Mcat name (3 columns).
These brand names(2nd
column) have been found out from the 1st
column of the above table.
Brand_Data_Result:
The output table which contains along with the 3 columns mentioned above of Brand_Data - all
the specifications corresponding to that Lead eg. size, quantity, budget etc.
PROCEDURE:
To find the Brand Names :-
1. Firstly I was instructed to look for the Brand names in the Lead description text.
2. I considered only two columns namely ETO_OFR_DESC and the MCAT .
3. Now in the description column of this new sheet I searched the word “Brand” so that
we can find the leads in which indeed some Brands were mentioned.
4. Then we considered the text after the word “Brand” and continued till the new line starts,
therefore it considered the multiple Brand names separated by “and”, “comma” and “or”.
5. After splitting the text based on the above words all the brands can be separated.
6. Put all of these Brands and corresponding Mcat in different rows.
7. Remove the rows which contain wrongly captured Brand names eg. „any‟, „other‟ and
„all‟.
8. Remove the Brands which starts with a number except “3m” because that‟s a brand.
9. Now at this point we have all the genuine Brands and the corresponding Mcats. The
output is in the following format.
Description Brand_Name Mcat_Name
To find the specifications:
Here also quite similar procedure was followed as mentioned above. In this first case I
splitted the text using strsplit() function in stringr library of R by “:” because almost all the
specifications had this in common eg. Budget: 50000 INR, Model: 5690 etc.
1. Now the splitted text is in the list format. Check if the length of this list is less than 2. If
yes then consider the only first three columns in the final output. Otherwise split the list
by “n”.
2. After doing this access the elements of the list one by one, and attach the last word of a
string to the first word of next consecutive string. This attachment can be done by paste()
function in R
22. Summer Internship Report Page 22
3. Now put the result of the attachment in a different column of the result. It was decided
that maximum no. of columns can go up to 10 so that almost all the specifications can be
captured.
4. Since in a particular lead the word “Brand:” can be anywhere either before, after or in
between the specifications. So to keep the specifications aside what I did was that I
looked for the word “Brand:” in all the columns for a particular row and wherever I found
it I applied the Swap operation between that value and the value in the 4th
column so that
all Specifications come in 5th
to 10th
column not before that.
5. The header of the output is in following format. The 2nd
and the 4th
column have the
same value. Its just that the Brand name in 4th
column comes after “Brand:”.
Description Brand Mcat Brand: Spec1 Spec2 Spec3 Spec4 Spec5 Spec6
The summary of the Results:-
1. Using pivot table, we generated the summary in which the counts of all the Mcats
mapped to a particular Brand are mentioned.
2. Similarly we created another summary sheet which contained the counts of Brands
mapped to a particular Mcat.
To find the count of Brands and Specifications for a particular Mcat :-
In this part the required output is in the following format.
Mcat_Name Count_Of
All_Brands
Brand_Name Individual
Brand_Count
Specifications Specification_Freq
1. The output of the previous activity has been used as input for this one with certain
modifications.
2. The first modification required was to remove the values attached with the specifications eg.
Budget: 50000 INR, I had to remove the value after “:” so that only specification name
remains and which can be counted easily.
3. For this I splitted the specification text by “:” using strsplit() function in R which returns a
list. In this list the first element is the specification name like “budget” and the second
23. Summer Internship Report Page 23
element is the value of that specification ie. “50000 INR”. The first is what we needed to put
instead of the whole text.
4. Also It was needed to convert the specification‟s name to lower case so that “Budget” and
“budget” are not different from each other when converted into factors should give a
cumulative count.
5. Then I removed those specifications which were pure numeric in nature and occur due to
error.
6. This was all the data preparation that was needed for this activity. Now I created a null
dataframe called Result which had 6 columns same as mentioned above and all of them were
initiated to zero.
7. The specification file has the following format.
Mcat Brand: Spec1 Spec2 Spec3 Spec4 Spec5 Spec6
8. To find the total count of brands, I checked that how many times a particular Mcat in the first
column repeated itself.
9. Now put all the specifications in a vector b. It contains repeated specfications which means
that all the specifications for every brand for a particular Mcat are in this vector.
10. Now put all the brands in a vector a. It contains repeated brands which means that all the
brands for a particular Mcat are in this vector.
11. After all this we have to find out the individual brand count and individual specifications
count for a particular Mcat.
12. For this I used count() function in plyr() library in R. This function is used to return a
dataframe of frequency of different variables.
13. I converted the elements of vectors a and b to factors so that count() function can work and
can return the dataframe which contains the frequency of the variables. These dataframes are
named as summary_a and summary_b.
14. Now unfactor these dataframes using unfactor() function in varhandle() library of R, so that
the elements in these dataframes can be accessed in put into the final result sheet.
15. To know how much rows are required in the final result for a particular Mcat, find the
maximum length among (length(summary_a[[1]]), length(summary_b[[1]]), count).
16. Finally put the values from summary_a and summary_a in the final result dataframe along
with their total and individual count corresponding to a particular Mcat.
24. Summer Internship Report Page 24
BIBLIOGRAPHY:
I referred to some books which had provided me with much of guidance for the project. Apart
from domain knowledge these books had provided us deep insights of the subject.
BOOKS:
R for programmers by Norman Matloff
Introducing Python by Bill Lubanovic
APPENDIX:
#Project 1: Part1 -
# To remove the , and ( ) from the product names
Product_Name[,3] <- str_replace_all(Product_Name[,3], "[^[:alnum:]]", " ")
#After that remove extra spaces produced due to removing , and ()
Product_Name[ ,3] <- gsub(pattern = "s+", replacement = " ", Product_Name[ ,3])
# We have to remove the , and ( ) from the Lead names
Lead_Name[,1] <- str_replace_all(Lead_Name[,1], "[^[:alnum:]]", " ")
#After that remove extra spaces produced due to removing , and ()
Lead_Name[ ,1] <- gsub(pattern = "s+", replacement = " ", Lead_Name[ ,1])
# Now do the splittig of product names
#To produce splitted text of Product_Name in list format
test = 0
for (i in 1:length(Product_Name)) {
#for (i in 1:20) {
print(i)
test1 = (strsplit(Product_Name[i,3], " "))
test = rbind(test, test1)
25. Summer Internship Report Page 25
}
Splitted_Row = test[-1] #because first row is 0
#Splitted_OFR_ID is a list of splitted offer title name
test2 =0
for (i in 1:length(Lead_Name)) {
print(i)
test1 = (strsplit(Lead_Name[i,1], " "))
test2 = rbind(test2, test1)
}
Splitted_OFR_ID = test2[-1] #because first row is 0
#Creating vectors of columns of Product_Name matrix and putting them in a list as a list so that
access in a list becomes easy
PN_USR_ID = list(Product_Name[ ,1])
PN_ITEM_ID = list(Product_Name[ ,2])
PN_ITEM_NAME = list(Product_Name[ ,3])
PN_MCAT_ID = list(Product_Name[ ,4])
#Creating vectors of coloumns of LEAD_NAME matrix and putting them in a list as list so that
accessing the elements becomes easy
LN_OFR_TITLE = list(Lead_Name[ ,1])
LN_MCAT_ID = list(Lead_Name[ ,2])
LN_MCAT_NAME = list(Lead_Name[ ,3])
#combining the data
product_list1 = list( PN_USR_ID = PN_USR_ID, PN_ITEM_ID = PN_ITEM_ID,
PN_ITEM_NAME = PN_ITEM_NAME,PN_MCAT_ID= PN_MCAT_ID,
pn_splitted = Splitted_Row )
Lead_list1 = list( LN_OFR_TITLE= LN_OFR_TITLE, LN_MCAT_ID = LN_MCAT_ID,
26. Summer Internship Report Page 26
LN_MCAT_NAME = LN_MCAT_NAME,
ln_id_splitted = Splitted_OFR_ID )
# Loop to search matches b/w splitted_row and splitted _ofr_id
# Initialization values
s = 0
t=0
v=0
J=0
flag = 0
count = 0
count1 = 0
Match_Results = cbind(OP_USR_ID=0, OP_ITEM_ID=0,OP_ITEM_NAME=0,OP_MCAT_ID=0,
OP_LN_OFR_TITLE=0,OP_LN_MCAT_NAME=0,OP_LN_MCAT_ID=0)
for( i in 1:300)
#for(i in 1:length(product_list1$pn_splitted))
{ print(i)
count1 = 0
#print(length(product_list1$pn_splitted[[i]]))
J = length(product_list1$pn_splitted[[i]])
for (k in 1:length(Lead_list1$ln_id_splitted))
{
s = 0
27. Summer Internship Report Page 27
t = 0
for ( j in 1:length(product_list1$pn_splitted[[i]]))
{
for (l in 1:length(Lead_list1$ln_id_splitted[[k]]))
{
# compulsory match for last word
if(identical( product_list1$pn_splitted[[i]][J],
Lead_list1$ln_id_splitted[[k]][l] ) == TRUE)
{ t = t+1 }
# compulsory match for second last word
if(identical( product_list1$pn_splitted[[i]][J-1],
Lead_list1$ln_id_splitted[[k]][l] ) == TRUE)
{v = v+1}
if( identical( product_list1$pn_splitted[[i]][j],
Lead_list1$ln_id_splitted[[k]][l] ) == TRUE )
{
s = s + 1
break
#print(s)
} } }
if(J==1)
{ r = 1 }
if(J == 2) {
r = 1
}
28. Summer Internship Report Page 28
if(J == 3) {
r = .65
}
if(J == 4) {
r = .7
}
if(J == 5) {
r = .8
}
if(J == 6) {
r = .6
}
if( (s/J >= r | s >= 4) & t!=0 & v!=0 )
{
print(k)
count1 = count1 + 1
if(flag == 0)
{
OP_MCAT_ID = product_list1$PN_MCAT_ID[[1]][i]
# Returning item name of that item
OP_ITEM_NAME = product_list1$PN_ITEM_NAME[[1]][i]
# Returning iuser id of that item
OP_USR_ID = product_list1$PN_USR_ID[[1]][i]
# Returning item id of that item
op_ITEM_ID = product_list1$PN_ITEM_ID[[1]][i]
29. Summer Internship Report Page 29
# Returning OFFER TITLE
OP_LN_OFR_TITLE = Lead_list1$LN_OFR_TITLE[[1]][k]
# Returning Lead Mcat Id
OP_LN_MCAT_ID = Lead_list1$LN_MCAT_ID[[1]][k]
# Returning Lead Mcat Name
OP_LN_MCAT_NAME = Lead_list1$LN_MCAT_NAME[[1]][k]
flag = 1
}
else {
# Returning mcat Id of that item
OP_MCAT_ID1 = product_list1$PN_MCAT_ID[[1]][i]
OP_MCAT_ID = rbind(OP_MCAT_ID,OP_MCAT_ID1)
OP_ITEM_NAME1 = product_list1$PN_ITEM_NAME[[1]][i]
OP_ITEM_NAME = rbind(OP_ITEM_NAME, OP_ITEM_NAME1)
# Returning user id of that item
OP_USR_ID1 = product_list1$PN_USR_ID[[1]][i]
OP_USR_ID = rbind(OP_USR_ID, OP_USR_ID1)
# Returning item id of that item
op_ITEM_ID1 = product_list1$PN_ITEM_ID[[1]][i]
op_ITEM_ID = rbind(op_ITEM_ID, op_ITEM_ID1)
30. Summer Internship Report Page 30
# Returning OFFER TITLE
OP_LN_OFR_TITLE1 = Lead_list1$LN_OFR_TITLE[[1]][k]
OP_LN_OFR_TITLE = rbind(OP_LN_OFR_TITLE, OP_LN_OFR_TITLE1)
MCAT_NUM = as.numeric(Lead_list1$LN_MCAT_ID[[1]][k])
OP_LN_MCAT_ID1 = MCAT_NUM
OP_LN_MCAT_ID = rbind(OP_LN_MCAT_ID, OP_LN_MCAT_ID1)
# Returning Lead Mcat Name
OP_LN_MCAT_NAME1 = Lead_list1$LN_MCAT_NAME[[1]][k]
OP_LN_MCAT_NAME = rbind(OP_LN_MCAT_NAME, OP_LN_MCAT_NAME1)
}}}
Match_Results_K = cbind( OP_USR_ID, op_ITEM_ID,
OP_ITEM_NAME, OP_MCAT_ID, OP_LN_OFR_TITLE,
OP_LN_MCAT_NAME,OP_LN_MCAT_ID)
Match_Results_K <- subset(Match_Results_K, !duplicated(Match_Results_K[,7]))
Match_Results = rbind(Match_Results,Match_Results_K)
OP_LN_OFR_TITLE = NULL
OP_LN_MCAT_ID = NULL
OP_LN_MCAT_NAME = NULL
OP_USR_ID = NULL
OP_MCAT_ID = NULL
op_ITEM_ID = NULL
OP_ITEM_NAME = NULL
# Counting the products for which there is no match
31. Summer Internship Report Page 31
if(count1 == 0)
{ count = count + 1
} }
# To remove the first row of zeroes from the result
Match_Results = Match_Results[ -1, ]
Match_Results_final = merge(Match_Results, Lead_Count,
by.x="OP_LN_MCAT_ID", by.y = "GLCAT_MCAT_ID")
# Finally putting the no. of leads corresponding to different Mcat
#IDs in the Match_Results
# Using merge function
#write.csv(Match_Results_final ,"Match_Results_83.csv")
Part 2 : Comparison
# To remove the products with only one Mcat
for (i in 1:length(paid_supplier_new_products_mcat[[1]]))
#for(i in 29464:29469)
{ print(i)
if(paid_supplier_new_products_mcat[i,1] != "" &
paid_supplier_new_products_mcat[i+1,1] != "" )
{
paid_supplier_new_products_mcat[i,1] = ""
paid_supplier_new_products_mcat[i,2] = ""
} }
# Remove the blank rows in Excel
32. Summer Internship Report Page 32
write.csv(paid_supplier_new_products_mcat, "Samar_Products.csv")
# To facilitate merging Product should repeat in 1st column for the corresponding Mcats
for (i in 1:length(paid_supplier_new_products_mcat1[[1]]))
{
print(i)
if(paid_supplier_new_products_mcat1[i,1] == "")
{
paid_supplier_new_products_mcat1[i,1] = paid_supplier_new_products_mcat1[i-1,1]
} }
Somesh_Results1 = merge(paid_supplier_new_products_mcat1, Lead_Count,
by.x="Mcat", by.y = "GLCAT_MCAT_NAME")
write.csv(Samar_Final_Max_Results, "Samar_Final_Max_Results.csv")
# To interchange the 1st and 2nd columns
for (i in 1:length(Results[[1]]))
{
print(i)
temp = Results[i,1]
Results[i,1] = Results[i,2]
Results[i,2] = temp
}
colnames(Results) = c( "Product","Mcat" , "GLCAT_MCAT_ID", "JFM.Approved")
write.csv(Results, "Somesh_Final_Results.csv")
write.csv(Samar_Final_Max_Results, "Samar_Final_Max_Results.csv")
# To consider only the maximum leads Mcats
33. Summer Internship Report Page 33
for (i in 1:length(paid_supplier_new_products_mcat1[[1]]))
{ print(i)
if(paid_supplier_new_products_mcat1[i,1] == "")
{
paid_supplier_new_products_mcat1[i,2] = ""
} }
write.csv(paid_supplier_new_products_mcat1, "test.csv")
write.csv(Somesh_Results1, "Samar_Final_Max_Results.csv")
# To consider only the maximum leads Mcats
for (i in 1:length(paid_supplier_new_products_mcat2[[1]]))
{
print(i)
if(paid_supplier_new_products_mcat2[i,1] == "")
{
paid_supplier_new_products_mcat2[i,2] = ""
} }
write.csv(paid_supplier_new_products_mcat2, "paid_supplier_new_products_mcat2.csv")
Samar_Final_Max_Results = merge(paid_supplier_new_products_mcat2, Lead_Count,
by.x="Mcat", by.y = "GLCAT_MCAT_NAME")
Somesh_Results1 = merge(paid_supplier_new_products_mcat1, Lead_Count,
by.x="Mcat", by.y = "GLCAT_MCAT_NAME")
Somesh_Final_Results = Somesh_Results1
write.csv(Somesh_Final_Max_Results, "Somesh_Final_Max_Results.csv")
Results = merge(Samar_Products, Lead_Count, by.x="Mcat", by.y = "GLCAT_MCAT_NAME" )
34. Summer Internship Report Page 34
write.csv(Results, "Somesh_Results.csv")
# Code for trimming
for (i in 1:length(paid_supplier_new_products_mcat1[[1]])) {
print(i)
paid_supplier_new_products_mcat1[i,2] = trimws(paid_supplier_new_products_mcat1[i,2])
}
for (i in 1:length(Samar_Products[[1]])) {
print(i)
Samar_Products[i,2] = trimws(Samar_Products[i,2])
}
Project #2:
Loop to search in deleted leads data -
for(i in 1:length(deleted_total[ ,1]))
{
for (j in 1:length(secondary_mcat[ ,1]) )
{
if(deleted_total[i,1] == secondary_mcat[j,1]
&&
deleted_total[i,3]== secondary_mcat[j,3] && secondary_mcat[j,4] <= deleted_total[i,2]
&& secondary_mcat[ j,3]!= secondary_mcat[ j,5])
{
print(i)
USR_ID = secondary_mcat[j,1]
OFR_ID = secondary_mcat[j,2]
52. Summer Internship Report Page 52
}
write.csv(Brand_Data,"Brand_Data.csv")
To find the specifications:
#Declaring 2 lists
list1 = list()
list2 = list()
# Declaring a null dataframe
Brand_Data_Result = data.frame(cbind( A=NULL,B=NULL, C=NULL, D=NULL, E=NULL,
F=NULL, G=NULL, H=NULL, I=NULL))
for (i in 1:length(Brand_Data[[1]])) {
print(i)
a = as.character(Brand_Data[i,1])
#Split the text based on the ":"
if(grepl("Brand:", a))
{ b = strsplit(a, ":") }
#Split the text based on the ":-"
else if(grepl("Brand:-",a))
{ b = strsplit(a, ":-") }
#Split the text based on the "-"
else if(grepl("Brand-",a))
b = strsplit(a, "-")
#Split the text based on the " -"
else if(grepl("Brand -",a))
b = strsplit(a, " -")
53. Summer Internship Report Page 53
#Split the text based on the " :-"
else if(grepl("Brand :-",a))
b = strsplit(a, " :-")
#Split the text based on the " :"
else if(grepl("Brand :",a))
b = strsplit(a, " :")
else b = a
if( length(b[[1]])<2 )
{
# when length of the splitted text is less than 2 Just consider the first 3 columns of input in which
# the 2nd one already contains the brand name
Brand_Data_Result[i,1] = Brand_Data[i,1]
Brand_Data_Result[i,2] = Brand_Data[i,2]
Brand_Data_Result[i,3] = Brand_Data[i,3]
next
# Go for the next iteration ( next i )
}
for (j in 1:length(b[[1]]))
{
# split the text based on "n"
c = strsplit(b[[1]][j], "n")
list1[j]= c
}
# Attach the last word of one string and the first word of next string
for (k in 1:(j-1)) {
54. Summer Internship Report Page 54
d = paste(list1[[k]][length(list1[[k]])], list1[[k+1]][1], sep = ":")
list2[k]= d
}
Brand_Data_Result[i,1] = Brand_Data[i,1]
Brand_Data_Result[i,2] = Brand_Data[i,2]
Brand_Data_Result[i,3] = Brand_Data[i,3]
for(l in 1:length(list2))
{
Brand_Data_Result[i,l+3]=list2[l]
}
list1 = NULL
list2 = NULL}
Brand_Data_Result = Brand_Data_Result[ ,1:10]
# Code to rearrange the rows of the brand data result so that 4th columns conatins only the brand
name not anything else
for (p in 1:length(Brand_Data_Result[[1]])){
print(p)
if(grepl("Brand", Brand_Data_Result[p,4]))
{
next
}
for (q in 5:10) {
if(grepl("Brand", Brand_Data_Result[p,q]))
{
55. Summer Internship Report Page 55
temp = Brand_Data_Result[p,q]
Brand_Data_Result[p,q]= Brand_Data_Result[p,4]
Brand_Data_Result[p,4] = temp }}}
write.csv(Brand_Data_Result, "Brand_Data_Result.csv")
To find the count of Brands and Specifications for a particular Mcat :
Brand_Data1 = Brand_Specifications_Result
Brand_Data1[is.na(Brand_Data1)] = ""
# Code to rearrange the rows of the brand data result so that 4th columns conatins only the brand
name not anything else
a = unfactor(a)
for (p in 1:length(a[[1]]))
{
if(p%%100 ==0)
{print(p)}
if(grepl("Brand", a[p,3]))
{
next
}
for (q in 3:8)
{
if(grepl("Brand", a[p,q]))
{
temp = a[p,q]
a[p,q]= a[p,3]
a[p,3] = temp }}}
56. Summer Internship Report Page 56
library(stringr)
list_splitted = list()
list3 = list()
test = list()
Result = data.frame()
Brand_Name = 0
Mcat_Name=0
Spec1 = 0
Spec2 = 0
Spec3 = 0
Spec4 = 0
Spec5 = 0
Spec6 = 0
Spec7 = 0
# Spec# are the specification columns
for (p in 1:length(Brand_Data1[[1]]) ){
if(p%%100 ==0)
{print(p)}
if(Brand_Data1[p,2]==""){next}
# First split the text on the basis of " and " and assign that list to test
test = strsplit(Brand_Data1[p,2]," and ")
if(test[[1]][1] ==""){next}
#print(test)
57. Summer Internship Report Page 57
#print(length(test[[1]]))
# Now split the elements of the test on the basis of ","
for(q in 1:length(test[[1]]))
{
#print(q)
#Put all the splitted elements in list3
list3[q] = strsplit(test[[1]][q],",") }
#print(list3)
for (r in 1:length(list3))
{
# check if Brand name is pure no.- then don't consider that
for(s in 1:length(list3[[r]]))
{
#if( is.na(as.numeric(list3[[r]][s])))
test1 = list3[[r]][s]
test1 = str_trim(test1)
Brand_Name = rbind(Brand_Name,test1)
test2 = Brand_Data1[p,3]
test2 = str_trim(test2)
Mcat_Name = rbind(Mcat_Name,test2)
test2 = Brand_Data1[p,4]
Spec1 = rbind(Spec1,test2)
test2 = Brand_Data1[p,5]
Spec2 = rbind(Spec2,test2)
59. Summer Internship Report Page 59
list3 = NULL
}
# To remove the first row of zeroes
Result = Result[-1, ]
#colnames(Result) = c("Brand_Name","Mcat_Name")
Result_final_1 = Result
# To remove wrongly chosen brands
for (t in 1:length(Result[[1]]))
{
if(t%%100 ==0)
{print(t)}
if(grepl("any|other|all ",Result[t,1]) )
{
Result = Result[-t, ]
#print("Good")
}
}
# To remove "." and ":" from Brand names in the 1st column of Result So that when grepl is used,
some observations should not miss due to extra "."
#Result[Result.na] = 0
Result[ ,1] = sub("[.,:,',-,(,),+]","", Result[ ,1])
# To remove the rows which contain only "all" in brand column
60. Summer Internship Report Page 60
for (t in 1:length(Result[[1]]))
{
if(t%%100 ==0)
{print(t)}
if(grepl(Result[t,1],"all ") | is.na(Result[t,1]))
{
Result = Result[-t, ]
#print("Good")
}}
# To do the trimming of extra spaces created due to removal of ":"
# # Code for trimming
for (i in 1:length(Result[[1]])) {
if(i%%100 ==0)
{print(i)}
Result[i,1] = trimws(Result[i,1])
}
#write.csv(Result, "Result.csv")
# To get rid of factors first save it and then import it
#write.csv(Result, "Result.csv")
Result2 = Result
#rm(Result)
library(varhandle)
Result = unfactor(Result)
61. Summer Internship Report Page 61
#Result = read.csv(file.choose(), header = TRUE, sep = ",", stringsAsFactors = FALSE)
#Result[Result == ""] = 0
# Result1 = Result2
# Result2 = Result
# Result = Result2
# To split the Brands in Result based on " or "
# First assigned the Result to a different dataframe so that u can use earlier to split based on " and
" as such remove the first column of Result
Result_copy1 = Result
Result = NULL
library(stringr)
list_splitted = list()
list3 = list()
test = list()
Result = data.frame()
Brand_Name = 0
Mcat_Name=0
Spec1 = 0
Spec2 = 0
Spec3 = 0
Spec4 = 0
Spec5 = 0
Spec6 = 0
Spec7 = 0
62. Summer Internship Report Page 62
# Now make the Result null because the output will be stored in it
#colnames(Result_copy1) = c("P","M")
for (p in 1:length(Result_copy1[[1]]) )
# for ( p in 3:4)
{
if(p%%100 ==0)
{print(p)}
# First split the text on the basis of " and " and assign that list to test
test = strsplit(Result_copy1[p,1]," or ")
if(test[[1]][1]==""){next}
#print(test)
#print(length(test[[1]]))
# Now split the elements of the test on the basis of ","
for(q in 1:length(test[[1]]))
{
#print(q)
#Put all the splitted elements in list3
list3[q] = strsplit(test[[1]][q],",")
#print(list3[q])
}
#print(list3)
for (r in 1:length(list3))
{
# check if Brand name is pure no.- then don't consider that
65. Summer Internship Report Page 65
# To remove the first row of zeroes
Result = Result[-1, ]
Result[ ,1] = sub("[.,:,',-,(,),+]","", Result[ ,1])
# To remove wrongly chosen brands
for (t in 1:length(Result[[1]]))
{
print(t)
if(grepl("any|other|all ",Result[t,1]) )
{
Result = Result[-t, ]
#print("Good")}}
for (t in 1:length(Result[[1]])){
if(t%%100 ==0)
{print(t)}
if(grepl(Result[t,1],"all ") | is.na(Result[t,1]))
{
Result = Result[-t, ]
#print("Good")}}
# To do the trimming of extra spaces created due to removal of ":"
# # Code for trimming
66. Summer Internship Report Page 66
for (i in 1:length(Result[[1]])) {
if(i%%100 ==0)
{print(i)}
Result[i,1] = trimws(Result[i,1]) }
Result4 = Result
# To remove the Brands which starts with a number
for (i in 1:length(Result[[1]]))
# for(i in 1:10 )
{
if(i%%100 ==0)
{print(i)}
if(substr(Result[i, 1], 1, 2)== "3m" |
is.na(as.numeric(substr(Result[i,1],1,1))))
{ next }
if(!is.na(as.numeric(substr(Result[i,1],1,1))))
{
Result[i, ] = "" } }
Result_Final = Result
d = Result_Final
#write.csv(Result_Final, "Result_Final.csv")
# Remove the first column in the Excel. Again import that data as the final input for the count of
specifications
# read.csv(file.choose(), header = TRUE, sep = ",", stringsAsFactors = FALSE)
# Now the final code to get the result in a given format
67. Summer Internship Report Page 67
# Format:-
# For a particular Mcat get all the brands and their individual count , Also get the count of all the
specifications for the same Mcat
# Input:- Brand_Specifications_Final_Result
Specifications = Brand_Specifications_Final_Result
# First take the specifications and split on ":" to consider only Specification not the value
# Now convert these specifications to factors so that count becomes easy
for (i in 1:length(Specifications[[1]]))
{
print(i)
for (j in 3:8)
{
if(Specifications[i,j]!="")
{
a = strsplit(Specifications[i,j],":")
Specifications[i,j] = a[[1]][1]
}
a = NULL }}
# To remove the brands with " etc" string
for (i in 1:length(Specifications[[1]]))
{
print(i)
if(grepl(" etc| china",Specifications[i,2] ))
{
68. Summer Internship Report Page 68
a = strsplit(Specifications[i,2]," etc")
Specifications[i,2] = a[[1]][1] }
a = NULL
}
# To remove the brands which has only "etc" string
for (i in 1:length(Specifications[[1]]) )
{
print(i)
if(Specifications[i,2] !="etc" & Specifications[i,2] !="china"
& Specifications[i,2] !="chinese")
{
next }
else { Specifications = Specifications[-i, ]}
}
# To make specifications lower case so that we don't get different and less counts for the
# same specification and also remove the numbers from the specifications
for (i in 1:length(Specifications[[1]]))
{
print(i)
for (j in 3:8)
{
if(is.na(as.numeric(Specifications[i,j])))
{
69. Summer Internship Report Page 69
Specifications[i,j] = tolower(Specifications[i,j])
}
else { Specifications[i,j] = "" }
if(!grepl("price|budget", Specifications[i,j]))
{
next
}
else { Specifications[i,j] = "" } }}
write.csv(Specifications,"Specs_Final_Result.csv")
# Final Loop
library(plyr) # For count function
library(varhandle) # For unfactor function
Result = data.frame()
Mcat_Name = 0
Total_Brand_Count = 0
Brand_Name = 0
Brand_count = 0
Specs = 0
Specs_Count = 0
count = 1
test1 = 0
test2 = 0
a = NULL
70. Summer Internship Report Page 70
b = NULL
c = NULL
d = NULL
for (i in 1:length(Specifications[[1]]))
#for(i in 1:30)
{
print(i)
temp = Specifications[i,2]
a = c(a,temp )
#print(temp)
# a is the vector of brand names
#print(a)
if(Specifications[i,1] == Specifications[i+1,1])
{
count = count + 1
#print(count)
}
for (j in 3:7)
{
#print("yes")
if(Specifications[i,j]!="")
{
temp = Specifications[i,j]
b = c(b,temp)
#print(b)
71. Summer Internship Report Page 71
} }
# When Mcat Changes
if(Specifications[i,1]!=Specifications[i+1,1])
{
Mcat = Specifications[i,1]
if(length(a)!=0)
{
a = factor(a)
summary_a = count(a)
summary_a = unfactor(summary_a)
}
if(length(b)!=0)
{
b = factor(b)
summary_b = count(b)
summary_b = unfactor(summary_b)
}
#print(summary_b)
# To know how much rows are required for particular Mcat
max_len = max(length(summary_a[[1]]), length(summary_b[[1]]), count)
for (k in 1:max_len)
{ Mcat_Name = rbind(Mcat_Name, Mcat )
# To get the Brand Names and their count
if(k<= length(summary_a[[1]]))