The document summarizes a term project analyzing product data scraped from The Body Shop website. It describes scraping 326 products, collecting 12 data points per product, and analyzing the dataset using Pandas. It identifies popular products and ingredients, visualizes the dataset to show category and sub-category trends, and explores correlations between variables like reviews and ratings. The project encountered challenges with website changes and complex HTML, but was able to mine useful insights from the scraped Body Shop data.
3. Introduction
– The Body Shop International Limited, trading as The Body
Shop, is a cosmetics, skin care and perfume company which
is a subsidiary of Brazilian company Natura & Co.
– In this presentation, we’ll show how we performed web
scraping using Python 3 and the BeautifulSoup library on the
Body Shop website.
– We’ll be scraping all products information from
the https://www.thebodyshop.com/en-us/ , and then
analyzing them using the Pandas library.
5. Popular products
Body butters (including Moringa, Satsuma, Strawberry, Olive, Shea, Mango and Coconut)
Body products such as body scrub, body butter and bath lilies
Cosmetics (including mascara, lipstick, lip gloss, eye shadow and cotton rounds)
Full skin care ranges (including Tea tree, Vitamin C, Vitamin E, Aloe vera and Seaweed)
Men's skin care (Including maca root and white musk)
Hair care (including their famous Banana shampoo and Banana conditioner)
Fragrances (Women's and Men's)
Bath products including shower gels and solid soaps
6. We parsed through all main
categories and scraped their
products information.
We were able to fetch about
326 products from the website.
Web Scraping
7.
8. Columns in our dataset
Item Number
Main Category
Sub-Category
Product Name
Main Ingredient
Reviews Count
Ratings out of 5
Sizes Available
Sizes
Prices
Short Description
Long Description
328 Rows, 12 Columns
10. Challenges
External sites can change without warning.
Due to holiday season, the website was changing frequently and
that broke scrapers often
Confusing and difficult to dig HTML tags
Difficulty in scraping ‘Reviews Count’ & ‘Ratings’
‘Main Ingredient’ might just not scrap successfully sometimes and
result in huge number of Nas
Nested sizes and prices columns difficult to mine or explode
12. What are the most
popular Main
Ingredients ?
There 31 unique Main
Ingredients
This bar plot shows the top 5
popular Main Ingredients are:
Aloe Vera
Shea
Marula
Organic Alcohol
Honey
13. Total products
in each Main
Category ?
Products in BODY category are
highest in number with 36.89% of all
products.
14. What are the most popular
Sub-Categories ?
This horizontal bar plot shows top 5 sub-
categories are:
Lotions & Creams
Hand Creams
Body Butter
Body Wash
Lip Balm
15. What are the top 10
highest reviewed
products?
What are the
top 25 highest
rated products?
16. Data Mining
Let us check how many products
are Vegan or Vegetarian?
Created a function to data mine
from “Long Descriptions” texts
Created a new column called
“Cruelty-Free” to show if the
product is Vegan or Vegetarian
Pie Chart showing percent of
products in cruelty-free categories.
18. Pairplot or
Correlogram
Using Seaborn
Shows scatter plot between
each numerical variable
Here, there is no linear trend
visible
19. Data Slicing and Dicing
Exploding nested columns
The Sizes column was a string containing all sizes
Created a function to clean it and convert it to list
Exploded new list column into separate rows
Grouped the rows for each Main category by three sizes – Large, Medium &
Small
Visualized grouped data
21. Visualizing grouped sizes
Two examples of two categories
Body and Makeup
Small sized products are popular in makeup
While Body sells more medium and some large sized
products