Data Analytics Portfolio_IS.pptx

Iryna Smologonova
Data Analyst - Portfolio
https://smologonova.github.io/

Hello!
My name is Iryna Smologonova
I am a data analyst with a background in product
development, audit and customer service.
With my curiosity and tenacity, I make connections
between different data sets, translate data into
actionable insights and communicate the ideas to
stakeholders.
I am excited to expand my data analysis and
critical-thinking skills, and to uncover the most
desirable solutions of not only solving problems
from the customer’s perspective and also affecting
business growth.
2
Technical skills
Excel Tableau
SQL Project management
Python Big data processing
Soft skills
Problem-solving Collaboration
Leadership Business acumen
Storytelling Curiosity

Projects
3
 Rockbuster Stealth
Analyzing online movie rental transactions to answer business questions
 Flu Season
Analyzing regional and seasonal trends of influenza in the US
 GameCo
Analyzing global video game sales
 Instacart
Analyzing historical grocery order data to generate insights for Marketing
strategy

4
Rockbuster Stealth – a global movie rental company
Objective
To assist with launcging strategy for
the new online video service
 Perform an analysis of historical
data to identify sales trends,
customer behavior, rental duration
 Develop insights and
recommendations for Rockbuster
(fictitious company)
Tools
 PostgreSQL
 Power Point
 Tableau
Data
 Rockbuster dataset
Source: PostgreSQL Tutorial
Skills
Creating data dictionary
Database querying
Joining tables
Subqueries
Common table expressions
ERD
visualization
Database
querying
Summarizing
& cleaning
data in SQL
Filtering &
grouping
Answer
business
questions
Data
visualization
Recommend
strategies

5
Rockbuster Stealth SQL functions:
Aggregating, ranking, joining & grouping
SQL queries available here
Business questions
Which movies contributed the most to
revenue gain?
Which genres the most popular?
Answers
 Top 20 movies account for 6% of total
revenue and 2% number of movies
 Top 3 genres by revenue- Sports, Sci-
Fi and Animation produce
 Top genres - Comedy, New and Sports
produce more revenue per film.

6
Rockbuster Stealth Business questions:
Do sales figures vary between geographic regions?
Which rating yield the most revenue?
Top 3 countries - India, China and the United States account for
25% of total number of customers and company revenue.
Ratings PG-13 & NC 17 generate the most revenue
Data visualizations created in Tableau, available here

7
Rockbuster Stealth Project deliverables:
 Project report
 GitHub Repo
 Data Dictionary
Key learning experience:
 Common Table Expressions (CTE) is more readable than
subqueries and can be reusable. However, subqueries
and CTE have pros & cons and the choice between them
should be made on a case-by-case basis.
 SQL ranking functions allowed me to define top 20
movies with the highest revenue in a simple way and
made my query more readable. RANK()/DENSE_RANK()
functions are great for sequencing and comparing data
across various factors.
 A bubble chart is a solution to visualize three metrics –
number of transactions, revenue and average revenue
per number of transactions. It allowed to include the
addition of a third dimension as a bubble size/color to
emphasize the most popular genres.
Recommendations:
Focus on:
 Adding movies to inventory generating the most
revenue Ratings: PG-13 & NC-17 and Genres: Sports,
Sci-Fi and Animation
 Comedy, New and Sports as higher generating
genres produced more revenue per film could be
beneficial at a pilot project
 Top 3 countries - India, China and the United States
account for 25% of total number of customers and
company revenue. Therefore, I would recommend to
start the streaming service by piloting in these
countries
Click links
to check
the project

Flu season
8
Sourcing the proper
data
Data profiling
& integrity
Data quality
measures
Data
transformation
& integration
Conducting
statistical
analysis
Consolidating
analytical
insights
Statistical
hypothesis
testing
Objective
To assist in preparation of
staffing plan in the United
States for upcoming influenza
season:
 Analyze death trends
 Prioritize states with
vulnerable populations
Tools
 Excel
 Tableau
Data
• Influenza deaths by geography, time,
age, and gender
Source: CDC
 Population data by geography
Source: US Census Bureau
 Influenza lab test results by state
Source: CDC (Fluview)
Skills
Translating business requirements
Data cleaning
Data integration & transformation
Statistical hypothesis testing
Visual analysis
Forecasting
Storytelling in Tableau
Presenting results
Scenario: To help a medical staffing agency prepare for the upcoming flu season by examining trends in
influenza and how they can be used to proactively plan for clinic and hospital staffing needs across the country
Data viz &
storytelling

9
Flu Season
 Data transformation by using pivot tables and VLOOKUP functions
 Combining different data sets by utilizing common state and year/month variables
 Normalizing the flu deaths data according to state populations by deriving new variables representing flu
deaths as a percentage of state population
 Examining the data variability by calculating the variance and the standard deviation
 Correlation coefficient between death rate of 65+ population and death rate below 65 was 0.79 that
quantified the strong relationship. It means that the higher rate of vulnerable population 65+ in the state
the higher a death rate
 Null Hypothesis: Flu death rate of population 65+ years old is less or equal than people under 65 years
old
 Alternative Hypothesis: Flu death rate of population 65+ years old is higher than people under 65 years
old
 The p-value is much less than the significance level of 0.05. This means that the null hypothesis is
rejected and there is 95% chance that the flu death rate of people 65+ years old is higher
 With 95% confidence (alfa 0.05) we can say that there is a significant difference between the flu death
rate of 65+ years and other groups.
Data
transformation
& integration
Conducting
Statistical
Analysis
Statistical
Hypothesis
Testing
Transformation & integration
Statistical Analysis
 Project management plan
 Interim report
Click links
to check
the project

10
Flu Season
Influenza death rate among population 65+ years old Number of influenza deaths by state
One-year forecast of influenza deaths by state
 The flu season starts in December and ends in March with a peak in January
 The death rate forecast for an upcoming flu season is pretty the same in
comparison with the historical data
 The less populous states have the higher death rate of elderly populations
per 100K: Alaska, Hawaii, Wyoming, District of Columbia, South and North
Dakota, Vermont
 The higher populous states have the higher number of deaths: California,
New York, Texas, Pennsylvania and Florida
Key questions:
When does the flu season start and end?
Where to focus during the flu season?

11
Flu Season Project deliverables:  Storytelling in Tableau – an interactive slide deck
 GitHub Repo
 It is important to consider data limitations and assess the
impact of it on the analysis and the result interpretation.
As analysis progresses, data limitations may become
apparent and should be added to the analysis plan.
 Data mapping helps to match variables between
different data sets. Death data and population data by
states were mapped using age variable on 10 years
ranges starting from age of 5 years old.
 Normalization is a part of data preparation and allows
data in different units to be compared using the same
units. The flu deaths data was normalized according to
state populations by deriving new variables representing
flu deaths as a percentage of state population.
Recommendations:
Focus on:
 6 top states with the highest death rate:
Alaska, Hawaii, Wyoming, District of Columbia,
South and North Dakota, Vermont
 5 top states with the highest elderly
populations:
California, Texas, Florida, New York and
Pennsylvania
 the influenza season months:
Dec-Mar with a peak in Jan
Click links
to check
the project

GameCo – an online rental video game company
12
Data exploration
Data
cleaning
Grouping
data
Descriptive
analytics
Developing
insights
Proposal
report
Viz data
insights
Objective
 Perform a descriptive analysis of
an online video game sales data
set to foster a better
understanding of how GameCo’s
(fictitious company) new games
might fare in the market.
 Compare historical regional sales
assumptions with the reality of
current market conditions.
Tools
 Excel
 Power Point
Data
 VGSales
Skills
Grouping data
Summarizing data
Descriptive analysis
Visualizing results in Excel
Presenting results

GameCo
13
0%
10%
20%
30%
40%
50%
60%
Percent
of
Sales
Years
Regional Sales as a Percentage of Global Sales
NA sales EU Sales JP sales
Key question: How have their sales figures varied between geographic regions over time?
52%
27%
9%
32%
38%
19%
North America Europe Japan
Sales by region 2008 vs 2016
2008 2016
Assumption:
The initial understanding of the business was that
videogame sales have remained consistent across
regions over time.
Testing:
• by plotting each region as a percentage of global
sales in a line graph, it reveals that sales in Europe
overtook North America in 2016
• the column chart has indicated a comparison of
regional sales in 2008 vs 2016 - share of North
America sales dropped by 20%, in Europe and Japan
it increased by 11% and 10% accordingly

14
54%
17%
14%
10%
5%
New games by Genre 2015-2016
Action
Role-Playing
Sports
Shooter
Fighting
GameCo Key question: Are certain types of games more popular than others?
0 5 10 15 20 25 30 35 40
Shooter
Action
Sports
Role-Playing
Fighting
Misc
Racing
Sales in Millions ($)
Genre
Regional Sales by Genre 2015-2016
JP sales EU sales NA sales
The clustered bar chart was created in Excel to
represent the regional sales by different video games
genres.
The pie chart shows the percentage breakdown of new
games sales for the last two years.
Insights:
• Shooter dominates in North American market
• Action is popular in all regional markets and it has
the highest number of new games
• Role-Playing is the second leader in Japan

15
GameCo Project deliverables:
Market Goal Target
audience
Actions
North American
refocus the budget
allocation
stabilize
sales in
preventing
further
decline
current and
former
customers using
the large
historical
customer base
launch direct marketing campaigns
on promoting new games in
Shooter, Action and Sports genres
European
support the sales with
a slight increasing
budget resources
keep the
growing
trend over
time
loyal customers
and acquire new
customers
promote intensively new games via
marketing campaigns such as
BOGO - buy one game in
Shooter/Sports genre and get one in
Action/Role-Playing genre with a
certain discount
Japanese
allocate additional
resources for
emphasizing
promotion and
attracting new
customers
continue a
growing
trend
started the
last year
current
customers and
attract new ones
advance promotions using the last
year approach and keeping the main
accent on Action and Role-Playing
new games
 Proposal report
 GitHub Repo
 It is important to consider testing
the assumptions that go with the
analysis. It allows to determine if
conclusions are correctly drawn
from the results of the analysis.
 The goal of the project, the
regional customers, sales over
time and best-selling genres were
taken into consideration to
develop a regional approach of
setting goals, focusing on a target
audience and developing
recommendations for marketing
activities.
Recommendations:
Click links
to check
the project

16
Instacart – an online grocery store operates via an app
Objective
To assist with identifying sales
patterns for better segmentation:
 Explore historical data to
define buying trends and
customer behavior
 Select sub-groups of
customers and analyze their
ordering habits
Tools
 Python
 Jupiter notebook
 Pandas & NumPy libraries
 Matplotlib, seaborn & pyplot
 Excel
Data
Skills
Data cleaning &
wrangling
Data merging
Deriving new variables
Aggregating
Population flows
Data wrangling &
subsetting
Data
consistency
check
Combining &
exporting
data
Deriving
new
variables
Grouping &
aggregating
Excel report
Data viz with
Python
 Orders  Departments
 Products  Customers

17
Instacart Population flows
Merging the datasets
The project was started from cleaning,
organising and merging data before
conducting the analysis
 Data wrangling procedure:
- dropping and renaming columns in “orders” dataset
- renaming columns and changing data types in “products” and “customers” datasets
- transposing “departments” datasets
 Merge data together

18
Instacart Deriving variables & crosstabs
Flag creation:
 Spending flag was defined by spending amount: less than $8 & over or equal $8
 Product price range was divided by 4 categories: High-range product over $15, Mid-
high range between $10 &$ 15, Mid-low range between $5 &$10 and Low range
product equal or less than $5
Crosstab calculation:
 To display the number of orders made for
every day of the week, a flag ‘busiest days’ and
a variable ‘order_day_of_week’ were taken

19
Instacart Visualization:
 Matplotlib
 Seaborn
Business question
What differences can you find in
ordering habits of different
customer profiles?
Answer
 High income level customers
prefer products from alcohol
and pets department.
 Affluent customers prefer
products from meat seafood
and can goods.
 There is no clear preference for
Middle income customers.
 Low income customers prefer
products from snacks,
beverages and breakfast
departments.
Business question
What different classifications does the
demographic information suggest?
Answer
 Middle and Affluent customers across
all age groups make the major number
of orders
 Young and Middle-aged customers
with the Middle income level are the
core of the customer base

20
Instacart
 Run out of memory in RAM for a code execution
is substantially reduced by converting data
types, sampling data or restarting kernels
 A vast collection of libraries helps with exploring,
cleaning large data sets and creating
visualizations in a simple way
 Deriving new variables and creating crosstabs
give unlimited opportunities to find insights and
communicate them via visualizations with
different features for creating informative,
customized, and appealing plots to present data
in the most simple and effective way.
Recommendations:
 The least busiest days are the middle of the week: Tuesday
and Wednesday. Ads should be running on Monday-
Wednesday with a target to increase number of orders on
Tuesday and Wednesday
 To boost sales - focus on middle level income with the profile
young single and single parent .Target younger customers
with low income level lower price products from snacks,
beverages and breakfast departments
 Promote to High income and Affluent customers high range
products from meat seafood, can goods, alcohol and pets
departments due to they have more potential capability buy a
group of products
Project deliverables:
 GitHub Repo
 Excel report
Click links
to check
the project

Iryna Smologonova
Winnipeg, Canada
Get in touch
21
Email:
email me directly

Data Analytics Portfolio_IS.pptx

Recommended

Recommended

More Related Content

Similar to Data Analytics Portfolio_IS.pptx

Similar to Data Analytics Portfolio_IS.pptx (20)

Recently uploaded

Recently uploaded (20)

Data Analytics Portfolio_IS.pptx