Exploratory Data Analysis of Worldwide Startup Companies using Python

Exploratory Data Analysis for
Machine Learning on:
Worldwide Startup
Companies

Fitrie Ratnasari
28th November, 2020

Dataset Description & Initial Planning

The dataset will be used from Crunchbase in csv format, of which the data are worldwide start up
companies recorded from 1902 until 2014. It is consisted of 54,294 entries (rows) with 38
attributes of various data types between object and float, as picture shown below:

1

As for an objective of this project we would like to crack the hidden insight along the years from
1904 until 2014 from crunching the numbers of the dataset, not to mention from the company has
been founded, their status, corporate actions such as Merger & Acquisition (M&A) or becoming
Public Companies, and their funding status whether have received Seed, Angel fund or Venture
Capital investment.
For further analysis we would like to know the successful probability for companies by
considering their status, market, and funding have been received as a basis by conducting
correlation amongst attributes.
In various studies had been taken, successful start up commonly defined as two-way strategy
that makes a large amount of money to its founders, investors and first employees, as a company
can either have an IPO (Initial Public Offering) by going to a public stock market (i.e. Facebook
going public, allowing everyone to invest in the company by buying shares being sold by its
insiders in the U.S stock market) or, be acquired by or merged (M&A) with another company (i.e.
Microsoft acquiring LinkedIn for $26B) where those who have previously invested receive
immediate cash in return for their shares. This process is often denominated as an exit strategy
(Guo, Lou, & Pérez-Castrillo, 2015). This project will therefore consider both an IPO (Initial Public
Offering) and a process of M&A (Mergers & Acquisitions) as the critical events that classify a
start-up as successful.
Initial plans before doing further data exploration are seeing the data type thoroughly regarding
its datatype and all data fulfillment given whether they are appropriate, by then we can know
subsequently what kind of data cleansing should be taken.

2

Data Cleansing & Feature Engineering
After acquiring the dataset we found that there are numerous tasks for data cleansing should be
taken before doing any further analysis, since the dataset is quite messy with formatting, labelling
header, quite a lot involving missing values, and the dataset are also dispersed to introduce
outliers.
So that in this project, the data wrenching which have been taken are :
1. Fixing spacing format in header such as ‘ market ‘ and ‘ funding_total_usd ‘
2. Remove 4855 row duplicates
3. Tackling uncommon format.
Attribute ‘funding_total_usd’ involved uncommon string format with wrongly used comma
as separate number, then we eliminate the comma and change the data type into
numeric.
4. Handling missing values.
Change the missing values such as ‘funding_total_usd’, from NaN value with 0.
5. Detecting and handling outliers.
When plotting into distribution, outliers really matter to generate uninterpretable
visualization. For this we remove the outlier by using interquartile range. Should be noted
that this step only be used for Exploratory Data Analysis only, not to be used in Machine
Learning (in ML we’ll be transforming the data whether using regression, polynomial
regression or log instead)

Feature Engineering also brings advantages such as handling object data-type into numeric by
One-Hot-Encoding and can also be used for transforming the attributes which have an outlier
(considering removing them altogether can also reduce our training accuracy later in the Machine
Learning process). Hence in this dataset we use :
1. One Hot Encoding for attribute ‘status’.
2. Creating new variables of ‘get_seed_funding’, ‘get_angel_fund’, and ‘get_venture’, and
most importantly ‘successfull_code’. 1 for ‘Yes’, and 0 for ‘No’ for all cases mentioned.
3. Change attributes ‘founded_at’ to be ‘founded_year’, since inconsistent data between
year in ‘founded_at’ and year in ‘founded_month’ is found, so that we extract the year in
‘founded_month’ to be new attribute ‘founded_year’, and subsequently drop the
‘founded_at’ column.
3

Key Finding and Insights
Start ups are supposed to be known for their innovation from the gap of problem and solving,
and also known for companies of growth seeking-business, so that the nature of business itself
requires heavy funds and it is common to look for capital from a variety of sources such as angel
investor and venture capitals.
In this section, there are 3 sub-section: start up, market, and funding.

A. START UP

1. Top 5 Country in terms of Start Up Quantity:

We can say that the USA has dominating start up quantity across the globe, more than 50%
from whole startups worldwide. It is undoubtedly true, since the US has an immense support
ecosystem for startups to grow from ideation to scale up the business. Following England
with 2,642 start up companies, Canada 1,405 companies, China 1,239 companies and
Germany with 968 companies respectively in 2014.
4

2. Start Up Status

Since 1902 until 2014 from 49,437 start up companies recorded, 5.4% of them are closed,
86.9% operating and 7,7% acquired can be called as one of terms for successful start up
for their exit strategy.

3. Start Up Founded Year Distribution

From the figure above, mid of 1995 is the commencement of growing startups worldwide,
where recorded around 437 companies and almost doubled in following year by having
731 start up companies in 2001.
5

The history also took place as ‘Bubble DotCom’, where the technology-companies
attracted the market to be over-valuation. In 1999, the height of the dotcom craze, there
were 457 IPOs. Most were Internet and technology stocks. Of those, 117 doubled in price
on the first day of trading. Tech and dotcom IPOs were minting new millionaires every
day, both at the management level and retail investor level. But then the sell-off started on
March 11, 2000. Investors suddenly realized that a tech and/or Internet company with a
billion-dollar valuation that has no revenue or earnings is saddled with debt and has no
future.

4. Those Who Survived from Dotcom Bubble and Become Tech Titans

After Dotcom Bubble, companies with strong business revenues have survived, namely
Amazon, Netflix, eBay, Google, Alibaba. Some of them are now still tech-leading
companies. As we know FANG+ companies (FACEBOOK, AMAZON, NETFLIX, GOOGLE,
ALIBABA) are the take titans who outperformed the wider market since the coronavirus
(COVID-19) pandemic spurred record sell-offs in March. Unlike other stocks which met
their dip price in this time, FANG+ companies are hype up even until 80% take up rate
compared to their lowest in early March due to its performance and forward looking
valuation.

6

B. MARKET

1. Top 15 StartUp Market Worldwide

It is obvious that the growing number of startups would touch almost all sectors related to
people as the market. The most common category from all startups worldwide are software
at the highest place, followed by Biotechnology, Mobile, E-Commerce, Curated Web,
Enterprise Software, Health Care, Clean Technology, Games and Embedded system of
hardware & software.
Whereas Figure below shows that e-Commerce is the most favorable category amongs start
up companies in China, Indonesia and India, which is slightly different from States.
7

2. Most Favorable Category of Startup Product

For almost 3 decades until 2014, Social Media, Curated Web, Mobile can be seen as the
most favorable start up product from all over the world. The least favourable, the smaller
the picture of words would be plotted.

8

C. FUNDING

1. Total Funding Distribution

Total Funding can be defined as total or sum from whole funding obtained, from seed,
grant, angel investor, and venture capitals in all round. From the picture above we can
see that the dispersion of total funding across startups is very high. So that we take out
the outlier to understand better, as can be seen below. The data tells that most of total
funding are below USD 2.5 Million, and top 10% start up companies received 78% from all
total funding across the globe.

9

2. Total Funding in Various Unicorn in 2014
Unicorns in 2014 are not as many as today, but there are few of them who are still
becoming the tech titans today. Facebook as one of the unicorns in 2014 successfully
obtained the highest funding compared to Alibaba, Twitter, Cloudera and Uber with the
amount almost USD 2.5 Billion.

3. Seed Funding, vs Angel Funding vs. Venture Investment
Seed money, sometimes being called seed funding or seed capital, is a form of securities
offering in which an investor invests capital in a startup company in exchange for an
equity stake or convertible note stake in the company. The term seed suggests that this is
a very early investment, meant to support the business until it can generate cash of its
own (see cash flow), or until it is ready for further investments. Seed money options
include friends and family funding, seed venture capital funds, angel funding, and
crowdfunding.
The difference between seed funding and angel investment used in this dataset is seed
funding coming from seed venture capital institution funds. Whereas angel investment
coming from informal or private investors or being called as angel investors who
deliberately invest based on their personal preference.
While venture capital is a form of private equity and a type of financing that investors
provide to startup companies and small businesses that are believed to have long-term
10

growth potential. Venture capital generally comes from well-off investors, investment
banks and any other financial institutions. However, it does not always take a monetary
form; it can also be provided in the form of technical or managerial expertise. In the
dataset column ‘venture’ are the total investment amount from round A, round B, round C,
round D, until round H.
The question is, how many startups are getting seed, angel investment and venture
investment? We can see the difference in three (3) pie chart below.

11

It is an obvious fact that the most difficult source of funding for start up is to get angel
investment, as the number of angel investments is very few and requires a strong
networking to get access to them. Business incubator can be the hub between angel
investors and start up companies.
Meanwhile, the startup percentage of getting seed funding is around 28%. From investor
glasses, giving seed funding can be both advantage and disadvantage. The drawbacks is
the risk would be higher than investors who inject the monetary during a VC round, since
the real market and numbers of revenue are not there. On the other hand, when investors
choose the right start up in the seed stage they will be having higher return as they do not
need to inject more monetary funds to take a part in the shareholders list like VC rounds
do. High risk, high return.
Last but not least, if there are 100 startup companies, statistics show that 47 of them are
backed by Venture Capital institutions. Even though this might seem easy for a startup to
get VC investment, it should be taken into consideration that the startup must be ready
with all due diligence processes. If VC is interested in the proposal, the firm or the
investor must then perform due diligence, which includes a thorough investigation of the
company's business model, products, management, and operating history, among other
things.
Once due diligence has been completed, the firm or the investor will pledge an
investment of capital in exchange for equity in the company. These funds may be
provided all at once, but more typically the capital is provided in rounds. The firm or
investor then takes an active role in the funded company, advising and monitoring its
progress before releasing additional funds.
The investor exits the company after a period of time, typically four to six years after the
initial investment, by initiating a merger, acquisition or initial public offering (IPO), of which
we label them as successful startup companies later on in this project.

12

4. Do the Successful Unicorns require seed funding?
It is a fascinating fact that FANG+ companies (Facebook, Amazon, Netflix, Google,
Alibaba) as the tech titans-companies today which took majority market caps in Wall
Street were not required seed funding back then, right after they founded the product.
They are using bootstrapping to fund themselves. On the other hand, we can see
Dropbox and Uber (also becoming unicorn today) received seed funding back then for
USD 200,000 and below.

13

5. How much does a monetary fund differ from seed funding, angel investor, and each
round in VC investment?

From the dataset we can conclude that the seed funding average took the lowest fund
amongst other funding rounds and sources, around USD 776,350
While angel investment average is around USD 1 Million.
Whereas Venture Capital rounds average are:
Average of Round A : USD 6.9 Million
Average of Round B : USD 13.5 Million
Average of Round C : USD 21 Million
Average of Round D : USD 28 Million
Average of Round E : USD 32 Million
Average of Round F : USD 48 Million
Average of Round G : USD 83 Million
Average of Round H : USD 175 Million
In 2014, only 4 companies are getting round H investment, 3 of them are e-Commerce:
Flipkart (India), Deem (USA), Locondo (Japan).

The other one company categorised in game, which has headquarter in Singapore,
named Gumi.
14

Hypothesis & Statistical Significance Testing
After seeing the data, the hypothesis arises are:
1. Startups in the USA have strong relations for being successful startups or have linear
relationships, since the ecosystem is set up greatly.
2. Startups who get venture investment supposed to be successful startups.
3. Founded year becomes one of the predictors for a startup to succeed.
Statistical Significance Testing

Method would be used to conduct statistical significance to prove the hypothesis is through
calculating Pearson Correlation and p-value between attributes.

Correlation is a measure of the extent of interdependence between variables.
Causation is the relationship between cause and effect between two variables.
It is important to know the difference between these two and that correlation does not imply
causation. Determining correlation is much simpler than determining causation as causation may
require independent experimentation.

Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X & Y. The
resulting coefficient is a value between -1 and 1 inclusive, where:
● 1: Total positive linear correlation.
● 0: No linear correlation, the two variables most likely do not affect each other.
● -1: Total negative linear correlation.

P-value
P-value is the probability value that the correlation between these two variables is statistically
significant. Normally, we choose a significance level of 0.05, which means that we are 95%
confident that the correlation between the variables is significant. By convention when:
p-value is < 0.001: we say there is strong evidence that the correlation is significant. p-value is <
0.05: there is moderate evidence that the correlation is significant.
p-value is < 0.1: there is weak evidence that the correlation is significant.
p-value is > 0.1: there is no evidence that the correlation is significant.

15

New ‘target’ variable is made to define successful startup or not. As mentioned in the first section,
successful startups are those who have Merger & Acquisition record or have become public
companies.
Result below shows that correlation between USA startup and successful startup is extremely
weak linear relationship, since the p-value is < 0.001, the correlation between country code USA
and become successful startup is statistically significant, although the linear relationship is weak
(~0.09), means that even if you are from USA and build a startup it does not mean it would be
successful later on.

Same idea with relationship between getting venture investment and becoming successful
startup, since the p-value is < 0.001, the correlation between getting venture and become
successful startup is statistically significant, although the linear relationship is weak (~0.12), means
getting venture does only have a slight effect on becoming successful startup.

Hence we can conclude that Hypothesis 1 and 2 are not correctly proved.

16

Suggestion & Summary
There are numerous suggestions to take the analysis going further:
1. Using the most updated dataset in Q3 2020, so that we can see the start up dynamics on
what happened during Corona Virus, this can be a basis for the Government to mitigate
the failure of promising startup companies.
2. Predicting whether from startup companies are having high probability to succeed
(defined by Merger & Acquisition or going IPO) by using Machine Learning on
Classification Model.

Summary
The dataset is about startup, corporate actions and investment obtained, sourced from
Crunchbase in csv format, of which the data are worldwide start up companies recorded from
1902 until 2014. It consists of 54,294 entries (rows) with 38 attributes of various data types
between objects and floats with quite messy formatting. Due to this limitation, several steps in
data cleansing are highly required before conducting further analysis.
For further research, would be great if data can also involve variables which have correlation to
target, such as founder experience year, number of patents/copyright/ any goodwill of the
company, grit score of the founders, customer count, etc, by that we can see which ones of them
could boost up number of successful startups. Subsequently could enhance prediction in data
training for Machine Learning.

17

Source Code Archive
https://github.com/fitrieratna/notebook/blob/master/EDA_START_UP_v_finale.ipynb

References
1. IBM Machine Learning Foundation, Exploratory Data Analysis, 2020
2. IBM Data Science For Professional: Machine Learning, 2020
18

Exploratory Data Analysis of Worldwide Startup Companies using Python

Recommended

Recommended

More Related Content

Similar to Exploratory Data Analysis of Worldwide Startup Companies using Python

Similar to Exploratory Data Analysis of Worldwide Startup Companies using Python (20)

More from Fitrie Ratnasari

More from Fitrie Ratnasari (9)

Recently uploaded

Recently uploaded (20)

Exploratory Data Analysis of Worldwide Startup Companies using Python