SlideShare a Scribd company logo
UNIVERSITÀ DEGLI STUDI DI CAGLIARI
FACOLTÀ DI SCIENZE
Corso di laurea in Informatica
DATA EXTRACTION FROM APPANNIE'S WEBSITE USING
WEB SCRAPING
Supervisor:
Prof. Riccardo Scateni
Candidates:
Mattia Palla
(matr. 48971)
Academic year 2015-2016
Abstract
A web scraper tool is a software used to extract target data from a web ap-
plication, usually a website. Normally it is the last resort to obtain data from
a source which provide no other conveninet way like downloadable structured
data les or API to communicate with the database contening target data.
Nowadays, data and informations about the mobile application market
and its trend are very important for developers and software houses which
want to launch a new application or improve the existing ones.
The role of a scraper tool is to extract and trasform unstructured data
to semi-structured or structured data, the role of data mining methods is
to discover patterns in large data sets and make them in a readable and
understandable shape.
The purpose of this project is the development of an eective and ecient
tool which, using scraping and data mining methods, extracts useful data
from source providing data about the mobile app market, and shows them in
an understandble shape highlighting links and correlations among extracted
data.
ii
CONTENTS
CONTENTS
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 State of the art 3
2.1 App Annie: A mobile analytics provider companies . . . . . . . . . . . . . 3
2.1.1 App Annie's APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Data scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Discussion 5
3.1 The starting point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Appannie's structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Selenium API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6 Browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.7 The tool's development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.7.1 The project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.7.2 Opening the browser and Logging in . . . . . . . . . . . . . . . . . 9
3.7.3 The scraping from app's page . . . . . . . . . . . . . . . . . . . . . 9
3.7.4 How to save data to CSV le . . . . . . . . . . . . . . . . . . . . . 10
3.7.5 Web scraping on the chart webpages . . . . . . . . . . . . . . . . . 10
3.7.5.1 The selection of the date . . . . . . . . . . . . . . . . . . . 11
3.7.6 Time and optimisations . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7.6.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7.6.2 Misuring of the time . . . . . . . . . . . . . . . . . . . . . 14
3.7.7 The backup of the state . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8 Good shape of informations and data mining . . . . . . . . . . . . . . . . . 14
iii
CONTENTS
4 Conclusion 15
4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Acknowledgement 19
6 Attachments 21
6.1 The development environment . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.3 Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.4 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv
Chapter 1. Introduction
CHAPTER 1
INTRODUCTION
1.1 Motivation
There are sources of data where it is possible just to consult data through a website
and it is impossible to save a le with some important and selected informations. Web
scraping and data mining methods can be used to prepare a structured or semi-structured
le with ltered and useful data, trasforming a hardly understandble set of data contained
in a website into an organised and readable report of mobile app market stored in a le
[5] [2].
1.2 Purpose
The result of this project aims to help developers and software houses to make decisions
about the development, the presentation, and the advertising of a new mobile app, but
also helps to adjust apps that have already come out.
1.3 Organization
This thesis is the result of an Erasmus traineeship experience at Bournemouth University.
During the training period, there was a rst phase of study to understand how it was
possible to achieve the required results, followed by a second phase of implementation of
the idea.
In the rst stage of this thesis' work, most of the code consisted in a series of attempts
to try dierent approaches to the problem. All the focus went on to understand how was
it possible to develop a tool to extract needed data. The code of a simple web scraping
tool written in python was used to understand how a software like that could work, after
that the focus had shifted to an useful API for the project's porpouse, learning how to
use Selenium API.
1
Chapter 1. Introduction
The second step was to code and test repeatedly every piece of the code, nding out
new issues every time and xing them. The development proceded trough repeated little
goals, increasing, step by step, the features and the quality of software.
The last stage was to nd correlations among extracted data and show these in an
agreeable shape to raise the interest of developers.
2
Chapter 2. State of the art
CHAPTER 2
STATE OF THE ART
In this section are explained the main problem, the methods and tools used as a term
of comparison and as a reference for this work.
2.1 App Annie: A mobile analytics provider companies
With the growth of the mobile app market increasing at such a rate, there have been
numerous app marketing agencies and mobile analytics privider emerging, oering support
in app promotion, analysing the market and elaborating statistics. One of them is App
Annie.
App Annie is a business intelligence company and analyst rm headquartered in San
Francisco, California. It produces business intelligence tools and market reports for the
apps and digital goods industry. App Annie creates next generation business intelligence
solutions for the global app economy and is providing the most successful companies with
these data products for free. Business media outlets use App Annie as a complete solution
for tracking downloads, revenues, rankings and reviews. The App Annie suite consists
of three main products: Store Stats, Analytics, and Intelligence which provides the most
accurate market estimates available.
2.1.1 App Annie's APIs
To obtain data from App Annie's website, there are available APIs to communicate
directly with the database. The problem is the limitations in using them for free (1000
calls for day). Also Appannie allow to download a CSV le with some data, but this
feature is not free and only a little part of needed data would be available.
2.2 Data scraping
The only way to obtain all needed data is to use web scraping methods. There are other
web scraping tools but a general web scraper tool cannot produce the same result of a
3
Chapter 2. State of the art
tailored tool. Tools such as UiPath or OutWit Hub have basic version free that is not
enough powerful to conduct all needed tasks.
4
Chapter 3. Discussion
CHAPTER 3
DISCUSSION
3.1 The starting point
Nowadays we have a lot of data about every subject. This could be a preciuos resource
if we use it in the correct way. The problem faced in this thesis is how to extract data
automatically from Appennie website in order to realise an organized set of data that
is easily understandable and readable that can help apps' developers to make better
decisions. From this data set it will be easier to understand some trends of the market
and the reasons that increase or reduce the popularity of each app. it will focus on the
USA market and the chinese market.
3.2 The Approach
The software has been realised using an approach based on an iterative and incremental
development, inspired by agile programming practices [1]. The development was con-
ducted through weekly deliveries of working code, increasing every week the features and
the proprieties of the software. The testing phase was continued for the whole period of
development focusing on just coded parts and new implemented features weekly.
The most important data for this work are the name, the description and the versions
of each app, that are shown in the top 100 charts of Appannie website, for the last year.
To avoid having too much repetitions of data and to reduce the time to compute them, it
has been decided to extract the required data every two weeks, going as far back as one
year into the past. In this way the size of the sample is big enough and the day of the
week does not aect the misuring.
3.3 Appannie's structure
The website pages of interest are two:
1. Top 100 apps charts:
5
Chapter 3. Discussion
Figure 3.1: www.appannie.com, Top 100 apps charts
Figure 3.2: www.appannie.com The page dedicated to each app
6
Chapter 3. Discussion
The most popular 100 apps are shown by downloads in this page. There are 5 type
of charts: free, paid, grossing, new free, new paid. The existance of these lists
depends on the considerated country's market; In relationship to the markets
considerated in this work, in the USA one there are free, paid and grossing
charts, in the chinese one, all the ve types are present.
2. The page dedicated to each app:
There are details related to each app such as name, software house, description,
versions history, images, rewievs' distribution and other useless informations
for this work.
3.4 Selenium API
Selenium is a set of dierent software tools for browser automation and web scraping.
It is one of the most famous and used API for these aims, it can be used with dierent
languages and browsers. It is a very exible, versatile and easy to use set of tools. Using
it it's possible to automate the browser to surf on the website, to look for specic elements
in the HTML code and to extract pieces or pattern found [3].
3.5 Regular expressions
Regular expressions are used to look for everything in web scraping [4]. They are very
useful to extract writing from elements that contain a lot of text, or dierent types of
text (versions element for example) in order to better classify extracted data.
3.6 Browsers
The choice of using Mozilla Firefox was made for the following reasons:
1. One of the most used and reliable browsers
2. Free and open source code
3. The avability of a wide selection of extentions and add-ons to customise the browser's
behavior
3.7 The tool's development
3.7.1 The project
After a phase of feasibility study, some goals have been xed and it has been decided how
and what the tools have to do. It extracts from the top 100 charts of USA and chinese
markets the lists of apps on the charts, grouped by type of chart and country, so every list
is stored in separated CSV les. This process is iterated in order to extract charts every
2 weeks beginning from the current day (or a previusly chosen date) going backwards as
7
Chapter 3. Discussion
Figure 3.3: The tool's ow
8
Chapter 3. Discussion
far as one year into the past or into a specic date. After that, the tool reads recently
created CSV les and, for each url, it opens the browser using that url and extracting all
html code including pictures and saving descripton and history versions in a new CSV le
structured. Now we have all the required data and the tool can process them to produce
a new le where all the obtained informations are readable and understandable.
3.7.2 Opening the browser and Logging in
The rst step of the development was to automate the opening of the browser and to show
the page of an app. The rst problem has been how to automate the authentication in the
website. In fact the website shows data only to registered users. To solve this, selenium
functions Findelement to nd elements contaning username and password elds in the
welcome page, and Sendkeys to ll them with username and password have been used.
this.findElement(By.id(email)).sendKeys(new String []
{ user }); // Enter user
this.findElement(By.id(password)).sendKeys(new String
[] { pass }); // Enter Password
this.findElement(By.id(submit)).click (); // Click on
'Sign In' button
Listing 3.1: Login's automation
3.7.3 The scraping from app's page
In the rst weeks of the traineeship the attention has been focused in coding the core
of the tool, so in writing methods which extract target data from the selected web-
pages. For this stage, a random app page to test every little improvment has been
used, showing extracted data on the terminal of the IDE. Selenium API provides a
driver class which has been expanded with new methods in order to have all the fea-
tures of the original class and to customise them to manage better the surng and the
scraping. Filters of ndelement method that operate looking for the name of the class
inside the html code have been used. In this way it is possible to select just the use-
ful elements in the page, in this case description and the history of the versions.
List WebElement  elements; // list of webelement to
store each webelement (paragraph)
WebElement elem;
By bySearch = By.className(app_content_section);
elements = this.findElements(bySearch);
Listing 3.2: Extracted by the method that extracts the description of each app
However, in most parts of app's pages, looking for and clicking on the button more
to show the complete element's text instead of a little rst part of it have been needed.
The problem was that not every page has a more button because some have a little
description that does not need to be expanded to be read. Generalising, this is a frequent
problem in web scraping tools because often, even with the same general structure html
9
Chapter 3. Discussion
page, the target element can be hidden in some pages and visible in dierent ones. To
solve that 2 dierent solutions with Selenium API can be used:
1. ndelement with try catch:
This solution of the problem uses exceptions risen by Selenium API in the Findele-
ment method. Findelement can nd an element but if the element is not found
an exception is risen. So it's possible to manage the exception to continue to
run the code even if the miss is not a problem.
2. ndelements with checking on the resultant list:
A dierent solution is to use the FindElements method. FindElements can nd a
undened number of matches but if it nds no element, no exception is risen
and the resultant list will be empty, so it is easy to check the size of the list to
understand if at least one element was found.
There are some pros and cons for each solution: The rst one has the advantage to be
faster if the probability to nd the target element is high, in fact when it nds the element
it stop itself and the next line code is processed. The second one is better when we have
to look for an element but the rst found could not be the searched one so we need to
continue the reaserch. the rts solution has been used in order to get faster operations.
3.7.4 How to save data to CSV le
To save data on a CSV le standard libreries of java to write on les has been used.
The structure of the resultant le is made up of three columns: URL, descrpition and
versions. For the versions elds each version has been divided in order to have only one
version for each eld in the CSV le. To create the splitting using regular expression has
been needed to match all dierent versions of each app (in the html code all versions are
together in the same element) and to put each one in a dierent eld of CSV le.
3.7.5 Web scraping on the chart webpages
This step has represented the most dicult part of the work [5]. The html structure
of this page is complex and dierent for each country, so the code has to generalise
problems and nd consistent solutions. Also, this page uses AJAX to manage the date's
drop-down menu and this is a typical problem when a scraper tool is designed [5].
10
Chapter 3. Discussion
WebElement elementHead = this.findElementByCssSelector(
div.region -main -inner:nth -child (3)  table:nth -
child (1)  thead:nth -child (1));
this.typesList = elementHead.getText ().split(n);
this.numList=typesList.length;
WebElement elementStoreTable = this.findElementById(
storestats -top -table);
List WebElement  elementsRows = elementStoreTable.
findElements(By.cssSelector(tr));
Listing 3.3: Extraction of the list from the chart
URL DESCRIPTION VERSIONS
https://www.appannie.com/apps/. . . Description Welcome to your . . . 3.9 May 27, 2016 Current release 3.8 May 21, 2016
Table 3.1: File structure required
The purpose of this step was to build a second csv le where the tool stores charts of
the top 100 apps with their urls. Each page has at least 3 types of dierent charts and
not more than 5. This structure depends on the selected country. In the same page it is
possibile to choose the date of the chart. The tool has to hold in consideration all these
features. To make the process and the reading of les easier, creating a le for each type
of chart for each country has been preferred.
To get this function suitable for dierent structures of this page it has been chosen
to reveal the number and the types of lists in the page before beginning data extraction
in order to organise the structure of the CSV le to hold data in a clear way. The
Findelement function has been used repeatly to select the correct element reducing the
time spent.
3.7.5.1 The selection of the date
So far the date factor was ignored during the development. This has been a precise
choice in order to make the development easier following an incremental approach. Our
goal is to reveal charts every two weeks and to achieve it. The tool has to manage in the
data drop-down menu to select the required data. This is a real and complex automation
of the browser that has to click on the data menu, reveal the current date and select the
correct one. The class calendar of java.util has been used to count days in order to select
dates spaced exactly 14 days apart.
11
Chapter 3. Discussion
if(this.monthList.indexOf(this.calMonthSelectedNow) ==
calExpToStart.get(Calendar.MONTH)  (calExpToStart.
get(Calendar.YEAR) == Integer.parseInt(this.
calYearSelectedNow)))
{
String monthWithZero;
//Month and year matched. Now Call selectDate
function with date to select and set dateNotFound
flag to false.
int month = this.monthList.indexOf(this.
calMonthSelectedNow)+1;
if( month  10)
monthWithZero = 0 + month;
else
monthWithZero = month + ;
stringFPG.monthYearForNameList = this.
calYearSelectedNow + _ + monthWithZero + _;
this.selectDate(String.valueOf(calExpToStart.get(
Calendar.DAY_OF_MONTH)));
dateNotFound = false;
}
//If current selected month and year are less than
expected start month and year then go Inside this
condition.
else if(this.monthList.indexOf(this.calMonthSelectedNow
)  calExpToStart.get(Calendar.MONTH)  (
calExpToStart.get(Calendar.YEAR) == Integer.parseInt
(this.calYearSelectedNow)) || calExpToStart.get(
Calendar.YEAR)  Integer.parseInt(this.
calYearSelectedNow))
//Click on next button of date picker.
this.findElement(By.className(.ui-datepicker -next))
.click ();
//If current selected month and year are more than
expected start month and year then go Inside this
condition.
else if(this.monthList.indexOf(this.calMonthSelectedNow
)  calExpToStart.get(Calendar.MONTH)  (
calExpToStart.get(Calendar.YEAR) == Integer.parseInt
(this.calYearSelectedNow)) || calExpToStart.get(
Calendar.YEAR)  Integer.parseInt(this.
calYearSelectedNow))
//Click on prev button of date picker.
this.findElement(By.className(ui-datepicker -prev)).
click();
Listing 3.4: A piece of code to select the date from Appannie website
12
Chapter 3. Discussion
3.7.6 Time and optimisations
The total time spent by the program to complete the working ow of one year of app
charts is long, approximately the tool spends 20 minuts to extract the apps' lists and
30 seconds to analyse each application presents in the charts. 30 seconds is the result
of a work aiming to reduce as more as possibile the time, trying dierent combinations
of Selenium API functions to look for the required element in the HTML code. Also,
this time does not worry the Website's server because the behaviour of the browser's
automation is similar to a human behaviour so the operations of the tool are not blocked.
We have 26 charts' pages to analyse. In the USA charts pages there are 3 types of lists,
in the Chinese charts pages there are 5 types of lists. Every type of list is made up of 100
names of apps (sometimes the new free and the new paid charts have less than 100 apps
but not more than 100 obviously). So the number of apps is little less than 20800. If the
tool analyses every app's page the time spent would be 10400 minuts that means 173,3
hours, so more than 7 days of incessant running. Fortunatlely it is possibile to optimise
it in a huge way. In fact the major part of 20800 apps are repeated ones. Since the
longest part of the processes of the tool is the scraping from every app's page, this part
has been optimased keeping out apps already analysed. Anyway in the stored lists ther
are all apps, in order to have a clear situations of every chart, bau when the tool reads
these lists, it keeps out apps already saw. In this way the number of app's page analysed
and the time spent by the tool is considerably decreased. The number of analysed app is
become 2100, reducing the spent time of 90%, that means 16,6 hours.
3.7.6.1 Multithreading
The tools is partially multithreading, and the idea is to trasform every parts using mul-
tithreading. Now, the rst process of the tool, that is the extraction of the lists from top
100 charts pages is coded with 2 threads, one for each country. In this way the time to
extract lists has been reduced of 33% (from 15 minutes for each country, that is 30 minutes
in total, to 20 minutes using multithreads algorithm, with the testing hardware). The
second part of the tools could improve in required time with a multithreading approach,
and this will be part of future progresses.
ThreadDemo T1 = new ThreadDemo(USA, https ://www.
appannie.com/apps/google -play/top -chart/united -
states/overall /?date =2016 -05 -25);
T1.start();
ThreadDemo T2 = new ThreadDemo(CHI, https ://www.
appannie.com/apps/google -play/top -chart/china/
overall/);
T2.start();
Listing 3.5: Multithreading
13
Chapter 3. Discussion
3.7.6.2 Misuring of the time
Opening webpage: OK Total time: 3149.649 sec. Task time: 9.821045 sec.
checking if app esists yet... OK Total time: 3160.453 sec. Task time: 10.803955 sec.
nding images with relative url... OK Total time: 3160.863 sec. Task time: 0.41015625 sec.
Downloading webpage on data.html... OK Total time: 3161.082 sec. Task time: 0.21899414 sec.
Fetching description's product... OK Total time: 3162.048 sec. Task time: 0.96606445 sec.
Fetching versions's product... OK Total time: 3166.224 sec. Task time: 4.1760254 sec.
Fetching versions's product... OK Total time: 3166.738 sec. Task time: 0.513916 sec.
Fetching versions's product... OK Total time: 3166.739 sec. Task time: 9.765625E-4 sec.
Opening new tab browser to visit the next url... OK Total time: 3173.452 sec. Task time: 6.7128906 sec.
App time: 33.624023 sec.
Table 3.2: Misuring of the time
The misuring of the required time has been managed using a stopwatch class and writing
in a text le the spent time of each step of the tool. It is very important the control of
the times in a web scraping tool because the high amount of data requires long time.
3.7.7 The backup of the state
Providing a backup system was needed. In fact, in a so long running period, it is
impossibile to exclude the possibility of a black out or a server or client crash or, also, a
connection problem. To solve it, the tool saves its state before processing every app, in
order to not lose any data or time. If the program is stopped, when the user starts the
program again, it restart from the last app analysed (if the last running time it did not
nished to process the apps)
3.8 Good shape of informations and data mining
The last step of the tool is to create a le where all extracted data are included and
shown in a understandable and readable shape, in order to raise the interest of developers,
workers in the app's advertising eld, managers of work groups related to mobile apps
market and other professionals in the same eld. This new le is structured like the gures
in the conclusion section, and highlights the how often the apps are presents in the charts
and the evolution of position on the charts during the evaluated period. From this le we
can elaborate data and trough graphics and diagrams show some importants informations
very useful for the stakeholders.
14
Chapter 4. Conclusion
CHAPTER 4
CONCLUSION
4.1 Results
The results of this work are three types of CSV les where you can nd all the extracted
data in a semi-structured shape. The choice of the CSV le has been made because it is
easy to use and easy to manage. It is supported by almost all spreadsheets, database man-
agement systems and many programming languages have available libraries that support
CSV les.
The rst type shows the chart in a specic date with the list of URLs of the dedicated
page on Appannie. This type of le is only a means to the following steps. The second
type is dierent: here each row is dedicated to an app and the data of each app are shown
in dierent elds of the same row. So, in the same place there is all the extracted data
of every app present on the top 100 charts of the evaluated period: url, description and
versions history, with a version for each eld of the row. This le can be considered a
semistructured data le, and the contained data are organised in a ready-to-analyse shape.
The third type of le is more structured than the other ones, but it's still considered semi-
structured because there is not a relational database containing data. However this is a
shape that highlights a lot of important information, that was hidden inside the amount of
data. We can see a new chart where the apps are ordered by presences in the two weekly
charts and the taken position with the related date. Also the average of the positions
during the evaluated period is shown.
URL DESCRIPTION VERSIONS
https://www.appannie.com/apps/google-play/app/com.mojang.minecraftpe/details/ Description Our latest free . . . Varies with device Jan 17, 2012 Current release
https://www.appannie.com/apps/google-play/app/com.sikebox.retrorika.material.icons/details/ Description bWelcome to your . . . 3.9 May 27, 2016 Current release 3.8 May 21, 2016
https://www.appannie.com/apps/google-play/app/com.ninjakiwi.bloonstd5/details/ Description Five-star tower defense . . . 3.2 May 17, 2016 Current release 3.1 Mar 10, 2016
https://www.appannie.com/apps/google-play/app/com.robtopx.geometryjump/details/ Description Jump and y your way . . . 2.011 Sep 29, 2015 Current release 2.01 Sep 28, 2015
Table 4.1: Extracted from the second le
The third type of le is the rst step of a proper data mining approach. From this one
we can elaborate a lot of data, extract information easily and gure out distributions and
statistics about the apps.
15
Chapter 4. Conclusion
Type and dates of chart Position App's name Weeks Pos average Worst pos Best pos
Messenger 28 1,1428571429 3 1
2016_05_11_USA_Free 1
2016_04_27_USA_Free 1
2016_04_13_USA_Free 2
2016_03_30_USA_Free 1
2016_03_16_USA_Free 1
2016_03_02_USA_Free 1
2016_02_17_USA_Free 1
2016_02_03_USA_Free 1
2016_01_20_USA_Free 1
2016_01_06_USA_Free 1
2015_12_23_USA_Free 1
2015_12_09_USA_Free 1
2015_11_25_USA_Free 1
2015_11_11_USA_Free 1
2015_10_28_USA_Free 1
2015_10_14_USA_Free 1
2015_09_30_USA_Free 1
2015_09_16_USA_Free 1
2015_09_02_USA_Free 1
2015_08_19_USA_Free 1
2015_08_05_USA_Free 3
2015_07_22_USA_Free 2
2015_07_08_USA_Free 1
2015_06_24_USA_Free 1
2015_06_10_USA_Free 1
2015_05_27_USA_Free 1
2015_05_13_USA_Free 1
2015_04_29_USA_Free 1
Table 4.2: The third le
4.2 Future Work
The html code les of every analysed app, have been stored in order to allow future
elaborations and research, even if the website will not be accessible.
The next step is to examine in depth the data of the les and especially of the third
le, in order to improve the shape and to nd other connections among the available set
of data.
Extracting more data from the site is possible and, using the developed tool as a starting
point, improving it by adding more characteristic and modifying which data has to be
16
Chapter 4. Conclusion
extracted also is possible. This is possible thanks to an accurate modular development
with various classes that handle parts of code that are independent from each other.
The tool does not have a graphic interface, because it was not a priority for this project,
but it could be an essential component to make the tool more usable and to enable Not-IT
professionals to use it.
17
Chapter 4. Conclusion
18
Chapter 5. Acknowledgement
CHAPTER 5
ACKNOWLEDGEMENT
I wish to express my sincere thanks to professor Riccardo Scateni, (University of Cagliari)
for his guidance and for his precious advices, to professor Gianni Fenu for his great avail-
ability and to professor Xiaosong Yang and professor Zhidong Xiao for their continuous
support and teachings during all the training period at Bournemouth University.
I am also thankful to Nicola, Edoardo and Simone for this traineeship experience, to
all Chwazi team for his support for these last three years and to my friends, especially
Valentina, Riccardo, Mario, Mattia, Luigi for their encouragement, costant help and their
importants advices.
I take this opportunity to express gratitude to all of the Departments faculty members
for their help and support. I also greatly thank my family for their support and attention.
19
Chapter 5. Acknowledgement
20
Chapter 6. Attachments
CHAPTER 6
ATTACHMENTS
6.1 The development environment
6.1.1 Hardware
CPU : Intel Core Duo 2 SU9400 1,4 Ghz
GPU : Intel GMA X4500
RAM : 4.0GB
6.1.2 Software
OS : Windows 10 Home x86 and Ubuntu 15.10 x64
IDE : Eclipse Mars 5.2
GitHub Atom : Atom 1.8.0
6.1.3 Repository
GitHub : Mattia Palla's Bitbucket
6.1.4 Thesis
LaTeX : Texmaker 4.5
21
Chapter 6. Attachments
22
LISTINGS
LISTINGS
3.1 Login's automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Extracted by the method that extracts the description of each app . . . . . 9
3.3 Extraction of the list from the chart . . . . . . . . . . . . . . . . . . . . . . 11
3.4 A piece of code to select the date from Appannie website . . . . . . . . . . 12
3.5 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
23
LISTINGS
24
List of Figures
LIST OF FIGURES
3.1 www.appannie.com, Top 100 apps charts . . . . . . . . . . . . . . . . . . . 6
3.2 www.appannie.com The page dedicated to each app . . . . . . . . . . . . . 6
3.3 The tool's ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
25
List of Figures
26
Bibliography
BIBLIOGRAPHY
[1] Kent Beck, Mike Beedle, Arie Van Bennekum, Alistair Cockburn, Ward Cunningham,
Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeries, et al.
Manifesto for agile software development. 2001.
[2] Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C Miller.
Automation and customization of rendered web pages. In Proceedings of the 18th
annual ACM symposium on User interface software and technology, pages 163172.
ACM, 2005.
[3] http://www.seleniumhq.org/docs/index.jsp. Selenium documentation.
[4] Simon Munzert, Peter Rubba, Christian, and Dominic Nyhuis. Automated data col-
lection with R: A practical guide to web scraping and text mining. John Wiley Sons,
2014.
[5] Michael Schrenk. Webbots, spiders, and screen scrapers: A guide to developing Internet
agents with PHP/CURL. No Starch Press, 2012.
27
Bibliography
28

More Related Content

Similar to Thesis

A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
Tom Robinson
 
MSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPadMSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPad
Traitet Thepbandansuk
 
Mobile Friendly Web Services - Thesis
Mobile Friendly Web Services - ThesisMobile Friendly Web Services - Thesis
Mobile Friendly Web Services - ThesisNiko Kumpu
 
Digital Content Retrieval Final Report
Digital Content Retrieval Final ReportDigital Content Retrieval Final Report
Digital Content Retrieval Final Report
Kourosh Sajjadi
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
Adel Belasker
 
BI Project report
BI Project reportBI Project report
BI Project report
hlel
 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_linjinxing lin
 
Chat Application [Full Documentation]
Chat Application [Full Documentation]Chat Application [Full Documentation]
Chat Application [Full Documentation]
Rajon
 
B-Translator as a Software Engineering Project
B-Translator as a Software Engineering ProjectB-Translator as a Software Engineering Project
B-Translator as a Software Engineering Project
Dashamir Hoxha
 
digiinfo website project report
digiinfo website project reportdigiinfo website project report
digiinfo website project report
ABHIJEET KHIRE
 
Implementing the Auphonic Web Application Programming Interface
Implementing the Auphonic Web Application Programming InterfaceImplementing the Auphonic Web Application Programming Interface
Implementing the Auphonic Web Application Programming Interface
Educational Technology
 
Esignature api-report1
Esignature api-report1Esignature api-report1
Esignature api-report1Subhodip Datta
 
NEW BACKEND.pdf
NEW BACKEND.pdfNEW BACKEND.pdf
NEW BACKEND.pdf
Shreejit Jadhav
 
CA-PMM Syste Recommendations v3
CA-PMM Syste Recommendations v3CA-PMM Syste Recommendations v3
CA-PMM Syste Recommendations v3Matthew Gibbs
 
Zap Scanning
Zap ScanningZap Scanning
Zap Scanning
Suresh Kumar
 
Design and Monitoring Performance of Digital Properties
Design and Monitoring Performance of Digital PropertiesDesign and Monitoring Performance of Digital Properties
Design and Monitoring Performance of Digital Properties
IRJET Journal
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
mustafa sarac
 

Similar to Thesis (20)

A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
 
MSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPadMSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPad
 
Cloud view platform-highlights-web3
Cloud view platform-highlights-web3Cloud view platform-highlights-web3
Cloud view platform-highlights-web3
 
Mobile Friendly Web Services - Thesis
Mobile Friendly Web Services - ThesisMobile Friendly Web Services - Thesis
Mobile Friendly Web Services - Thesis
 
Digital Content Retrieval Final Report
Digital Content Retrieval Final ReportDigital Content Retrieval Final Report
Digital Content Retrieval Final Report
 
Abrek_Thesis
Abrek_ThesisAbrek_Thesis
Abrek_Thesis
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
 
BI Project report
BI Project reportBI Project report
BI Project report
 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_lin
 
Chat Application [Full Documentation]
Chat Application [Full Documentation]Chat Application [Full Documentation]
Chat Application [Full Documentation]
 
B-Translator as a Software Engineering Project
B-Translator as a Software Engineering ProjectB-Translator as a Software Engineering Project
B-Translator as a Software Engineering Project
 
digiinfo website project report
digiinfo website project reportdigiinfo website project report
digiinfo website project report
 
Implementing the Auphonic Web Application Programming Interface
Implementing the Auphonic Web Application Programming InterfaceImplementing the Auphonic Web Application Programming Interface
Implementing the Auphonic Web Application Programming Interface
 
Esignature api-report1
Esignature api-report1Esignature api-report1
Esignature api-report1
 
NEW BACKEND.pdf
NEW BACKEND.pdfNEW BACKEND.pdf
NEW BACKEND.pdf
 
CA-PMM Syste Recommendations v3
CA-PMM Syste Recommendations v3CA-PMM Syste Recommendations v3
CA-PMM Syste Recommendations v3
 
sada_shopping
sada_shoppingsada_shopping
sada_shopping
 
Zap Scanning
Zap ScanningZap Scanning
Zap Scanning
 
Design and Monitoring Performance of Digital Properties
Design and Monitoring Performance of Digital PropertiesDesign and Monitoring Performance of Digital Properties
Design and Monitoring Performance of Digital Properties
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
 

Thesis

  • 1. UNIVERSITÀ DEGLI STUDI DI CAGLIARI FACOLTÀ DI SCIENZE Corso di laurea in Informatica DATA EXTRACTION FROM APPANNIE'S WEBSITE USING WEB SCRAPING Supervisor: Prof. Riccardo Scateni Candidates: Mattia Palla (matr. 48971) Academic year 2015-2016
  • 2. Abstract A web scraper tool is a software used to extract target data from a web ap- plication, usually a website. Normally it is the last resort to obtain data from a source which provide no other conveninet way like downloadable structured data les or API to communicate with the database contening target data. Nowadays, data and informations about the mobile application market and its trend are very important for developers and software houses which want to launch a new application or improve the existing ones. The role of a scraper tool is to extract and trasform unstructured data to semi-structured or structured data, the role of data mining methods is to discover patterns in large data sets and make them in a readable and understandable shape. The purpose of this project is the development of an eective and ecient tool which, using scraping and data mining methods, extracts useful data from source providing data about the mobile app market, and shows them in an understandble shape highlighting links and correlations among extracted data.
  • 3. ii
  • 4. CONTENTS CONTENTS 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 State of the art 3 2.1 App Annie: A mobile analytics provider companies . . . . . . . . . . . . . 3 2.1.1 App Annie's APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Data scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Discussion 5 3.1 The starting point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Appannie's structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Selenium API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.5 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.6 Browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.7 The tool's development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.7.1 The project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.7.2 Opening the browser and Logging in . . . . . . . . . . . . . . . . . 9 3.7.3 The scraping from app's page . . . . . . . . . . . . . . . . . . . . . 9 3.7.4 How to save data to CSV le . . . . . . . . . . . . . . . . . . . . . 10 3.7.5 Web scraping on the chart webpages . . . . . . . . . . . . . . . . . 10 3.7.5.1 The selection of the date . . . . . . . . . . . . . . . . . . . 11 3.7.6 Time and optimisations . . . . . . . . . . . . . . . . . . . . . . . . 13 3.7.6.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . 13 3.7.6.2 Misuring of the time . . . . . . . . . . . . . . . . . . . . . 14 3.7.7 The backup of the state . . . . . . . . . . . . . . . . . . . . . . . . 14 3.8 Good shape of informations and data mining . . . . . . . . . . . . . . . . . 14 iii
  • 5. CONTENTS 4 Conclusion 15 4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5 Acknowledgement 19 6 Attachments 21 6.1 The development environment . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.1.3 Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.1.4 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 iv
  • 6. Chapter 1. Introduction CHAPTER 1 INTRODUCTION 1.1 Motivation There are sources of data where it is possible just to consult data through a website and it is impossible to save a le with some important and selected informations. Web scraping and data mining methods can be used to prepare a structured or semi-structured le with ltered and useful data, trasforming a hardly understandble set of data contained in a website into an organised and readable report of mobile app market stored in a le [5] [2]. 1.2 Purpose The result of this project aims to help developers and software houses to make decisions about the development, the presentation, and the advertising of a new mobile app, but also helps to adjust apps that have already come out. 1.3 Organization This thesis is the result of an Erasmus traineeship experience at Bournemouth University. During the training period, there was a rst phase of study to understand how it was possible to achieve the required results, followed by a second phase of implementation of the idea. In the rst stage of this thesis' work, most of the code consisted in a series of attempts to try dierent approaches to the problem. All the focus went on to understand how was it possible to develop a tool to extract needed data. The code of a simple web scraping tool written in python was used to understand how a software like that could work, after that the focus had shifted to an useful API for the project's porpouse, learning how to use Selenium API. 1
  • 7. Chapter 1. Introduction The second step was to code and test repeatedly every piece of the code, nding out new issues every time and xing them. The development proceded trough repeated little goals, increasing, step by step, the features and the quality of software. The last stage was to nd correlations among extracted data and show these in an agreeable shape to raise the interest of developers. 2
  • 8. Chapter 2. State of the art CHAPTER 2 STATE OF THE ART In this section are explained the main problem, the methods and tools used as a term of comparison and as a reference for this work. 2.1 App Annie: A mobile analytics provider companies With the growth of the mobile app market increasing at such a rate, there have been numerous app marketing agencies and mobile analytics privider emerging, oering support in app promotion, analysing the market and elaborating statistics. One of them is App Annie. App Annie is a business intelligence company and analyst rm headquartered in San Francisco, California. It produces business intelligence tools and market reports for the apps and digital goods industry. App Annie creates next generation business intelligence solutions for the global app economy and is providing the most successful companies with these data products for free. Business media outlets use App Annie as a complete solution for tracking downloads, revenues, rankings and reviews. The App Annie suite consists of three main products: Store Stats, Analytics, and Intelligence which provides the most accurate market estimates available. 2.1.1 App Annie's APIs To obtain data from App Annie's website, there are available APIs to communicate directly with the database. The problem is the limitations in using them for free (1000 calls for day). Also Appannie allow to download a CSV le with some data, but this feature is not free and only a little part of needed data would be available. 2.2 Data scraping The only way to obtain all needed data is to use web scraping methods. There are other web scraping tools but a general web scraper tool cannot produce the same result of a 3
  • 9. Chapter 2. State of the art tailored tool. Tools such as UiPath or OutWit Hub have basic version free that is not enough powerful to conduct all needed tasks. 4
  • 10. Chapter 3. Discussion CHAPTER 3 DISCUSSION 3.1 The starting point Nowadays we have a lot of data about every subject. This could be a preciuos resource if we use it in the correct way. The problem faced in this thesis is how to extract data automatically from Appennie website in order to realise an organized set of data that is easily understandable and readable that can help apps' developers to make better decisions. From this data set it will be easier to understand some trends of the market and the reasons that increase or reduce the popularity of each app. it will focus on the USA market and the chinese market. 3.2 The Approach The software has been realised using an approach based on an iterative and incremental development, inspired by agile programming practices [1]. The development was con- ducted through weekly deliveries of working code, increasing every week the features and the proprieties of the software. The testing phase was continued for the whole period of development focusing on just coded parts and new implemented features weekly. The most important data for this work are the name, the description and the versions of each app, that are shown in the top 100 charts of Appannie website, for the last year. To avoid having too much repetitions of data and to reduce the time to compute them, it has been decided to extract the required data every two weeks, going as far back as one year into the past. In this way the size of the sample is big enough and the day of the week does not aect the misuring. 3.3 Appannie's structure The website pages of interest are two: 1. Top 100 apps charts: 5
  • 11. Chapter 3. Discussion Figure 3.1: www.appannie.com, Top 100 apps charts Figure 3.2: www.appannie.com The page dedicated to each app 6
  • 12. Chapter 3. Discussion The most popular 100 apps are shown by downloads in this page. There are 5 type of charts: free, paid, grossing, new free, new paid. The existance of these lists depends on the considerated country's market; In relationship to the markets considerated in this work, in the USA one there are free, paid and grossing charts, in the chinese one, all the ve types are present. 2. The page dedicated to each app: There are details related to each app such as name, software house, description, versions history, images, rewievs' distribution and other useless informations for this work. 3.4 Selenium API Selenium is a set of dierent software tools for browser automation and web scraping. It is one of the most famous and used API for these aims, it can be used with dierent languages and browsers. It is a very exible, versatile and easy to use set of tools. Using it it's possible to automate the browser to surf on the website, to look for specic elements in the HTML code and to extract pieces or pattern found [3]. 3.5 Regular expressions Regular expressions are used to look for everything in web scraping [4]. They are very useful to extract writing from elements that contain a lot of text, or dierent types of text (versions element for example) in order to better classify extracted data. 3.6 Browsers The choice of using Mozilla Firefox was made for the following reasons: 1. One of the most used and reliable browsers 2. Free and open source code 3. The avability of a wide selection of extentions and add-ons to customise the browser's behavior 3.7 The tool's development 3.7.1 The project After a phase of feasibility study, some goals have been xed and it has been decided how and what the tools have to do. It extracts from the top 100 charts of USA and chinese markets the lists of apps on the charts, grouped by type of chart and country, so every list is stored in separated CSV les. This process is iterated in order to extract charts every 2 weeks beginning from the current day (or a previusly chosen date) going backwards as 7
  • 13. Chapter 3. Discussion Figure 3.3: The tool's ow 8
  • 14. Chapter 3. Discussion far as one year into the past or into a specic date. After that, the tool reads recently created CSV les and, for each url, it opens the browser using that url and extracting all html code including pictures and saving descripton and history versions in a new CSV le structured. Now we have all the required data and the tool can process them to produce a new le where all the obtained informations are readable and understandable. 3.7.2 Opening the browser and Logging in The rst step of the development was to automate the opening of the browser and to show the page of an app. The rst problem has been how to automate the authentication in the website. In fact the website shows data only to registered users. To solve this, selenium functions Findelement to nd elements contaning username and password elds in the welcome page, and Sendkeys to ll them with username and password have been used. this.findElement(By.id(email)).sendKeys(new String [] { user }); // Enter user this.findElement(By.id(password)).sendKeys(new String [] { pass }); // Enter Password this.findElement(By.id(submit)).click (); // Click on 'Sign In' button Listing 3.1: Login's automation 3.7.3 The scraping from app's page In the rst weeks of the traineeship the attention has been focused in coding the core of the tool, so in writing methods which extract target data from the selected web- pages. For this stage, a random app page to test every little improvment has been used, showing extracted data on the terminal of the IDE. Selenium API provides a driver class which has been expanded with new methods in order to have all the fea- tures of the original class and to customise them to manage better the surng and the scraping. Filters of ndelement method that operate looking for the name of the class inside the html code have been used. In this way it is possible to select just the use- ful elements in the page, in this case description and the history of the versions. List WebElement elements; // list of webelement to store each webelement (paragraph) WebElement elem; By bySearch = By.className(app_content_section); elements = this.findElements(bySearch); Listing 3.2: Extracted by the method that extracts the description of each app However, in most parts of app's pages, looking for and clicking on the button more to show the complete element's text instead of a little rst part of it have been needed. The problem was that not every page has a more button because some have a little description that does not need to be expanded to be read. Generalising, this is a frequent problem in web scraping tools because often, even with the same general structure html 9
  • 15. Chapter 3. Discussion page, the target element can be hidden in some pages and visible in dierent ones. To solve that 2 dierent solutions with Selenium API can be used: 1. ndelement with try catch: This solution of the problem uses exceptions risen by Selenium API in the Findele- ment method. Findelement can nd an element but if the element is not found an exception is risen. So it's possible to manage the exception to continue to run the code even if the miss is not a problem. 2. ndelements with checking on the resultant list: A dierent solution is to use the FindElements method. FindElements can nd a undened number of matches but if it nds no element, no exception is risen and the resultant list will be empty, so it is easy to check the size of the list to understand if at least one element was found. There are some pros and cons for each solution: The rst one has the advantage to be faster if the probability to nd the target element is high, in fact when it nds the element it stop itself and the next line code is processed. The second one is better when we have to look for an element but the rst found could not be the searched one so we need to continue the reaserch. the rts solution has been used in order to get faster operations. 3.7.4 How to save data to CSV le To save data on a CSV le standard libreries of java to write on les has been used. The structure of the resultant le is made up of three columns: URL, descrpition and versions. For the versions elds each version has been divided in order to have only one version for each eld in the CSV le. To create the splitting using regular expression has been needed to match all dierent versions of each app (in the html code all versions are together in the same element) and to put each one in a dierent eld of CSV le. 3.7.5 Web scraping on the chart webpages This step has represented the most dicult part of the work [5]. The html structure of this page is complex and dierent for each country, so the code has to generalise problems and nd consistent solutions. Also, this page uses AJAX to manage the date's drop-down menu and this is a typical problem when a scraper tool is designed [5]. 10
  • 16. Chapter 3. Discussion WebElement elementHead = this.findElementByCssSelector( div.region -main -inner:nth -child (3) table:nth - child (1) thead:nth -child (1)); this.typesList = elementHead.getText ().split(n); this.numList=typesList.length; WebElement elementStoreTable = this.findElementById( storestats -top -table); List WebElement elementsRows = elementStoreTable. findElements(By.cssSelector(tr)); Listing 3.3: Extraction of the list from the chart URL DESCRIPTION VERSIONS https://www.appannie.com/apps/. . . Description Welcome to your . . . 3.9 May 27, 2016 Current release 3.8 May 21, 2016 Table 3.1: File structure required The purpose of this step was to build a second csv le where the tool stores charts of the top 100 apps with their urls. Each page has at least 3 types of dierent charts and not more than 5. This structure depends on the selected country. In the same page it is possibile to choose the date of the chart. The tool has to hold in consideration all these features. To make the process and the reading of les easier, creating a le for each type of chart for each country has been preferred. To get this function suitable for dierent structures of this page it has been chosen to reveal the number and the types of lists in the page before beginning data extraction in order to organise the structure of the CSV le to hold data in a clear way. The Findelement function has been used repeatly to select the correct element reducing the time spent. 3.7.5.1 The selection of the date So far the date factor was ignored during the development. This has been a precise choice in order to make the development easier following an incremental approach. Our goal is to reveal charts every two weeks and to achieve it. The tool has to manage in the data drop-down menu to select the required data. This is a real and complex automation of the browser that has to click on the data menu, reveal the current date and select the correct one. The class calendar of java.util has been used to count days in order to select dates spaced exactly 14 days apart. 11
  • 17. Chapter 3. Discussion if(this.monthList.indexOf(this.calMonthSelectedNow) == calExpToStart.get(Calendar.MONTH) (calExpToStart. get(Calendar.YEAR) == Integer.parseInt(this. calYearSelectedNow))) { String monthWithZero; //Month and year matched. Now Call selectDate function with date to select and set dateNotFound flag to false. int month = this.monthList.indexOf(this. calMonthSelectedNow)+1; if( month 10) monthWithZero = 0 + month; else monthWithZero = month + ; stringFPG.monthYearForNameList = this. calYearSelectedNow + _ + monthWithZero + _; this.selectDate(String.valueOf(calExpToStart.get( Calendar.DAY_OF_MONTH))); dateNotFound = false; } //If current selected month and year are less than expected start month and year then go Inside this condition. else if(this.monthList.indexOf(this.calMonthSelectedNow ) calExpToStart.get(Calendar.MONTH) ( calExpToStart.get(Calendar.YEAR) == Integer.parseInt (this.calYearSelectedNow)) || calExpToStart.get( Calendar.YEAR) Integer.parseInt(this. calYearSelectedNow)) //Click on next button of date picker. this.findElement(By.className(.ui-datepicker -next)) .click (); //If current selected month and year are more than expected start month and year then go Inside this condition. else if(this.monthList.indexOf(this.calMonthSelectedNow ) calExpToStart.get(Calendar.MONTH) ( calExpToStart.get(Calendar.YEAR) == Integer.parseInt (this.calYearSelectedNow)) || calExpToStart.get( Calendar.YEAR) Integer.parseInt(this. calYearSelectedNow)) //Click on prev button of date picker. this.findElement(By.className(ui-datepicker -prev)). click(); Listing 3.4: A piece of code to select the date from Appannie website 12
  • 18. Chapter 3. Discussion 3.7.6 Time and optimisations The total time spent by the program to complete the working ow of one year of app charts is long, approximately the tool spends 20 minuts to extract the apps' lists and 30 seconds to analyse each application presents in the charts. 30 seconds is the result of a work aiming to reduce as more as possibile the time, trying dierent combinations of Selenium API functions to look for the required element in the HTML code. Also, this time does not worry the Website's server because the behaviour of the browser's automation is similar to a human behaviour so the operations of the tool are not blocked. We have 26 charts' pages to analyse. In the USA charts pages there are 3 types of lists, in the Chinese charts pages there are 5 types of lists. Every type of list is made up of 100 names of apps (sometimes the new free and the new paid charts have less than 100 apps but not more than 100 obviously). So the number of apps is little less than 20800. If the tool analyses every app's page the time spent would be 10400 minuts that means 173,3 hours, so more than 7 days of incessant running. Fortunatlely it is possibile to optimise it in a huge way. In fact the major part of 20800 apps are repeated ones. Since the longest part of the processes of the tool is the scraping from every app's page, this part has been optimased keeping out apps already analysed. Anyway in the stored lists ther are all apps, in order to have a clear situations of every chart, bau when the tool reads these lists, it keeps out apps already saw. In this way the number of app's page analysed and the time spent by the tool is considerably decreased. The number of analysed app is become 2100, reducing the spent time of 90%, that means 16,6 hours. 3.7.6.1 Multithreading The tools is partially multithreading, and the idea is to trasform every parts using mul- tithreading. Now, the rst process of the tool, that is the extraction of the lists from top 100 charts pages is coded with 2 threads, one for each country. In this way the time to extract lists has been reduced of 33% (from 15 minutes for each country, that is 30 minutes in total, to 20 minutes using multithreads algorithm, with the testing hardware). The second part of the tools could improve in required time with a multithreading approach, and this will be part of future progresses. ThreadDemo T1 = new ThreadDemo(USA, https ://www. appannie.com/apps/google -play/top -chart/united - states/overall /?date =2016 -05 -25); T1.start(); ThreadDemo T2 = new ThreadDemo(CHI, https ://www. appannie.com/apps/google -play/top -chart/china/ overall/); T2.start(); Listing 3.5: Multithreading 13
  • 19. Chapter 3. Discussion 3.7.6.2 Misuring of the time Opening webpage: OK Total time: 3149.649 sec. Task time: 9.821045 sec. checking if app esists yet... OK Total time: 3160.453 sec. Task time: 10.803955 sec. nding images with relative url... OK Total time: 3160.863 sec. Task time: 0.41015625 sec. Downloading webpage on data.html... OK Total time: 3161.082 sec. Task time: 0.21899414 sec. Fetching description's product... OK Total time: 3162.048 sec. Task time: 0.96606445 sec. Fetching versions's product... OK Total time: 3166.224 sec. Task time: 4.1760254 sec. Fetching versions's product... OK Total time: 3166.738 sec. Task time: 0.513916 sec. Fetching versions's product... OK Total time: 3166.739 sec. Task time: 9.765625E-4 sec. Opening new tab browser to visit the next url... OK Total time: 3173.452 sec. Task time: 6.7128906 sec. App time: 33.624023 sec. Table 3.2: Misuring of the time The misuring of the required time has been managed using a stopwatch class and writing in a text le the spent time of each step of the tool. It is very important the control of the times in a web scraping tool because the high amount of data requires long time. 3.7.7 The backup of the state Providing a backup system was needed. In fact, in a so long running period, it is impossibile to exclude the possibility of a black out or a server or client crash or, also, a connection problem. To solve it, the tool saves its state before processing every app, in order to not lose any data or time. If the program is stopped, when the user starts the program again, it restart from the last app analysed (if the last running time it did not nished to process the apps) 3.8 Good shape of informations and data mining The last step of the tool is to create a le where all extracted data are included and shown in a understandable and readable shape, in order to raise the interest of developers, workers in the app's advertising eld, managers of work groups related to mobile apps market and other professionals in the same eld. This new le is structured like the gures in the conclusion section, and highlights the how often the apps are presents in the charts and the evolution of position on the charts during the evaluated period. From this le we can elaborate data and trough graphics and diagrams show some importants informations very useful for the stakeholders. 14
  • 20. Chapter 4. Conclusion CHAPTER 4 CONCLUSION 4.1 Results The results of this work are three types of CSV les where you can nd all the extracted data in a semi-structured shape. The choice of the CSV le has been made because it is easy to use and easy to manage. It is supported by almost all spreadsheets, database man- agement systems and many programming languages have available libraries that support CSV les. The rst type shows the chart in a specic date with the list of URLs of the dedicated page on Appannie. This type of le is only a means to the following steps. The second type is dierent: here each row is dedicated to an app and the data of each app are shown in dierent elds of the same row. So, in the same place there is all the extracted data of every app present on the top 100 charts of the evaluated period: url, description and versions history, with a version for each eld of the row. This le can be considered a semistructured data le, and the contained data are organised in a ready-to-analyse shape. The third type of le is more structured than the other ones, but it's still considered semi- structured because there is not a relational database containing data. However this is a shape that highlights a lot of important information, that was hidden inside the amount of data. We can see a new chart where the apps are ordered by presences in the two weekly charts and the taken position with the related date. Also the average of the positions during the evaluated period is shown. URL DESCRIPTION VERSIONS https://www.appannie.com/apps/google-play/app/com.mojang.minecraftpe/details/ Description Our latest free . . . Varies with device Jan 17, 2012 Current release https://www.appannie.com/apps/google-play/app/com.sikebox.retrorika.material.icons/details/ Description bWelcome to your . . . 3.9 May 27, 2016 Current release 3.8 May 21, 2016 https://www.appannie.com/apps/google-play/app/com.ninjakiwi.bloonstd5/details/ Description Five-star tower defense . . . 3.2 May 17, 2016 Current release 3.1 Mar 10, 2016 https://www.appannie.com/apps/google-play/app/com.robtopx.geometryjump/details/ Description Jump and y your way . . . 2.011 Sep 29, 2015 Current release 2.01 Sep 28, 2015 Table 4.1: Extracted from the second le The third type of le is the rst step of a proper data mining approach. From this one we can elaborate a lot of data, extract information easily and gure out distributions and statistics about the apps. 15
  • 21. Chapter 4. Conclusion Type and dates of chart Position App's name Weeks Pos average Worst pos Best pos Messenger 28 1,1428571429 3 1 2016_05_11_USA_Free 1 2016_04_27_USA_Free 1 2016_04_13_USA_Free 2 2016_03_30_USA_Free 1 2016_03_16_USA_Free 1 2016_03_02_USA_Free 1 2016_02_17_USA_Free 1 2016_02_03_USA_Free 1 2016_01_20_USA_Free 1 2016_01_06_USA_Free 1 2015_12_23_USA_Free 1 2015_12_09_USA_Free 1 2015_11_25_USA_Free 1 2015_11_11_USA_Free 1 2015_10_28_USA_Free 1 2015_10_14_USA_Free 1 2015_09_30_USA_Free 1 2015_09_16_USA_Free 1 2015_09_02_USA_Free 1 2015_08_19_USA_Free 1 2015_08_05_USA_Free 3 2015_07_22_USA_Free 2 2015_07_08_USA_Free 1 2015_06_24_USA_Free 1 2015_06_10_USA_Free 1 2015_05_27_USA_Free 1 2015_05_13_USA_Free 1 2015_04_29_USA_Free 1 Table 4.2: The third le 4.2 Future Work The html code les of every analysed app, have been stored in order to allow future elaborations and research, even if the website will not be accessible. The next step is to examine in depth the data of the les and especially of the third le, in order to improve the shape and to nd other connections among the available set of data. Extracting more data from the site is possible and, using the developed tool as a starting point, improving it by adding more characteristic and modifying which data has to be 16
  • 22. Chapter 4. Conclusion extracted also is possible. This is possible thanks to an accurate modular development with various classes that handle parts of code that are independent from each other. The tool does not have a graphic interface, because it was not a priority for this project, but it could be an essential component to make the tool more usable and to enable Not-IT professionals to use it. 17
  • 24. Chapter 5. Acknowledgement CHAPTER 5 ACKNOWLEDGEMENT I wish to express my sincere thanks to professor Riccardo Scateni, (University of Cagliari) for his guidance and for his precious advices, to professor Gianni Fenu for his great avail- ability and to professor Xiaosong Yang and professor Zhidong Xiao for their continuous support and teachings during all the training period at Bournemouth University. I am also thankful to Nicola, Edoardo and Simone for this traineeship experience, to all Chwazi team for his support for these last three years and to my friends, especially Valentina, Riccardo, Mario, Mattia, Luigi for their encouragement, costant help and their importants advices. I take this opportunity to express gratitude to all of the Departments faculty members for their help and support. I also greatly thank my family for their support and attention. 19
  • 26. Chapter 6. Attachments CHAPTER 6 ATTACHMENTS 6.1 The development environment 6.1.1 Hardware CPU : Intel Core Duo 2 SU9400 1,4 Ghz GPU : Intel GMA X4500 RAM : 4.0GB 6.1.2 Software OS : Windows 10 Home x86 and Ubuntu 15.10 x64 IDE : Eclipse Mars 5.2 GitHub Atom : Atom 1.8.0 6.1.3 Repository GitHub : Mattia Palla's Bitbucket 6.1.4 Thesis LaTeX : Texmaker 4.5 21
  • 28. LISTINGS LISTINGS 3.1 Login's automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Extracted by the method that extracts the description of each app . . . . . 9 3.3 Extraction of the list from the chart . . . . . . . . . . . . . . . . . . . . . . 11 3.4 A piece of code to select the date from Appannie website . . . . . . . . . . 12 3.5 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 23
  • 30. List of Figures LIST OF FIGURES 3.1 www.appannie.com, Top 100 apps charts . . . . . . . . . . . . . . . . . . . 6 3.2 www.appannie.com The page dedicated to each app . . . . . . . . . . . . . 6 3.3 The tool's ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 25
  • 32. Bibliography BIBLIOGRAPHY [1] Kent Beck, Mike Beedle, Arie Van Bennekum, Alistair Cockburn, Ward Cunningham, Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeries, et al. Manifesto for agile software development. 2001. [2] Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C Miller. Automation and customization of rendered web pages. In Proceedings of the 18th annual ACM symposium on User interface software and technology, pages 163172. ACM, 2005. [3] http://www.seleniumhq.org/docs/index.jsp. Selenium documentation. [4] Simon Munzert, Peter Rubba, Christian, and Dominic Nyhuis. Automated data col- lection with R: A practical guide to web scraping and text mining. John Wiley Sons, 2014. [5] Michael Schrenk. Webbots, spiders, and screen scrapers: A guide to developing Internet agents with PHP/CURL. No Starch Press, 2012. 27