This document describes the development of a web scraping tool to extract useful mobile app market data from Appannie's website. The tool automates browsing to Appannie pages using Selenium, scrapes app name, description and version history from individual app pages, and saves the data to CSV files. It iterates through Appannie's top charts from the past year for the US and Chinese markets to build a structured dataset for analysis and to help app developers. The project uses an agile development approach with weekly iterations to expand the tool's functionality and optimize performance over time.
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
Finding the needed information (images, articles or other) is not always as simple as going to a search en-
gine. This paper aim at developing an interactive presentation system, able to cope with live presentation
challenges.
This is a report on a mobile application that enables one to easily identify happening places, hotels and guest houses within a particular city, its currently being tested in Gulu City by the the users and the development team
Restaurant e-menu on iPad, Rapid Application Development (RAD), Model-View-Controller (MVC), ASP.Net, Xcode, Web services, iPad application and mobile application development.
A Usability Evaluation carried out on my second year Brunel Group project.
A.R.C. (Augmented Reality Communicator), is an augmented reality social networking application , designed and built for my second year group project at Brunel University.
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
Finding the needed information (images, articles or other) is not always as simple as going to a search en-
gine. This paper aim at developing an interactive presentation system, able to cope with live presentation
challenges.
This is a report on a mobile application that enables one to easily identify happening places, hotels and guest houses within a particular city, its currently being tested in Gulu City by the the users and the development team
Restaurant e-menu on iPad, Rapid Application Development (RAD), Model-View-Controller (MVC), ASP.Net, Xcode, Web services, iPad application and mobile application development.
A Usability Evaluation carried out on my second year Brunel Group project.
A.R.C. (Augmented Reality Communicator), is an augmented reality social networking application , designed and built for my second year group project at Brunel University.
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...Tom Robinson
With more people gaining access to the internet every day, the web enabling of core services and business processes is becoming essential. There is a great deal of existing research covering techniques and approaches to web enablement for commercial and public sector organisations, but very little that is aimed specifically at small charities and voluntary sector organisations. Numerous studies have shown that charities often lag behind commercial organisations when it comes to their internet infrastructure and the extent of web enablement. This dissertation investigates the needs and issues which charities face, in order to define a number of key web enablement aims and objectives. Some problems are unique to the charitable sector whilst others apply to all types of organisations.
As most web applications can be accessed from anywhere in the world, globalisation is an inherent web development issue. A number of the most common issues associated with globalisation are examined and current best practice solutions suggested.
The Foundations, Fundamentals, Features and Future (F4) Framework is the outcome of the research into the situation, needs and issues faced by charitable organisations. It offers a simple but detailed framework designed specially for web enablement projects within charitable organisations. The framework is broken down into four key stages of web enablement – foundations, fundamentals, features and future possibility. Through the four layers, the framework covers key business drivers, internet access and security, error-handling techniques through to global database access and undeveloped future technologies.
The framework was developed and refined through research and work undertaken with GAP Activity Projects, a worldwide gap year charity. To demonstrate the implementation of the framework, GAP is used as a case study. A number of web and related applications are developed and evaluated including an online application system, mass mailing tools and an extranet application. The case study demonstrates a number of novel techniques that have been developed to solve some of the problems which were faced, including the use of XML as a data storage method and a unique form validation technique.
Although the evaluation of the framework shows that it meets well the objectives it set out to achieve, there are opportunities for improvement and future work. A number of future expansions possibilities are examined including the use of mobile technology and content management systems.
This file is the final report for the course Digital Content Retrieval (DCR) presented at Pavia University as Computer Engineering Master's course. The report explains the procedure for the development of a personal website and a video curriculum describing its development aspects using proper project management techniques. The source of the personal website and the video curriculum are available at https://github.com/kooroshsajadi/personal-website and https://vimeo.com/843032358?share=copy respectively.
A Mobile and Web application for time measurement intended to get an accurate picture of the productive time in a production environment in order to reveal the root causes behind ineffective/idle time and to eliminate non-added activities/tasks .
Technical Key-words : Ionic 2, Angular 2, PouchDB, CouchDB ,
DB Replication Protocol, Django, Python NvD3 charts .
B-Translator as a Software Engineering ProjectDashamir Hoxha
The project B-Translator will be presented, trying to illustrate through it some software development/engineering concepts and practices (how they are actually applied in this project).
Back-end and front-end development are two distinct but interconnected components of web development. Understanding their differences and roles is fundamental for anyone entering the field.
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...Tom Robinson
With more people gaining access to the internet every day, the web enabling of core services and business processes is becoming essential. There is a great deal of existing research covering techniques and approaches to web enablement for commercial and public sector organisations, but very little that is aimed specifically at small charities and voluntary sector organisations. Numerous studies have shown that charities often lag behind commercial organisations when it comes to their internet infrastructure and the extent of web enablement. This dissertation investigates the needs and issues which charities face, in order to define a number of key web enablement aims and objectives. Some problems are unique to the charitable sector whilst others apply to all types of organisations.
As most web applications can be accessed from anywhere in the world, globalisation is an inherent web development issue. A number of the most common issues associated with globalisation are examined and current best practice solutions suggested.
The Foundations, Fundamentals, Features and Future (F4) Framework is the outcome of the research into the situation, needs and issues faced by charitable organisations. It offers a simple but detailed framework designed specially for web enablement projects within charitable organisations. The framework is broken down into four key stages of web enablement – foundations, fundamentals, features and future possibility. Through the four layers, the framework covers key business drivers, internet access and security, error-handling techniques through to global database access and undeveloped future technologies.
The framework was developed and refined through research and work undertaken with GAP Activity Projects, a worldwide gap year charity. To demonstrate the implementation of the framework, GAP is used as a case study. A number of web and related applications are developed and evaluated including an online application system, mass mailing tools and an extranet application. The case study demonstrates a number of novel techniques that have been developed to solve some of the problems which were faced, including the use of XML as a data storage method and a unique form validation technique.
Although the evaluation of the framework shows that it meets well the objectives it set out to achieve, there are opportunities for improvement and future work. A number of future expansions possibilities are examined including the use of mobile technology and content management systems.
This file is the final report for the course Digital Content Retrieval (DCR) presented at Pavia University as Computer Engineering Master's course. The report explains the procedure for the development of a personal website and a video curriculum describing its development aspects using proper project management techniques. The source of the personal website and the video curriculum are available at https://github.com/kooroshsajadi/personal-website and https://vimeo.com/843032358?share=copy respectively.
A Mobile and Web application for time measurement intended to get an accurate picture of the productive time in a production environment in order to reveal the root causes behind ineffective/idle time and to eliminate non-added activities/tasks .
Technical Key-words : Ionic 2, Angular 2, PouchDB, CouchDB ,
DB Replication Protocol, Django, Python NvD3 charts .
B-Translator as a Software Engineering ProjectDashamir Hoxha
The project B-Translator will be presented, trying to illustrate through it some software development/engineering concepts and practices (how they are actually applied in this project).
Back-end and front-end development are two distinct but interconnected components of web development. Understanding their differences and roles is fundamental for anyone entering the field.
1. UNIVERSITÀ DEGLI STUDI DI CAGLIARI
FACOLTÀ DI SCIENZE
Corso di laurea in Informatica
DATA EXTRACTION FROM APPANNIE'S WEBSITE USING
WEB SCRAPING
Supervisor:
Prof. Riccardo Scateni
Candidates:
Mattia Palla
(matr. 48971)
Academic year 2015-2016
2. Abstract
A web scraper tool is a software used to extract target data from a web ap-
plication, usually a website. Normally it is the last resort to obtain data from
a source which provide no other conveninet way like downloadable structured
data les or API to communicate with the database contening target data.
Nowadays, data and informations about the mobile application market
and its trend are very important for developers and software houses which
want to launch a new application or improve the existing ones.
The role of a scraper tool is to extract and trasform unstructured data
to semi-structured or structured data, the role of data mining methods is
to discover patterns in large data sets and make them in a readable and
understandable shape.
The purpose of this project is the development of an eective and ecient
tool which, using scraping and data mining methods, extracts useful data
from source providing data about the mobile app market, and shows them in
an understandble shape highlighting links and correlations among extracted
data.
6. Chapter 1. Introduction
CHAPTER 1
INTRODUCTION
1.1 Motivation
There are sources of data where it is possible just to consult data through a website
and it is impossible to save a le with some important and selected informations. Web
scraping and data mining methods can be used to prepare a structured or semi-structured
le with ltered and useful data, trasforming a hardly understandble set of data contained
in a website into an organised and readable report of mobile app market stored in a le
[5] [2].
1.2 Purpose
The result of this project aims to help developers and software houses to make decisions
about the development, the presentation, and the advertising of a new mobile app, but
also helps to adjust apps that have already come out.
1.3 Organization
This thesis is the result of an Erasmus traineeship experience at Bournemouth University.
During the training period, there was a rst phase of study to understand how it was
possible to achieve the required results, followed by a second phase of implementation of
the idea.
In the rst stage of this thesis' work, most of the code consisted in a series of attempts
to try dierent approaches to the problem. All the focus went on to understand how was
it possible to develop a tool to extract needed data. The code of a simple web scraping
tool written in python was used to understand how a software like that could work, after
that the focus had shifted to an useful API for the project's porpouse, learning how to
use Selenium API.
1
7. Chapter 1. Introduction
The second step was to code and test repeatedly every piece of the code, nding out
new issues every time and xing them. The development proceded trough repeated little
goals, increasing, step by step, the features and the quality of software.
The last stage was to nd correlations among extracted data and show these in an
agreeable shape to raise the interest of developers.
2
8. Chapter 2. State of the art
CHAPTER 2
STATE OF THE ART
In this section are explained the main problem, the methods and tools used as a term
of comparison and as a reference for this work.
2.1 App Annie: A mobile analytics provider companies
With the growth of the mobile app market increasing at such a rate, there have been
numerous app marketing agencies and mobile analytics privider emerging, oering support
in app promotion, analysing the market and elaborating statistics. One of them is App
Annie.
App Annie is a business intelligence company and analyst rm headquartered in San
Francisco, California. It produces business intelligence tools and market reports for the
apps and digital goods industry. App Annie creates next generation business intelligence
solutions for the global app economy and is providing the most successful companies with
these data products for free. Business media outlets use App Annie as a complete solution
for tracking downloads, revenues, rankings and reviews. The App Annie suite consists
of three main products: Store Stats, Analytics, and Intelligence which provides the most
accurate market estimates available.
2.1.1 App Annie's APIs
To obtain data from App Annie's website, there are available APIs to communicate
directly with the database. The problem is the limitations in using them for free (1000
calls for day). Also Appannie allow to download a CSV le with some data, but this
feature is not free and only a little part of needed data would be available.
2.2 Data scraping
The only way to obtain all needed data is to use web scraping methods. There are other
web scraping tools but a general web scraper tool cannot produce the same result of a
3
9. Chapter 2. State of the art
tailored tool. Tools such as UiPath or OutWit Hub have basic version free that is not
enough powerful to conduct all needed tasks.
4
10. Chapter 3. Discussion
CHAPTER 3
DISCUSSION
3.1 The starting point
Nowadays we have a lot of data about every subject. This could be a preciuos resource
if we use it in the correct way. The problem faced in this thesis is how to extract data
automatically from Appennie website in order to realise an organized set of data that
is easily understandable and readable that can help apps' developers to make better
decisions. From this data set it will be easier to understand some trends of the market
and the reasons that increase or reduce the popularity of each app. it will focus on the
USA market and the chinese market.
3.2 The Approach
The software has been realised using an approach based on an iterative and incremental
development, inspired by agile programming practices [1]. The development was con-
ducted through weekly deliveries of working code, increasing every week the features and
the proprieties of the software. The testing phase was continued for the whole period of
development focusing on just coded parts and new implemented features weekly.
The most important data for this work are the name, the description and the versions
of each app, that are shown in the top 100 charts of Appannie website, for the last year.
To avoid having too much repetitions of data and to reduce the time to compute them, it
has been decided to extract the required data every two weeks, going as far back as one
year into the past. In this way the size of the sample is big enough and the day of the
week does not aect the misuring.
3.3 Appannie's structure
The website pages of interest are two:
1. Top 100 apps charts:
5
11. Chapter 3. Discussion
Figure 3.1: www.appannie.com, Top 100 apps charts
Figure 3.2: www.appannie.com The page dedicated to each app
6
12. Chapter 3. Discussion
The most popular 100 apps are shown by downloads in this page. There are 5 type
of charts: free, paid, grossing, new free, new paid. The existance of these lists
depends on the considerated country's market; In relationship to the markets
considerated in this work, in the USA one there are free, paid and grossing
charts, in the chinese one, all the ve types are present.
2. The page dedicated to each app:
There are details related to each app such as name, software house, description,
versions history, images, rewievs' distribution and other useless informations
for this work.
3.4 Selenium API
Selenium is a set of dierent software tools for browser automation and web scraping.
It is one of the most famous and used API for these aims, it can be used with dierent
languages and browsers. It is a very exible, versatile and easy to use set of tools. Using
it it's possible to automate the browser to surf on the website, to look for specic elements
in the HTML code and to extract pieces or pattern found [3].
3.5 Regular expressions
Regular expressions are used to look for everything in web scraping [4]. They are very
useful to extract writing from elements that contain a lot of text, or dierent types of
text (versions element for example) in order to better classify extracted data.
3.6 Browsers
The choice of using Mozilla Firefox was made for the following reasons:
1. One of the most used and reliable browsers
2. Free and open source code
3. The avability of a wide selection of extentions and add-ons to customise the browser's
behavior
3.7 The tool's development
3.7.1 The project
After a phase of feasibility study, some goals have been xed and it has been decided how
and what the tools have to do. It extracts from the top 100 charts of USA and chinese
markets the lists of apps on the charts, grouped by type of chart and country, so every list
is stored in separated CSV les. This process is iterated in order to extract charts every
2 weeks beginning from the current day (or a previusly chosen date) going backwards as
7
14. Chapter 3. Discussion
far as one year into the past or into a specic date. After that, the tool reads recently
created CSV les and, for each url, it opens the browser using that url and extracting all
html code including pictures and saving descripton and history versions in a new CSV le
structured. Now we have all the required data and the tool can process them to produce
a new le where all the obtained informations are readable and understandable.
3.7.2 Opening the browser and Logging in
The rst step of the development was to automate the opening of the browser and to show
the page of an app. The rst problem has been how to automate the authentication in the
website. In fact the website shows data only to registered users. To solve this, selenium
functions Findelement to nd elements contaning username and password elds in the
welcome page, and Sendkeys to ll them with username and password have been used.
this.findElement(By.id(email)).sendKeys(new String []
{ user }); // Enter user
this.findElement(By.id(password)).sendKeys(new String
[] { pass }); // Enter Password
this.findElement(By.id(submit)).click (); // Click on
'Sign In' button
Listing 3.1: Login's automation
3.7.3 The scraping from app's page
In the rst weeks of the traineeship the attention has been focused in coding the core
of the tool, so in writing methods which extract target data from the selected web-
pages. For this stage, a random app page to test every little improvment has been
used, showing extracted data on the terminal of the IDE. Selenium API provides a
driver class which has been expanded with new methods in order to have all the fea-
tures of the original class and to customise them to manage better the surng and the
scraping. Filters of ndelement method that operate looking for the name of the class
inside the html code have been used. In this way it is possible to select just the use-
ful elements in the page, in this case description and the history of the versions.
List WebElement elements; // list of webelement to
store each webelement (paragraph)
WebElement elem;
By bySearch = By.className(app_content_section);
elements = this.findElements(bySearch);
Listing 3.2: Extracted by the method that extracts the description of each app
However, in most parts of app's pages, looking for and clicking on the button more
to show the complete element's text instead of a little rst part of it have been needed.
The problem was that not every page has a more button because some have a little
description that does not need to be expanded to be read. Generalising, this is a frequent
problem in web scraping tools because often, even with the same general structure html
9
15. Chapter 3. Discussion
page, the target element can be hidden in some pages and visible in dierent ones. To
solve that 2 dierent solutions with Selenium API can be used:
1. ndelement with try catch:
This solution of the problem uses exceptions risen by Selenium API in the Findele-
ment method. Findelement can nd an element but if the element is not found
an exception is risen. So it's possible to manage the exception to continue to
run the code even if the miss is not a problem.
2. ndelements with checking on the resultant list:
A dierent solution is to use the FindElements method. FindElements can nd a
undened number of matches but if it nds no element, no exception is risen
and the resultant list will be empty, so it is easy to check the size of the list to
understand if at least one element was found.
There are some pros and cons for each solution: The rst one has the advantage to be
faster if the probability to nd the target element is high, in fact when it nds the element
it stop itself and the next line code is processed. The second one is better when we have
to look for an element but the rst found could not be the searched one so we need to
continue the reaserch. the rts solution has been used in order to get faster operations.
3.7.4 How to save data to CSV le
To save data on a CSV le standard libreries of java to write on les has been used.
The structure of the resultant le is made up of three columns: URL, descrpition and
versions. For the versions elds each version has been divided in order to have only one
version for each eld in the CSV le. To create the splitting using regular expression has
been needed to match all dierent versions of each app (in the html code all versions are
together in the same element) and to put each one in a dierent eld of CSV le.
3.7.5 Web scraping on the chart webpages
This step has represented the most dicult part of the work [5]. The html structure
of this page is complex and dierent for each country, so the code has to generalise
problems and nd consistent solutions. Also, this page uses AJAX to manage the date's
drop-down menu and this is a typical problem when a scraper tool is designed [5].
10
16. Chapter 3. Discussion
WebElement elementHead = this.findElementByCssSelector(
div.region -main -inner:nth -child (3) table:nth -
child (1) thead:nth -child (1));
this.typesList = elementHead.getText ().split(n);
this.numList=typesList.length;
WebElement elementStoreTable = this.findElementById(
storestats -top -table);
List WebElement elementsRows = elementStoreTable.
findElements(By.cssSelector(tr));
Listing 3.3: Extraction of the list from the chart
URL DESCRIPTION VERSIONS
https://www.appannie.com/apps/. . . Description Welcome to your . . . 3.9 May 27, 2016 Current release 3.8 May 21, 2016
Table 3.1: File structure required
The purpose of this step was to build a second csv le where the tool stores charts of
the top 100 apps with their urls. Each page has at least 3 types of dierent charts and
not more than 5. This structure depends on the selected country. In the same page it is
possibile to choose the date of the chart. The tool has to hold in consideration all these
features. To make the process and the reading of les easier, creating a le for each type
of chart for each country has been preferred.
To get this function suitable for dierent structures of this page it has been chosen
to reveal the number and the types of lists in the page before beginning data extraction
in order to organise the structure of the CSV le to hold data in a clear way. The
Findelement function has been used repeatly to select the correct element reducing the
time spent.
3.7.5.1 The selection of the date
So far the date factor was ignored during the development. This has been a precise
choice in order to make the development easier following an incremental approach. Our
goal is to reveal charts every two weeks and to achieve it. The tool has to manage in the
data drop-down menu to select the required data. This is a real and complex automation
of the browser that has to click on the data menu, reveal the current date and select the
correct one. The class calendar of java.util has been used to count days in order to select
dates spaced exactly 14 days apart.
11
17. Chapter 3. Discussion
if(this.monthList.indexOf(this.calMonthSelectedNow) ==
calExpToStart.get(Calendar.MONTH) (calExpToStart.
get(Calendar.YEAR) == Integer.parseInt(this.
calYearSelectedNow)))
{
String monthWithZero;
//Month and year matched. Now Call selectDate
function with date to select and set dateNotFound
flag to false.
int month = this.monthList.indexOf(this.
calMonthSelectedNow)+1;
if( month 10)
monthWithZero = 0 + month;
else
monthWithZero = month + ;
stringFPG.monthYearForNameList = this.
calYearSelectedNow + _ + monthWithZero + _;
this.selectDate(String.valueOf(calExpToStart.get(
Calendar.DAY_OF_MONTH)));
dateNotFound = false;
}
//If current selected month and year are less than
expected start month and year then go Inside this
condition.
else if(this.monthList.indexOf(this.calMonthSelectedNow
) calExpToStart.get(Calendar.MONTH) (
calExpToStart.get(Calendar.YEAR) == Integer.parseInt
(this.calYearSelectedNow)) || calExpToStart.get(
Calendar.YEAR) Integer.parseInt(this.
calYearSelectedNow))
//Click on next button of date picker.
this.findElement(By.className(.ui-datepicker -next))
.click ();
//If current selected month and year are more than
expected start month and year then go Inside this
condition.
else if(this.monthList.indexOf(this.calMonthSelectedNow
) calExpToStart.get(Calendar.MONTH) (
calExpToStart.get(Calendar.YEAR) == Integer.parseInt
(this.calYearSelectedNow)) || calExpToStart.get(
Calendar.YEAR) Integer.parseInt(this.
calYearSelectedNow))
//Click on prev button of date picker.
this.findElement(By.className(ui-datepicker -prev)).
click();
Listing 3.4: A piece of code to select the date from Appannie website
12
18. Chapter 3. Discussion
3.7.6 Time and optimisations
The total time spent by the program to complete the working ow of one year of app
charts is long, approximately the tool spends 20 minuts to extract the apps' lists and
30 seconds to analyse each application presents in the charts. 30 seconds is the result
of a work aiming to reduce as more as possibile the time, trying dierent combinations
of Selenium API functions to look for the required element in the HTML code. Also,
this time does not worry the Website's server because the behaviour of the browser's
automation is similar to a human behaviour so the operations of the tool are not blocked.
We have 26 charts' pages to analyse. In the USA charts pages there are 3 types of lists,
in the Chinese charts pages there are 5 types of lists. Every type of list is made up of 100
names of apps (sometimes the new free and the new paid charts have less than 100 apps
but not more than 100 obviously). So the number of apps is little less than 20800. If the
tool analyses every app's page the time spent would be 10400 minuts that means 173,3
hours, so more than 7 days of incessant running. Fortunatlely it is possibile to optimise
it in a huge way. In fact the major part of 20800 apps are repeated ones. Since the
longest part of the processes of the tool is the scraping from every app's page, this part
has been optimased keeping out apps already analysed. Anyway in the stored lists ther
are all apps, in order to have a clear situations of every chart, bau when the tool reads
these lists, it keeps out apps already saw. In this way the number of app's page analysed
and the time spent by the tool is considerably decreased. The number of analysed app is
become 2100, reducing the spent time of 90%, that means 16,6 hours.
3.7.6.1 Multithreading
The tools is partially multithreading, and the idea is to trasform every parts using mul-
tithreading. Now, the rst process of the tool, that is the extraction of the lists from top
100 charts pages is coded with 2 threads, one for each country. In this way the time to
extract lists has been reduced of 33% (from 15 minutes for each country, that is 30 minutes
in total, to 20 minutes using multithreads algorithm, with the testing hardware). The
second part of the tools could improve in required time with a multithreading approach,
and this will be part of future progresses.
ThreadDemo T1 = new ThreadDemo(USA, https ://www.
appannie.com/apps/google -play/top -chart/united -
states/overall /?date =2016 -05 -25);
T1.start();
ThreadDemo T2 = new ThreadDemo(CHI, https ://www.
appannie.com/apps/google -play/top -chart/china/
overall/);
T2.start();
Listing 3.5: Multithreading
13
19. Chapter 3. Discussion
3.7.6.2 Misuring of the time
Opening webpage: OK Total time: 3149.649 sec. Task time: 9.821045 sec.
checking if app esists yet... OK Total time: 3160.453 sec. Task time: 10.803955 sec.
nding images with relative url... OK Total time: 3160.863 sec. Task time: 0.41015625 sec.
Downloading webpage on data.html... OK Total time: 3161.082 sec. Task time: 0.21899414 sec.
Fetching description's product... OK Total time: 3162.048 sec. Task time: 0.96606445 sec.
Fetching versions's product... OK Total time: 3166.224 sec. Task time: 4.1760254 sec.
Fetching versions's product... OK Total time: 3166.738 sec. Task time: 0.513916 sec.
Fetching versions's product... OK Total time: 3166.739 sec. Task time: 9.765625E-4 sec.
Opening new tab browser to visit the next url... OK Total time: 3173.452 sec. Task time: 6.7128906 sec.
App time: 33.624023 sec.
Table 3.2: Misuring of the time
The misuring of the required time has been managed using a stopwatch class and writing
in a text le the spent time of each step of the tool. It is very important the control of
the times in a web scraping tool because the high amount of data requires long time.
3.7.7 The backup of the state
Providing a backup system was needed. In fact, in a so long running period, it is
impossibile to exclude the possibility of a black out or a server or client crash or, also, a
connection problem. To solve it, the tool saves its state before processing every app, in
order to not lose any data or time. If the program is stopped, when the user starts the
program again, it restart from the last app analysed (if the last running time it did not
nished to process the apps)
3.8 Good shape of informations and data mining
The last step of the tool is to create a le where all extracted data are included and
shown in a understandable and readable shape, in order to raise the interest of developers,
workers in the app's advertising eld, managers of work groups related to mobile apps
market and other professionals in the same eld. This new le is structured like the gures
in the conclusion section, and highlights the how often the apps are presents in the charts
and the evolution of position on the charts during the evaluated period. From this le we
can elaborate data and trough graphics and diagrams show some importants informations
very useful for the stakeholders.
14
20. Chapter 4. Conclusion
CHAPTER 4
CONCLUSION
4.1 Results
The results of this work are three types of CSV les where you can nd all the extracted
data in a semi-structured shape. The choice of the CSV le has been made because it is
easy to use and easy to manage. It is supported by almost all spreadsheets, database man-
agement systems and many programming languages have available libraries that support
CSV les.
The rst type shows the chart in a specic date with the list of URLs of the dedicated
page on Appannie. This type of le is only a means to the following steps. The second
type is dierent: here each row is dedicated to an app and the data of each app are shown
in dierent elds of the same row. So, in the same place there is all the extracted data
of every app present on the top 100 charts of the evaluated period: url, description and
versions history, with a version for each eld of the row. This le can be considered a
semistructured data le, and the contained data are organised in a ready-to-analyse shape.
The third type of le is more structured than the other ones, but it's still considered semi-
structured because there is not a relational database containing data. However this is a
shape that highlights a lot of important information, that was hidden inside the amount of
data. We can see a new chart where the apps are ordered by presences in the two weekly
charts and the taken position with the related date. Also the average of the positions
during the evaluated period is shown.
URL DESCRIPTION VERSIONS
https://www.appannie.com/apps/google-play/app/com.mojang.minecraftpe/details/ Description Our latest free . . . Varies with device Jan 17, 2012 Current release
https://www.appannie.com/apps/google-play/app/com.sikebox.retrorika.material.icons/details/ Description bWelcome to your . . . 3.9 May 27, 2016 Current release 3.8 May 21, 2016
https://www.appannie.com/apps/google-play/app/com.ninjakiwi.bloonstd5/details/ Description Five-star tower defense . . . 3.2 May 17, 2016 Current release 3.1 Mar 10, 2016
https://www.appannie.com/apps/google-play/app/com.robtopx.geometryjump/details/ Description Jump and y your way . . . 2.011 Sep 29, 2015 Current release 2.01 Sep 28, 2015
Table 4.1: Extracted from the second le
The third type of le is the rst step of a proper data mining approach. From this one
we can elaborate a lot of data, extract information easily and gure out distributions and
statistics about the apps.
15
21. Chapter 4. Conclusion
Type and dates of chart Position App's name Weeks Pos average Worst pos Best pos
Messenger 28 1,1428571429 3 1
2016_05_11_USA_Free 1
2016_04_27_USA_Free 1
2016_04_13_USA_Free 2
2016_03_30_USA_Free 1
2016_03_16_USA_Free 1
2016_03_02_USA_Free 1
2016_02_17_USA_Free 1
2016_02_03_USA_Free 1
2016_01_20_USA_Free 1
2016_01_06_USA_Free 1
2015_12_23_USA_Free 1
2015_12_09_USA_Free 1
2015_11_25_USA_Free 1
2015_11_11_USA_Free 1
2015_10_28_USA_Free 1
2015_10_14_USA_Free 1
2015_09_30_USA_Free 1
2015_09_16_USA_Free 1
2015_09_02_USA_Free 1
2015_08_19_USA_Free 1
2015_08_05_USA_Free 3
2015_07_22_USA_Free 2
2015_07_08_USA_Free 1
2015_06_24_USA_Free 1
2015_06_10_USA_Free 1
2015_05_27_USA_Free 1
2015_05_13_USA_Free 1
2015_04_29_USA_Free 1
Table 4.2: The third le
4.2 Future Work
The html code les of every analysed app, have been stored in order to allow future
elaborations and research, even if the website will not be accessible.
The next step is to examine in depth the data of the les and especially of the third
le, in order to improve the shape and to nd other connections among the available set
of data.
Extracting more data from the site is possible and, using the developed tool as a starting
point, improving it by adding more characteristic and modifying which data has to be
16
22. Chapter 4. Conclusion
extracted also is possible. This is possible thanks to an accurate modular development
with various classes that handle parts of code that are independent from each other.
The tool does not have a graphic interface, because it was not a priority for this project,
but it could be an essential component to make the tool more usable and to enable Not-IT
professionals to use it.
17
24. Chapter 5. Acknowledgement
CHAPTER 5
ACKNOWLEDGEMENT
I wish to express my sincere thanks to professor Riccardo Scateni, (University of Cagliari)
for his guidance and for his precious advices, to professor Gianni Fenu for his great avail-
ability and to professor Xiaosong Yang and professor Zhidong Xiao for their continuous
support and teachings during all the training period at Bournemouth University.
I am also thankful to Nicola, Edoardo and Simone for this traineeship experience, to
all Chwazi team for his support for these last three years and to my friends, especially
Valentina, Riccardo, Mario, Mattia, Luigi for their encouragement, costant help and their
importants advices.
I take this opportunity to express gratitude to all of the Departments faculty members
for their help and support. I also greatly thank my family for their support and attention.
19
32. Bibliography
BIBLIOGRAPHY
[1] Kent Beck, Mike Beedle, Arie Van Bennekum, Alistair Cockburn, Ward Cunningham,
Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeries, et al.
Manifesto for agile software development. 2001.
[2] Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C Miller.
Automation and customization of rendered web pages. In Proceedings of the 18th
annual ACM symposium on User interface software and technology, pages 163172.
ACM, 2005.
[3] http://www.seleniumhq.org/docs/index.jsp. Selenium documentation.
[4] Simon Munzert, Peter Rubba, Christian, and Dominic Nyhuis. Automated data col-
lection with R: A practical guide to web scraping and text mining. John Wiley Sons,
2014.
[5] Michael Schrenk. Webbots, spiders, and screen scrapers: A guide to developing Internet
agents with PHP/CURL. No Starch Press, 2012.
27