Datapedia Analysis Report

Datapedia 1
GraduationProjectReport Faculty of Computers and Information
Information Systems Dept.
Cairo University
2014/2015
Datapedia
Project Team:
Abanoub A. Amin. (20110002)
Ahmed A. Ahmed. (20110070)
Ahmed M. Osman (Team Leader). (20110086)
Ahmed N. Muhammad. (20110099)
Ahmed R. Negm. (20110053)
Under the supervision of:
Dr. Hoda M. O. Mokhtar.
Eng. Mohamed Hafez.

Datapedia 2
Table of Contents
Chapter 1: System Proposal ……………………………………………………….……………. 4
1.1 Introduction …………………………………………………………………………………… 4
1.2 Problem Statement ………………………………………………………………………….. 5
1.3 Project Justification ………………………………………………………………………….. 5
1.4 Project Scope ………………………………………………………………………………… 5
1.5 Technologies & Tools …………….………………………………………………………….. 6
1.6 Limitations & Exclusions ………………………………………………………..…………... 23
Chapter 2: System Analysis ……………………………………………..…………………….. 25
2.1 Project Stakeholders ……………………………………………………….…….………… 52
2.2 Non-Functional Requirements ………………………………………….……….……….. 52
2.3 Functional Requirements ………………………………………………….…….………... 62
2.4 User Profiles ………………………………………………………………….…….…….…… 72
2.5 Task Profiles ………………………………………………………………….…………..…… 72
2.6 Environmental Profiles …………………………………………………….……………..… 82
2.7 Sample Personas …………………………………………………………………….……... 82
2.8 Competitive Analysis ………………………………………………………………….…… 03
Chapter 3: System Design ……………………………………………………….…………..… 13
3.1 System Development Methodology ……………………………………………………. 13
3.2 Main Objectives ……………………………………………………………….……………. 13
3.3 Secondary objectives ………………………………………………………….………….. 23
3.4 Use Case Diagram …………………………………………………………………………. 33
3.5 Backend Design ………………………….…………………………………………………. 34
3.6 Block Diagram ………………………………………………………………….…………… 39
3.7 Sample Mock-ups ……………………………………………………………..……………. 40
Chapter 4: System Implementation& Management …………....…………..………..….. 42
4.1 Phase 1: Data Collection and Preprocessing …………………………………………. 42
4.2 Phase 2: Sentiment Analysis ………………………………………………………………. 43
4.3 Phase 3: Data Analytics …………………………………………………………………… 44

Datapedia 3
4.4 Phase 4: Data presentation ………………………………………………………………. 50
4.5 Algorithm Flowchart ………………………………………………………………….…….. 51
Chapter 5: System Evaluation …………………………………………………..…………….. 52
5.1 Data Evaluation ……………………………………………………………..……………… 52
5.2 Algorithm Evaluation…………………………………………………………………..….... 54
5.3 System Evaluation ……………………………………………………………..………….... 55
References ……………………………………….………………………………………….....… 58

Datapedia 4
Chapter 1: System Proposal.
1.1 Introduction.
Nowadays, it is true that our personal life is highly dependent on the
technology that people have developed. We are currently witnessing an
era where new competitive products are produced almost every day. A
key impact of this rapid technological advancement is that it affected
several aspects in our life style along with the way we purchase products,
the way we communicate, the way we travel, and not to mention the
way we learn.
Despite the importance of these advancements, today, buying a certain
product is no longer a trivial task that can be taken in minutes. Almost
everything we use has been refurbished to better standards. New
features, new specifications, price variations, new applications, and
many other dimensions affect our purchasing decision, making it more
sophisticated and require better decision-making approaches.Although
this impact is clear on the personal level, it is also obvious in larger
organizations where purchasing machines, computers and equipment is
also a crucial decision that costs money.Thus, traditional purchasing
approaches where a customer selects by shape, or size or just price is no
longerefficient.
Today, we all compare between products, ask for recommendations,
and navigate through the web for product reviews aiming to find clues
that help us in our decision-making. However, going through different sites
for collecting such comparative data, or chatting in social networks to
find recommendations is not an appealing task for many people.
Inspired by the importance of this decision-making problem, and the role
social networks and online reviews, we aim to integrate different product
related data sources to provide the customer with an integrated view for

Datapedia 5
the product he wants to purchase.
1.2 Problem Statement.
Buying a product whether it is a computer, a mobile phone, a tablet, or
even a car is no longer a simple task. With the increase in prices of new
products and the wide range of competing products that offer similar
functionalities at lower price, choosing between alternatives turns to be
difficult.
1.3 Project Justification.
It is now common that before buying any product people at least ask
their friends or relatives for recommendations, others search over the
internet for customers’ reviews, others tweet or chat over social networks
to gather information, going through all those paths is not an appealing
approach for many people especially those with minimal computer
education. Thus, it would have been more appealing if customers can
enter a “single” website or run a mobile application that simply does the
work and presents the final comparison in a user-friendly analytical
approach.
1.4 Project Scope.
In this project, we aim to design a web application that analyzesbig
volumes of product reviews, social networks posts and tweets related to
the product. Then, presents the results of this big data analytics job in a
user friendly, understandable, and easily interpreted manner that can be
easily used by different customers for different purposes.

Datapedia 6
1.5 Technologies &Tools.
R Language 3.1.2 RStudio 0.98.1091.
R is an open source software package to perform statistical analysis on
data. R is a programming language used by data scientist statisticians
and others who need to make statistical analysis of data and glean key
insights from data using mechanisms, such as regression, clustering,
classification, and text analysis. R is registered under GNU (General Public
License). It was developed by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, which is currently handled by the R
Development Core Team. It can be considered as a different
implementation of S, developed by Johan Chambers at Bell Labs. There
are some important differences, but a lot of the code written in S can be
unaltered using the R interpreter engine.
R provides a wide variety of statistical, machine learning (linear and
nonlinear modeling, classic statistical tests, time-series analysis,
classification, clustering) and graphical techniques, and is highly
extensible. R has various built-in as well as extended functions for
statistical, machine learning, and visualization tasks such as:
• Data extraction
• Data cleaning
• Data loading
• Data transformation
• Statistical analysis
• Predictive modeling
• Data visualization
It is one of the most popular open source statistical analysis packages
available on the market today. It is cross platform, has a very wide
community support, and a large and ever-growing user community who
are adding new packages every day.
With its growing list of packages, R can now connect with other data
stores, such as MySQL, SQLite, MongoDB, and Hadoop for data storage
activities.
Let's see different useful features of R:

Datapedia 7
• Effective programming language
• Relational database support
• Data analytics
• Data visualization
• Extension through the vast library of R packages
The graph provided from KD suggests that R is the most popular language
for data analysis and mining:
The following graph provides details about the total number of R
packages released by R users from 2005 to 2013. This is how we explore R
users. The growth was exponential in 2012 and it seems that 2013 is on
track to beat that R allowed performing Data analytics by various
statistical and machine learning operations as follows:
• Regression
• Classification
• Clustering
• Recommendations
• Text mining

Datapedia 8
Using R Enhanced our project, giving us the ability to make accurate
analytics on tweets to figure out each user’s opinion.
Installation:
First we install R for window which is required to install RStudio.
Then we can download and install RStudio.

Datapedia 9
RStudio interface:

Datapedia 10
Twitter Authentication with R:
After installing our packages and libraries we need to create a new
Twitter application to be used in the data collection phase where we use
this application to interact with Twitter search API.
This app is needed for the authentication process with Twitter. Creating
the Twitter application and doing the handshake is a must as you have to
do it every time you want to get data from Twitter with R.
First we go to https://dev.twitter.com/ and log in with a Twitter Account.
After creating the application, we can get our api_key and our
api_secret as well as our access_token and access_token_secret from our
app settings on Twitter.
And that’s it. Now we can search Twitter anytime we want and get the
data we need.

Datapedia 11
2- Apache HadoopHadoop Stream v1.2.1.
As mentioned earlier, we are about to use the R language to perform
some analytical tasks on massive amount of data.
Big Data has to deal with large and complex datasets that can be
structured, semi-structured, or unstructured and will typically not fit into
memory to be processed. They have to be processed in place, which
means that computation has to be done where the data resides for
processing. When we talk to developers, the people actually building Big
Data systems and applications, we get a better idea of what they mean
about 3Vs. They typically would mention the 3Vs model of Big
Data, which are velocity, volume, and variety.
Velocity refers to the low latency, real-time speed at which the analytics
need to be applied. A typical example of this would be to perform
analytics on a continuous stream of data originating from a social
networking site or aggregation of disparate sources of data.
Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB
based on the type of the application that generates or receives the data.
Variety refers to the various types of the data that can exist, for example,
text, audio, video, and photos.
Big Data usually includes datasets with sizes. It is not possible for such
systems to process this amount of data within the time frame mandated
by the business. Big Data volumes are a constantly moving target, as of
2012 ranging from a few dozen terabytes to many petabytes of data in a
single dataset. Faced with this seemingly insurmountable challenge,
entirely new platforms are called Big Data platforms.

Datapedia 12
Some of the popular organizations that hold Big Data are as follows:
• Facebook: It has 40 PB of data and captures 100 TB/day.
• Yahoo!: It has 60 PB of data.
• Twitter: It captures 8 TB/day.
• EBay: It has 40 PB of data and captures 50 TB/day.
How much data is considered as Big Data differs from company to
company. Though true that one company's Big Data is another's small,
there is something common: doesn't fit in memory, nor disk, has rapid
influx of data that needs to be processed and would benefit from
distributed software stacks. For some companies,
10 TB of data would be considered Big Data and for others 1 PB would be
Big Data.

Datapedia 13
So only you can determine whether the data is really Big Data. It is
sufficient to say that it would start in the low terabyte range.
We will use Apache Hadoop as a tool to handle these big amount of
data for the sake of our project.
Apache Hadoop is an open source Java framework for processing and
querying vast amounts of data on large clusters of commodity hardware.
Hadoop is a top level Apache project, initiated and led by Yahoo! and
Doug Cutting. It relies on an active community of contributors from all
over the world for its success.
With a significant technology investment by Yahoo!, Apache Hadoop has
become an enterprise-ready cloud computing technology. It is
becoming the industry de facto framework for Big Data processing.
Hadoop changes the economics and the dynamics of large-scale
computing. Its impact can be boiled down to four salient characteristics.
Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions.
Apache Hadoop has two main features:
• HDFS (Hadoop Distributed File System)
• MapReduce
HDFS is Hadoop's own rack-aware file system, which is a UNIX-based data
storage layer of Hadoop. HDFS is derived from concepts of Google file
system. An important characteristic of Hadoop is the partitioning of data
and computation across many (thousands of) hosts, and the execution of
application computations in parallel, close to their data. On HDFS, data
files are replicated as sequences of blocks in the cluster.
A Hadoop cluster scales computation capacity, storage capacity, and
I/O bandwidth by simply adding commodity servers. HDFS can be
accessed from applications in many different ways. Natively, HDFS
provides a Java API for applications to use.
The Hadoop clusters at Yahoo! span 40,000 servers and store 40
petabytes of application data, with the largest Hadoop cluster being
4,000 servers. Also, one hundred other organizations worldwide are known
to use Hadoop.

Datapedia 14
Characteristics of HDFS:
• Fault tolerant
• Runs with commodity hardware
• Able to handle large datasets
• Master slave paradigm
• Write once file access only
MapReduce is a programming model for processing large datasets
distributed on a large cluster. MapReduce is the heart of Hadoop. Its
programming paradigm allows performing massive data processing
across thousands of servers configured with Hadoop clusters. This is
derived from Google MapReduce.
Hadoop MapReduce is a software framework for writing applications
easily, which process large amounts of data (multi-terabyte datasets) in
parallel on large clusters (thousands of nodes) of commodity hardware in
a reliable, fault-tolerant manner.
This MapReduce paradigm is divided into two phases, Map and Reduce
that mainly deal with key and value pairs of data. The Map and Reduce
task run sequentially in a cluster; the output of the Map phase becomes
the input for the Reduce phase. These phases are explained as follows:
• Map phase: Once divided, datasets are assigned to the task tracker to
perform the Map phase. The data functional operation will be performed
over the data, emitting the mapped key and value pairs as the output of
the map phase.
• Reduce phase: The master node then collects the answers to all the
sub-problems and combines them in some way to form the output; the
answer to the problem it was originally trying to solve.
The five common steps of parallel computing are as follows:
1. Preparing the Map () input: This will take the input data row wise and
emit key value pairs per rows, or we can explicitly change as per the
requirement.
° Map input: list (k1, v1)

Datapedia 15
2. Run the user-provided Map () code
° Map output: list (K2, v2)
3. Shuffle the Map output to the Reduce processors. Also, shuffle the
similar keys (grouping them) and input them to the same reducer.
4. Run the user-provided Reduce() code: This phase will run the custom
reducer code designed by developer to run on shuffled data and emit
key and value.
° Reduce input: (K2, list (v2))
° Reduce output: (k3, v3)
5. Produce the final output: Finally, the master node collects all reducer
output and combines and writes them in a text file.
In our system, we used Hadoop 1.2.1 as it’s a stable version of Hadoop
and it can be downloaded from Apache Hadoop official website
Hadoop 1.2.1 needs openjdk-7 and JAVA_HOME must be added to
Ubuntu environment for that purpose you need to edit /etc./environment.

Datapedia 16
A dedicated user must be added in order to launch Hadoop and also it is
recommended to create a group Hadoop because during the
installation and configuration of Hadoop we need a separation and
privacy from other users.
Hadoop requires SSH access to manage its nodes, i.e. remote machines
plus your local machine if you want to use Hadoop on it you need to
configure SSH server. For configuring SSH server use the following
commands:

Datapedia 17
3- RHadoop.
RHadoop is a collection of five R packages that allow users to manage
and analyze data with Hadoop. The packages are regularly tested (and
always before a release) on recent releases of the Cloudera and
Hortonworks Hadoop distributions and should have broad compatibility
with open source Hadoop and mapR's distribution. We normally test on
recent Revolution R and CentOS releases, but we expect all the RHadoop
packages to work on a recent release of open source R and Linux.
RHadoop consists of the following packages:
ravro: read and write files in avro format.
plyrmr: higher level plyr-like data processing for structured data, powered
by rmr.
rmr: functions providing Hadoop MapReduce functionality in R.
rhdfs: functions providing file management of the HDFS from within R.
rhbase: functions providing database management for the HBase
distributed database from within R.
4- PHP.
PHP (recursive acronym for PHP: Hypertext Preprocessor) is a widely-used
open source general-purpose scripting language that is especially suited
for web development and can be embedded into HTML.
What distinguishes PHP from something like client-side JavaScript is that
the code is executed on the server, and that completely suits our target
as the server has to have a specific environment that won’t be available
in every regular user’s pc, generating HTML which is then sent to the client.
The client would receive the results of running that script, but would not
know what the underlying code was.

Datapedia 18
PHP is mainly focused on server-side scripting, so it can do anything any
other CGI program can do, such as collect form data, generate dynamic
page content, or send and receive cookies and much more.
There are three main areas where PHP scripts are used.
• Server-side scripting. This is the most traditional and main target field for
PHP. You need three things to make this work. The PHP parser (CGI or
server module), a web server and a web browser.
• Command line scripting. You can make a PHP script to run it without
any server or browser. You only need the PHP parser to use it this way. This
type of usage is ideal for scripts regularly executed using cron (on *nix or
Linux) or Task Scheduler (on Windows). These scripts can also be used for
simple text processing tasks. And that’s what we are looking for, we will
actually use PHP to execute the R script that does the sentimental
analytics.
• Writing desktop applications. PHP is probably not the very best
language to create a desktop application with a graphical user interface,
but if you know PHP very well, and would like to use some advanced PHP
features in your client-side applications you can also use PHP-GTK to write
such programs. You also have the ability to write cross-platform
applications this way.
5- Google Charts API.
The Google Chart API is a tool that lets people easily create a chart from
some data and embed it in a web page. Google creates a PNG image
of a chart from data and formatting parameters in an HTTP request. Many
types of charts are supported, and by making the request into an image
tag, people can simply include the chart in a web page.

Datapedia 19
Charts are exposed as JavaScript classes, and Google Charts provides
many chart types for you to use. The default appearance will usually be
all what we need, and we can always customize a chart to fit the look
and feel of the website.
Charts are highly interactive and expose events that let us connect them
to create complex dashboards or other experiences integrated with our
webpage. Charts are rendered using HTML5/SVG technology to provide
cross-browser compatibility (including VML for older IE versions) and cross
platform portability to iPhones, iPads and Android. Our users will never
have to mess with plugins or any software. If they have a web browser,
they can see the charts.
All chart types are populated with data using the DataTable class,
making it easy to switch between chart types as you experiment to find
the ideal appearance. The DataTable provides methods for sorting,
modifying, and filtering data, and can be populated directly from your
web page, a database, or any data provider supporting the Chart Tools
Datasource protocol.
Google Charts API has many useful characteristics that perfectly goes
along with our project goal to have visual ads for the resulted analytics
like:
•Free. Using the same chart tools Google uses, completely free and with
three years' backward compatibility guaranteed.
•Customizable. Allows us to configure an extensive set of options to
perfectly match the look and feel of our website.
•Controls and Dashboards. Easily connected to an interactive dashboard.
1.6 Google Maps.

Datapedia 20
As said before Google Maps will be used within Datapedia to provide the
graphical analytics aids.
On today’s Web, mapping solutions are a natural ingredient. So we use
them to see the location of the tweets and visualize the content of it in a
colored pins that its color indicates whether this is a positive, negative or
neutral tweet.
Most of tweets has a location, and if location exists, then it can be
displayed on a map.
There are several mapping solutions including Yahoo! Maps and Bing
Maps, but the most popular one is Google Maps. In fact, according to
Programmableweb.com, it’s the most popular API on the
Internet.
The Google Maps API lets you harness the power of Google Maps to use
in Datapedia to display the located tweets in an efficient and usable
manner.
First of all we need to know the term "coordinates", Coordinates are used
to express locations in the world. There are several different coordinate
systems. The one being used in Google Maps is the Word Geodetic
System 84 (WGS 84), which is the same system the Global Positioning
System (GPS) uses. The coordinates are expressed using latitude and
longitude. You can think of these as the y and x values in a grid.
To create a Google map, we must know that it resides in a web page. So,
the first thing you need to do is to set up that page. This includes creating
an HTML page and a style sheet for it. Once you have everything set up,
you can insert the actual map.

Datapedia 21
Now that you have a web page set up, you’re ready to load the Google
Maps API. The API is a JavaScript file that is hosted on Google’s servers. It’s
loaded with a <script> element in the <head> section of the page. The
<script> element can be used to add remote scripts into a web page,
and that’s exactly what you want to do here.
The most common use of maps on the Internet is to visualize the
geographic position of something. The
Google Maps marker is the perfect tool for doing this. A marker is basically
a small image that is positioned at a specific place on a map. In
Detapedia, It indicates that this is a tweet and by clicking, it shows the
content of the tweet.
The content of the tweet –by clicking the marker- opens in a window that
called InfoWindow, so when marking places on a map, we will want to
show additional information related to that place. Like the text content of
the located tweet, The Google Maps API offers a perfect tool for this, and
that’s the InfoWindow. It looks like a speech bubble and typically appears
over a marker when you click it.

Datapedia 22
7- Opinion Lexicon.
Used in the Sentiment analysis process as we search for a match between
the tweet and the lexicon and -if found- give that match the
corresponding points in the lexicon.
It’s like a big dictionary that contains English words and classifies them to
negative and positive based on score or a rating where more than a zero
is positive, lower than zero is negative.
8- Bootstrap v3.3.5.
Bootstrap is an open-source CSS, JavaScript framework that was originally
developed for twitter application by twitter's team of designers and

Datapedia 23
developers.
It is a combination of HTML, CSS, and JavaScript code designed to help
build user interface components. Bootstrap was also programmed to
support both HTML5 and CSS3.
Also it is called Front-end-framework.
Bootstrap is a free collection of tools for creating a websites and web
applications.
It contains HTML and CSS-based design templates for typography, forms,
buttons, navigation and other interface components, as well as optional
JavaScript extensions.
Some Reasons we preferred Bootstrap Framework:
•Easy to get started.
•Great grid system.
•Base styling for most HTML elements (Typography, Code, Tables, Forms,
Buttons, Images, and Icons).
•Extensive list of components.
•Bundled JavaScript plugins.
1.6 Limitations & Exclusions
Language Support:
Our System support and analyze tweets written only in English as the
opinion lexicon contains English words only.
Twitter Search API:
Access to the Search API is measured by the IP address the request is
coming from. So the rate limiting is too strict like:
- The downloaded tweets are only 1 week old as you should be a Twitter
search partner to be able to request older tweets from the search API.
twitteR data package in R Statistics:

Datapedia 24
- Based on the limits of the searchTwitter () function we can only
download 1500 vectors representing 1500 tweets per request.

Datapedia 25
Chapter 2: System Analysis
2.1 Project Stakeholders
Beneficiaries:
- Users: Regular users can benefit from Datapedia by using it to know
people’s opinions of a certain product or a service to help them decide
what to buy.
- Marketing Agencies: Digital Marketers and social media analysts can
use Datapedia to track the volume of activity around campaigns in real-
time, including number of users, find influencers talking about it and track
negative feedbacks.
- Decision Makers: Datapedia helps decision makers to make more
efficient decisions based on user feedbacks.
Owners:
The project team and supervisors who will deliver the project. Without
them project won’t happen.
Suppliers:
Twitter is a major supplier as we download the tweets from its API.
Project Sponsor:
Our project is accepted in ITAC graduation project funding program and
sponsored by ITIDA.
2.2 Non-Functional Requirements
Usability: Our system is built to in a manner that insures that it will be easy
and simple and the majority of users will find it user-friendly and easy to
deal with.

Datapedia 26
Portability: Our system is built to work on any operating system and on
any web browser.
Accuracy:
- Data accuracy is 100% guaranteed as the system deals with Twitter
search API directly.
- Sentiment scoring accuracy is our main debate in this project as we
spent a lot of time handling special cases like (negation, sarcasm, etc...)
to boost our algorithm accuracy.
Extensibility:
Our goal is to add more features to our system in the future like time
framing and predicting future trends.
2.3 Functional Requirements
1- Easily understandable analytics with dynamically-changed graphical
aids.
2- Great sentiment analysis accuracy accompanied with calculations
and samples.
3- User Recommendations based on true statistics about related products.
4- Fully interactive dashboard graphs and charts to help users make
better decisions.

Datapedia 27
2.4User Profiles
Characteristics Regular Users Marketers Project Managers / Managers Administrator
Age 10 : 60 23 : 35 35 : 50 25 : 35
Gender
50% males
50% females
50% males 80% males 100% males
Education Basic Education Marketing Degrees MBAs / PHDs CS Degree
Language English / Arabic English / Arabic English / Arabic English
Computer Experience Low mid Mid High
Domain Experience
(Social Analytics)
Low High Low : Mid High
Expectations Ease of access Speed of task Ease of access and speed of task
Comprehensive
functionality
2.5 Task Profiles
Task Name Regular Users Marketers
Project Managers /
Managers Administrator
1 Sign up
X
X X
2
Submit new tracking
request X X
3
View Tracking reports
X X X
4
Export tracking reports
X X
5 View tracking history
X
X X
6 View user data X

Datapedia 28
2.6 Environmental Profiles
Characteristics Regular Users Marketers Project Managers /
Managers
Administrator
Location Indoor or
mobile
Indoor or
mobile
Indoor or mobile Indoor
Workspace Office Office Cubicle
Software Any browser Any browser Any browser Any browser
Lighting Bright Good Average
Noise Quiet quiet quiet quiet
Hardware PC PC PC PC
Internet
Connection
Normal
connection
Normal
connection
Fast connection Fast connection
2.7Sample Personas
Persona #1
Name: Mr. Ahmed Roshdy.
Age: 20.
Position: Student.
Education: Pursuing a degree in Computer Science.
Things he wants to know Things he wants to do
 People’s opinions on the product he wants
to buy to decide whether he will buy it or
not.
 Knowing the product that he wants to buy.

Datapedia 29
Persona #2
Name: Mr. John Doe.
Age: 45.
Position: Managing Director.
Education: Masters of Business Administration.
Company: XYZ for Real estate development.
Things he wants to know Things he wants to do
 How much units should he sell in his new
project?
 What is the feedback of his current clients?
 Where to build new projects and housing
complexes?
 Attract more clients to his business.
 Make more profit.
 Build strong relationships with his clients.
Persona #3
Name: Ms. Karin Kudrow.
Age: 25.
Position: Social Media Strategist.
Education: Bachelor degree in Marketing.
Company: XYZ for Business Solutions.
Things she wants to know Things she wants to do
 How many tweets written with a certain
hashtag?
 Who are the best contributors on the
hashtag?
 How people interact with the hashtag and
are their tweets negative or positive?
 Launch a new marketing campaign on
twitter.
 Share with the followers of the company a
new hashtag to share their thoughts on it.

Datapedia 30
2.8 Competitive Analysis
Factor keyhole Hashtagify.me hashtracking hashtags talkwalker Datapedia
Subscription 30 days trial
then
599$ per
month for
full
subscription.
14 days trial
then 299$ per
month for full
subscription.
30 days trial
then
399$ per
month for full
subscription.
Free trial
with
upgradable
account for
349$ per
month for
full
subscription.
7 days trial
then
1400$ per
month for
full
subscription.
Full free
subscription.
Tracking
method
Real time +
historical
data
(50$ per
report)
Real time +
historical
data.
Real time +
historical
data (up to
30 days)
24-Hour
Trend
Graph
based on
1% data
sample.
Real time +
historical
data.
Real time +
historical
data.
Sentiment
analysis X  X X  
Social
Media
Coverage
Twitter,
Facebook
and
Instagram.
Twitter and
they will add
Facebook
and
Instagram in
the future.
Twitter. Twitter. Twitter,
Facebook
and
Instagram.
Twitter
Technology Web
application.
Web
application.
Web
application.
Web
application.
Web
application.
Web
application,
planning to
release an
android
application.
From the table above, it is clear that Datapedia is deliberated to support all
of the existing common feature in the corresponding and similar web-based
applications. However, Datapedia will have some edge over the others such
as the free full subscription and the ability to be extended to a mobile
application. In addition, building Datapedia as a web-based application will
give it an advantage over the others having features such as the
dynamically-changed graphical analytics aids and the interactive
dashboard graphs and charts to support the user in making better decisions.

Datapedia 31
Chapter 3: System Design
3.1 System Development Methodology
We used the agile software development as:
1- We were changing requirements frequently, even late in
development.
2- We delivered working software frequently, from a couple of weeks to
a couple of months, with a preference to the shorter timescale.
3- The Working software was the primary measure of progress.
4- Our method of conveying information to and within a development
team is face-to-face conversation.
5- There was a continuous attention to technical excellence and good
design enhances agility.
6- To implement a new feature the developers need to lose only the
work of a few days, or even only hours, to roll back and implement it.
7- Unlike the waterfall model in agile model very limited planning is
required to get started with the project. Agile assumes that the end
users’ needs are ever changing in a dynamic business and IT world.
Changes can be discussed and features can be newly effected or
removed based on feedback. This effectively gives the customer the
finished system they want or need.
3.2 Main Objectives
1- To help the user make better decisions concerning buying a product in
an easy & user-friendly way based on real reviews written by actual users
among different social media sites.
2- To help marketers make better marketing plans based on what people
like and dislike (user opinion and review) and by taking feedbacks in

Datapedia 32
addition to studying the trend of opinions based on different dimensions
(time, location, source, …).
3- To help mangers make more accurate forecasts based on real data
collected from real users.
3.3 Secondary objectives
1- Implementing the “power of the user” strategy.
2- Giving recommendations based on real data.
3- Pointing out how market responds to a new product, service and
initiative.

Datapedia 33
3.4 Use Case Diagram

Datapedia 34
3.5 Backend Design
To know how the process will go when we speak about big data, we must
first know about Hadoop itself. As an open source java based software it will
cost us nothing to implement the parallel processing techniques to our
collected data and apply our algorithm on it, Hadoop mainly rely on dividing
the tasks to smaller subtasks via a programming paradigm called
mapReduce, on which an algorithm is mapped to the different parts of the
data and finally a reducer will reduce the result as shown in the figure:
The MapReduce structure is mainly consists of two parts:
• JobTracker: This is the master node of the MapReduce system, which
manages the jobs and resources in the cluster (TaskTrackers). The JobTracker
tries to schedule each map as close to the actual data being processed on
the TaskTracker, which is running on the same DataNode as the underlying
block.
• TaskTracker: These are the slaves that are deployed on each machine.
They are responsible for running the map and reducing tasks as instructed by
the JobTracker. And they are working as a primary-slave work force in which
the task tracker assigns different tasks to the job trackers just like the following
figure:

Datapedia 35
As we can see, hadoop is mainly based on the divide and conquer process
and that formulate it’s parallel processing core, the other core is the file
system that hadoop rely on, the HDFS (Hadoop Distribution File System) which
is the master piece of the fault handling core in hadoop. It is actually a
normal file system Data in a Hadoop cluster is broken down into smaller
pieces (called blocks) and distributed throughout the cluster. In this way, the
map and reduce functions can be executed on smaller subsets of your larger
data sets, and this provides the scalability that is needed for big data
processing.

Datapedia 36
HDFS is based on three nodes:
• NameNode: This is the master of the HDFS system. It maintains the
directories, files, and manages the blocks that are present on the DataNodes.
• DataNode: These are slaves that are deployed on each machine and
provide actual storage. They are responsible for serving read-and-write data
requests for the clients.
• Secondary NameNode: This is responsible for performing periodic
checkpoints. So, if the NameNode fails at any time, it can be replaced with a
snapshot image stored by the secondary NameNode checkpoints.
Sometimes the data resides on the HDFS (in various formats).since a lot of
data analysts are very productive in R, it is natural to use R to compute with
the data stored through Hadoop-related tools.
As mentioned earlier, the strengths of R lie in its ability to analyze data using a
rich library of packages but fall short when it comes to working on very large
datasets. The strength of Hadoop on the other hand is to store and process

Datapedia 37
very large amounts of data in the TB and even PB range. Such vast datasets
cannot be processed in memory as the RAM of each machine cannot hold
such large datasets. The options would be to run analysis on limited chunks
also known as sampling or to correspond the analytical power of R with the
storage and processing power of Hadoop and you arrive at an ideal solution.
Such solutions can also be achieved in the cloud using platforms such as
Amazon EMR.
R will not load all data (Big Data) into machine memory. So, Hadoop can be
chosen to load the data as Big Data. Not all algorithms work across Hadoop,
and the algorithms are, in general, not R algorithms. Despite this, analytics
with R have several issues related to large data. In order to analyze the
dataset, R loads it into the memory, and if the dataset is large, it will fail with
exceptions such as "cannot allocate vector of size x". Hence, in order to
process large datasets, the processing power of R can be vastly magnified
by combining it with the power of a Hadoop cluster. Hadoop is very a
popular framework that provides such parallel processing capabilities. So, we
can use R algorithms or analysis processing over Hadoop clusters to get the
work done.

Datapedia 38
If we think about a combined RHadoop system, R will take care of data
analysis operations with the preliminary functions, such as data loading,
exploration, analysis, and visualization, and Hadoop will take care of parallel
data storage as well as computation power against distributed data. Prior to
the advent of affordable Big Data technologies, analysis used to be run on
limited datasets on a single machine. Advanced machine learning
algorithms are very effective when applied to large datasets, and this is
possible only with large clusters where data can be stored and processed
with distributed data storage systems.

Datapedia 39
3.6 Block Diagram

Datapedia 40
3.7 Sample Mock-ups
We have chosen Windows10 as the product to be evaluated in our
prototype, we will get the word or hashtag that the user entered and use it to
search for tweets that contain this specific word.
Then we will classify these tweets as positive or negative or neutral, this
classification will be based on word classification lexicon, each word will
have a specific score in positive or negative in a scale from 5 to -5.
And then the results of data analytics phase will be represented on a
dashboard to the user to help make an informed decision.
Figure 1: Home Page

Datapedia 41
Figure 2: Sentiment Scores
Figure 3: Map Dimension (Demographic)

Datapedia 42
Chapter 4: System Implementation and Management
The proposed project consists of 4 main phases as follows:
4.1 Phase 1: Data Collection and Preprocessing
In this phase, we collected relevant data form social networks as well as
product review sites. In this stage, we will first use different “APIs” to grab the
data. Then, we transformed the data into R-based data frames to be easily
parsed, accessed, and queried. Our goal was to collect millions of product-
related records.
The collected data includes key attributes like:
- "text": The text that will be analyzed later in the sentiment analysis phase.
- "favorited": A Boolean attribute to show if the tweet is favorited by any
Twitter user or not.
- "favoriteCount": The count of the how many favorites the tweet got.
- “created": The time attribute when the tweet was posted.
- "id": The tweet id.
- "statusSource": The source of the tweet where if it’s posted from the website
or from a mobile application.
- "screenName": The handle (user name) of the user who wrote the tweet.
- "retweetCount": The count of the how many retweets the tweet got.
- "isRetweet": A Boolean attribute to show if the tweet is a retweeted tweet
from another user.
- "retweeted": A Boolean attribute to show if the tweet is retweeted by any
Twitter user or not.
- "Longitude": Coordinates on map to get the location of the user who
posted the tweet.
- "Latitude": Coordinates on map to get the location of the user who posted
the tweet.

Datapedia 43
We started this phase from October and it was an on-going phase which
means we kept downloading tweets from the October till May, 2015.
One of the problems we faced was the limitation of the Twitter search API as
we could only download tweets that are 1 week old as you must be a Twitter
search partner to be able to request older tweets from the search API.
To overcome this, we downloaded tweets on a daily biases to build our own
historical data.
By the end of May, 2015 we successfully downloaded 300,000 tweets.
Another problem was that the data contained a lot of commercial tweets
and ads like promotions.
To overcome this, we decided to collect pure datasets from different sources
that provide user reviews on specific products. We collected about 500 user
review manually from websites that provide real reviews from real users like
GSMArena and Reevoo.
And then we realized that it would take us a lot of time to get a decent
number of reviews to be analyzed, so we built our own web crawler using
Python to automatically get the reviews from those sites.
4.2 Phase 2: Sentiment Analysis.
In this stage, the data collected in phase 1 is classified as positive, negative
and neutral. This classification is based on word classification lexicon. Natural
language processing techniques are used in this stage to parse the sentence,
understand it, and finally classify it into the predefined classes.

Datapedia 44
This step was a crucial step as it was the main output that will eventually
control the customers’ decision about the product.
Our algorithm was enhanced more than one time to get better accuracy
percentages.
Our first algorithms was so basic and relied on matching the words between
the tweet and the lexicon. This algorithm led to poor results.
And then we managed to know the reasons behind this poor results which
are the special cases that needs to be handled like (Negation and sarcasm).
Those cases were handled in the algorithm by searching for negation in the
text along with the sentiment scoring process and if it’s found the score is
multiplied by -1 automatically until the search reach the statement character.
This way we managed to score a great accuracy score by using the data
collected in the data collection phase and we analyzed this pure dataset
manually and knew the overall sentiment score of it and then we analyzed it
using our algorithm.
4.3 Phase 3: Data Analytics.
In this phase, R-language will be used to handle the enormous data volumes
and help in achieving the data analytics task. Classified data will be input to
this module and different techniques will be used (and/or compared) to
obtain the best analytical results.
For big data analytics (which will be a whole new process) we cannot rely on
the usual way of executing the algorithm as it’d take so much time to do so ,
even more time to just fetch the data into the memory as R must operate on
data already in memory plus the time that will be taken for the analytic
process itself which will be huge according to what is really happening like

Datapedia 45
cleaning the tweets, dividing each single tweet into words and look for them
in the lexicon besides the summation of the score process.
This is when the Hadoop’s turn comes, we actually didn’t use hadoop itself,
we used something called hadoop streaming which is a streaming service or
a framework that give developers the ability to execute hadoop tasks in
different languages than java like C++, C, Python and R. so it perfectly suits
what we are looking for to combine the great analytical power of R and the
stable-fault handling techniques of Hadoop.
Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce
job with executable scripts such as Mapper and Reducer. This is similar to the
pipe operation in Linux. With this, the text input file is printed on stream (stdin),
which is provided as an input to Mapper and the output (stdout) of Mapper
is provided as an input to Reducer; finally, Reducer writes the output to the
HDFS directory.
The main advantage of the Hadoop streaming utility is that it allows Java as
well as non-Java programmed MapReduce jobs to be executed over
Hadoop clusters. Also, it takes care of the progress of running MapReduce
jobs. The Hadoop streaming supports the Perl, Python, PHP, R, and C++
programming languages. To run an application written in other programming
languages, the developer just needs to translate the application logic into
the Mapper and Reducer sections with the key and value output elements.

Datapedia 46
The main six Hadoop streaming components of the preceding command are
listed and explained as follows:
• jar: This option is used to run a jar with coded classes that are designed for
serving the streaming functionality with Java as well as other programmed
Mappers and Reducers. It's called the Hadoop streaming jar.
• Input: This option is used for specifying the location of input dataset (stored
on HDFS) to Hadoop streaming MapReduce job.
• Output: This option is used for telling the HDFS output directory (where the
output of the MapReduce job will be written) to Hadoop streaming
MapReduce job.
• File: This option is used for copying the MapReduce resources such as
Mapper, Reducer, and Combiner to computer nodes (Tasktrackers) to make
it local.
• Mapper: This option is used for identification of the executable Mapper file.
• Reducer: This option is used for identification of the executable Reducer file.
RHadoop is a collection of three R packages for providing large data
operations with an R environment. It was developed by Revolution Analytics,
which is the leading commercial provider of software based on R. RHadoop
is available with three main R packages: rhdfs, rmr, and rhbase. Each of them
offers different Hadoop features.
• rhdfs is an R interface for providing the HDFS usability from the R console. As
Hadoop MapReduce programs write their output on HDFS, it is very easy to
access them by calling the rhdfs methods. The R programmer can easily
perform read and write operations on distributed data files. Basically, rhdfs
package calls the HDFS API in backend to operate data sources stored on
HDFS.
• rmr is an R interface for providing Hadoop MapReduce facility inside the R
environment. So, the R programmer needs to just divide their application
logic into the map and reduce phases and submit it with the rmr methods.
After that, rmr calls the Hadoop streaming MapReduce API with several job

Datapedia 47
parameters as input directory, output directory, mapper, reducer, and so on,
to perform the R MapReduce job over Hadoop cluster.
• rhbase is an R interface for operating the Hadoop HBase data source
stored at the distributed network via a Thrift server. The rhbase package is
designed with several methods for initialization and read/write and table
manipulation operations.
Installing the R packages: We need several R packages to be installed that
help it to connect R with Hadoop. The list of the packages can be installed
by calling the execution of the following R command in the R console:
install.packages( c('rJava','RJSONIO', 'itertools', 'digest','Rcpp
','httr','functional','devtools', 'plyr','reshape2'))
• Setting environment variables: We can set this via the R console using the
following code:
## Setting HADOOP_CMD
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
## Setting up HADOOP_STREAMING
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/contrib/streaming/
hadoop-streaming-1.2.1.jar")
or, we can also set the R console via the command line as follows:
export HADOOP_CMD=/usr/local/Hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/
streaming/hadoop-streaming-1.2.1.jar

Datapedia 48
In our project, as we implied earlier we are talking about a text file that
contains more than 300,000 tweets, in the data analytics process we put that
file into the hdfs directory using the HDFS commands in Linux and the file itself
is spitted into 20 different parts, with one replication and one cluster.
We used 2 mappers in our project which is enough to our I7 5 cores CPU,
which means 2 tasks are done simultaneously in each core as we enabled
hybrid threading.

Datapedia 49
Using R, it took more than 5 minutes to load data into the memory, more than
20 minutes to work on the data and bring results with Hadoop, it took 15
minutes only to get all the results in a much more easy way.

Datapedia 50
4.4 Phase 4: Data presentation.
This is the final step where results obtained will be presented to the user in a
user friendly manner. Dashboards will be used to present the analytics results.
Maps will be used to represent geographically-related information. Pie-charts,
bar charts and others will be used for analytical data. The user will be given
the freedom to drill-down & roll-up with the analytics granularity so that (s)he
can obtain a global view before making decisions.
For business analysis reports, we are giving also some useful analytics like:
1- Most Retweeted tweets.
2- Most Favorite tweets.
3- Most accompanied hashtags.
Which can come at handy finding trends or related topics to the user’s
inserted text.
We used PHP, bootstrap and java script to implement the user interface,
while the giving the user the option to enter his preferred search term in the
real time section and the option to upload a file of tweets in the historical
part section as we had to ensure that our application will stay generic and
error free.

Datapedia 51
This Page was Intentionally Left Blank

Datapedia 52
Figure 4: Commercial tweets.
Chapter 5: System Evaluation
5.1 Data Evaluation
As mentioned earlier our data is fetched directly from Twitter search API, but
we noticed a lot of problems in this data that affected the performance of
our algorithm and our system overall. One of the problems we faced is that
the data contain a lot of ads and commercial tweets as shown in figure 4:
To overcome this problem we decided to collect pure datasets from
different sources that provide user reviews on specific products. We
collected about 500 user review manually from websites that provide real
reviews from real users like GSMArena and Reevoo.

Datapedia 53
Screen shot of the pure dataset:

Datapedia 54
Figure 5: Algorithms Comparison
5.2 Algorithm Evaluation
Our main objective in this project was the accuracy of the sentiment analysis
scoring algorithm and our algorithm was enhanced more than one time to
get better accuracy percentages.
Our first algorithms was so basic and relied on matching the words between
the tweet and the lexicon. This algorithm led to poor results.
And then we managed to know the reasons behind this poor results which
are the special cases that needs to handled like (Negation and sarcasm).
And then we managed to score a great accuracy score by using the data
collected in the data evaluation phase and we analyzed this pure dataset
manually and knew the overall sentiment score of it and then we analyzed it
using our algorithm.
And we compared the scores of every algorithm together as shown in figure
5.

Datapedia 55
5.3 System Evaluation
And finally we wanted to benchmark our algorithm performance against
other algorithms.
We chose Stanford’s recursive deep model as it’s very popular and used by
many users interested in sentiment analysis.
The two algorithms were tested on the same data collected and manually
analyzed by us in the data evaluation phase.
As shown in figure 6, our algorithm scored 61.7% accuracy and Stanford’s
recursive deep model scored 55.8%
And here’s some examples from the two algorithms to support this
benchmark:
Example 1:

Datapedia 56
The text says: “The camera quality and options got better, overall, the phone
is not a disappointment”.
The review is positive:
Stanford’s recursive deep model scored 0 for this sentence which means it’s
neutral.
Our algorithm scored +4 which means it’s positive.
Example 2:
The text says: “Excellent camera with video facility”.
The review is positive.
Stanford’s recursive deep model scored 0 for this sentence which means it’s
neutral.

Datapedia 57
Our algorithm scored +3 which means it’s positive.

Datapedia 58
References
- Book 1: Big Data Analytics with R and Hadoop
First published: November 2013
Production Reference: 1181113
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78216-328-2
- The Comprehensive R Archive Network (CRAN)
Link to the CRAN: http://cran.rstudio.com/
- Course (1): Swirl. Learn R, in R.
Link to the course: http://swirlstats.com/
- Course (2): R Programming.
By: Johns Hopkins University.
Link to the course: https://www.coursera.org/course/rprog
- Course (3): Getting and Cleaning Data.
By: Johns Hopkins University
Link to the course: https://www.coursera.org/course/getdata
- Course (4): Exploratory Data Analysis.

Datapedia 59
By: Johns Hopkins University.
Link to the course: https://www.coursera.org/course/exdata
- Discussion article (1): What are the best supervised learning algorithms for
sentiment analysis in text?
Link to the article: http://www.quora.com/What-are-the-best-supervised-
learning-algorithms-for-sentiment-analysis-in-text
- Article (1): Why Sentiment Analysis Is Essential for Small Businesses.
By: Tara Hornor.
Link to the article: http://graziadiovoice.pepperdine.edu/why-sentiment-
analysis-is-essential-for-small-businesses/
- Article (2): The 37 best tools for data visualization.
By: Brian Suda and Sam Hampton-Smith.
Link to the article: http://www.creativebloq.com/design-tools/data-
visualization-712402

Datapedia Analysis Report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Datapedia Analysis Report

Similar to Datapedia Analysis Report (20)

Recently uploaded

Recently uploaded (20)

Datapedia Analysis Report