SlideShare a Scribd company logo
1 of 59
Download to read offline
Datapedia 1
GraduationProjectReport Faculty of Computers and Information
Information Systems Dept.
Cairo University
2014/2015
Datapedia
Project Team:
Abanoub A. Amin. (20110002)
Ahmed A. Ahmed. (20110070)
Ahmed M. Osman (Team Leader). (20110086)
Ahmed N. Muhammad. (20110099)
Ahmed R. Negm. (20110053)
Under the supervision of:
Dr. Hoda M. O. Mokhtar.
Eng. Mohamed Hafez.
Datapedia 2
Table of Contents
Chapter 1: System Proposal ……………………………………………………….……………. 4
1.1 Introduction …………………………………………………………………………………… 4
1.2 Problem Statement ………………………………………………………………………….. 5
1.3 Project Justification ………………………………………………………………………….. 5
1.4 Project Scope ………………………………………………………………………………… 5
1.5 Technologies & Tools …………….………………………………………………………….. 6
1.6 Limitations & Exclusions ………………………………………………………..…………... 23
Chapter 2: System Analysis ……………………………………………..…………………….. 25
2.1 Project Stakeholders ……………………………………………………….…….………… 52
2.2 Non-Functional Requirements ………………………………………….……….……….. 52
2.3 Functional Requirements ………………………………………………….…….………... 62
2.4 User Profiles ………………………………………………………………….…….…….…… 72
2.5 Task Profiles ………………………………………………………………….…………..…… 72
2.6 Environmental Profiles …………………………………………………….……………..… 82
2.7 Sample Personas …………………………………………………………………….……... 82
2.8 Competitive Analysis ………………………………………………………………….…… 03
Chapter 3: System Design ……………………………………………………….…………..… 13
3.1 System Development Methodology ……………………………………………………. 13
3.2 Main Objectives ……………………………………………………………….……………. 13
3.3 Secondary objectives ………………………………………………………….………….. 23
3.4 Use Case Diagram …………………………………………………………………………. 33
3.5 Backend Design ………………………….…………………………………………………. 34
3.6 Block Diagram ………………………………………………………………….…………… 39
3.7 Sample Mock-ups ……………………………………………………………..……………. 40
Chapter 4: System Implementation& Management …………....…………..………..….. 42
4.1 Phase 1: Data Collection and Preprocessing …………………………………………. 42
4.2 Phase 2: Sentiment Analysis ………………………………………………………………. 43
4.3 Phase 3: Data Analytics …………………………………………………………………… 44
Datapedia 3
4.4 Phase 4: Data presentation ………………………………………………………………. 50
4.5 Algorithm Flowchart ………………………………………………………………….…….. 51
Chapter 5: System Evaluation …………………………………………………..…………….. 52
5.1 Data Evaluation ……………………………………………………………..……………… 52
5.2 Algorithm Evaluation…………………………………………………………………..….... 54
5.3 System Evaluation ……………………………………………………………..………….... 55
References ……………………………………….………………………………………….....… 58
Datapedia 4
Chapter 1: System Proposal.
1.1 Introduction.
Nowadays, it is true that our personal life is highly dependent on the
technology that people have developed. We are currently witnessing an
era where new competitive products are produced almost every day. A
key impact of this rapid technological advancement is that it affected
several aspects in our life style along with the way we purchase products,
the way we communicate, the way we travel, and not to mention the
way we learn.
Despite the importance of these advancements, today, buying a certain
product is no longer a trivial task that can be taken in minutes. Almost
everything we use has been refurbished to better standards. New
features, new specifications, price variations, new applications, and
many other dimensions affect our purchasing decision, making it more
sophisticated and require better decision-making approaches.Although
this impact is clear on the personal level, it is also obvious in larger
organizations where purchasing machines, computers and equipment is
also a crucial decision that costs money.Thus, traditional purchasing
approaches where a customer selects by shape, or size or just price is no
longerefficient.
Today, we all compare between products, ask for recommendations,
and navigate through the web for product reviews aiming to find clues
that help us in our decision-making. However, going through different sites
for collecting such comparative data, or chatting in social networks to
find recommendations is not an appealing task for many people.
Inspired by the importance of this decision-making problem, and the role
social networks and online reviews, we aim to integrate different product
related data sources to provide the customer with an integrated view for
Datapedia 5
the product he wants to purchase.
1.2 Problem Statement.
Buying a product whether it is a computer, a mobile phone, a tablet, or
even a car is no longer a simple task. With the increase in prices of new
products and the wide range of competing products that offer similar
functionalities at lower price, choosing between alternatives turns to be
difficult.
1.3 Project Justification.
It is now common that before buying any product people at least ask
their friends or relatives for recommendations, others search over the
internet for customers’ reviews, others tweet or chat over social networks
to gather information, going through all those paths is not an appealing
approach for many people especially those with minimal computer
education. Thus, it would have been more appealing if customers can
enter a “single” website or run a mobile application that simply does the
work and presents the final comparison in a user-friendly analytical
approach.
1.4 Project Scope.
In this project, we aim to design a web application that analyzesbig
volumes of product reviews, social networks posts and tweets related to
the product. Then, presents the results of this big data analytics job in a
user friendly, understandable, and easily interpreted manner that can be
easily used by different customers for different purposes.
Datapedia 6
1.5 Technologies &Tools.
R Language 3.1.2  RStudio 0.98.1091.
R is an open source software package to perform statistical analysis on
data. R is a programming language used by data scientist statisticians
and others who need to make statistical analysis of data and glean key
insights from data using mechanisms, such as regression, clustering,
classification, and text analysis. R is registered under GNU (General Public
License). It was developed by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, which is currently handled by the R
Development Core Team. It can be considered as a different
implementation of S, developed by Johan Chambers at Bell Labs. There
are some important differences, but a lot of the code written in S can be
unaltered using the R interpreter engine.
R provides a wide variety of statistical, machine learning (linear and
nonlinear modeling, classic statistical tests, time-series analysis,
classification, clustering) and graphical techniques, and is highly
extensible. R has various built-in as well as extended functions for
statistical, machine learning, and visualization tasks such as:
• Data extraction
• Data cleaning
• Data loading
• Data transformation
• Statistical analysis
• Predictive modeling
• Data visualization
It is one of the most popular open source statistical analysis packages
available on the market today. It is cross platform, has a very wide
community support, and a large and ever-growing user community who
are adding new packages every day.
With its growing list of packages, R can now connect with other data
stores, such as MySQL, SQLite, MongoDB, and Hadoop for data storage
activities.
Let's see different useful features of R:
Datapedia 7
• Effective programming language
• Relational database support
• Data analytics
• Data visualization
• Extension through the vast library of R packages
The graph provided from KD suggests that R is the most popular language
for data analysis and mining:
The following graph provides details about the total number of R
packages released by R users from 2005 to 2013. This is how we explore R
users. The growth was exponential in 2012 and it seems that 2013 is on
track to beat that R allowed performing Data analytics by various
statistical and machine learning operations as follows:
• Regression
• Classification
• Clustering
• Recommendations
• Text mining
Datapedia 8
Using R Enhanced our project, giving us the ability to make accurate
analytics on tweets to figure out each user’s opinion.
Installation:
First we install R for window which is required to install RStudio.
Then we can download and install RStudio.
Datapedia 9
RStudio interface:
Datapedia 10
Twitter Authentication with R:
After installing our packages and libraries we need to create a new
Twitter application to be used in the data collection phase where we use
this application to interact with Twitter search API.
This app is needed for the authentication process with Twitter. Creating
the Twitter application and doing the handshake is a must as you have to
do it every time you want to get data from Twitter with R.
First we go to https://dev.twitter.com/ and log in with a Twitter Account.
After creating the application, we can get our api_key and our
api_secret as well as our access_token and access_token_secret from our
app settings on Twitter.
And that’s it. Now we can search Twitter anytime we want and get the
data we need.
Datapedia 11
2- Apache HadoopHadoop Stream v1.2.1.
As mentioned earlier, we are about to use the R language to perform
some analytical tasks on massive amount of data.
Big Data has to deal with large and complex datasets that can be
structured, semi-structured, or unstructured and will typically not fit into
memory to be processed. They have to be processed in place, which
means that computation has to be done where the data resides for
processing. When we talk to developers, the people actually building Big
Data systems and applications, we get a better idea of what they mean
about 3Vs. They typically would mention the 3Vs model of Big
Data, which are velocity, volume, and variety.
Velocity refers to the low latency, real-time speed at which the analytics
need to be applied. A typical example of this would be to perform
analytics on a continuous stream of data originating from a social
networking site or aggregation of disparate sources of data.
Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB
based on the type of the application that generates or receives the data.
Variety refers to the various types of the data that can exist, for example,
text, audio, video, and photos.
Big Data usually includes datasets with sizes. It is not possible for such
systems to process this amount of data within the time frame mandated
by the business. Big Data volumes are a constantly moving target, as of
2012 ranging from a few dozen terabytes to many petabytes of data in a
single dataset. Faced with this seemingly insurmountable challenge,
entirely new platforms are called Big Data platforms.
Datapedia 12
Some of the popular organizations that hold Big Data are as follows:
• Facebook: It has 40 PB of data and captures 100 TB/day.
• Yahoo!: It has 60 PB of data.
• Twitter: It captures 8 TB/day.
• EBay: It has 40 PB of data and captures 50 TB/day.
How much data is considered as Big Data differs from company to
company. Though true that one company's Big Data is another's small,
there is something common: doesn't fit in memory, nor disk, has rapid
influx of data that needs to be processed and would benefit from
distributed software stacks. For some companies,
10 TB of data would be considered Big Data and for others 1 PB would be
Big Data.
Datapedia 13
So only you can determine whether the data is really Big Data. It is
sufficient to say that it would start in the low terabyte range.
We will use Apache Hadoop as a tool to handle these big amount of
data for the sake of our project.
Apache Hadoop is an open source Java framework for processing and
querying vast amounts of data on large clusters of commodity hardware.
Hadoop is a top level Apache project, initiated and led by Yahoo! and
Doug Cutting. It relies on an active community of contributors from all
over the world for its success.
With a significant technology investment by Yahoo!, Apache Hadoop has
become an enterprise-ready cloud computing technology. It is
becoming the industry de facto framework for Big Data processing.
Hadoop changes the economics and the dynamics of large-scale
computing. Its impact can be boiled down to four salient characteristics.
Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions.
Apache Hadoop has two main features:
• HDFS (Hadoop Distributed File System)
• MapReduce
HDFS is Hadoop's own rack-aware file system, which is a UNIX-based data
storage layer of Hadoop. HDFS is derived from concepts of Google file
system. An important characteristic of Hadoop is the partitioning of data
and computation across many (thousands of) hosts, and the execution of
application computations in parallel, close to their data. On HDFS, data
files are replicated as sequences of blocks in the cluster.
A Hadoop cluster scales computation capacity, storage capacity, and
I/O bandwidth by simply adding commodity servers. HDFS can be
accessed from applications in many different ways. Natively, HDFS
provides a Java API for applications to use.
The Hadoop clusters at Yahoo! span 40,000 servers and store 40
petabytes of application data, with the largest Hadoop cluster being
4,000 servers. Also, one hundred other organizations worldwide are known
to use Hadoop.
Datapedia 14
Characteristics of HDFS:
• Fault tolerant
• Runs with commodity hardware
• Able to handle large datasets
• Master slave paradigm
• Write once file access only
MapReduce is a programming model for processing large datasets
distributed on a large cluster. MapReduce is the heart of Hadoop. Its
programming paradigm allows performing massive data processing
across thousands of servers configured with Hadoop clusters. This is
derived from Google MapReduce.
Hadoop MapReduce is a software framework for writing applications
easily, which process large amounts of data (multi-terabyte datasets) in
parallel on large clusters (thousands of nodes) of commodity hardware in
a reliable, fault-tolerant manner.
This MapReduce paradigm is divided into two phases, Map and Reduce
that mainly deal with key and value pairs of data. The Map and Reduce
task run sequentially in a cluster; the output of the Map phase becomes
the input for the Reduce phase. These phases are explained as follows:
• Map phase: Once divided, datasets are assigned to the task tracker to
perform the Map phase. The data functional operation will be performed
over the data, emitting the mapped key and value pairs as the output of
the map phase.
• Reduce phase: The master node then collects the answers to all the
sub-problems and combines them in some way to form the output; the
answer to the problem it was originally trying to solve.
The five common steps of parallel computing are as follows:
1. Preparing the Map () input: This will take the input data row wise and
emit key value pairs per rows, or we can explicitly change as per the
requirement.
° Map input: list (k1, v1)
Datapedia 15
2. Run the user-provided Map () code
° Map output: list (K2, v2)
3. Shuffle the Map output to the Reduce processors. Also, shuffle the
similar keys (grouping them) and input them to the same reducer.
4. Run the user-provided Reduce() code: This phase will run the custom
reducer code designed by developer to run on shuffled data and emit
key and value.
° Reduce input: (K2, list (v2))
° Reduce output: (k3, v3)
5. Produce the final output: Finally, the master node collects all reducer
output and combines and writes them in a text file.
In our system, we used Hadoop 1.2.1 as it’s a stable version of Hadoop
and it can be downloaded from Apache Hadoop official website
Hadoop 1.2.1 needs openjdk-7 and JAVA_HOME must be added to
Ubuntu environment for that purpose you need to edit /etc./environment.
Datapedia 16
A dedicated user must be added in order to launch Hadoop and also it is
recommended to create a group Hadoop because during the
installation and configuration of Hadoop we need a separation and
privacy from other users.
Hadoop requires SSH access to manage its nodes, i.e. remote machines
plus your local machine if you want to use Hadoop on it you need to
configure SSH server. For configuring SSH server use the following
commands:
Datapedia 17
3- RHadoop.
RHadoop is a collection of five R packages that allow users to manage
and analyze data with Hadoop. The packages are regularly tested (and
always before a release) on recent releases of the Cloudera and
Hortonworks Hadoop distributions and should have broad compatibility
with open source Hadoop and mapR's distribution. We normally test on
recent Revolution R and CentOS releases, but we expect all the RHadoop
packages to work on a recent release of open source R and Linux.
RHadoop consists of the following packages:
ravro: read and write files in avro format.
plyrmr: higher level plyr-like data processing for structured data, powered
by rmr.
rmr: functions providing Hadoop MapReduce functionality in R.
rhdfs: functions providing file management of the HDFS from within R.
rhbase: functions providing database management for the HBase
distributed database from within R.
4- PHP.
PHP (recursive acronym for PHP: Hypertext Preprocessor) is a widely-used
open source general-purpose scripting language that is especially suited
for web development and can be embedded into HTML.
What distinguishes PHP from something like client-side JavaScript is that
the code is executed on the server, and that completely suits our target
as the server has to have a specific environment that won’t be available
in every regular user’s pc, generating HTML which is then sent to the client.
The client would receive the results of running that script, but would not
know what the underlying code was.
Datapedia 18
PHP is mainly focused on server-side scripting, so it can do anything any
other CGI program can do, such as collect form data, generate dynamic
page content, or send and receive cookies and much more.
There are three main areas where PHP scripts are used.
• Server-side scripting. This is the most traditional and main target field for
PHP. You need three things to make this work. The PHP parser (CGI or
server module), a web server and a web browser.
• Command line scripting. You can make a PHP script to run it without
any server or browser. You only need the PHP parser to use it this way. This
type of usage is ideal for scripts regularly executed using cron (on *nix or
Linux) or Task Scheduler (on Windows). These scripts can also be used for
simple text processing tasks. And that’s what we are looking for, we will
actually use PHP to execute the R script that does the sentimental
analytics.
• Writing desktop applications. PHP is probably not the very best
language to create a desktop application with a graphical user interface,
but if you know PHP very well, and would like to use some advanced PHP
features in your client-side applications you can also use PHP-GTK to write
such programs. You also have the ability to write cross-platform
applications this way.
5- Google Charts API.
The Google Chart API is a tool that lets people easily create a chart from
some data and embed it in a web page. Google creates a PNG image
of a chart from data and formatting parameters in an HTTP request. Many
types of charts are supported, and by making the request into an image
tag, people can simply include the chart in a web page.
Datapedia 19
Charts are exposed as JavaScript classes, and Google Charts provides
many chart types for you to use. The default appearance will usually be
all what we need, and we can always customize a chart to fit the look
and feel of the website.
Charts are highly interactive and expose events that let us connect them
to create complex dashboards or other experiences integrated with our
webpage. Charts are rendered using HTML5/SVG technology to provide
cross-browser compatibility (including VML for older IE versions) and cross
platform portability to iPhones, iPads and Android. Our users will never
have to mess with plugins or any software. If they have a web browser,
they can see the charts.
All chart types are populated with data using the DataTable class,
making it easy to switch between chart types as you experiment to find
the ideal appearance. The DataTable provides methods for sorting,
modifying, and filtering data, and can be populated directly from your
web page, a database, or any data provider supporting the Chart Tools
Datasource protocol.
Google Charts API has many useful characteristics that perfectly goes
along with our project goal to have visual ads for the resulted analytics
like:
•Free. Using the same chart tools Google uses, completely free and with
three years' backward compatibility guaranteed.
•Customizable. Allows us to configure an extensive set of options to
perfectly match the look and feel of our website.
•Controls and Dashboards. Easily connected to an interactive dashboard.
1.6 Google Maps.
Datapedia 20
As said before Google Maps will be used within Datapedia to provide the
graphical analytics aids.
On today’s Web, mapping solutions are a natural ingredient. So we use
them to see the location of the tweets and visualize the content of it in a
colored pins that its color indicates whether this is a positive, negative or
neutral tweet.
Most of tweets has a location, and if location exists, then it can be
displayed on a map.
There are several mapping solutions including Yahoo! Maps and Bing
Maps, but the most popular one is Google Maps. In fact, according to
Programmableweb.com, it’s the most popular API on the
Internet.
The Google Maps API lets you harness the power of Google Maps to use
in Datapedia to display the located tweets in an efficient and usable
manner.
First of all we need to know the term "coordinates", Coordinates are used
to express locations in the world. There are several different coordinate
systems. The one being used in Google Maps is the Word Geodetic
System 84 (WGS 84), which is the same system the Global Positioning
System (GPS) uses. The coordinates are expressed using latitude and
longitude. You can think of these as the y and x values in a grid.
To create a Google map, we must know that it resides in a web page. So,
the first thing you need to do is to set up that page. This includes creating
an HTML page and a style sheet for it. Once you have everything set up,
you can insert the actual map.
Datapedia 21
Now that you have a web page set up, you’re ready to load the Google
Maps API. The API is a JavaScript file that is hosted on Google’s servers. It’s
loaded with a <script> element in the <head> section of the page. The
<script> element can be used to add remote scripts into a web page,
and that’s exactly what you want to do here.
The most common use of maps on the Internet is to visualize the
geographic position of something. The
Google Maps marker is the perfect tool for doing this. A marker is basically
a small image that is positioned at a specific place on a map. In
Detapedia, It indicates that this is a tweet and by clicking, it shows the
content of the tweet.
The content of the tweet –by clicking the marker- opens in a window that
called InfoWindow, so when marking places on a map, we will want to
show additional information related to that place. Like the text content of
the located tweet, The Google Maps API offers a perfect tool for this, and
that’s the InfoWindow. It looks like a speech bubble and typically appears
over a marker when you click it.
Datapedia 22
7- Opinion Lexicon.
Used in the Sentiment analysis process as we search for a match between
the tweet and the lexicon and -if found- give that match the
corresponding points in the lexicon.
It’s like a big dictionary that contains English words and classifies them to
negative and positive based on score or a rating where more than a zero
is positive, lower than zero is negative.
8- Bootstrap v3.3.5.
Bootstrap is an open-source CSS, JavaScript framework that was originally
developed for twitter application by twitter's team of designers and
Datapedia 23
developers.
It is a combination of HTML, CSS, and JavaScript code designed to help
build user interface components. Bootstrap was also programmed to
support both HTML5 and CSS3.
Also it is called Front-end-framework.
Bootstrap is a free collection of tools for creating a websites and web
applications.
It contains HTML and CSS-based design templates for typography, forms,
buttons, navigation and other interface components, as well as optional
JavaScript extensions.
Some Reasons we preferred Bootstrap Framework:
•Easy to get started.
•Great grid system.
•Base styling for most HTML elements (Typography, Code, Tables, Forms,
Buttons, Images, and Icons).
•Extensive list of components.
•Bundled JavaScript plugins.
1.6 Limitations & Exclusions
Language Support:
Our System support and analyze tweets written only in English as the
opinion lexicon contains English words only.
Twitter Search API:
Access to the Search API is measured by the IP address the request is
coming from. So the rate limiting is too strict like:
- The downloaded tweets are only 1 week old as you should be a Twitter
search partner to be able to request older tweets from the search API.
twitteR data package in R Statistics:
Datapedia 24
- Based on the limits of the searchTwitter () function we can only
download 1500 vectors representing 1500 tweets per request.
Datapedia 25
Chapter 2: System Analysis
2.1 Project Stakeholders
Beneficiaries:
- Users: Regular users can benefit from Datapedia by using it to know
people’s opinions of a certain product or a service to help them decide
what to buy.
- Marketing Agencies: Digital Marketers and social media analysts can
use Datapedia to track the volume of activity around campaigns in real-
time, including number of users, find influencers talking about it and track
negative feedbacks.
- Decision Makers: Datapedia helps decision makers to make more
efficient decisions based on user feedbacks.
Owners:
The project team and supervisors who will deliver the project. Without
them project won’t happen.
Suppliers:
Twitter is a major supplier as we download the tweets from its API.
Project Sponsor:
Our project is accepted in ITAC graduation project funding program and
sponsored by ITIDA.
2.2 Non-Functional Requirements
Usability: Our system is built to in a manner that insures that it will be easy
and simple and the majority of users will find it user-friendly and easy to
deal with.
Datapedia 26
Portability: Our system is built to work on any operating system and on
any web browser.
Accuracy:
- Data accuracy is 100% guaranteed as the system deals with Twitter
search API directly.
- Sentiment scoring accuracy is our main debate in this project as we
spent a lot of time handling special cases like (negation, sarcasm, etc...)
to boost our algorithm accuracy.
Extensibility:
Our goal is to add more features to our system in the future like time
framing and predicting future trends.
2.3 Functional Requirements
1- Easily understandable analytics with dynamically-changed graphical
aids.
2- Great sentiment analysis accuracy accompanied with calculations
and samples.
3- User Recommendations based on true statistics about related products.
4- Fully interactive dashboard graphs and charts to help users make
better decisions.
Datapedia 27
2.4User Profiles
Characteristics Regular Users Marketers Project Managers / Managers Administrator
Age 10 : 60 23 : 35 35 : 50 25 : 35
Gender
50% males
50% females
50% males 80% males 100% males
Education Basic Education Marketing Degrees MBAs / PHDs CS Degree
Language English / Arabic English / Arabic English / Arabic English
Computer Experience Low mid Mid High
Domain Experience
(Social Analytics)
Low High Low : Mid High
Expectations Ease of access Speed of task Ease of access and speed of task
Comprehensive
functionality
2.5 Task Profiles
Task Name Regular Users Marketers
Project Managers /
Managers Administrator
1 Sign up
X
X X
2
Submit new tracking
request X X
3
View Tracking reports
X X X
4
Export tracking reports
X X
5 View tracking history
X
X X
6 View user data X
Datapedia 28
2.6 Environmental Profiles
Characteristics Regular Users Marketers Project Managers /
Managers
Administrator
Location Indoor or
mobile
Indoor or
mobile
Indoor or mobile Indoor
Workspace Office Office Cubicle
Software Any browser Any browser Any browser Any browser
Lighting Bright Good Average
Noise Quiet quiet quiet quiet
Hardware PC PC PC PC
Internet
Connection
Normal
connection
Normal
connection
Fast connection Fast connection
2.7Sample Personas
Persona #1
Name: Mr. Ahmed Roshdy.
Age: 20.
Position: Student.
Education: Pursuing a degree in Computer Science.
Things he wants to know Things he wants to do
 People’s opinions on the product he wants
to buy to decide whether he will buy it or
not.
 Knowing the product that he wants to buy.
Datapedia 29
Persona #2
Name: Mr. John Doe.
Age: 45.
Position: Managing Director.
Education: Masters of Business Administration.
Company: XYZ for Real estate development.
Things he wants to know Things he wants to do
 How much units should he sell in his new
project?
 What is the feedback of his current clients?
 Where to build new projects and housing
complexes?
 Attract more clients to his business.
 Make more profit.
 Build strong relationships with his clients.
Persona #3
Name: Ms. Karin Kudrow.
Age: 25.
Position: Social Media Strategist.
Education: Bachelor degree in Marketing.
Company: XYZ for Business Solutions.
Things she wants to know Things she wants to do
 How many tweets written with a certain
hashtag?
 Who are the best contributors on the
hashtag?
 How people interact with the hashtag and
are their tweets negative or positive?
 Launch a new marketing campaign on
twitter.
 Share with the followers of the company a
new hashtag to share their thoughts on it.
Datapedia 30
2.8 Competitive Analysis
Factor keyhole Hashtagify.me hashtracking hashtags talkwalker Datapedia
Subscription 30 days trial
then
599$ per
month for
full
subscription.
14 days trial
then 299$ per
month for full
subscription.
30 days trial
then
399$ per
month for full
subscription.
Free trial
with
upgradable
account for
349$ per
month for
full
subscription.
7 days trial
then
1400$ per
month for
full
subscription.
Full free
subscription.
Tracking
method
Real time +
historical
data
(50$ per
report)
Real time +
historical
data.
Real time +
historical
data (up to
30 days)
24-Hour
Trend
Graph
based on
1% data
sample.
Real time +
historical
data.
Real time +
historical
data.
Sentiment
analysis X  X X  
Social
Media
Coverage
Twitter,
Facebook
and
Instagram.
Twitter and
they will add
Facebook
and
Instagram in
the future.
Twitter. Twitter. Twitter,
Facebook
and
Instagram.
Twitter
Technology Web
application.
Web
application.
Web
application.
Web
application.
Web
application.
Web
application,
planning to
release an
android
application.
From the table above, it is clear that Datapedia is deliberated to support all
of the existing common feature in the corresponding and similar web-based
applications. However, Datapedia will have some edge over the others such
as the free full subscription and the ability to be extended to a mobile
application. In addition, building Datapedia as a web-based application will
give it an advantage over the others having features such as the
dynamically-changed graphical analytics aids and the interactive
dashboard graphs and charts to support the user in making better decisions.
Datapedia 31
Chapter 3: System Design
3.1 System Development Methodology
We used the agile software development as:
1- We were changing requirements frequently, even late in
development.
2- We delivered working software frequently, from a couple of weeks to
a couple of months, with a preference to the shorter timescale.
3- The Working software was the primary measure of progress.
4- Our method of conveying information to and within a development
team is face-to-face conversation.
5- There was a continuous attention to technical excellence and good
design enhances agility.
6- To implement a new feature the developers need to lose only the
work of a few days, or even only hours, to roll back and implement it.
7- Unlike the waterfall model in agile model very limited planning is
required to get started with the project. Agile assumes that the end
users’ needs are ever changing in a dynamic business and IT world.
Changes can be discussed and features can be newly effected or
removed based on feedback. This effectively gives the customer the
finished system they want or need.
3.2 Main Objectives
1- To help the user make better decisions concerning buying a product in
an easy & user-friendly way based on real reviews written by actual users
among different social media sites.
2- To help marketers make better marketing plans based on what people
like and dislike (user opinion and review) and by taking feedbacks in
Datapedia 32
addition to studying the trend of opinions based on different dimensions
(time, location, source, …).
3- To help mangers make more accurate forecasts based on real data
collected from real users.
3.3 Secondary objectives
1- Implementing the “power of the user” strategy.
2- Giving recommendations based on real data.
3- Pointing out how market responds to a new product, service and
initiative.
Datapedia 33
3.4 Use Case Diagram
Datapedia 34
3.5 Backend Design
To know how the process will go when we speak about big data, we must
first know about Hadoop itself. As an open source java based software it will
cost us nothing to implement the parallel processing techniques to our
collected data and apply our algorithm on it, Hadoop mainly rely on dividing
the tasks to smaller subtasks via a programming paradigm called
mapReduce, on which an algorithm is mapped to the different parts of the
data and finally a reducer will reduce the result as shown in the figure:
The MapReduce structure is mainly consists of two parts:
• JobTracker: This is the master node of the MapReduce system, which
manages the jobs and resources in the cluster (TaskTrackers). The JobTracker
tries to schedule each map as close to the actual data being processed on
the TaskTracker, which is running on the same DataNode as the underlying
block.
• TaskTracker: These are the slaves that are deployed on each machine.
They are responsible for running the map and reducing tasks as instructed by
the JobTracker. And they are working as a primary-slave work force in which
the task tracker assigns different tasks to the job trackers just like the following
figure:
Datapedia 35
As we can see, hadoop is mainly based on the divide and conquer process
and that formulate it’s parallel processing core, the other core is the file
system that hadoop rely on, the HDFS (Hadoop Distribution File System) which
is the master piece of the fault handling core in hadoop. It is actually a
normal file system Data in a Hadoop cluster is broken down into smaller
pieces (called blocks) and distributed throughout the cluster. In this way, the
map and reduce functions can be executed on smaller subsets of your larger
data sets, and this provides the scalability that is needed for big data
processing.
Datapedia 36
HDFS is based on three nodes:
• NameNode: This is the master of the HDFS system. It maintains the
directories, files, and manages the blocks that are present on the DataNodes.
• DataNode: These are slaves that are deployed on each machine and
provide actual storage. They are responsible for serving read-and-write data
requests for the clients.
• Secondary NameNode: This is responsible for performing periodic
checkpoints. So, if the NameNode fails at any time, it can be replaced with a
snapshot image stored by the secondary NameNode checkpoints.
Sometimes the data resides on the HDFS (in various formats).since a lot of
data analysts are very productive in R, it is natural to use R to compute with
the data stored through Hadoop-related tools.
As mentioned earlier, the strengths of R lie in its ability to analyze data using a
rich library of packages but fall short when it comes to working on very large
datasets. The strength of Hadoop on the other hand is to store and process
Datapedia 37
very large amounts of data in the TB and even PB range. Such vast datasets
cannot be processed in memory as the RAM of each machine cannot hold
such large datasets. The options would be to run analysis on limited chunks
also known as sampling or to correspond the analytical power of R with the
storage and processing power of Hadoop and you arrive at an ideal solution.
Such solutions can also be achieved in the cloud using platforms such as
Amazon EMR.
R will not load all data (Big Data) into machine memory. So, Hadoop can be
chosen to load the data as Big Data. Not all algorithms work across Hadoop,
and the algorithms are, in general, not R algorithms. Despite this, analytics
with R have several issues related to large data. In order to analyze the
dataset, R loads it into the memory, and if the dataset is large, it will fail with
exceptions such as "cannot allocate vector of size x". Hence, in order to
process large datasets, the processing power of R can be vastly magnified
by combining it with the power of a Hadoop cluster. Hadoop is very a
popular framework that provides such parallel processing capabilities. So, we
can use R algorithms or analysis processing over Hadoop clusters to get the
work done.
Datapedia 38
If we think about a combined RHadoop system, R will take care of data
analysis operations with the preliminary functions, such as data loading,
exploration, analysis, and visualization, and Hadoop will take care of parallel
data storage as well as computation power against distributed data. Prior to
the advent of affordable Big Data technologies, analysis used to be run on
limited datasets on a single machine. Advanced machine learning
algorithms are very effective when applied to large datasets, and this is
possible only with large clusters where data can be stored and processed
with distributed data storage systems.
Datapedia 39
3.6 Block Diagram
Datapedia 40
3.7 Sample Mock-ups
We have chosen Windows10 as the product to be evaluated in our
prototype, we will get the word or hashtag that the user entered and use it to
search for tweets that contain this specific word.
Then we will classify these tweets as positive or negative or neutral, this
classification will be based on word classification lexicon, each word will
have a specific score in positive or negative in a scale from 5 to -5.
And then the results of data analytics phase will be represented on a
dashboard to the user to help make an informed decision.
Figure 1: Home Page
Datapedia 41
Figure 2: Sentiment Scores
Figure 3: Map Dimension (Demographic)
Datapedia 42
Chapter 4: System Implementation and Management
The proposed project consists of 4 main phases as follows:
4.1 Phase 1: Data Collection and Preprocessing
In this phase, we collected relevant data form social networks as well as
product review sites. In this stage, we will first use different “APIs” to grab the
data. Then, we transformed the data into R-based data frames to be easily
parsed, accessed, and queried. Our goal was to collect millions of product-
related records.
The collected data includes key attributes like:
- "text": The text that will be analyzed later in the sentiment analysis phase.
- "favorited": A Boolean attribute to show if the tweet is favorited by any
Twitter user or not.
- "favoriteCount": The count of the how many favorites the tweet got.
- “created": The time attribute when the tweet was posted.
- "id": The tweet id.
- "statusSource": The source of the tweet where if it’s posted from the website
or from a mobile application.
- "screenName": The handle (user name) of the user who wrote the tweet.
- "retweetCount": The count of the how many retweets the tweet got.
- "isRetweet": A Boolean attribute to show if the tweet is a retweeted tweet
from another user.
- "retweeted": A Boolean attribute to show if the tweet is retweeted by any
Twitter user or not.
- "Longitude": Coordinates on map to get the location of the user who
posted the tweet.
- "Latitude": Coordinates on map to get the location of the user who posted
the tweet.
Datapedia 43
We started this phase from October and it was an on-going phase which
means we kept downloading tweets from the October till May, 2015.
One of the problems we faced was the limitation of the Twitter search API as
we could only download tweets that are 1 week old as you must be a Twitter
search partner to be able to request older tweets from the search API.
To overcome this, we downloaded tweets on a daily biases to build our own
historical data.
By the end of May, 2015 we successfully downloaded 300,000 tweets.
Another problem was that the data contained a lot of commercial tweets
and ads like promotions.
To overcome this, we decided to collect pure datasets from different sources
that provide user reviews on specific products. We collected about 500 user
review manually from websites that provide real reviews from real users like
GSMArena and Reevoo.
And then we realized that it would take us a lot of time to get a decent
number of reviews to be analyzed, so we built our own web crawler using
Python to automatically get the reviews from those sites.
4.2 Phase 2: Sentiment Analysis.
In this stage, the data collected in phase 1 is classified as positive, negative
and neutral. This classification is based on word classification lexicon. Natural
language processing techniques are used in this stage to parse the sentence,
understand it, and finally classify it into the predefined classes.
Datapedia 44
This step was a crucial step as it was the main output that will eventually
control the customers’ decision about the product.
Our algorithm was enhanced more than one time to get better accuracy
percentages.
Our first algorithms was so basic and relied on matching the words between
the tweet and the lexicon. This algorithm led to poor results.
And then we managed to know the reasons behind this poor results which
are the special cases that needs to be handled like (Negation and sarcasm).
Those cases were handled in the algorithm by searching for negation in the
text along with the sentiment scoring process and if it’s found the score is
multiplied by -1 automatically until the search reach the statement character.
This way we managed to score a great accuracy score by using the data
collected in the data collection phase and we analyzed this pure dataset
manually and knew the overall sentiment score of it and then we analyzed it
using our algorithm.
4.3 Phase 3: Data Analytics.
In this phase, R-language will be used to handle the enormous data volumes
and help in achieving the data analytics task. Classified data will be input to
this module and different techniques will be used (and/or compared) to
obtain the best analytical results.
For big data analytics (which will be a whole new process) we cannot rely on
the usual way of executing the algorithm as it’d take so much time to do so ,
even more time to just fetch the data into the memory as R must operate on
data already in memory plus the time that will be taken for the analytic
process itself which will be huge according to what is really happening like
Datapedia 45
cleaning the tweets, dividing each single tweet into words and look for them
in the lexicon besides the summation of the score process.
This is when the Hadoop’s turn comes, we actually didn’t use hadoop itself,
we used something called hadoop streaming which is a streaming service or
a framework that give developers the ability to execute hadoop tasks in
different languages than java like C++, C, Python and R. so it perfectly suits
what we are looking for to combine the great analytical power of R and the
stable-fault handling techniques of Hadoop.
Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce
job with executable scripts such as Mapper and Reducer. This is similar to the
pipe operation in Linux. With this, the text input file is printed on stream (stdin),
which is provided as an input to Mapper and the output (stdout) of Mapper
is provided as an input to Reducer; finally, Reducer writes the output to the
HDFS directory.
The main advantage of the Hadoop streaming utility is that it allows Java as
well as non-Java programmed MapReduce jobs to be executed over
Hadoop clusters. Also, it takes care of the progress of running MapReduce
jobs. The Hadoop streaming supports the Perl, Python, PHP, R, and C++
programming languages. To run an application written in other programming
languages, the developer just needs to translate the application logic into
the Mapper and Reducer sections with the key and value output elements.
Datapedia 46
The main six Hadoop streaming components of the preceding command are
listed and explained as follows:
• jar: This option is used to run a jar with coded classes that are designed for
serving the streaming functionality with Java as well as other programmed
Mappers and Reducers. It's called the Hadoop streaming jar.
• Input: This option is used for specifying the location of input dataset (stored
on HDFS) to Hadoop streaming MapReduce job.
• Output: This option is used for telling the HDFS output directory (where the
output of the MapReduce job will be written) to Hadoop streaming
MapReduce job.
• File: This option is used for copying the MapReduce resources such as
Mapper, Reducer, and Combiner to computer nodes (Tasktrackers) to make
it local.
• Mapper: This option is used for identification of the executable Mapper file.
• Reducer: This option is used for identification of the executable Reducer file.
RHadoop is a collection of three R packages for providing large data
operations with an R environment. It was developed by Revolution Analytics,
which is the leading commercial provider of software based on R. RHadoop
is available with three main R packages: rhdfs, rmr, and rhbase. Each of them
offers different Hadoop features.
• rhdfs is an R interface for providing the HDFS usability from the R console. As
Hadoop MapReduce programs write their output on HDFS, it is very easy to
access them by calling the rhdfs methods. The R programmer can easily
perform read and write operations on distributed data files. Basically, rhdfs
package calls the HDFS API in backend to operate data sources stored on
HDFS.
• rmr is an R interface for providing Hadoop MapReduce facility inside the R
environment. So, the R programmer needs to just divide their application
logic into the map and reduce phases and submit it with the rmr methods.
After that, rmr calls the Hadoop streaming MapReduce API with several job
Datapedia 47
parameters as input directory, output directory, mapper, reducer, and so on,
to perform the R MapReduce job over Hadoop cluster.
• rhbase is an R interface for operating the Hadoop HBase data source
stored at the distributed network via a Thrift server. The rhbase package is
designed with several methods for initialization and read/write and table
manipulation operations.
Installing the R packages: We need several R packages to be installed that
help it to connect R with Hadoop. The list of the packages can be installed
by calling the execution of the following R command in the R console:
install.packages( c('rJava','RJSONIO', 'itertools', 'digest','Rcpp
','httr','functional','devtools', 'plyr','reshape2'))
• Setting environment variables: We can set this via the R console using the
following code:
## Setting HADOOP_CMD
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
## Setting up HADOOP_STREAMING
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/contrib/streaming/
hadoop-streaming-1.2.1.jar")
or, we can also set the R console via the command line as follows:
export HADOOP_CMD=/usr/local/Hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/
streaming/hadoop-streaming-1.2.1.jar
Datapedia 48
In our project, as we implied earlier we are talking about a text file that
contains more than 300,000 tweets, in the data analytics process we put that
file into the hdfs directory using the HDFS commands in Linux and the file itself
is spitted into 20 different parts, with one replication and one cluster.
We used 2 mappers in our project which is enough to our I7 5 cores CPU,
which means 2 tasks are done simultaneously in each core as we enabled
hybrid threading.
Datapedia 49
Using R, it took more than 5 minutes to load data into the memory, more than
20 minutes to work on the data and bring results with Hadoop, it took 15
minutes only to get all the results in a much more easy way.
Datapedia 50
4.4 Phase 4: Data presentation.
This is the final step where results obtained will be presented to the user in a
user friendly manner. Dashboards will be used to present the analytics results.
Maps will be used to represent geographically-related information. Pie-charts,
bar charts and others will be used for analytical data. The user will be given
the freedom to drill-down & roll-up with the analytics granularity so that (s)he
can obtain a global view before making decisions.
For business analysis reports, we are giving also some useful analytics like:
1- Most Retweeted tweets.
2- Most Favorite tweets.
3- Most accompanied hashtags.
Which can come at handy finding trends or related topics to the user’s
inserted text.
We used PHP, bootstrap and java script to implement the user interface,
while the giving the user the option to enter his preferred search term in the
real time section and the option to upload a file of tweets in the historical
part section as we had to ensure that our application will stay generic and
error free.
Datapedia 51
This Page was Intentionally Left Blank
Datapedia 52
Figure 4: Commercial tweets.
Chapter 5: System Evaluation
5.1 Data Evaluation
As mentioned earlier our data is fetched directly from Twitter search API, but
we noticed a lot of problems in this data that affected the performance of
our algorithm and our system overall. One of the problems we faced is that
the data contain a lot of ads and commercial tweets as shown in figure 4:
To overcome this problem we decided to collect pure datasets from
different sources that provide user reviews on specific products. We
collected about 500 user review manually from websites that provide real
reviews from real users like GSMArena and Reevoo.
Datapedia 53
Screen shot of the pure dataset:
Datapedia 54
Figure 5: Algorithms Comparison
5.2 Algorithm Evaluation
Our main objective in this project was the accuracy of the sentiment analysis
scoring algorithm and our algorithm was enhanced more than one time to
get better accuracy percentages.
Our first algorithms was so basic and relied on matching the words between
the tweet and the lexicon. This algorithm led to poor results.
And then we managed to know the reasons behind this poor results which
are the special cases that needs to handled like (Negation and sarcasm).
And then we managed to score a great accuracy score by using the data
collected in the data evaluation phase and we analyzed this pure dataset
manually and knew the overall sentiment score of it and then we analyzed it
using our algorithm.
And we compared the scores of every algorithm together as shown in figure
5.
Datapedia 55
5.3 System Evaluation
And finally we wanted to benchmark our algorithm performance against
other algorithms.
We chose Stanford’s recursive deep model as it’s very popular and used by
many users interested in sentiment analysis.
The two algorithms were tested on the same data collected and manually
analyzed by us in the data evaluation phase.
As shown in figure 6, our algorithm scored 61.7% accuracy and Stanford’s
recursive deep model scored 55.8%
And here’s some examples from the two algorithms to support this
benchmark:
Example 1:
Datapedia 56
The text says: “The camera quality and options got better, overall, the phone
is not a disappointment”.
The review is positive:
Stanford’s recursive deep model scored 0 for this sentence which means it’s
neutral.
Our algorithm scored +4 which means it’s positive.
Example 2:
The text says: “Excellent camera with video facility”.
The review is positive.
Stanford’s recursive deep model scored 0 for this sentence which means it’s
neutral.
Datapedia 57
Our algorithm scored +3 which means it’s positive.
Datapedia 58
References
- Book 1: Big Data Analytics with R and Hadoop
First published: November 2013
Production Reference: 1181113
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78216-328-2
- The Comprehensive R Archive Network (CRAN)
Link to the CRAN: http://cran.rstudio.com/
- Course (1): Swirl. Learn R, in R.
Link to the course: http://swirlstats.com/
- Course (2): R Programming.
By: Johns Hopkins University.
Link to the course: https://www.coursera.org/course/rprog
- Course (3): Getting and Cleaning Data.
By: Johns Hopkins University
Link to the course: https://www.coursera.org/course/getdata
- Course (4): Exploratory Data Analysis.
Datapedia 59
By: Johns Hopkins University.
Link to the course: https://www.coursera.org/course/exdata
- Discussion article (1): What are the best supervised learning algorithms for
sentiment analysis in text?
Link to the article: http://www.quora.com/What-are-the-best-supervised-
learning-algorithms-for-sentiment-analysis-in-text
- Article (1): Why Sentiment Analysis Is Essential for Small Businesses.
By: Tara Hornor.
Link to the article: http://graziadiovoice.pepperdine.edu/why-sentiment-
analysis-is-essential-for-small-businesses/
- Article (2): The 37 best tools for data visualization.
By: Brian Suda and Sam Hampton-Smith.
Link to the article: http://www.creativebloq.com/design-tools/data-
visualization-712402

More Related Content

What's hot

IRE2014-Sentiment Analysis
IRE2014-Sentiment AnalysisIRE2014-Sentiment Analysis
IRE2014-Sentiment AnalysisGangasagar Patil
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis reportSavio Aberneithie
 
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
SENTIMENT ANALYSIS-AN OBJECTIVE VIEWSENTIMENT ANALYSIS-AN OBJECTIVE VIEW
SENTIMENT ANALYSIS-AN OBJECTIVE VIEWJournal For Research
 
New sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumarNew sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumarRavi Kumar
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonHetu Bhavsar
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Rachit Goel
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesKarol Chlasta
 
social network analysis project twitter sentimental analysis
social network analysis project twitter sentimental analysissocial network analysis project twitter sentimental analysis
social network analysis project twitter sentimental analysisAshish Mundra
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis pptSonuCreation
 
Sentiment Analaysis on Twitter
Sentiment Analaysis on TwitterSentiment Analaysis on Twitter
Sentiment Analaysis on TwitterNitish J Prabhu
 
Best Practices for Sentiment Analysis Webinar
Best Practices for Sentiment Analysis Webinar Best Practices for Sentiment Analysis Webinar
Best Practices for Sentiment Analysis Webinar Mechanical Turk
 
Social Media Sentiments Analysis
Social Media Sentiments AnalysisSocial Media Sentiments Analysis
Social Media Sentiments AnalysisPratisthaSingh5
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemGan Keng Hoon
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis Naveen Kumar
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project reportBharat Khanna
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion miningAnkush Mehta
 

What's hot (20)

IRE2014-Sentiment Analysis
IRE2014-Sentiment AnalysisIRE2014-Sentiment Analysis
IRE2014-Sentiment Analysis
 
Opinion Mining – Twitter
Opinion Mining – TwitterOpinion Mining – Twitter
Opinion Mining – Twitter
 
Project report
Project reportProject report
Project report
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
SENTIMENT ANALYSIS-AN OBJECTIVE VIEWSENTIMENT ANALYSIS-AN OBJECTIVE VIEW
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
 
New sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumarNew sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumar
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using python
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
 
social network analysis project twitter sentimental analysis
social network analysis project twitter sentimental analysissocial network analysis project twitter sentimental analysis
social network analysis project twitter sentimental analysis
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Sentiment Analaysis on Twitter
Sentiment Analaysis on TwitterSentiment Analaysis on Twitter
Sentiment Analaysis on Twitter
 
Best Practices for Sentiment Analysis Webinar
Best Practices for Sentiment Analysis Webinar Best Practices for Sentiment Analysis Webinar
Best Practices for Sentiment Analysis Webinar
 
Social Media Sentiments Analysis
Social Media Sentiments AnalysisSocial Media Sentiments Analysis
Social Media Sentiments Analysis
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
 

Similar to Datapedia Analysis Report

Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
 
The_Story_of_HavenOndemand_External
The_Story_of_HavenOndemand_ExternalThe_Story_of_HavenOndemand_External
The_Story_of_HavenOndemand_ExternalFernando Lucini
 
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET Journal
 
The Story of HPE Haven OnDemand
The Story of HPE Haven OnDemandThe Story of HPE Haven OnDemand
The Story of HPE Haven OnDemandAlon Mei-raz
 
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Balvinder Hira
 
GERSIS INDUSTRY CASES
GERSIS INDUSTRY CASESGERSIS INDUSTRY CASES
GERSIS INDUSTRY CASESSergej Markov
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsRavi Teja
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...IRJET Journal
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
 
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...panagenda
 
CollabSphere 2020 - ANA101 - Domino Application Strategy Key insights for suc...
CollabSphere 2020 - ANA101 - Domino Application Strategy Key insights for suc...CollabSphere 2020 - ANA101 - Domino Application Strategy Key insights for suc...
CollabSphere 2020 - ANA101 - Domino Application Strategy Key insights for suc...panagenda
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companiesRobert Smith
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 

Similar to Datapedia Analysis Report (20)

Agile data science
Agile data scienceAgile data science
Agile data science
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
The_Story_of_HavenOndemand_External
The_Story_of_HavenOndemand_ExternalThe_Story_of_HavenOndemand_External
The_Story_of_HavenOndemand_External
 
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
 
The Story of HPE Haven OnDemand
The Story of HPE Haven OnDemandThe Story of HPE Haven OnDemand
The Story of HPE Haven OnDemand
 
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
GERSIS INDUSTRY CASES
GERSIS INDUSTRY CASESGERSIS INDUSTRY CASES
GERSIS INDUSTRY CASES
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data Analytics
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
Big data
Big dataBig data
Big data
 
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
CollabSphere 2020 - ANA101 - Domino Application Strategy Key insights for suc...
CollabSphere 2020 - ANA101 - Domino Application Strategy Key insights for suc...CollabSphere 2020 - ANA101 - Domino Application Strategy Key insights for suc...
CollabSphere 2020 - ANA101 - Domino Application Strategy Key insights for suc...
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Datapedia Analysis Report

  • 1. Datapedia 1 GraduationProjectReport Faculty of Computers and Information Information Systems Dept. Cairo University 2014/2015 Datapedia Project Team: Abanoub A. Amin. (20110002) Ahmed A. Ahmed. (20110070) Ahmed M. Osman (Team Leader). (20110086) Ahmed N. Muhammad. (20110099) Ahmed R. Negm. (20110053) Under the supervision of: Dr. Hoda M. O. Mokhtar. Eng. Mohamed Hafez.
  • 2. Datapedia 2 Table of Contents Chapter 1: System Proposal ……………………………………………………….……………. 4 1.1 Introduction …………………………………………………………………………………… 4 1.2 Problem Statement ………………………………………………………………………….. 5 1.3 Project Justification ………………………………………………………………………….. 5 1.4 Project Scope ………………………………………………………………………………… 5 1.5 Technologies & Tools …………….………………………………………………………….. 6 1.6 Limitations & Exclusions ………………………………………………………..…………... 23 Chapter 2: System Analysis ……………………………………………..…………………….. 25 2.1 Project Stakeholders ……………………………………………………….…….………… 52 2.2 Non-Functional Requirements ………………………………………….……….……….. 52 2.3 Functional Requirements ………………………………………………….…….………... 62 2.4 User Profiles ………………………………………………………………….…….…….…… 72 2.5 Task Profiles ………………………………………………………………….…………..…… 72 2.6 Environmental Profiles …………………………………………………….……………..… 82 2.7 Sample Personas …………………………………………………………………….……... 82 2.8 Competitive Analysis ………………………………………………………………….…… 03 Chapter 3: System Design ……………………………………………………….…………..… 13 3.1 System Development Methodology ……………………………………………………. 13 3.2 Main Objectives ……………………………………………………………….……………. 13 3.3 Secondary objectives ………………………………………………………….………….. 23 3.4 Use Case Diagram …………………………………………………………………………. 33 3.5 Backend Design ………………………….…………………………………………………. 34 3.6 Block Diagram ………………………………………………………………….…………… 39 3.7 Sample Mock-ups ……………………………………………………………..……………. 40 Chapter 4: System Implementation& Management …………....…………..………..….. 42 4.1 Phase 1: Data Collection and Preprocessing …………………………………………. 42 4.2 Phase 2: Sentiment Analysis ………………………………………………………………. 43 4.3 Phase 3: Data Analytics …………………………………………………………………… 44
  • 3. Datapedia 3 4.4 Phase 4: Data presentation ………………………………………………………………. 50 4.5 Algorithm Flowchart ………………………………………………………………….…….. 51 Chapter 5: System Evaluation …………………………………………………..…………….. 52 5.1 Data Evaluation ……………………………………………………………..……………… 52 5.2 Algorithm Evaluation…………………………………………………………………..….... 54 5.3 System Evaluation ……………………………………………………………..………….... 55 References ……………………………………….………………………………………….....… 58
  • 4. Datapedia 4 Chapter 1: System Proposal. 1.1 Introduction. Nowadays, it is true that our personal life is highly dependent on the technology that people have developed. We are currently witnessing an era where new competitive products are produced almost every day. A key impact of this rapid technological advancement is that it affected several aspects in our life style along with the way we purchase products, the way we communicate, the way we travel, and not to mention the way we learn. Despite the importance of these advancements, today, buying a certain product is no longer a trivial task that can be taken in minutes. Almost everything we use has been refurbished to better standards. New features, new specifications, price variations, new applications, and many other dimensions affect our purchasing decision, making it more sophisticated and require better decision-making approaches.Although this impact is clear on the personal level, it is also obvious in larger organizations where purchasing machines, computers and equipment is also a crucial decision that costs money.Thus, traditional purchasing approaches where a customer selects by shape, or size or just price is no longerefficient. Today, we all compare between products, ask for recommendations, and navigate through the web for product reviews aiming to find clues that help us in our decision-making. However, going through different sites for collecting such comparative data, or chatting in social networks to find recommendations is not an appealing task for many people. Inspired by the importance of this decision-making problem, and the role social networks and online reviews, we aim to integrate different product related data sources to provide the customer with an integrated view for
  • 5. Datapedia 5 the product he wants to purchase. 1.2 Problem Statement. Buying a product whether it is a computer, a mobile phone, a tablet, or even a car is no longer a simple task. With the increase in prices of new products and the wide range of competing products that offer similar functionalities at lower price, choosing between alternatives turns to be difficult. 1.3 Project Justification. It is now common that before buying any product people at least ask their friends or relatives for recommendations, others search over the internet for customers’ reviews, others tweet or chat over social networks to gather information, going through all those paths is not an appealing approach for many people especially those with minimal computer education. Thus, it would have been more appealing if customers can enter a “single” website or run a mobile application that simply does the work and presents the final comparison in a user-friendly analytical approach. 1.4 Project Scope. In this project, we aim to design a web application that analyzesbig volumes of product reviews, social networks posts and tweets related to the product. Then, presents the results of this big data analytics job in a user friendly, understandable, and easily interpreted manner that can be easily used by different customers for different purposes.
  • 6. Datapedia 6 1.5 Technologies &Tools. R Language 3.1.2 RStudio 0.98.1091. R is an open source software package to perform statistical analysis on data. R is a programming language used by data scientist statisticians and others who need to make statistical analysis of data and glean key insights from data using mechanisms, such as regression, clustering, classification, and text analysis. R is registered under GNU (General Public License). It was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, which is currently handled by the R Development Core Team. It can be considered as a different implementation of S, developed by Johan Chambers at Bell Labs. There are some important differences, but a lot of the code written in S can be unaltered using the R interpreter engine. R provides a wide variety of statistical, machine learning (linear and nonlinear modeling, classic statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible. R has various built-in as well as extended functions for statistical, machine learning, and visualization tasks such as: • Data extraction • Data cleaning • Data loading • Data transformation • Statistical analysis • Predictive modeling • Data visualization It is one of the most popular open source statistical analysis packages available on the market today. It is cross platform, has a very wide community support, and a large and ever-growing user community who are adding new packages every day. With its growing list of packages, R can now connect with other data stores, such as MySQL, SQLite, MongoDB, and Hadoop for data storage activities. Let's see different useful features of R:
  • 7. Datapedia 7 • Effective programming language • Relational database support • Data analytics • Data visualization • Extension through the vast library of R packages The graph provided from KD suggests that R is the most popular language for data analysis and mining: The following graph provides details about the total number of R packages released by R users from 2005 to 2013. This is how we explore R users. The growth was exponential in 2012 and it seems that 2013 is on track to beat that R allowed performing Data analytics by various statistical and machine learning operations as follows: • Regression • Classification • Clustering • Recommendations • Text mining
  • 8. Datapedia 8 Using R Enhanced our project, giving us the ability to make accurate analytics on tweets to figure out each user’s opinion. Installation: First we install R for window which is required to install RStudio. Then we can download and install RStudio.
  • 10. Datapedia 10 Twitter Authentication with R: After installing our packages and libraries we need to create a new Twitter application to be used in the data collection phase where we use this application to interact with Twitter search API. This app is needed for the authentication process with Twitter. Creating the Twitter application and doing the handshake is a must as you have to do it every time you want to get data from Twitter with R. First we go to https://dev.twitter.com/ and log in with a Twitter Account. After creating the application, we can get our api_key and our api_secret as well as our access_token and access_token_secret from our app settings on Twitter. And that’s it. Now we can search Twitter anytime we want and get the data we need.
  • 11. Datapedia 11 2- Apache HadoopHadoop Stream v1.2.1. As mentioned earlier, we are about to use the R language to perform some analytical tasks on massive amount of data. Big Data has to deal with large and complex datasets that can be structured, semi-structured, or unstructured and will typically not fit into memory to be processed. They have to be processed in place, which means that computation has to be done where the data resides for processing. When we talk to developers, the people actually building Big Data systems and applications, we get a better idea of what they mean about 3Vs. They typically would mention the 3Vs model of Big Data, which are velocity, volume, and variety. Velocity refers to the low latency, real-time speed at which the analytics need to be applied. A typical example of this would be to perform analytics on a continuous stream of data originating from a social networking site or aggregation of disparate sources of data. Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB based on the type of the application that generates or receives the data. Variety refers to the various types of the data that can exist, for example, text, audio, video, and photos. Big Data usually includes datasets with sizes. It is not possible for such systems to process this amount of data within the time frame mandated by the business. Big Data volumes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single dataset. Faced with this seemingly insurmountable challenge, entirely new platforms are called Big Data platforms.
  • 12. Datapedia 12 Some of the popular organizations that hold Big Data are as follows: • Facebook: It has 40 PB of data and captures 100 TB/day. • Yahoo!: It has 60 PB of data. • Twitter: It captures 8 TB/day. • EBay: It has 40 PB of data and captures 50 TB/day. How much data is considered as Big Data differs from company to company. Though true that one company's Big Data is another's small, there is something common: doesn't fit in memory, nor disk, has rapid influx of data that needs to be processed and would benefit from distributed software stacks. For some companies, 10 TB of data would be considered Big Data and for others 1 PB would be Big Data.
  • 13. Datapedia 13 So only you can determine whether the data is really Big Data. It is sufficient to say that it would start in the low terabyte range. We will use Apache Hadoop as a tool to handle these big amount of data for the sake of our project. Apache Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware. Hadoop is a top level Apache project, initiated and led by Yahoo! and Doug Cutting. It relies on an active community of contributors from all over the world for its success. With a significant technology investment by Yahoo!, Apache Hadoop has become an enterprise-ready cloud computing technology. It is becoming the industry de facto framework for Big Data processing. Hadoop changes the economics and the dynamics of large-scale computing. Its impact can be boiled down to four salient characteristics. Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions. Apache Hadoop has two main features: • HDFS (Hadoop Distributed File System) • MapReduce HDFS is Hadoop's own rack-aware file system, which is a UNIX-based data storage layer of Hadoop. HDFS is derived from concepts of Google file system. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands of) hosts, and the execution of application computations in parallel, close to their data. On HDFS, data files are replicated as sequences of blocks in the cluster. A Hadoop cluster scales computation capacity, storage capacity, and I/O bandwidth by simply adding commodity servers. HDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use. The Hadoop clusters at Yahoo! span 40,000 servers and store 40 petabytes of application data, with the largest Hadoop cluster being 4,000 servers. Also, one hundred other organizations worldwide are known to use Hadoop.
  • 14. Datapedia 14 Characteristics of HDFS: • Fault tolerant • Runs with commodity hardware • Able to handle large datasets • Master slave paradigm • Write once file access only MapReduce is a programming model for processing large datasets distributed on a large cluster. MapReduce is the heart of Hadoop. Its programming paradigm allows performing massive data processing across thousands of servers configured with Hadoop clusters. This is derived from Google MapReduce. Hadoop MapReduce is a software framework for writing applications easily, which process large amounts of data (multi-terabyte datasets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. This MapReduce paradigm is divided into two phases, Map and Reduce that mainly deal with key and value pairs of data. The Map and Reduce task run sequentially in a cluster; the output of the Map phase becomes the input for the Reduce phase. These phases are explained as follows: • Map phase: Once divided, datasets are assigned to the task tracker to perform the Map phase. The data functional operation will be performed over the data, emitting the mapped key and value pairs as the output of the map phase. • Reduce phase: The master node then collects the answers to all the sub-problems and combines them in some way to form the output; the answer to the problem it was originally trying to solve. The five common steps of parallel computing are as follows: 1. Preparing the Map () input: This will take the input data row wise and emit key value pairs per rows, or we can explicitly change as per the requirement. ° Map input: list (k1, v1)
  • 15. Datapedia 15 2. Run the user-provided Map () code ° Map output: list (K2, v2) 3. Shuffle the Map output to the Reduce processors. Also, shuffle the similar keys (grouping them) and input them to the same reducer. 4. Run the user-provided Reduce() code: This phase will run the custom reducer code designed by developer to run on shuffled data and emit key and value. ° Reduce input: (K2, list (v2)) ° Reduce output: (k3, v3) 5. Produce the final output: Finally, the master node collects all reducer output and combines and writes them in a text file. In our system, we used Hadoop 1.2.1 as it’s a stable version of Hadoop and it can be downloaded from Apache Hadoop official website Hadoop 1.2.1 needs openjdk-7 and JAVA_HOME must be added to Ubuntu environment for that purpose you need to edit /etc./environment.
  • 16. Datapedia 16 A dedicated user must be added in order to launch Hadoop and also it is recommended to create a group Hadoop because during the installation and configuration of Hadoop we need a separation and privacy from other users. Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it you need to configure SSH server. For configuring SSH server use the following commands:
  • 17. Datapedia 17 3- RHadoop. RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. The packages are regularly tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR's distribution. We normally test on recent Revolution R and CentOS releases, but we expect all the RHadoop packages to work on a recent release of open source R and Linux. RHadoop consists of the following packages: ravro: read and write files in avro format. plyrmr: higher level plyr-like data processing for structured data, powered by rmr. rmr: functions providing Hadoop MapReduce functionality in R. rhdfs: functions providing file management of the HDFS from within R. rhbase: functions providing database management for the HBase distributed database from within R. 4- PHP. PHP (recursive acronym for PHP: Hypertext Preprocessor) is a widely-used open source general-purpose scripting language that is especially suited for web development and can be embedded into HTML. What distinguishes PHP from something like client-side JavaScript is that the code is executed on the server, and that completely suits our target as the server has to have a specific environment that won’t be available in every regular user’s pc, generating HTML which is then sent to the client. The client would receive the results of running that script, but would not know what the underlying code was.
  • 18. Datapedia 18 PHP is mainly focused on server-side scripting, so it can do anything any other CGI program can do, such as collect form data, generate dynamic page content, or send and receive cookies and much more. There are three main areas where PHP scripts are used. • Server-side scripting. This is the most traditional and main target field for PHP. You need three things to make this work. The PHP parser (CGI or server module), a web server and a web browser. • Command line scripting. You can make a PHP script to run it without any server or browser. You only need the PHP parser to use it this way. This type of usage is ideal for scripts regularly executed using cron (on *nix or Linux) or Task Scheduler (on Windows). These scripts can also be used for simple text processing tasks. And that’s what we are looking for, we will actually use PHP to execute the R script that does the sentimental analytics. • Writing desktop applications. PHP is probably not the very best language to create a desktop application with a graphical user interface, but if you know PHP very well, and would like to use some advanced PHP features in your client-side applications you can also use PHP-GTK to write such programs. You also have the ability to write cross-platform applications this way. 5- Google Charts API. The Google Chart API is a tool that lets people easily create a chart from some data and embed it in a web page. Google creates a PNG image of a chart from data and formatting parameters in an HTTP request. Many types of charts are supported, and by making the request into an image tag, people can simply include the chart in a web page.
  • 19. Datapedia 19 Charts are exposed as JavaScript classes, and Google Charts provides many chart types for you to use. The default appearance will usually be all what we need, and we can always customize a chart to fit the look and feel of the website. Charts are highly interactive and expose events that let us connect them to create complex dashboards or other experiences integrated with our webpage. Charts are rendered using HTML5/SVG technology to provide cross-browser compatibility (including VML for older IE versions) and cross platform portability to iPhones, iPads and Android. Our users will never have to mess with plugins or any software. If they have a web browser, they can see the charts. All chart types are populated with data using the DataTable class, making it easy to switch between chart types as you experiment to find the ideal appearance. The DataTable provides methods for sorting, modifying, and filtering data, and can be populated directly from your web page, a database, or any data provider supporting the Chart Tools Datasource protocol. Google Charts API has many useful characteristics that perfectly goes along with our project goal to have visual ads for the resulted analytics like: •Free. Using the same chart tools Google uses, completely free and with three years' backward compatibility guaranteed. •Customizable. Allows us to configure an extensive set of options to perfectly match the look and feel of our website. •Controls and Dashboards. Easily connected to an interactive dashboard. 1.6 Google Maps.
  • 20. Datapedia 20 As said before Google Maps will be used within Datapedia to provide the graphical analytics aids. On today’s Web, mapping solutions are a natural ingredient. So we use them to see the location of the tweets and visualize the content of it in a colored pins that its color indicates whether this is a positive, negative or neutral tweet. Most of tweets has a location, and if location exists, then it can be displayed on a map. There are several mapping solutions including Yahoo! Maps and Bing Maps, but the most popular one is Google Maps. In fact, according to Programmableweb.com, it’s the most popular API on the Internet. The Google Maps API lets you harness the power of Google Maps to use in Datapedia to display the located tweets in an efficient and usable manner. First of all we need to know the term "coordinates", Coordinates are used to express locations in the world. There are several different coordinate systems. The one being used in Google Maps is the Word Geodetic System 84 (WGS 84), which is the same system the Global Positioning System (GPS) uses. The coordinates are expressed using latitude and longitude. You can think of these as the y and x values in a grid. To create a Google map, we must know that it resides in a web page. So, the first thing you need to do is to set up that page. This includes creating an HTML page and a style sheet for it. Once you have everything set up, you can insert the actual map.
  • 21. Datapedia 21 Now that you have a web page set up, you’re ready to load the Google Maps API. The API is a JavaScript file that is hosted on Google’s servers. It’s loaded with a <script> element in the <head> section of the page. The <script> element can be used to add remote scripts into a web page, and that’s exactly what you want to do here. The most common use of maps on the Internet is to visualize the geographic position of something. The Google Maps marker is the perfect tool for doing this. A marker is basically a small image that is positioned at a specific place on a map. In Detapedia, It indicates that this is a tweet and by clicking, it shows the content of the tweet. The content of the tweet –by clicking the marker- opens in a window that called InfoWindow, so when marking places on a map, we will want to show additional information related to that place. Like the text content of the located tweet, The Google Maps API offers a perfect tool for this, and that’s the InfoWindow. It looks like a speech bubble and typically appears over a marker when you click it.
  • 22. Datapedia 22 7- Opinion Lexicon. Used in the Sentiment analysis process as we search for a match between the tweet and the lexicon and -if found- give that match the corresponding points in the lexicon. It’s like a big dictionary that contains English words and classifies them to negative and positive based on score or a rating where more than a zero is positive, lower than zero is negative. 8- Bootstrap v3.3.5. Bootstrap is an open-source CSS, JavaScript framework that was originally developed for twitter application by twitter's team of designers and
  • 23. Datapedia 23 developers. It is a combination of HTML, CSS, and JavaScript code designed to help build user interface components. Bootstrap was also programmed to support both HTML5 and CSS3. Also it is called Front-end-framework. Bootstrap is a free collection of tools for creating a websites and web applications. It contains HTML and CSS-based design templates for typography, forms, buttons, navigation and other interface components, as well as optional JavaScript extensions. Some Reasons we preferred Bootstrap Framework: •Easy to get started. •Great grid system. •Base styling for most HTML elements (Typography, Code, Tables, Forms, Buttons, Images, and Icons). •Extensive list of components. •Bundled JavaScript plugins. 1.6 Limitations & Exclusions Language Support: Our System support and analyze tweets written only in English as the opinion lexicon contains English words only. Twitter Search API: Access to the Search API is measured by the IP address the request is coming from. So the rate limiting is too strict like: - The downloaded tweets are only 1 week old as you should be a Twitter search partner to be able to request older tweets from the search API. twitteR data package in R Statistics:
  • 24. Datapedia 24 - Based on the limits of the searchTwitter () function we can only download 1500 vectors representing 1500 tweets per request.
  • 25. Datapedia 25 Chapter 2: System Analysis 2.1 Project Stakeholders Beneficiaries: - Users: Regular users can benefit from Datapedia by using it to know people’s opinions of a certain product or a service to help them decide what to buy. - Marketing Agencies: Digital Marketers and social media analysts can use Datapedia to track the volume of activity around campaigns in real- time, including number of users, find influencers talking about it and track negative feedbacks. - Decision Makers: Datapedia helps decision makers to make more efficient decisions based on user feedbacks. Owners: The project team and supervisors who will deliver the project. Without them project won’t happen. Suppliers: Twitter is a major supplier as we download the tweets from its API. Project Sponsor: Our project is accepted in ITAC graduation project funding program and sponsored by ITIDA. 2.2 Non-Functional Requirements Usability: Our system is built to in a manner that insures that it will be easy and simple and the majority of users will find it user-friendly and easy to deal with.
  • 26. Datapedia 26 Portability: Our system is built to work on any operating system and on any web browser. Accuracy: - Data accuracy is 100% guaranteed as the system deals with Twitter search API directly. - Sentiment scoring accuracy is our main debate in this project as we spent a lot of time handling special cases like (negation, sarcasm, etc...) to boost our algorithm accuracy. Extensibility: Our goal is to add more features to our system in the future like time framing and predicting future trends. 2.3 Functional Requirements 1- Easily understandable analytics with dynamically-changed graphical aids. 2- Great sentiment analysis accuracy accompanied with calculations and samples. 3- User Recommendations based on true statistics about related products. 4- Fully interactive dashboard graphs and charts to help users make better decisions.
  • 27. Datapedia 27 2.4User Profiles Characteristics Regular Users Marketers Project Managers / Managers Administrator Age 10 : 60 23 : 35 35 : 50 25 : 35 Gender 50% males 50% females 50% males 80% males 100% males Education Basic Education Marketing Degrees MBAs / PHDs CS Degree Language English / Arabic English / Arabic English / Arabic English Computer Experience Low mid Mid High Domain Experience (Social Analytics) Low High Low : Mid High Expectations Ease of access Speed of task Ease of access and speed of task Comprehensive functionality 2.5 Task Profiles Task Name Regular Users Marketers Project Managers / Managers Administrator 1 Sign up X X X 2 Submit new tracking request X X 3 View Tracking reports X X X 4 Export tracking reports X X 5 View tracking history X X X 6 View user data X
  • 28. Datapedia 28 2.6 Environmental Profiles Characteristics Regular Users Marketers Project Managers / Managers Administrator Location Indoor or mobile Indoor or mobile Indoor or mobile Indoor Workspace Office Office Cubicle Software Any browser Any browser Any browser Any browser Lighting Bright Good Average Noise Quiet quiet quiet quiet Hardware PC PC PC PC Internet Connection Normal connection Normal connection Fast connection Fast connection 2.7Sample Personas Persona #1 Name: Mr. Ahmed Roshdy. Age: 20. Position: Student. Education: Pursuing a degree in Computer Science. Things he wants to know Things he wants to do  People’s opinions on the product he wants to buy to decide whether he will buy it or not.  Knowing the product that he wants to buy.
  • 29. Datapedia 29 Persona #2 Name: Mr. John Doe. Age: 45. Position: Managing Director. Education: Masters of Business Administration. Company: XYZ for Real estate development. Things he wants to know Things he wants to do  How much units should he sell in his new project?  What is the feedback of his current clients?  Where to build new projects and housing complexes?  Attract more clients to his business.  Make more profit.  Build strong relationships with his clients. Persona #3 Name: Ms. Karin Kudrow. Age: 25. Position: Social Media Strategist. Education: Bachelor degree in Marketing. Company: XYZ for Business Solutions. Things she wants to know Things she wants to do  How many tweets written with a certain hashtag?  Who are the best contributors on the hashtag?  How people interact with the hashtag and are their tweets negative or positive?  Launch a new marketing campaign on twitter.  Share with the followers of the company a new hashtag to share their thoughts on it.
  • 30. Datapedia 30 2.8 Competitive Analysis Factor keyhole Hashtagify.me hashtracking hashtags talkwalker Datapedia Subscription 30 days trial then 599$ per month for full subscription. 14 days trial then 299$ per month for full subscription. 30 days trial then 399$ per month for full subscription. Free trial with upgradable account for 349$ per month for full subscription. 7 days trial then 1400$ per month for full subscription. Full free subscription. Tracking method Real time + historical data (50$ per report) Real time + historical data. Real time + historical data (up to 30 days) 24-Hour Trend Graph based on 1% data sample. Real time + historical data. Real time + historical data. Sentiment analysis X  X X   Social Media Coverage Twitter, Facebook and Instagram. Twitter and they will add Facebook and Instagram in the future. Twitter. Twitter. Twitter, Facebook and Instagram. Twitter Technology Web application. Web application. Web application. Web application. Web application. Web application, planning to release an android application. From the table above, it is clear that Datapedia is deliberated to support all of the existing common feature in the corresponding and similar web-based applications. However, Datapedia will have some edge over the others such as the free full subscription and the ability to be extended to a mobile application. In addition, building Datapedia as a web-based application will give it an advantage over the others having features such as the dynamically-changed graphical analytics aids and the interactive dashboard graphs and charts to support the user in making better decisions.
  • 31. Datapedia 31 Chapter 3: System Design 3.1 System Development Methodology We used the agile software development as: 1- We were changing requirements frequently, even late in development. 2- We delivered working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale. 3- The Working software was the primary measure of progress. 4- Our method of conveying information to and within a development team is face-to-face conversation. 5- There was a continuous attention to technical excellence and good design enhances agility. 6- To implement a new feature the developers need to lose only the work of a few days, or even only hours, to roll back and implement it. 7- Unlike the waterfall model in agile model very limited planning is required to get started with the project. Agile assumes that the end users’ needs are ever changing in a dynamic business and IT world. Changes can be discussed and features can be newly effected or removed based on feedback. This effectively gives the customer the finished system they want or need. 3.2 Main Objectives 1- To help the user make better decisions concerning buying a product in an easy & user-friendly way based on real reviews written by actual users among different social media sites. 2- To help marketers make better marketing plans based on what people like and dislike (user opinion and review) and by taking feedbacks in
  • 32. Datapedia 32 addition to studying the trend of opinions based on different dimensions (time, location, source, …). 3- To help mangers make more accurate forecasts based on real data collected from real users. 3.3 Secondary objectives 1- Implementing the “power of the user” strategy. 2- Giving recommendations based on real data. 3- Pointing out how market responds to a new product, service and initiative.
  • 33. Datapedia 33 3.4 Use Case Diagram
  • 34. Datapedia 34 3.5 Backend Design To know how the process will go when we speak about big data, we must first know about Hadoop itself. As an open source java based software it will cost us nothing to implement the parallel processing techniques to our collected data and apply our algorithm on it, Hadoop mainly rely on dividing the tasks to smaller subtasks via a programming paradigm called mapReduce, on which an algorithm is mapped to the different parts of the data and finally a reducer will reduce the result as shown in the figure: The MapReduce structure is mainly consists of two parts: • JobTracker: This is the master node of the MapReduce system, which manages the jobs and resources in the cluster (TaskTrackers). The JobTracker tries to schedule each map as close to the actual data being processed on the TaskTracker, which is running on the same DataNode as the underlying block. • TaskTracker: These are the slaves that are deployed on each machine. They are responsible for running the map and reducing tasks as instructed by the JobTracker. And they are working as a primary-slave work force in which the task tracker assigns different tasks to the job trackers just like the following figure:
  • 35. Datapedia 35 As we can see, hadoop is mainly based on the divide and conquer process and that formulate it’s parallel processing core, the other core is the file system that hadoop rely on, the HDFS (Hadoop Distribution File System) which is the master piece of the fault handling core in hadoop. It is actually a normal file system Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing.
  • 36. Datapedia 36 HDFS is based on three nodes: • NameNode: This is the master of the HDFS system. It maintains the directories, files, and manages the blocks that are present on the DataNodes. • DataNode: These are slaves that are deployed on each machine and provide actual storage. They are responsible for serving read-and-write data requests for the clients. • Secondary NameNode: This is responsible for performing periodic checkpoints. So, if the NameNode fails at any time, it can be replaced with a snapshot image stored by the secondary NameNode checkpoints. Sometimes the data resides on the HDFS (in various formats).since a lot of data analysts are very productive in R, it is natural to use R to compute with the data stored through Hadoop-related tools. As mentioned earlier, the strengths of R lie in its ability to analyze data using a rich library of packages but fall short when it comes to working on very large datasets. The strength of Hadoop on the other hand is to store and process
  • 37. Datapedia 37 very large amounts of data in the TB and even PB range. Such vast datasets cannot be processed in memory as the RAM of each machine cannot hold such large datasets. The options would be to run analysis on limited chunks also known as sampling or to correspond the analytical power of R with the storage and processing power of Hadoop and you arrive at an ideal solution. Such solutions can also be achieved in the cloud using platforms such as Amazon EMR. R will not load all data (Big Data) into machine memory. So, Hadoop can be chosen to load the data as Big Data. Not all algorithms work across Hadoop, and the algorithms are, in general, not R algorithms. Despite this, analytics with R have several issues related to large data. In order to analyze the dataset, R loads it into the memory, and if the dataset is large, it will fail with exceptions such as "cannot allocate vector of size x". Hence, in order to process large datasets, the processing power of R can be vastly magnified by combining it with the power of a Hadoop cluster. Hadoop is very a popular framework that provides such parallel processing capabilities. So, we can use R algorithms or analysis processing over Hadoop clusters to get the work done.
  • 38. Datapedia 38 If we think about a combined RHadoop system, R will take care of data analysis operations with the preliminary functions, such as data loading, exploration, analysis, and visualization, and Hadoop will take care of parallel data storage as well as computation power against distributed data. Prior to the advent of affordable Big Data technologies, analysis used to be run on limited datasets on a single machine. Advanced machine learning algorithms are very effective when applied to large datasets, and this is possible only with large clusters where data can be stored and processed with distributed data storage systems.
  • 40. Datapedia 40 3.7 Sample Mock-ups We have chosen Windows10 as the product to be evaluated in our prototype, we will get the word or hashtag that the user entered and use it to search for tweets that contain this specific word. Then we will classify these tweets as positive or negative or neutral, this classification will be based on word classification lexicon, each word will have a specific score in positive or negative in a scale from 5 to -5. And then the results of data analytics phase will be represented on a dashboard to the user to help make an informed decision. Figure 1: Home Page
  • 41. Datapedia 41 Figure 2: Sentiment Scores Figure 3: Map Dimension (Demographic)
  • 42. Datapedia 42 Chapter 4: System Implementation and Management The proposed project consists of 4 main phases as follows: 4.1 Phase 1: Data Collection and Preprocessing In this phase, we collected relevant data form social networks as well as product review sites. In this stage, we will first use different “APIs” to grab the data. Then, we transformed the data into R-based data frames to be easily parsed, accessed, and queried. Our goal was to collect millions of product- related records. The collected data includes key attributes like: - "text": The text that will be analyzed later in the sentiment analysis phase. - "favorited": A Boolean attribute to show if the tweet is favorited by any Twitter user or not. - "favoriteCount": The count of the how many favorites the tweet got. - “created": The time attribute when the tweet was posted. - "id": The tweet id. - "statusSource": The source of the tweet where if it’s posted from the website or from a mobile application. - "screenName": The handle (user name) of the user who wrote the tweet. - "retweetCount": The count of the how many retweets the tweet got. - "isRetweet": A Boolean attribute to show if the tweet is a retweeted tweet from another user. - "retweeted": A Boolean attribute to show if the tweet is retweeted by any Twitter user or not. - "Longitude": Coordinates on map to get the location of the user who posted the tweet. - "Latitude": Coordinates on map to get the location of the user who posted the tweet.
  • 43. Datapedia 43 We started this phase from October and it was an on-going phase which means we kept downloading tweets from the October till May, 2015. One of the problems we faced was the limitation of the Twitter search API as we could only download tweets that are 1 week old as you must be a Twitter search partner to be able to request older tweets from the search API. To overcome this, we downloaded tweets on a daily biases to build our own historical data. By the end of May, 2015 we successfully downloaded 300,000 tweets. Another problem was that the data contained a lot of commercial tweets and ads like promotions. To overcome this, we decided to collect pure datasets from different sources that provide user reviews on specific products. We collected about 500 user review manually from websites that provide real reviews from real users like GSMArena and Reevoo. And then we realized that it would take us a lot of time to get a decent number of reviews to be analyzed, so we built our own web crawler using Python to automatically get the reviews from those sites. 4.2 Phase 2: Sentiment Analysis. In this stage, the data collected in phase 1 is classified as positive, negative and neutral. This classification is based on word classification lexicon. Natural language processing techniques are used in this stage to parse the sentence, understand it, and finally classify it into the predefined classes.
  • 44. Datapedia 44 This step was a crucial step as it was the main output that will eventually control the customers’ decision about the product. Our algorithm was enhanced more than one time to get better accuracy percentages. Our first algorithms was so basic and relied on matching the words between the tweet and the lexicon. This algorithm led to poor results. And then we managed to know the reasons behind this poor results which are the special cases that needs to be handled like (Negation and sarcasm). Those cases were handled in the algorithm by searching for negation in the text along with the sentiment scoring process and if it’s found the score is multiplied by -1 automatically until the search reach the statement character. This way we managed to score a great accuracy score by using the data collected in the data collection phase and we analyzed this pure dataset manually and knew the overall sentiment score of it and then we analyzed it using our algorithm. 4.3 Phase 3: Data Analytics. In this phase, R-language will be used to handle the enormous data volumes and help in achieving the data analytics task. Classified data will be input to this module and different techniques will be used (and/or compared) to obtain the best analytical results. For big data analytics (which will be a whole new process) we cannot rely on the usual way of executing the algorithm as it’d take so much time to do so , even more time to just fetch the data into the memory as R must operate on data already in memory plus the time that will be taken for the analytic process itself which will be huge according to what is really happening like
  • 45. Datapedia 45 cleaning the tweets, dividing each single tweet into words and look for them in the lexicon besides the summation of the score process. This is when the Hadoop’s turn comes, we actually didn’t use hadoop itself, we used something called hadoop streaming which is a streaming service or a framework that give developers the ability to execute hadoop tasks in different languages than java like C++, C, Python and R. so it perfectly suits what we are looking for to combine the great analytical power of R and the stable-fault handling techniques of Hadoop. Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce job with executable scripts such as Mapper and Reducer. This is similar to the pipe operation in Linux. With this, the text input file is printed on stream (stdin), which is provided as an input to Mapper and the output (stdout) of Mapper is provided as an input to Reducer; finally, Reducer writes the output to the HDFS directory. The main advantage of the Hadoop streaming utility is that it allows Java as well as non-Java programmed MapReduce jobs to be executed over Hadoop clusters. Also, it takes care of the progress of running MapReduce jobs. The Hadoop streaming supports the Perl, Python, PHP, R, and C++ programming languages. To run an application written in other programming languages, the developer just needs to translate the application logic into the Mapper and Reducer sections with the key and value output elements.
  • 46. Datapedia 46 The main six Hadoop streaming components of the preceding command are listed and explained as follows: • jar: This option is used to run a jar with coded classes that are designed for serving the streaming functionality with Java as well as other programmed Mappers and Reducers. It's called the Hadoop streaming jar. • Input: This option is used for specifying the location of input dataset (stored on HDFS) to Hadoop streaming MapReduce job. • Output: This option is used for telling the HDFS output directory (where the output of the MapReduce job will be written) to Hadoop streaming MapReduce job. • File: This option is used for copying the MapReduce resources such as Mapper, Reducer, and Combiner to computer nodes (Tasktrackers) to make it local. • Mapper: This option is used for identification of the executable Mapper file. • Reducer: This option is used for identification of the executable Reducer file. RHadoop is a collection of three R packages for providing large data operations with an R environment. It was developed by Revolution Analytics, which is the leading commercial provider of software based on R. RHadoop is available with three main R packages: rhdfs, rmr, and rhbase. Each of them offers different Hadoop features. • rhdfs is an R interface for providing the HDFS usability from the R console. As Hadoop MapReduce programs write their output on HDFS, it is very easy to access them by calling the rhdfs methods. The R programmer can easily perform read and write operations on distributed data files. Basically, rhdfs package calls the HDFS API in backend to operate data sources stored on HDFS. • rmr is an R interface for providing Hadoop MapReduce facility inside the R environment. So, the R programmer needs to just divide their application logic into the map and reduce phases and submit it with the rmr methods. After that, rmr calls the Hadoop streaming MapReduce API with several job
  • 47. Datapedia 47 parameters as input directory, output directory, mapper, reducer, and so on, to perform the R MapReduce job over Hadoop cluster. • rhbase is an R interface for operating the Hadoop HBase data source stored at the distributed network via a Thrift server. The rhbase package is designed with several methods for initialization and read/write and table manipulation operations. Installing the R packages: We need several R packages to be installed that help it to connect R with Hadoop. The list of the packages can be installed by calling the execution of the following R command in the R console: install.packages( c('rJava','RJSONIO', 'itertools', 'digest','Rcpp ','httr','functional','devtools', 'plyr','reshape2')) • Setting environment variables: We can set this via the R console using the following code: ## Setting HADOOP_CMD Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop") ## Setting up HADOOP_STREAMING Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/contrib/streaming/ hadoop-streaming-1.2.1.jar") or, we can also set the R console via the command line as follows: export HADOOP_CMD=/usr/local/Hadoop export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/ streaming/hadoop-streaming-1.2.1.jar
  • 48. Datapedia 48 In our project, as we implied earlier we are talking about a text file that contains more than 300,000 tweets, in the data analytics process we put that file into the hdfs directory using the HDFS commands in Linux and the file itself is spitted into 20 different parts, with one replication and one cluster. We used 2 mappers in our project which is enough to our I7 5 cores CPU, which means 2 tasks are done simultaneously in each core as we enabled hybrid threading.
  • 49. Datapedia 49 Using R, it took more than 5 minutes to load data into the memory, more than 20 minutes to work on the data and bring results with Hadoop, it took 15 minutes only to get all the results in a much more easy way.
  • 50. Datapedia 50 4.4 Phase 4: Data presentation. This is the final step where results obtained will be presented to the user in a user friendly manner. Dashboards will be used to present the analytics results. Maps will be used to represent geographically-related information. Pie-charts, bar charts and others will be used for analytical data. The user will be given the freedom to drill-down & roll-up with the analytics granularity so that (s)he can obtain a global view before making decisions. For business analysis reports, we are giving also some useful analytics like: 1- Most Retweeted tweets. 2- Most Favorite tweets. 3- Most accompanied hashtags. Which can come at handy finding trends or related topics to the user’s inserted text. We used PHP, bootstrap and java script to implement the user interface, while the giving the user the option to enter his preferred search term in the real time section and the option to upload a file of tweets in the historical part section as we had to ensure that our application will stay generic and error free.
  • 51. Datapedia 51 This Page was Intentionally Left Blank
  • 52. Datapedia 52 Figure 4: Commercial tweets. Chapter 5: System Evaluation 5.1 Data Evaluation As mentioned earlier our data is fetched directly from Twitter search API, but we noticed a lot of problems in this data that affected the performance of our algorithm and our system overall. One of the problems we faced is that the data contain a lot of ads and commercial tweets as shown in figure 4: To overcome this problem we decided to collect pure datasets from different sources that provide user reviews on specific products. We collected about 500 user review manually from websites that provide real reviews from real users like GSMArena and Reevoo.
  • 53. Datapedia 53 Screen shot of the pure dataset:
  • 54. Datapedia 54 Figure 5: Algorithms Comparison 5.2 Algorithm Evaluation Our main objective in this project was the accuracy of the sentiment analysis scoring algorithm and our algorithm was enhanced more than one time to get better accuracy percentages. Our first algorithms was so basic and relied on matching the words between the tweet and the lexicon. This algorithm led to poor results. And then we managed to know the reasons behind this poor results which are the special cases that needs to handled like (Negation and sarcasm). And then we managed to score a great accuracy score by using the data collected in the data evaluation phase and we analyzed this pure dataset manually and knew the overall sentiment score of it and then we analyzed it using our algorithm. And we compared the scores of every algorithm together as shown in figure 5.
  • 55. Datapedia 55 5.3 System Evaluation And finally we wanted to benchmark our algorithm performance against other algorithms. We chose Stanford’s recursive deep model as it’s very popular and used by many users interested in sentiment analysis. The two algorithms were tested on the same data collected and manually analyzed by us in the data evaluation phase. As shown in figure 6, our algorithm scored 61.7% accuracy and Stanford’s recursive deep model scored 55.8% And here’s some examples from the two algorithms to support this benchmark: Example 1:
  • 56. Datapedia 56 The text says: “The camera quality and options got better, overall, the phone is not a disappointment”. The review is positive: Stanford’s recursive deep model scored 0 for this sentence which means it’s neutral. Our algorithm scored +4 which means it’s positive. Example 2: The text says: “Excellent camera with video facility”. The review is positive. Stanford’s recursive deep model scored 0 for this sentence which means it’s neutral.
  • 57. Datapedia 57 Our algorithm scored +3 which means it’s positive.
  • 58. Datapedia 58 References - Book 1: Big Data Analytics with R and Hadoop First published: November 2013 Production Reference: 1181113 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-328-2 - The Comprehensive R Archive Network (CRAN) Link to the CRAN: http://cran.rstudio.com/ - Course (1): Swirl. Learn R, in R. Link to the course: http://swirlstats.com/ - Course (2): R Programming. By: Johns Hopkins University. Link to the course: https://www.coursera.org/course/rprog - Course (3): Getting and Cleaning Data. By: Johns Hopkins University Link to the course: https://www.coursera.org/course/getdata - Course (4): Exploratory Data Analysis.
  • 59. Datapedia 59 By: Johns Hopkins University. Link to the course: https://www.coursera.org/course/exdata - Discussion article (1): What are the best supervised learning algorithms for sentiment analysis in text? Link to the article: http://www.quora.com/What-are-the-best-supervised- learning-algorithms-for-sentiment-analysis-in-text - Article (1): Why Sentiment Analysis Is Essential for Small Businesses. By: Tara Hornor. Link to the article: http://graziadiovoice.pepperdine.edu/why-sentiment- analysis-is-essential-for-small-businesses/ - Article (2): The 37 best tools for data visualization. By: Brian Suda and Sam Hampton-Smith. Link to the article: http://www.creativebloq.com/design-tools/data- visualization-712402