Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.
Why Data Mining?
What Is Data Mining?
Data Mining: On What Kind of Data?
Data Classification
What is Sentiment Classification?
Importance of Sentiment classification
Twitter for Sentiment Classification
Problem Statement
Goal of this Classifications
Method to be used
Conclusion
2. Index
Why Data Mining?
What Is Data Mining?
Data Mining: On What Kind of Data?
Data Classification
What is Sentiment Classification?
Importance of Sentiment classification
Twitter for Sentiment Classification
Problem Statement
Goal of this Classifications
Method to be used
Conclusion
3. Why Data Mining?
Data explosion problem
Automated data collection tools and mature database technology lead to
tremendous amounts of data stored in databases, data warehouses and other
information repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
– Data warehousing and on-line analytical processing
– Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
4. What Is Data Mining?
Data mining (knowledge discovery in databases)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
5. Data Mining: On What Kind of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
7. Data Classification
Classification consists of assigning a class label to a set of
unclassified cases.
Supervised Classification
The set of possible classes is known in advance.
Unsupervised Classification
Set of possible classes is not known. After classification we can try to
assign a name to that class. Unsupervised classification is called
clustering.
8. What is Sentiment Classification?
The process of computationally identifying and categorizing
opinions expressed in a piece of text.
The goal is to determine whether the writer's attitude towards a particular topic,
product, etc., is positive, negative, or neutral.
9. Importance of Sentiment classification
Adjust marketing strategy
Measure ROI of your marketing campaign
Develop product quality
Improve customer service
Crisis management
Lead generation
Sales Revenue
10. Using Twitter for Sentiment
Classification
Most Popular microblogging site
Short Text Messages of 140 characters
328 million active users
500 million tweets are generated everyday
Twitter audience varies from common man to celebrities
Users often discuss current affairs and share personal views on various s
ubjects
Tweets are small in length and hence unambiguous
Last updated: 8/12/17 Source:https://www.omnicoreagency.com/twitter-statistics
11. Problem Statement
The problem at hand consists of two subtasks
– Emoticon-Hashtag Level Sentiment Analysis
Given a message containing hashtags and emoticons instance of a word
or a phrase, determine whether that instance is positive, negative or neu
tral in that context.
– Sentence Level Sentiment Analysis
Given a message containing a sentence, a word or a phrase,
determine whether that instance is positive, negative or neutral in that c
ontext.
12. Goal of this Classifications
There are Two goals to be achived
Large Scale Implementations for Sentiment Classification
Time efficiency for Sentiment Classification
13. Method to be used
We develop two systems
MapReduce
Apache Spark Framework
The task is inspired from MDPI by Andreas Kanavos, 2016 , Task : Twitter Sentiment Classification
14. Method to be used
MapReduce
The process of large datasets on a classification
It consists of two main procedures
- Map and Reduce
15. Method to be used
Apache Spark Framework
Apache Spark is an open source big data processing framework built around
speed, ease of use
Comprehensive, unified framework
100 times faster in memory and 10 times faster even when running on disk
It let quickly write applications in Java, Scala, or Python
16. Conclusion
Data mining is the best way to find out necessary informations and data classification
Make it more valuable. Hopefully, for huge amount of data,MapReduce model
and Spark Framework will help to expand the scalability of data and reduce execution
time.
17. References
Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen “Scalable
Sentiment Classification for Big Data Analysis Using Na ̈ıve Bayes Classifier”
on 2013 IEEE International Conference on Big Data
Roseline Antai “Sentiment Classification Using Summaries: A Comparative
Investigation of Lexical and Statistical Approaches” on 2014 6th Computer
Science and Electronic Engineering Conference (CEEC)
R. Suresh Ramanujam Ph.D, J. Nivedha, J. Kokila “SENTIMENT ANALYSIS
USING BIG DATA” on 2015 INTERNATIONAL CONFERENCE ON
COMPUTATION OF POWER, ENERGY, INFORMATION AND COMMUNICATION
Divya Sehgal l and Dr. Ambuj Kumar Agarwal2 “Sentiment Analysis of Big Data Applications using T
witter Data with the Help of HADOOP Framework”
RAVI VATRAPU1,2, RAGHAVA RAO MUKKAMALA1, ABID HUSSAIN1, AND BENJAMIN FLESCH1 “Social Set Ana
lysis:A set theoritical approch of big data analysis” on April 28, 2015
at IEEE
18. References
Pragya Tripathi, Santosh Kr Vishwakarma, Ajay Lala “Sentiment Analysis of English Tweets Using
RapidMiner” on 2015 International Conference on Computational Intelligence and Communication
Networks
Lukas Povoda, Radim Burget, Malay Kishore Dutta “Sentiment Analysis Based on Support Vector
Machine and Big Data”
Beiming Sun, Vincent TY Ng “Analyzing Sentimental Influence of Posts on Social Networks” 2014
IEEE
LI Bing, Keith C.C. Chan “A Fuzzy Logic Approach for Opinion Mining on Large Scale Twitter Data”
on 2014 IEEE/ACM
Andreas Kanavos 1,*, Nikolaos Nodarakis 1, Spyros Sioutas 2, Athanasios Tsakalidis 1,
Dimitrios Tsolis 3 and Giannis Tzimas 4 “Large Scale Implementations for Twitter Sentiment Cla
ssification” on 4 March 2017 at MDPI
Database
Used for Online Transactional Processing (OLTP) but can be used for other purposes such as Data Warehousing. This records the data from the user for history.
The tables and joins are complex since they are normalized (for RDMS). This is done to reduce redundant data and to save storage space.
Entity – Relational modeling techniques are used for RDMS database design.
Optimized for write operation.
Performance is low for analysis queries.
Data Warehouse
Used for Online Analytical Processing (OLAP). This reads the historical data for the Users for business decisions.
The Tables and joins are simple since they are de-normalized. This is done to reduce the response time for analytical queries.
Data – Modeling techniques are used for the Data Warehouse design.
Optimized for read operations.
High performance for analytical queries.
Is usually a Database.
It's important to note as well that Data Warehouses could be sourced from zero to many databases.
non-trivial = অতুচ্ছ নিহিত-implicit
Association: Association data mining detects recurring themes in databases, identifies relationships between them and develops a pattern of these relationships. It will then use these patterns as a reference to predict future behavior.Most notably, very complex versions of association data mining is used by Netflix to develop their entertainment recommendations and by Amazon to develop product recommendations during purchases.
Clustering: Cluster data mining is essentially the stepping stone towards being able to use classification data mining. This technique classifies previously unorganized data into categories that it creates. This can be extremely useful because the software has the capability of detecting very minute similarities or differences that a human analyst would likely not notice and therefore create more accurate/useful categories.
Classification / Categorization: Classification data mining is used to categorize new data into preexisting categories. It does this by examining the data that has previously been classified, learning the rules of classification and applying those rules to new data.
A transactional database is a DBMS where write transactions on the database are able to be rolled back if they are not completed properly (e.g. due to power or connectivity loss). Most modern relational database management systems fall into the category of databases that support transactions.
RIO= Return on Investment
The algorithm exploits all texts, hashtags and emoticons inside a tweet, as sentiment labels, and proceeds to a classification method of diverse sentiment types in a parallel and distributed manner
The sentiment analysis tool is based on Machine Learning methodologies alongside Natural Language Processing techniques and utilizes Apache Spark’sMachine learning library, MLlib
MapReduce is a programming model that enables the process of large datasets on a classification using a distributed and parallel algorithm. A MapReduce program consists of two main procedures, Map() and Reduce() respectively, and isexecuted in three steps: Map, Shuffle and Reduce
In the Map phase, input data is partitioned and each partition is given as an input to a worker that executes the map function. Each worker processes the data and outputs key-value pairs. In the Shuffle phase, key-value pairs are grouped by key and each group is sent to the corresponding Reducer
Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is the way to go.
What is Spark
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.
Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm.
Fastly’s edge cloud platform powers secure, fast and reliable online experiences for the world’s most popular digital businesses. See for yourself.
First of all, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data).
Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.
Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.
In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.
In this first installment of Apache Spark article series, we'll look at what Spark is, how it compares with a typical MapReduce solution and how it provides a complete suite of tools for big data processing.