5/25/2020 Rubric Detail – 31228.202030
https://ucumberlands.blackboard.com/webapps/rubric/do/course/gradeRubric?mode=grid&isPopup=true&rubricCount=1&prefix=_843783_1&course_i… 1/4
Rubric Detail
A rubric lists grading criteria that instructors use to evaluate student work. Your instructor linked a rubric to this item
and made it available to you. Select Grid View or List View to change the rubric's layout.
Show Descriptions Show Feedback
Name: ITS836 (8 Week) Research Paper Rubric
Description: Please use this rubric for grading research papers
Exit
Grid View List View
No requirements are met
Includes a few of the required components as speci�ed in the assignment.
Includes some of the required components as speci�ed in the assignment.
Includes most of the required components as speci�ed in the assignment.
Includes all of the required components as speci�ed in the assignment.
Requirements
--
No Evidence 0 (0.00%) points
Limited Evidence 3 (3.00%) points
Below Expectations 7 (7.00%) points
Approaches Expectations 11 (11.00%) points
Meets Expectations 15 (15.00%) points
Fails to provide enough content to show a demonstration of knowledge
Major errors or omissions in demonstration of knowledge.
Some signi�cant but not major errors or omissions in demonstration of knowledge.
A few errors or omissions in demonstration of knowledge.
Demonstrates strong or adequate knowledge of the materials; correctly represents knowledge
from the readings and sources.
Content
--
No Evidence 0 (0.00%) points
Limited Evidence 3 (3.00%) points
Below Expectations 7 (7.00%) points
Approaches Expectations 11 (11.00%) points
Meets Expectations 15 (15.00%) points
5/25/2020 Rubric Detail – 31228.202030
https://ucumberlands.blackboard.com/webapps/rubric/do/course/gradeRubric?mode=grid&isPopup=true&rubricCount=1&prefix=_843783_1&course_i… 2/4
g
Fails to provide a critical thinking analysis and interpretation
Major errors or omissions in analysis and interpretation.
Some signi�cant but not major errors or omissions in analysis and interpretation.
A few errors or omissions in analysis and interpretation.
Provides a strong critical analysis and interpretation of the information given.
Critical Analysis
--
No Evidence 0 (0.00%) points
Limited Evidence 5 (5.00%) points
Below Expectations 10 (10.00%) points
Approaches Expectations 15 (15.00%) points
Meets Expectations 20 (20.00%) points
Fails to demonstrate problem solving.
Major errors or omissions in problem solving.
Some signi�cant but not major errors or omissions in problem solving.
A few errors or omissions in problem solving.
Demonstrates strong or adequate thought and insight in problem solving.
Problem Solving
--
No Evidence 0 (0.00%) points
Limited Evidence 5 (5.00%) points
Below Expectations 10 (10.00%) points
Approaches Expectations 15 (15.00%) points
Meets Expectations 20 (20.00%) points
Source or example selection and integration of knowledge.
1. 5/25/2020 Rubric Detail – 31228.202030
https://ucumberlands.blackboard.com/webapps/rubric/do/course/
gradeRubric?mode=grid&isPopup=true&rubricCount=1&prefix=
_843783_1&course_i… 1/4
Rubric Detail
A rubric lists grading criteria that instructors use to evaluate
student work. Your instructor linked a rubric to this item
and made it available to you. Select Grid View or List View to
change the rubric's layout.
Show Descriptions Show Feedback
Name: ITS836 (8 Week) Research Paper Rubric
Description: Please use this rubric for grading research papers
Exit
Grid View List View
No requirements are met
Includes a few of the required components as speci�ed in the
assignment.
Includes some of the required components as speci�ed in the
assignment.
Includes most of the required components as speci�ed in the
assignment.
2. Includes all of the required components as speci�ed in the
assignment.
Requirements
--
No Evidence 0 (0.00%) points
Limited Evidence 3 (3.00%) points
Below Expectations 7 (7.00%) points
Approaches Expectations 11 (11.00%) points
Meets Expectations 15 (15.00%) points
Fails to provide enough content to show a demonstration of
knowledge
Major errors or omissions in demonstration of knowledge.
Some signi�cant but not major errors or omissions in
demonstration of knowledge.
A few errors or omissions in demonstration of knowledge.
Demonstrates strong or adequate knowledge of the materials;
correctly represents knowledge
from the readings and sources.
Content
--
No Evidence 0 (0.00%) points
Limited Evidence 3 (3.00%) points
3. Below Expectations 7 (7.00%) points
Approaches Expectations 11 (11.00%) points
Meets Expectations 15 (15.00%) points
5/25/2020 Rubric Detail – 31228.202030
https://ucumberlands.blackboard.com/webapps/rubric/do/course/
gradeRubric?mode=grid&isPopup=true&rubricCount=1&prefix=
_843783_1&course_i… 2/4
g
Fails to provide a critical thinking analysis and interpretation
Major errors or omissions in analysis and interpretation.
Some signi�cant but not major errors or omissions in analysis
and interpretation.
A few errors or omissions in analysis and interpretation.
Provides a strong critical analysis and interpretation of the
information given.
Critical Analysis
--
No Evidence 0 (0.00%) points
Limited Evidence 5 (5.00%) points
4. Below Expectations 10 (10.00%) points
Approaches Expectations 15 (15.00%) points
Meets Expectations 20 (20.00%) points
Fails to demonstrate problem solving.
Major errors or omissions in problem solving.
Some signi�cant but not major errors or omissions in problem
solving.
A few errors or omissions in problem solving.
Demonstrates strong or adequate thought and insight in problem
solving.
Problem Solving
--
No Evidence 0 (0.00%) points
Limited Evidence 5 (5.00%) points
Below Expectations 10 (10.00%) points
Approaches Expectations 15 (15.00%) points
Meets Expectations 20 (20.00%) points
Source or example selection and integration of knowledge from
the course is clearly de�cient.
Sources or examples meet required criteria and are poorly
chosen to provide substance and
5. perspectives on the issue under examination.
Sources or examples meet required criteria but are less than
adequately chosen to provide
substance and perspectives on the issue under examination.
Sources or examples meet required criteria but are less than
adequately chosen to provide
substance and perspectives on the issue under examination.
Sources/Examples
--
No Evidence 0 (0.00%) points
Limited Evidence 2 (2.00%) points
Below Expectations 4 (4.00%) points
Approaches Expectations 7 (7.00%) points
5/25/2020 Rubric Detail – 31228.202030
https://ucumberlands.blackboard.com/webapps/rubric/do/course/
gradeRubric?mode=grid&isPopup=true&rubricCount=1&prefix=
_843783_1&course_i… 3/4
Sources or examples meet required criteria and are well chosen
to provide substance and
perspectives on the issue under examination.
Meets Expectations 10 (10.00%) points
Project is not organized or well written, and is not in proper
6. paper format. Poor-quality work;
unacceptable in terms of grammar and spelling.
Project is poorly organized; does not follow proper paper
format. Inconsistent to inadequate
sentence and paragraph development; numerous errors in
grammar and spelling.
Project is adequately organized and written, and is in proper
format as outlined in the
assignment. Reasonably good sentence and paragraph structure;
signi�cant number of errors in
grammar and spelling.
Project is fairly well organized and written, and is in proper
format as outlined in the assignment.
Reasonably good sentence and paragraph structure; signi�cant
number of errors in grammar
and spelling.
Demonstrates strong or adequate thought and insight in problem
solving.
Organization, Grammar, Style
--
No Evidence 0 (0.00%) points
Limited Evidence 2 (2.00%) points
Below Expectations 4 (4.00%) points
Approaches Expectations 7 (7.00%) points
Meets Expectations 10 (10.00%) points
7. Numerous errors in APA formatting, with more than eight
signi�cant errors.
Numerous errors in APA formatting, with more than �ve
signi�cant errors.
Signi�cant errors in APA formatting, with four to �ve
signi�cant errors.
Sources or examples meet required criteria but are less than
adequately chosen to provide
substance and perspectives on the issue under examination.
Sources or examples meet required criteria and are well chosen
to provide substance and
perspectives on the issue under examination.
Proper use of APA formatting
--
No Evidence 0 (0.00%) points
Limited Evidence 2 (2.00%) points
Below Expectations 4 (4.00%) points
Approaches Expectations 7 (7.00%) points
Meets Expectations 10 (10.00%) points
Name:ITS836 (8 Week) Research Paper Rubric
Description:Please use this rubric for grading research papers
Exit
8. 5/25/2020 Rubric Detail – 31228.202030
https://ucumberlands.blackboard.com/webapps/rubric/do/course/
gradeRubric?mode=grid&isPopup=true&rubricCount=1&prefix=
_843783_1&course_i… 4/4
1
Learning Analytics or Educational Data Mining? This is the
Question...
Daniela Marcu
Ștefan cel Mare University of Suceava
Str. Universității 13, Suceava 720229
Phone: 0230 216 147
[email protected]
Mirela Danubianu
Ștefan cel Mare University of Suceava
Str. Universității 13, Suceava 720229
Phone: 0230 216 147
[email protected]
Abstract
In full expansion, a vital area such as education could not
remain indifferent to the use of
information and communication technology. Over the past two
decades we have witnessed the
9. emergence and development of e-learning systems, the
proliferation of MOOCs, and generally the
rise of Technology Enhanced Education. All of these
contributed to generation and storage of
unprecedented volumes of data concerning all areas of learning.
At the same time, domains such as data mining and big data
analytics have emerged and
developed. Their applications in education have spawned new
areas of research such as educational
data mining or learning analytics.
As an interdisciplinary research area Educational Data Mining
(EDM) aims to explore data
from educational environment to build models based on which
students' behavior and results are
better understood. In fact, EDM is a complex process that
consists of a few steps grouped in three
stages: data preprocessing, modelling and postprocessing. It
transforms raw data from educational
environments in useful information that could influence in a
positive way the educational process.
According to Society for Learning Analytics Research (SoLAR)
which took over the
wording of the first International Conference on Learning
Analytics and Knowledge, learning
analytics is ”the measurement, collection, analysis and reporting
of data about learners and their
contexts for purposes of understanding and optimizing learning
and the environments in which it
occurs” (Siemens, 2011).
This paper proposes a comparative study of the two concepts:
EDM and learning analytics.
Due to certain voices in the scientific environment that claim
10. that the two terms refer to the
same thing, we want to emphasize the similarities and
differences between them, and how each one
can serve to raise the quality in educational processes.
Keywords : EDM; LA; Data Mining; Education.
1. Introduction
The educational community has an interest in the great potential
of education. Why are
researchers so enthusiastic about this? The answer is simple.
Seeing the impact of applying data
mining to exploiting large data volumes and analyzing data
from areas such as the business
environment, social media, and other scientific areas, we can
think of the benefits for the education
system. If we could adapt the methods of finding models in the
data, used for analyzing the online
activity of clients and social media users for the educational
environment, we could get closer
evidence of reality on the activities of the training system.
The widespread use of computer-based pre-university learning,
the development of Web-
based courses, are additional reasons for EDM and LA research.
Designing educational policies based on practical evidence
provided by researchers can
bring benefits to the educational system.
11. BRAIN – Broad Research in Artificial Intelligence and
Neuroscience
Volume 10, Special Issue 2 (October, 2019), ISSN 2067-3957
2
The exploitation of large volumes of data from different
domains is done using specific
techniques and methods. It helps to develop tools to facilitate
progress in these areas.
The science of extracting useful information from large volumes
of data is called Data
Mining (DM) (Hand, Mannila & Smyth, 2001).
The concept is based on three key areas: statistics, artificial
intelligence and machine
learning (Figure 1).
Figure 1. Data Mining
Initially, DM used statistical algorithms. Specific techniques
such as decision trees,
association rules, clustering, artificial neural networks, and
others have been developed (Șușnea,
2012).
Applying exploitation methods for educational system data to
build models to better
understand students' behavior and outcomes is named
Educational Data Mining (EDM). Since data
and education issues are different from those in other areas,
classical DM methods have been
12. improved and supplemented with EDM specific methods
(Romero & Ventura, 2007). According to
some authors, there are four areas of application of EDM aimed
at: improving student modeling and
domain modeling, e-learning and scientific research (Baker,
2012).
In order to better understand learning, data from pupils and
from the educational
environment is measured, collected and analyzed. This is the
learning analysis and is a related field
of EDM. Among the Learning Analytics (LA) methods we can
list:
Buckingham Shum, 2012).
In the following sections we propose to detail relevant aspects
about EDM and LA in order
to provide viable arguments in a comparative study of the two
concepts.
2. Educational Data Mining
Over the past 10 years, the field of research aimed to exploit the
unique types of data from
education has developed quite internationally. In 2011, in
Massachusetts USA, the International
EDM Working Group (established in 2007) created the
International Society for EDM (online:
http://educationaldatamining.org/about/). Romania is, however,
at a pioneering stage in EDM.
There is currently a growing interest in using computers in
13. learning and Web-based training. With
the rapid increase in the volume of learning software resources,
the Romanian educational system
also accumulates huge amounts of data from students, teachers,
parents, libraries, secretariats, etc.
Getting the information needed to build models to improve the
quality of managerial decisions
becomes one of the greatest challenges of the present.
Traditional research in the field of education is time-consuming
and often non-ecological
through the waste of material resources. Developing an
experimental study, such as combating
school absenteeism, involves firstly the selection of schools,
teachers and pupils. It follows the
definition of strategies that lead to the identification of sources
of school stress, increasing the
D. Marcu, M. Danubianu - Learning Analytics or Educational
Data Mining? This is the Question...
3
motivation of students to attend classes, trust in school, family,
and so on. However, the studies
depend on context, class, geography, economic development,
teacher-student relationships.
Changing any parameter can lead to very different conclusions.
Soon there may be new factors that
could not be taken into consideration earlier in the demotivation
of students towards school. Making
traditional new studies for this topic involves the use of
important temporal resources.
14. By comparison, EDM proves to be more efficient. The analysis
of existing data in the
educational system through the use of specific EDM methods
allows the identification of new
models for new contexts. An enormous advantage is that the
same methods can be applied to
different data generating specific results without the need for
new analysis strategies.
More specifically, let's take the example of a course designed
for web-based training
(Romero, Ventura, De Bra, 2004). Traditionally, evaluating the
effectiveness of a course is done by
analyzing the results obtained by the student upon completion
of the course, which does not
necessarily lead to the improvement of the material or methods
and teaching tools used for the
future course versions. In fact, in the Romanian pre-university
system, the updating of educational
programs and educational resources does not present the
periodicity expected by the society.
What would it be like the knowledge of EDM data exploitation?
EDM methods aim at
discovering correlation rules between course components
(content, questions, various activities) and
student activities. In the Knowledge Discovery with Genetic
Programming for providing feedback
to the courseware author, C. Romero, S. Ventura and P. Bra
describe the four main steps in
building a software based on EDM (Romero, Ventura, De Bra,
2004): development, use,
discovering knowledge, improving
Other classification has three stages: preprocessing, data
exploitation and post processing
15. [3]. The cycle of these steps is illustrated in Figure 2.
Figure 2. Stages of the process of converting data into
information
If we refer again to the analysis of the efficiency of a course, in
the first stage, the
preprocessing is performed various operations such as:
formation on
pedagogical and methodological
aspects
time spent in the course, the
sections visited, the scores obtained and other interactions
appropriate for processing.
In the next step, EDM-specific algorithms are applied to obtain
different correlation rules.
The models will provide information in different formats for
analysis: numerical results of the
coefficients, tables, diagrams, correlation matrices (an example
is illustrated in Appendix 1 -
Correlation matrix obtained with the DataLab application based
on the results of the Olympiad of
computer science).
One of the most important rules for discovering knowledge is
if-else. Several such rules can
16. be defined in EDM: Association, Classification and Prediction
(Klosgen & Zytkow, 2002).
BRAIN – Broad Research in Artificial Intelligence and
Neuroscience
Volume 10, Special Issue 2 (October, 2019), ISSN 2067-3957
4
The teacher will analyze the results of the analyzes and study
the degree of achievement of
the initial goals.
Depending on the conclusions, it may take the decision to
improve the course and resume its
evaluation process. This may prove to be a difficult process
because opinions can differ
significantly from one teacher to another in relation to the
material and the way of interaction with
the student the course offers.
3. Methods of data exploitation
There are currently a wide variety of methods of exploiting data
in the education system.
These can be categorized into two broad categories according to
the ways to achieve the objectives:
ification, Regression, Outlier
Detecting
Discovery of data for human
judgment (Sasu, 2014).
17. Many of these are general DM methods: prediction,
classification, grouping, exploitation of
texts and others. But there are also specific EDM methods such
as nonnegative matrix factorization
and Knowledge tracing (KT) (Romero & Ventura, 2012). Here
are some of these:
Prediction
The method can be used in education to predict students'
behavior and outcomes. It is based
on the creation of predictive models. In the training phase, they
learn to make predictions about a
set of variables called predictors by analyzing them in
combination with other variables. Once the
enrollment phase is completed, the patterns can be applied to
the data sets for which the prediction
is to be applied. It is known the study by Baker, Gowda, Corbett
- Automatically detecting the
student's preparation for future learning: help use is key (Baker,
Gowda & Corbett, 2011). The
authors create a tool for automatically predicting a student's
future performance on the basis of
establishing positive or negative correlations between various
features such as: student test results,
time spent in response, time elapsed between receiving a clue
and typing the answer, and others. It
is experienced on a group of students, and then applied to
another group. The results are then
compared to those obtained using the Bayesian Knowledge
Tracing (BKT) model.
Classification
18. The method involves building a predictive model. The data in
the training set is
characterized by certain attributes. The model must identify
belonging to a class based on the set of
attributes. Suppose we built an educational software as an
interactive game for a given theme.
Based on user attributes such as age, gender, geographic area,
duration until the game is completed,
number of attempts we can build a classifier, and determine the
user's belonging to a specific class.
The model will learn to identify students. The analyzes can
provide information on the need to use
this educational method for certain age groups, interests and
education.
Methods that use the classification are: decision trees, neural
networks, bayesian
classifications, and others.
Clustering
The method involves building patterns that identify data
clustering after certain similarities.
For the model to provide quality predictions, the similarities
inside class must be maximized and
similarities between classes minimized.
The use of this method in Romanian high school education
could aim at grouping pupils
according to the pupil's learning style (auditory, visual,
practical - kinesthesis) based on the analysis
of behavior in relation to certain educational products and
pupils' characteristics. The prediction of
such a model could lead to an effective recommendation of how
19. to learn educational content. Thus,
the instructional process could be carried out efficiently in
relation to the learning particularities of
each student. At present, there is an attempt to unfold the
lessons in a way appropriate to the
D. Marcu, M. Danubianu - Learning Analytics or Educational
Data Mining? This is the Question...
5
students' learning styles, but the reality is that identifying
learning styles is superficial. The results
of the questionnaires are attached to the class catalog, but this
does not lead, in most cases, to the
improve teaching methods and techniques used in the lesson. In
the absence of clear alternatives,
the teacher has to improvise.
The method is successfully used in the detection of plagiarism
(Text Mining) and is also
applied in the educational sphere.
Outlier Detection
The method involves creating patterns that detect data that have
different features than
others. In Romanian education, this method could be used to
detect students with content
assimilation problems, or those with aberrant behavior.
In general, not only one EDM method is used in case studies.
Outlier Detection methods can
20. be used, for example, with data clustering techniques and
decision tree classification as presented in
the study by Ajith, Sai and Tejaswi (2013) - Evaluation of
student performance: an outlier detection
perspective (Ajith, Sai & Tejaswi, 2013). The study aims to
identify learners with special learning
needs to reduce the school failure rate. Input data are collected
from: participation in student
lessons, tests, notes on initial tests. In order to achieve the
proposed objective, they try to find
models for classifying students who will be helpful in setting up
study groups.
At present, in Romania, students in the high school education of
state do not have the
opportunity to trace the course matter in other groups than the
classes they belong to. Moreover,
pupils diagnosed as having special educational needs participate
in classes with other colleagues.
The teachers create for them specially programs. Then the
courses are held by under the guidance of
a single teacher who does not have any pedagogical and
methodical experience related to the
learning situation! There are special requirements for
conducting the educational process. This
based on grouping students within the same educational space
within the same timeframe to go
through different course materials. In the absence of a proper
classification, alternative methods and
means, and teachers with such experience, things happen more
or less in a manner that leads to the
best results.
Discovery with Models
Discovery with Models is the fifth category presented in Baker's
21. Taxonomy (Baker, 2012).
It is also one of the most widely used methods of data
exploitation in the field of education. It is
based on the use of a previously validated model as a
component in analyzes that use prediction or
exploitation of relationships in new contexts (Baker & Yacef,
2009). In this way information on
educational materials that contribute most to educational
progress can be obtained. A study carried
out by Beck and Mostow in 2008 - How who should practice:
Using learning decomposition to
evaluate the efficacy of different types of practice for different
types of students (Beck & Mostow,
2008) - on the analysis of different types of learners
demonstrates that the method supports
identifying relationships between student behavior and
characteristics of variables used.
Nonnegative Matrix Factorization (or Decomposition)
There are several algorithms used for factoring the nonnegative
matrix. This transforms
(decomposes, factorizes) a matrix V into two W and H matrices
with the property that they all have
non-negative elements. This is very useful in applications such
as determining the effectiveness of
an evaluation system in which matrices contain elements related
to: exams, abilities, and items.
Matrix V is obtained from the product of the two smaller
matrices as can be seen in Figure 3.
("Non-negative matrix factorization", 2019).
22. Figure 3. Illustration of approximate non-negative matrix
factorization. Source: wikipedia.org
BRAIN – Broad Research in Artificial Intelligence and
Neuroscience
Volume 10, Special Issue 2 (October, 2019), ISSN 2067-3957
6
We propose to study the evaluation of two specific abilities
defined on the columns of the
matrix W for 4 work requirements (items), defined in the W
matrix on the four lines.
Matrix H will contain two lines representing the two abilities
and 6 columns representing the
assessed students.
The result will be recorded in Matrix V that has 4 lines for each
of the 4 items and 6
columns for each of the 6 students.
A value of 1 in the W matrix indicates the need for a certain
skill (Figure 4) (Desmarais,
2012).
W
I
te
m
s
23. skills
0 1
1 0
1 0
1 1
X
H
sk
il
ls
students
1 1 1 0 1 1
0 0 1 1 0 0
≈
V
it
em
s
students
0 0 1 1 0 0
24. 1 1 1 0 1 1
1 1 1 0 1 1
1 1 2 1 1 1
Figure 4. Non-negative matrix factorization - example
The first item requires the ability 2, W [1] [2] = 1. Only the 2
and 3 students have the ability
2, so item 1 will not be promoted by students 1, 2, 4 and 5.
To promote Item 4 both skills are required. Only one of the
candidates will promote this
item with the maximum score.
Using computerized analysis methods, interpretations can be
obtained in a much shorter
time and with great accuracy because machines are faster and
more accurate than humans.
4. Learning Analysis (LA)
Learning is the product of an interaction between learners and
the learning environment,
between among students / educators / teachers and others (Elias
& Lias, 2011).
The evaluation of learning, in the traditional sense, is based on
the evaluation of student /
pupil outcomes. This involves assessing knowledge but also
trying to answer questions such as:
how well this student needs, how can be improved, how to
change the course interface to make it
more accessible. At present, especially in the pre-university
25. system, learning evaluation is based on
questionnaires. Obtaining feed-back is lasting because the non-
automatic data processing takes time
and the analysis possibilities are quite limited.
The desire to improve the quality of learning and assessment in
the educational system is
increasing at the international level, but also in our country.
Traditional systems are confronted by
huge amounts of data and their diversity. Learning Analytics
(LA) attempts to answer questions
about how this data can be used and how it can be transformed
and analyzed to provide useful
information that can give value to the learning process (Liu &
Fan, 2014).
In 2011, at the first International Conference on Learning
Analysis (LAK 2011), the
definition of the new research area, LA, was adopted as:
"learning analysis is the measurement,
collection, analysis and reporting of pupils and students and
about the context of learning, in order
to understand and optimize learning and its environments "
(Siemens, 2011).
Data analytics was first used in sales, also called Business
Intelligence. This branch of research
uses computer techniques to synthesize huge amounts of data
and turn them into powerful tools for
making the best marketing decisions.
With the development of Web technologies, a branch of data
analysis research, Web Analytics,
has been developed. Web Analytics tools collect data about
users of a site and report on their behavior.
This leads to a better understanding of customers and making
26. the best decisions to improve your
browsing experience and to keep visitors to the site.
D. Marcu, M. Danubianu - Learning Analytics or Educational
Data Mining? This is the Question...
7
Learning Analytics borrows tools and methods used in Business
Intelligence and Web Analytics
to analyze educational data.
At present, many universities, companies, and organizations are
developing learning platforms
for both students and lifelong learning. An enormous advantage
of these is to personalize the learning
experience and adapt it to the physical deficiencies of the
learners.
In a research conducted by the New Media Consortium and the
EDUCAUSE Learning Initiative
in 2016, areas that will have a particular impact on university
education globally by 2020 are identified.
One of these is Learning Analytics. In the research report LA is
defined as an application in the
educational field of Web Analytics. It focuses on the collection
and detailed analysis of student
interactions with online learning platforms (Johnson, Adams
Becker & Cummins, 2016).
A free example of a Web Analytics tool is provided by Google
and is called Google Analytics. It
provides sophisticated user behavior on a website and provides
its administrators with reports about:
27. many of them
are new customers;
With these reports, can create additional features, add more
interesting content, enhance
interactivity, customize the interface of the application based on
the devices used for viewing.
In the following figures (5,6,7) there are illustrated sections of
various reports provided by
this tool for the site https://www.modinfo.ro - a site dedicated
to the preparation of the students
from the Romanian high schools at the course of computer
science.
Figure 5 provides a diagram representation of the number of
visitors per page of the site. We
note that students are looking for baccalaureate content
(bac.php), admission to faculty
(admission.php) and additional training for performance
(cex.php).
Figure 5. User preferred content
Figure 6 represents the percentage of visitors to the site over a
fixed period, by age category.
28. It can be seen that most users are aged between 25 and 34 years.
For administrators, given the
period under review, this reveals their student’s preoccupation
for to prepare for the Computer
Programming Exam.
BRAIN – Broad Research in Artificial Intelligence and
Neuroscience
Volume 10, Special Issue 2 (October, 2019), ISSN 2067-3957
8
Figure 6. Demographics and interest categories - Age of users
Figure 7 provides information on analyzing the active presence
of a specific user on a site
within a selected time interval.
Figure 7. Behavior of a user on the site within a selected time
range
Choosing how to use and constructing analytics tools starts
from the choice of quantifiable
indicators that have to be defined according to the proposed
objectives. Examples of such indicators
for the educational environment:
29. tool within the course and
others.
4.1. Learning Analytics methods
Methods used for learning analysis include:
quality of the expression is analyzed.
ty in relation to learning: Students
interested in the topic will ask questions,
access links to supplementary resources
motivational learning.
LA uses some methods of data mining as EDM. They can be
classified in: Prediction,
Clustering, Relationship mining, Discovery with models,
Distillation of Data for Human Judgment
(Nunn, Avella, Kanai, & Kebritchi, 2016).
We will briefly describe the methods that have not already been
presented in the previous
section.
D. Marcu, M. Danubianu - Learning Analytics or Educational
Data Mining? This is the Question...
9
30. Relationship mining
It's a method that uses algorithms to find association rules to
detect, for example, mistakes
made by students when solving a set of exercises. Based on the
associations made, one can predict a
certain behavior of the student depending on the hypothesis of
solving the problem from which he
starts. Thus, the teacher or course manager can intervene in
order for the pupil / student not to
mistaken. There can be found, for example, relationships
between other activities of the student
(playing on the computer, talking to a chat room colleague)
while solving his or her work tasks and
erroneous answers (Baker, Corbett, Koedinger & Wagner,
2004).
Distillation of Data for Human Judgment
This method includes statistics and visualization techniques that
help people understand data
analytics. The method is the basis for the creation of many
useful tools that provide clear analysis
that can be quickly understood by unrelated users.
An example is the formation of a map to group learners by the
amount of heat emanating
from their bodies during learning the instructional material.
This can be done with sensors mounted
on the body. The analysis provides real-time learning about
learning performance indicators
(Merceron, 2015).
31. 5. Learning Analytics or Educational Data Mining?
Educational Data Mining is a new field of research. It is based
on the models, methods and
algorithms built for DM. However, there are also specific
methods of applying DM in education.
The main purpose of EDM is to explore large sets of data from
the educational system to create
knowledge-extraction models from the data. The main objective
is to provide useful information to
education decision makers about existing correlations between
sets of data that provide a deeper
understanding of the educational needs of students and the
system as a whole (de Almeida Neto &
Castro, 2017).
Learning Analytics is a newer field of research. It is based on
data analysis techniques in
Business Intelligence. LA uses highly sophisticated analysis
tools and predictive models to improve
learning. Most applications using LA have been created for the
university system and are dedicated
to early detection of concrete problems such as the risk of
abandoning a course by certain students.
LA also uses the expertise of other research areas, such as EDM
and Web Analytics, with the same
objectives of predicting learning outcomes and providing useful
information for improving the
quality of the learning process (Elias & Lias, 2011).
EDM is at the intersection of areas such as artificial
intelligence, machine learning,
education, and statistics.
Figure 8 shows the LA as an interdisciplinary subdomain of
32. Business Intelligence, Statistics
and Education.
Figure 8. Educational Data Mining and Learning Analytics
BRAIN – Broad Research in Artificial Intelligence and
Neuroscience
Volume 10, Special Issue 2 (October, 2019), ISSN 2067-3957
10
The two new areas of research are quite similar in terms of the
aims pursued an methods
used, but there are also some significant differences between
them. Some of the most important
resemblances and differences between EDM and LA are shown
in Tables 1 and 2.
Table 1. Similarities between EDM and LA
EDM LA
Both areas contribute to improving the quality of education and
education policies in schools and universities, but in
alternative education systems as well.
It is a new field of research. In 2011, in Massachusetts USA, the
International Working Group on EDM (established in 2007)
created the International Society for EDM.
The definition of this new field of research was
adopted in 2011 at the first International Conference
33. on Learning Analytics (LAK 2011).
It is based on the exploitation of large data collections. It is
based on analysis of large data collections.
It is based on the formulation of specific research …
Uncertainty in big data analytics: survey,
opportunities, and challenges
Reihaneh H. Hariri* , Erik M. Fredericks and Kate M. Bowers
Introduction
According to the National Security Agency, the Internet
processes 1826 petabytes (PB)
of data per day [1]. In 2018, the amount of data produced every
day was 2.5 quintil-
lion bytes [2]. Previously, the International Data Corporation
(IDC) estimated that the
amount of generated data will double every 2 years [3], however
90% of all data in the
world was generated over the last 2 years, and moreover Google
now processes more
than 40,000 searches every second or 3.5 billion searches per
day [2]. Facebook users
upload 300 million photos, 510,000 comments, and 293,000
status updates per day [2, 4].
Needless to say, the amount of data generated on a daily basis is
staggering. As a result,
techniques are required to analyze and understand this massive
amount of data, as it is a
great source from which to derive useful information.
Abstract
Big data analytics has gained wide attention from both academia
and industry as the
34. demand for understanding trends in massive datasets increases.
Recent developments
in sensor networks, cyber-physical systems, and the ubiquity of
the Internet of Things
(IoT) have increased the collection of data (including health
care, social media, smart
cities, agriculture, finance, education, and more) to an
enormous scale. However, the
data collected from sensors, social media, financial records, etc.
is inherently uncer-
tain due to noise, incompleteness, and inconsistency. The
analysis of such massive
amounts of data requires advanced analytical techniques for
efficiently reviewing and/
or predicting future courses of action with high precision and
advanced decision-
making strategies. As the amount, variety, and speed of data
increases, so too does the
uncertainty inherent within, leading to a lack of confidence in
the resulting analytics
process and decisions made thereof. In comparison to traditional
data techniques and
platforms, artificial intelligence techniques (including machine
learning, natural lan-
guage processing, and computational intelligence) provide more
accurate, faster, and
scalable results in big data analytics. Previous research and
surveys conducted on big
data analytics tend to focus on one or two techniques or specific
application domains.
However, little work has been done in the field of uncertainty
when applied to big data
analytics as well as in the artificial intelligence techniques
applied to the datasets. This
article reviews previous work in big data analytics and presents
a discussion of open
36. Page 2 of 16Hariri et al. J Big Data (2019) 6:44
Advanced data analysis techniques can be used to transform big
data into smart data
for the purposes of obtaining critical information regarding
large datasets [5, 6]. As such,
smart data provides actionable information and improves
decision-making capabilities
for organizations and companies. For example, in the field of
health care, analytics per-
formed upon big datasets (provided by applications such as
Electronic Health Records
and Clinical Decision Systems) may enable health care
practitioners to deliver effective
and affordable solutions for patients by examining trends in the
overall history of the
patient, in comparison to relying on evidence provided with
strictly localized or current
data. Big data analysis is difficult to perform using traditional
data analytics [7] as they
can lose effectiveness due to the five V’s characteristics of big
data: high volume, low
veracity, high velocity, high variety, and high value [7–9].
Moreover, many other charac-
teristics exist for big data, such as variability, viscosity,
validity, and viability [10]. Several
artificial intelligence (AI) techniques, such as machine learning
(ML), natural language
processing (NLP), computational intelligence (CI), and data
mining were designed to
provide big data analytic solutions as they can be faster, more
accurate, and more pre-
cise for massive volumes of data [8]. The aim of these advanced
analytic techniques is
to discover information, hidden patterns, and unknown
correlations in massive datasets
37. [7]. For instance, a detailed analysis of historical patient data
could lead to the detection
of destructive disease at an early stage, thereby enabling either
a cure or more optimal
treatment plan [11, 12]. Additionally, risky business decisions
(e.g., entering a new mar-
ket or launching a new product) can profit from simulations that
have better decision-
making skills [13].
While big data analytics using AI holds a lot of promise, a wide
range of challenges
are introduced when such techniques are subjected to
uncertainty. For instance, each of
the V characteristics introduce numerous sources of uncertainty,
such as unstructured,
incomplete, or noisy data. Furthermore, uncertainty can be
embedded in the entire ana-
lytics process (e.g., collecting, organizing, and analyzing big
data). For example, dealing
with incomplete and imprecise information is a critical
challenge for most data mining
and ML techniques. In addition, an ML algorithm may not
obtain the optimal result if
the training data is biased in any way [14, 15]. Wang et al. [16]
introduced six main chal-
lenges in big data analytics, including uncertainty. They focus
mainly on how uncertainty
impacts the performance of learning from big data, whereas a
separate concern lies in
mitigating uncertainty inherent within a massive dataset. These
challenges normally pre-
sent in data mining and ML techniques. Scaling these concerns
up to the big data level
will effectively compound any errors or shortcomings of the
entire analytics process.
38. Therefore, mitigating uncertainty in big data analytics must be
at the forefront of any
automated technique, as uncertainty can have a significant
influence on the accuracy of
its results.
Based on our examination of existing research, little work has
been done in terms of
how uncertainty significantly impacts the confluence of big data
and the analytics tech-
niques in use. To address this shortcoming, this article presents
an overview of the
existing AI techniques for big data analytics, including ML,
NLP, and CI from the per-
spective of uncertainty challenges, as well as suitable directions
for future research in
these domains. The contributions of this work are as follows.
First, we consider uncer-
tainty challenges in each of the 5 V’s big data characteristics.
Second, we review several
Page 3 of 16Hariri et al. J Big Data (2019) 6:44
techniques on big data analytics with impact of uncertainty for
each technique, and also
review the impact of uncertainty on several big data analytic
techniques. Third, we dis-
cuss available strategies to handle each challenge presented by
uncertainty.
To the best of our knowledge, this is the first article surveying
uncertainty in big data
analytics. The remainder of the paper is organized as follows.
“Background” section pre-
39. sents background information on big data, uncertainty, and big
data analytics. “Uncer-
tainty perspective of big data analytics” section considers
challenges and opportunities
regarding uncertainty in different AI techniques for big data
analytics. “Summary of mit-
igation strategies” section correlates the surveyed works with
their respective uncertain-
ties. Lastly, “Discussion” section summarizes this paper and
presents future directions of
research.
Background
This section reviews background information on the main
characteristics of big data,
uncertainty, and the analytics processes that address the
uncertainty inherent in big data.
Big data
In May 2011, big data was announced as the next frontier for
productivity, innovation,
and competition [11]. In 2018, the number of Internet users
grew 7.5% from 2016 to over
3.7 billion people [2]. In 2010, over 1 zettabyte (ZB) of data
was generated worldwide
and rose to 7 ZB by 2014 [17]. In 2001, the emerging
characteristics of big data were
defined with three V’s (Volume, Velocity, and Variety) [18].
Similarly, IDC defined big
data using four V’s (Volume, Variety, Velocity, and Value) in
2011 [19]. In 2012, Veracity
was introduced as a fifth characteristic of big data [20–22].
While many other V’s exist
[10], we focus on the five most common characteristics of big
data, as next illustrated in
40. Fig. 1.
Volume refers to the massive amount of data generated every
second and applies to the
size and scale of a dataset. It is impractical to define a universal
threshold for big data
volume (i.e., what constitutes a ‘big dataset’) because the time
and type of data can influ-
ence its definition [23]. Currently, datasets that reside in the
exabyte (EB) or ZB ranges
are generally considered as big data [8, 24], however challenges
still exist for datasets in
smaller size ranges. For example, Walmart collects 2.5 PB from
over a million custom-
ers every hour [25]. Such huge volumes of data can introduce
scalability and uncertainty
problems (e.g., a database tool may not be able to accommodate
infinitely large datasets).
Many existing data analysis techniques are not designed for
large-scale databases and
can fall short when trying to scan and understand the data at
scale [8, 15].
Variety refers to the different forms of data in a dataset
including structured data,
semi-structured data, and unstructured data. Structured data
(e.g., stored in a rela-
tional database) is mostly well-organized and easily sorted, but
unstructured data
(e.g., text and multimedia content) is random and difficult to
analyze. Semi-structured
data (e.g., NoSQL databases) contains tags to separate data
elements [23, 26], but
enforcing this structure is left to the database user. Uncertainty
can manifest when
converting between different data types (e.g., from unstructured
41. to structured data),
in representing data of mixed data types, and in changes to the
underlying struc-
ture of the dataset at run time. From the point of view of
variety, traditional big data
Page 4 of 16Hariri et al. J Big Data (2019) 6:44
analytics algorithms face challenges for handling multi-modal,
incomplete and noisy
data. Because such techniques (e.g., data mining algorithms) are
designed to consider
well-formatted input data, they may not be able to deal with
incomplete and/or dif-
ferent formats of input data [7]. This paper focuses on
uncertainty with regard to big
data analytics, however uncertainty can impact the dataset itself
as well.
Efficiently analysing unstructured and semi-structured data can
be challenging,
as the data under observation comes from heterogeneous sources
with a variety of
data types and representations. For example, real-world
databases are negatively
influenced by inconsistent, incomplete, and noisy data.
Therefore, a number of data
preprocessing techniques, including data cleaning, data
integrating, and data trans-
forming used to remove noise from data [27]. Data cleaning
techniques address data
quality and uncertainty problems resulting from variety in big
data (e.g., noise and
inconsistent data). Such techniques for removing noisy objects
42. during the analysis
process can significantly enhance the performance of data
analysis. For example, data
cleaning for error detection and correction is facilitated by
identifying and eliminat-
ing mislabeled training samples, ideally resulting in an
improvement in classification
accuracy in ML [28].
Velocity comprises the speed (represented in terms of batch,
near-real time, real time,
and streaming) of data processing, emphasizing that the speed
with which the data is
processed must meet the speed with which the data is produced
[8]. For example, Inter-
net of Things (IoT) devices continuously produce large amounts
of sensor data. If the
device monitors medical information, any delays in processing
the data and sending the
results to clinicians may result in patient injury or death (e.g., a
pacemaker that reports
emergencies to a doctor or facility) [20]. Similarly, devices in
the cyber-physical domain
often rely on real-time operating systems enforcing strict timing
standards on execution,
Fig. 1 Common big data characteristics
Page 5 of 16Hariri et al. J Big Data (2019) 6:44
and as such, may encounter problems when data provided from a
big data application
fails to be delivered on time.
43. Veracity represents the quality of the data (e.g., uncertain or
imprecise data). For
example, IBM estimates that poor data quality costs the US
economy $3.1 trillion per
year [21]. Because data can be inconsistent, noisy, ambiguous,
or incomplete, data verac-
ity is categorized as good, bad, and undefined. Due to the
increasingly diverse sources
and variety of data, accuracy and trust become more difficult to
establish in big data
analytics. For example, an employee may use Twitter to share
official corporate informa-
tion but at other times use the same account to express personal
opinions, causing prob-
lems with any techniques designed to work on the Twitter
dataset. As another example,
when analyzing millions of health care records to determine or
detect disease trends,
for instance to mitigate an outbreak that could impact many
people, any ambiguities or
inconsistencies in the dataset can interfere or decrease the
precision of the analytics pro-
cess [21].
Value represents the context and usefulness of data for decision
making, whereas the
prior V’s focus more on representing challenges in big data. For
example, Facebook,
Google, and Amazon have leveraged the value of big data via
analytics in their respective
products. Amazon analyzes large datasets of users and their
purchases to provide prod-
uct recommendations, thereby increasing sales and user
participation. Google collects
location data from Android users to improve location services in
Google Maps. Face-
44. book monitors users’ activities to provide targeted advertising
and friend recommenda-
tions. These three companies have each become massive by
examining large sets of raw
data and drawing and retrieving useful insight to make better
business decisions [29].
Uncertainty
Generally, “uncertainty is a situation which involves unknown
or imperfect information”
[30]. Uncertainty exists in every phase of big data learning [7]
and comes from many dif-
ferent sources, such as data collection (e.g., variance in
environmental conditions and
issues related to sampling), concept variance (e.g., the aims of
analytics do not present
similarly) and multimodality (e.g., the complexity and noise
introduced with patient
health records from multiple sensors include numerical, textual,
and image data). For
instance, most of the attribute values relating to the timing of
big data (e.g., when events
occur/have occurred) are missing due to noise and
incompleteness. Furthermore, the
number of missing links between data points in social networks
is approximately 80% to
90% and the number of missing attribute values within patient
reports transcribed from
doctor diagnoses are more than 90% [31]. Based on IBM
research in 2014, industry ana-
lysts believe that, by 2015, 80% of the world’s data will be
uncertain [32].
Various forms of uncertainty exist in big data and big data
analytics that may nega-
45. tively impact the effectiveness and accuracy of the results. For
example, if training
data is biased in any way, incomplete, or obtained through
inaccurate sampling, the
learning algorithm using corrupted training data will likely
output inaccurate results.
Therefore, it is critical to augment big data analytic techniques
to handle uncertainty.
Recently, meta-analysis studies that integrate uncertainty and
learning from data
have seen a sharp increase [33–35]. The handling of the
uncertainty embedded in the
entire process of data analytics has a significant effect on the
performance of learning
Page 6 of 16Hariri et al. J Big Data (2019) 6:44
from big data [16]. Other research also indicates that two more
features for big data,
such as multimodality (very complex types of data) and
changed-uncertainty (the
modeling and measure of uncertainty for big data) is remarkably
different from that of
small-size data. There is also a positive correlation in
increasing the size of a dataset
to the uncertainty of data itself and data processing [34]. For
example, fuzzy sets may
be applied to model uncertainty in big data to combat vague or
incorrect information
[36]. Moreover, and because the data may contain hidden
relationships, the uncer-
tainty is further increased.
Therefore, it is not an easy task to evaluate uncertainty in big
46. data, especially when
the data may have been collected in a manner that creates bias.
To combat the many
types of uncertainty that exist, many theories and techniques
have been developed to
model its various forms. We next describe several common
techniques.
Bayesian theory assumes a subjective interpretation of the
probability based on past
event/prior knowledge. In this interpretation the probability is
defined as an expres-
sion of a rational agent’s degrees of belief about uncertain
propositions [37]. Belief
function theory is a framework for aggregating imperfect data
through an informa-
tion fusion process when under uncertainty [38]. Probability
theory incorporates
randomness and generally deals with the statistical
characteristics of the input data
[34]. Classification entropy measures ambiguity between classes
to provide an index
of confidence when classifying. Entropy varies on a scale from
zero to one, where val-
ues closer to zero indicate more complete classification in a
single class, while values
closer to one indicate membership among several different
classes [39]. Fuzziness is
used to measure uncertainty in classes, notably in human
language (e.g., good and
bad) [16, 33, 40]. Fuzzy logic then handles the uncertainty
associated with human
perception by creating an approximate reasoning mechanism
[41, 42]. The method-
ology was intended to imitate human reasoning to better handle
uncertainty in the
47. real world [43]. Shannon’s entropy quantifies the amount of
information in a variable
to determine the amount of missing information on average in a
random source [44,
45]. The concept of entropy in statistics was introduced into the
theory of communi-
cation and transmission of information by Shannon [46].
Shannon entropy provides
a method of information quantification when it is not possible
to measure crite-
ria weights using a decision–maker. Rough set theory provides a
mathematical tool
for reasoning on vague, uncertain or incomplete information.
With the rough set
approach, concepts are described by two approximations (upper
and lower) instead of
one precise concept [47], making such methods invaluable to
dealing with uncertain
information systems [48]. Probabilistic theory and Shannon’s
entropy are often used
to model imprecise, incomplete, and inaccurate data. Moreover,
fuzzy set and rough
theory are used for modeling vague or ambiguous data [49], as
shown in Fig. 2.
Evaluating the level of uncertainty is a critical step in big data
analytics. Although
a variety of techniques exist to analyze big data, the accuracy of
the analysis may be
negatively affected if uncertainty in the data or the technique
itself is ignored. Uncer-
tainty models such as probability theory, fuzziness, rough set
theory, etc. can be used
to augment big data analytic techniques to provide more
accurate and more mean-
ingful results. Based on the previous research, Bayesian model
48. and fuzzy set theory
are common for modeling uncertainty and decision-making.
Table 1 compares and
Page 7 of 16Hariri et al. J Big Data (2019) 6:44
summarizes the techniques we have identified as relevant,
including a comparison
between different uncertainty strategies, focusing on
probabilistic theory, Shannon’s
entropy, fuzzy set theory, and rough set theory.
Big data analytics
Big data analytics describe the process of analyzing massive
datasets to discover pat-
terns, unknown correlations, market trends, user preferences,
and other valuable
information that previously could not be analyzed with
traditional tools [52]. With
the formalization of the big data’s five V characteristics,
analysis techniques needed
to be reevaluated to overcome their limitations on processing in
terms of time and
space [29]. Opportunities for utilizing big data are growing in
the modern world of
digital data. The global annual growth rate of big data
technologies and services is
Measuring uncertainty in
big data
Imprecise, inaccurate, and
incomplete data
49. Probability
Theory
Shannon's
Entropy
Vague or ambiguous data
Fuzzy Set
Theory
Rough Set
Theory
Fig. 2 Measuring uncertainty in big data
Table 1 Comparison of uncertainty strategies
Uncertainty models Features
Probability theory
Bayesian theory
Shannon’s entropy
Powerful for handling randomness and subjective uncertainty
where precision is required
Capable of handling complex data [50]
Fuzziness Handles vague and imprecise information in systems
that are difficult to model
Precision not guaranteed
Easy to implement and interpret [50]
Belief function Handle situations with some degree of ignorance
Combines distinct evidence from several sources to compute the
50. probability of specific
hypotheses
Considers all evidence available for the hypothesis
Ideal for incomplete and high complex data
Mathematically complex but improves uncertainty reduction
[50]
Rough set theory Provides an objective form of analysis [47]
Deals with vagueness in data
Minimal information necessary to determine set membership
Only uses the information presented within the given data [51]
Classification entropy Handles ambiguity between the classes
[39]
Page 8 of 16Hariri et al. J Big Data (2019) 6:44
predicted to increase about 36% between 2014 and 2019, with
the global income for
big data and business analytics anticipated to increase more
than 60% [53].
Several advanced data analysis techniques (i.e., ML, data
mining, NLP, and CI) and
potential strategies such as parallelization, divide-and-conquer,
incremental learn-
ing, sampling, granular computing, feature selection [16], and
instance selection [34]
can convert big problems to small problems and can be used to
make better deci-
sions, reduce costs, and enable more efficient processing.
With respect to big data analytics, parallelization reduces
51. computation time by
splitting large problems into smaller instances of itself and
performing the smaller
tasks simultaneously (e.g., distributing the smaller tasks across
multiple threads,
cores, or processors). Parallelization does not decrease the
amount of work per-
formed but rather reduces computation time as the small tasks
are completed at the
same point in time instead of one after another sequentially
[16].
The divide-and-conquer strategy plays an important role in
processing big data.
Divide-and-conquer consists of three phases: (1) reduce one
large problem into sev-
eral smaller problems, (2) complete the smaller problems, where
the solving of each
small problem contributes to the solving of the large problem,
and (3) incorporate
the solutions of the smaller problems into one large solution
such that the large
problem is considered solved. For many years the divide-and-
conquer strategy has
been used in very massive databases to manipulate records in
groups rather than all
the data at once [54].
Incremental learning is a learning algorithm popularly used with
streaming data
that is trained only with new data rather than only training with
existing data. Incre-
mental learning adjusts the parameters in the learning algorithm
over time accord-
ing to each new input data and each input is used for training
only once [16].
52. Sampling can be used as a data reduction method for big data
analytics for deriv-
ing patterns in large data sets by choosing, manipulating, and
analyzing a subset of
the data [16, 55]. Some research indicates that obtaining
effective results using sam-
pling depends on the data sampling criteria used [56].
Granular computing groups elements from a large space to
simplify the elements
into subsets, or granules [57, 58]. Granular computing is an
effective approach to
define uncertainty of objects in the search space as it reduces
large objects to a
smaller search space [59].
Feature selection is a conventional approach to handle big data
with the purpose of
choosing a subset of relative features for an aggregate but more
precise data repre-
sentation [60, 61]. Feature selection is a very useful strategy in
data mining for pre-
paring high-scale data [60].
Instance selection is practical in many ML or data mining tasks
as a major feature
in data pre-processing. By utilizing instance selection, it is
possible to reduce train-
ing sets and runtime in the classification or training phases
[62].
The costs of uncertainty (both monetarily and computationally)
and challenges
in generating effective models for uncertainties in big data
analytics have become
53. key to obtaining robust and performant systems. As such, we
examine several open
issues of the impacts of uncertainty on big data analytics in the
next section.
Page 9 of 16Hariri et al. J Big Data (2019) 6:44
Uncertainty perspective of big data analytics
This section examines the impact of uncertainty on three AI
techniques for big data ana-
lytics. Specifically, we focus on ML, NLP, and CI, although
many other analytics tech-
niques exist. For each presented technique, we examine the
inherent uncertainties and
discuss methods and strategies for their mitigation.
Machine learning and big data
When dealing with data analytics, ML is generally used to
create models for predic-
tion and knowledge discovery to enable data-driven decision-
making. Traditional ML
methods are not computationally efficient or scalable enough to
handle both the char-
acteristics of big data (e.g., large volumes, high speeds, varying
types, low value density,
incompleteness) and uncertainty (e.g., biased training data,
unexpected data types, etc.).
Several commonly used advanced ML techniques proposed for
big data analysis include
feature learning, deep learning, transfer learning, distributed
learning, and active learn-
ing. Feature learning includes a set of techniques that enables a
system to automatically
54. discover the representations needed for feature detection or
classification from raw data.
The performances of the ML algorithms are strongly influenced
by the selection of data
representation. Deep learning algorithms are designed for
analyzing and extracting valu-
able knowledge from massive amounts of data and data
collected from various sources
(e.g., separate variations within an image, such as a light,
various materials, and shapes)
[56], however current deep learning models incur a high
computational cost. Distrib-
uted learning can be used to mitigate the scalability problem of
traditional ML by carry-
ing out calculations on data sets distributed among several
workstations to scale up the
learning process [63]. Transfer learning is the ability to apply
knowledge learned in one
context to new contexts, effectively improving a learner from
one domain by transfer-
ring information from a related domain [64]. Active learning
refers to algorithms that
employ adaptive data collection [65] (i.e., processes that
automatically adjust param-
eters to collect the most useful data as quickly as possible) in
order to accelerate ML
activities and overcome labeling problems. The uncertainty
challenges of ML techniques
can be mainly attributed to learning from data with low veracity
(i.e., uncertain and
incomplete data) and data with low value (i.e., unrelated to the
current problem). We
found that, among the ML techniques, active learning, deep
learning, and fuzzy logic
theory are uniquely suited to support the challenge of reducing
uncertainty, as shown
55. in Fig. 3. Uncertainty can impact ML in terms of incomplete or
imprecise training sam-
ples, unclear classification boundaries, and rough knowledge of
the target data. In some
cases, the data is represented without labels, which can become
a challenge. Manually
labeling large data collections can be an expensive and
strenuous task, yet learning from
unlabeled data is very …
Research Paper – Data Science & Big Data Analytics
While this week’s topic highlighted the uncertainty of Big Data,
the author identified the following as areas for future research.
Pick one of the following for your Research paper.
· Additional study must be performed on the interactions
between each big data characteristic, as they do not exist
separately but naturally interact in the real world.
· The scalability and efficacy of existing analytics techniques
being applied to big data must be empirically examined.
· New techniques and algorithms must be developed in ML and
NLP to handle the real-time needs for decisions made based on
enormous amounts of data.
· More work is necessary on how to efficiently model
uncertainty in ML and NLP, as well as how to represent
uncertainty resulting from big data analytics.
· Since the CI algorithms are able to find an approximate
solution within a reasonable time, they have been used to tackle
ML problems and uncertainty challenges in data analytics and
process in recent years.
Your paper should meet the following requirements:
• Be approximately 3-5 pages in length, not including the
required cover page and reference page.
• Follow APA guidelines. Your paper should include an
introduction, a body with fully developed content, and a
conclusion.
56. • Support your response with the readings from the course and
at least five peer-reviewed articles or scholarly journals to
support your positions, claims, and observations. The UC
Library is a great place to find resources.
• Be clear with well-written, concise, using excellent grammar
and style techniques. You are being graded in part on the
quality of your writing.
References:
Marcu, D., & Danubianu, M. (2019). Learning Analytics or
Educational Data Mining? This is the Question. BRAIN: Broad
Research in Artificial Intelligence & Neuroscience, 10, 1–14.
Retrieved from
http://search.ebscohost.com/login.aspx?direct=true&AuthType=
shib&db=a9h&AN=139367236&site=eds-live
Hariri, R.H., Fredericks, E.M. & Bowers, K.M. J Big Data
(2019) 6: 44. https://doi.org/10.1186/s40537-019-0206-3