Discussion Board 1 – 2 Within the Discussion Board area, write 4

Discussion Board 1 – 2
Within the Discussion Board area, write 400-600 words that
respond to the following questions with your thoughts, ideas,
and comments. This will be the foundation for future
discussions by your classmates. Be substantive and clear, and
use examples to reinforce your ideas.
The architecture of Web 1.0 consists of following three
components (Jacobs & Walsh, 2004):
· Web resources identification: Uniform Resource Identifier
(URI)
· Interaction protocol: HyperText Transfer Protocol (HTTP)
· Data formats: HyperText Markup Language (HTML)
Over the last 25 years, the Web has experienced several
evolutions, which have been called Web 1.0, Web 2.0, Web 3.0,
Web 4.0, and Web 5.0. Each of the evolutions has brought in
more types of data sources, along with more advanced
functional capability to the Internet infrastructure to make the
Web the central place to see the convergence of many existing
and new technologies. These new capabilities, in turn, support
many new innovative business processes and practice through
the Web. Therefore, it is important to know the basic concepts
and applications of the Web, starting from its first generation.
Knowing the root of the Web technology will help you to
understand the reasons and consequences of the current and
future changes to the Web technology, as well as the challenges
of accessing the ever-growing Web data.
Complete the reading assignment, and search the Library and
Internet to find and study at least 2 more references that discuss
the concepts and applications of the Web. Based on the results
of your research, discuss the following questions:
· What role has each of the 3 components of the architecture of
Web 1.0 (URI, HTTP, and HTML) played in making the Web
one of the main sources of ever-growing big data?
· What will be the trend in terms of "performance bottleneck" to

access large-scale Web data as the Web technology evolves?
· Justify your point of view, and provide examples as necessary.
Unit 2 - 1
Primary Task Response: Within the Discussion Board area,
write 400-600 words that respond to the following questions
with your thoughts, ideas, and comments. This will be the
foundation for future discussions by your classmates. Be
substantive and clear, and use examples to reinforce your ideas.
As the core component of Web 4.0, the Internet of Things (IoT)
has become a reality after many years of development. Distinct
from all previous generations of the Web where all the data are
generated by people, the Web 4.0 data are generated by both
human and embedded computing devices (Atzori, 2010). The
number of sources for the Web data have greatly increased
because multibillions of uniquely identifiable embedded
computing devices are connected through the Internet
infrastructure and various types of wireless networks. Because
most of IoT devices only have limited computing resources,
they play the role of raw data collector and initial data
preprocessor. These devices have to send the lower-level data to
various data processing centers where the computers with higher
order computing resources will perform heavier duty tasks. The
IoT-based Web 4.0 has not only increased the data growth rate,
but it also shifted the performance bottlenecks of accessing Web
data to many new places in the Internet infrastructure. It is very
important to fully understand where these new performance
bottlenecks are and the root causes of their existence so that you
can be more effectively handle your computing resources in
accessing various types of Web data for your large-scale Web
data-based applications.
Internet to find and study at least 2 references that discuss the
concepts and applications of the IoT. Based on the results of
your research, discuss the following questions:

· Where will the new performance bottlenecks be when
accessing large-scale Web data generated by IoT?
· What is the new challenge for developing an indexing scheme
used to assist accessing large-scale Web data generated by IoT?
· Justify your point of view and provide examples as necessary.
Unit 3 – 1
MapReduce was originally developed for cost-efficient use of
large clusters of commodity computers to achieve scalable and
reliable data processing. It consistently applies two simple but
powerful functions—Map and Reduce—in parallel. Along
with Hadoop, which is an open-source implementation of
MapReduce, MapReduce has become one of the most popular
and practical technical solutions to deal with big data analytic
tasks. However, like any technical solution, the initial
MapReduce and Hadoop also have quite a few weaknesses when
applied to handle certain types of data processing applications.
Therefore, there is a need to thoroughly study the basic
concepts of MapReduce and its Hadoop implementation to fully
understand their pros and cons so that when applying them in
big data analytic tasks, you will be able to make the right
decisions and achieve the desired results.
Internet to find and study more references that discuss the
concepts and applications of MapReduce and Hadoop as needed.
Based on the results of your research, discuss the following
questions:
· What are the basic concepts of MapReduce?
· What are the top 3 features of Hadoop?
· What are the pros and cons of MapReduce?

Justify your point of view and provide examples, as necessary.
===============================================
=====================================
Unit 3 – 2
Many data analytic tasks in commonly used Web applications,
such as page ranking and social network analysis, are processed
iteratively until the computation meets the given condition.
However, the original MapReduce framework does not support
iterative computation directly. The iterative tasks have to be
manually developed through a separate software and use
multiple MapReduce jobs to emulate the iteration process. The
unchanged data from previous iteration will be reloaded and
reprocessed in the next iteration. This approach has increased
the performance penalty on computing resources because it does
not take advantage of most of the data in the iterations, which is
unchanged, and subsequently has no need to reload and
reprocess them during the consequent iterations. Another
problem with the manual approach is that it depends on
detecting the termination condition at each iteration. This
requires an extra MapReduce job, which causes extra
scheduling, I/O, and will increase network traffic. Obviously, a
better solution is required to address these performance
penalties.
Internet to find and study more references that discuss how to
address the weakness of MapReduce and Hadoop on iterative
computation. Based on the results of your research, discuss the
following questions:
· What are the weaknesses of the initial MapReduce framework
in iterative computation?
· What are the root causes of the weakness?

· What are the key technical steps to solve the weakness?
===============================================
=====================================
Unit 4 – 1
Most of data analytic tasks in commonly used large-scale Web
data processing applications, such as Web crawls and Web page
indexing, are not iterative but incremental. The application
usually runs one time as needed. However, there is a common
characteristic of data shown in the incremental computation on
most large-scale Web data. Most of the data do not change
between two different runs. This obviously is an opportunity to
improve the data processing performance for MapReduce and its
Hadoop implementation because they did not consider this data
characteristic in their design and development. For example, if
99% of a large-scale data set is unchanged and if there is a
method to allow the MapReduce-based Web application to reuse
that data directly in the next run without reprocessing, the data
processing performance on this data set will increase greatly.
Therefore, it is very important to acquire the knowledge and
skills on how to achieve this process with existing MapReduce
and Hadoop.
Complete the reading assignment and search the Library and
implement incremental computation with existing MapReduce
and Hadoop. Based on the results of your research, discuss the
· What are the principles of using the initial MapReduce
framework and Hadoop to improve performance of incremental
computation?
· How can these principles be designed and implemented

without a need of any big change on the initial MapReduce
framework and Hadoop?
===============================================
===============================
Unit 5 – 1
Distributed programming is a computation method in which
software will run on separate cores in multiple networked
computers. It is a true parallel computa tion model because it
can provide fully supported computing resources for
multitasking. Based on different criteria, distributed
programming models can be classified differently. If the
term distributed system is defined as “a system consisting of
networked computers and communicating through either
messaging passing or shared distributed memory to coordinate
the software functions to solve a problem or provide a service,”
then you can divide that distributed programming into the
following two models:
· Shared memory distributed programming
· Message-passing distributed programming
According to the definition of a distributed system, a cloud
computing environment is considered a distributed system.
Therefore, both the shared memory distributed programming
and the message-passing distributed programming can be
applied in the cloud. However, if you use cloud computing to
conduct large-scale data processing, the existing capabilities of
both the shared memory distributed programming or the
message-passing distributed programming are not sufficient
(Sakr & Gaber, 2014).
Internet to find and study additional references that discuss the

concepts and applications of the distributed programming
models. Based on the results of your research, discuss the
following question:
· Why are both the shared memory distributed programming and
the message-passing distributed programming insufficient when
processing the large-scale data in cloud computing
environment?
Please justify your point of view and provide examples, as
necessary.
===============================================
==================================
Unit 5 – 2
The CAP theorem was originally proposed by Dr. E. Brewer at a
symposium on distributed computing, and he stated that “in any
highly distributed data system, there are three commonly
desirable properties: consistency, availability, and partition
tolerance. However, it is impossible for a system to provide all
three properties at the same time” (2000). This theorem was
later proven by S. Gilbert and N. Lynch. The CAP theorem has
had great impact on the design of distributed systems and
services, including distributed database management systems
(DDBMS). Web-based applications have posted new
requirements that traditional database systems such as SQL-
based relational database systems (RDBs) cannot fully satisfy.
This has triggered a new type of data storage systems
called NoSQL systems to occur and gradually become a
dominant alternative solution for data store and management.
One of popular practices among NoSQL data storage systems is
based on the CAP theorem to make the trade-off among the
three properties. Because the high performance cost of
maintaining strong consistency based on the atomicity,

consistency, isolation, and durability (ACID) semantics held by
RDBs, NoSQL systems often apply the weak consistency model
in exchange for the great reduction of performance overhead
involved in enforcing strong consistency (Sakr & Gaber, 2014).
Internet to find and study additional references that discuss the
concepts of strong and weak consistency. Based on the results
of your research, discuss the following questions:
· Why must a NoSQL data storage system based on the cloud
computing environment make trade-offs between consistency
and availability?
· Where do the savings on the consistency handling overhead
come from in a NoSQL data storage system executing the weak
consistency?
Please justify your point of view and provide examples, as
necessary.
===============================================
==================================
Unit 6 -1
Any application that is data- and computing-intensive will be a
good candidate for a services based on cloud computing.
Visualizing large-scale data sets is one such application.
Visualizing large-scale data sets involves two aspects—large-
scale data processing and a visualization interface. To use the
cloud computing environment for large-scale data processing,
you need to consider the network performance issues such as
“the unevenness of bandwidth of computer pairs” that will be
discussed in another assignment (Sakr & Gaber, 2014). Now
you are focusing on the visualization interface aspect. Authors

have proposed a prototype framework for the design of
visualization service to the big data coming from the cloud
computing environment (Tanashi et al., 2010). The authors
discuss the end-user functionality supported by the framework
and their technical decisions on how to implement the
framework.
Internet to find and study additional references that discuss how
to visualize large-scale data sets in a cloud computing
environment. Based on the results of your research, discuss the
· What is the end-user functionality of the framework reported
in Tanahashi et al.?
· What are the technical design decisions that have an impact on
the performance of the framework?
===============================================
==============================
Unit 7 – 1
Big data analytics can help solve some very hard problems. One
example is to detect network traffic anomalies caused by
diverse machine-generated traffic attacks (known as hit
inflation attacks, which “refer to the fraudulent activities of
generating charges for online advertisers without a real interest
in the product advertised”) by detecting the anomalous
deviation from the expected Internet Protocol (IP) size
distribution, where the term of IP size is defined as “the number
of users sharing the same source IP” (Sakr & Gaber, 2014). The
ability to detect hit inflation attacks is critical to the well -being
of online advertisement because it will ensure the healthy
operations of many daily used popular public Web-based

services, such as search engines, e-mail, maps, and other Web-
based applications. However, the network traffic data itself is
also a type of large-scale data set. To process such a data set
efficiently to discover the corresponding IP size distributions
for all publishers’ Web sites for detecting network traffic
anomalies is a very challenging task.
Internet to find and study more references that discuss detecting
network traffic anomalies based on the IP size distribution.
concepts:
· Identify 1 method of detecting network traffic anomalies based
on the IP size distribution.
· What are the design principles of the method?
· How does each method address the performance issue of
processing such large scale network traffic data?
===============================================
============================
Unit 7 – 2
Different from the conventional distributed systems such as
those supercomputer-based client-server systems or small-scale
cluster systems, the network performance of the cloud
computing system has a unique characteristic. The network
bandwidth among different pairs of computers in the cloud can
vary significantly, and it is called “the bandwidth unevenness
among different machine pairs” (Sakr & Gaber, 2014). When a
very large-scale data set, such as those in a social network, Web
graph, information networks (which are known as large-scale
graph data set), needs to be partitioned into many machines in a
cloud computing system before it can be processed, the network

performance problem caused by the bandwidth unevenness
among different machine pairs needs to be seriously considered
and addressed. It will impact the entire data processing
performance because the partitioning of the very large-scale
data set will generate a very large amount of network traffic and
will impact a very large number of machines. Network
performance is a critical parameter for the design in any cloud
computing-based large-scale graph data set partitioning and
processing method.
Internet to find and study more references that discuss the cloud
computing-based large-scale graph dataset partitioning and
processing. Based on the results of your research, discuss the
following topics:
· Identify 2 large-scale graph data set partitioning methods used
in a cloud computing system.
· What are the design principles of each method?
· How does each method address the network performance issue
caused by the bandwidth unevenness among different machine
pairs?
===============================================
=======================
Unit 8 – 1
One of the main purposes of processing big data is to extract
knowledge (or so called big knowledge) out from the big data
set. “Knowledge is the meaningfulness about the data” (Sakr &
Gaber, 2014). Knowledge representation is usually associated
with the problem-solving task’s specific requirements. For
example, if a problem-solving task involves time order, then a

list may be a suitable data structure for the knowledge
representation; if a problem-solving task involves no time order,
then a set may be a suitable data structure for the knowledge
representation. However, there is a universal standard for
knowledge representation proposed by the World Web
Consortium (W3C) called Resource Description
Framework [RDF] (2014). It is the standard model for machine-
readable data representation, which now has been commonly
used to hold the knowledge representation in an application of
processing big data set. Resource Description Framework is
very helpful when you need to integrate the results of several
big data set processing applications. It can facilitate knowledge
integration even when the underlying data schemas differ in the
original data storage systems.
extract knowledge from a large scale data set by applying
machine learning. Based on the results of your research, discuss
the following:
· Identify 1 research work that involves how to extract out
knowledge from a large-scale data set with specific real-world
semantics (e.g., an informatics system for biomedical research)
by applying machine learning.
· How is the machine learning applied in this research work?
· How is the extracted knowledge represented?
· How does the research work address the performance issue of
processing such large-scale data sets?
===============================================
==============================
Unit 9 – 1

One way to research the security issues associated with big data
is to look into every stage of the life cycle of big data. The
entire data life cycle consists of the following 8 stages (Khan et
al., 2014):
· Stage 1: Raw data
· Stage 2: Collection
· Stage 3: Filtering and classification
· Stage 4: Data analysis
· Stage 5: Storing
· Stage 6: Sharing and publishing
· Stage 7: Security
· Stage 8: Retrieval, reuse, and discover
There are two items to point out, as follows:
· Stage 7 is an abstract stage.
· Only three stages (5, 6, and 8) are involved with security.
In Stage 5, the security issues associated with this stage are
mainly caused by two aspects—the size of data and the place to
store the data. Because the size of the data is too big, many
companies have to store their data in the cloud. However,
because the data are so big, it is really hard to verify if cloud
vendors indeed stored all the data. Because the cloud runs
under black box mode, the customers really have no way to
know where the data are stored, how they are stored, and
whether the integrity of the data is preserved. Because of the
cost of local storage and network bandwidth, customers cannot
even afford to use any simple approach, such as downloading
the entire data set, to verify if the data have been stored
properly in the cloud.
security issues associated with big data and how to solve them.
tasks:
· Identify 2 security issues associated with big data.
· What are the root causes of these 2 security issues?
· How can each of these 2 security issues be solved?

===============================================
=================================
Unit 10 -1
The cloud computing system shares the common characteristics
of the general security attacks to all types of distributed
computing systems, such as the following (Prakash & Darbari,
2012):
· Eavesdropping (gaining secret information)
· Masquerading (making assumptions on the identity of users)
· Message tampering (changing the content of the message)
· Replaying the message
· Denial of services
However, because of several special system features of the
cloud computing systems, such as virtual machines (VM), trust
asymmetry, semitransparent system architecture, and so forth,
the cloud computing system has a few special security issues.
These are summarized into the following 10 technical aspects
(Sakr & Gaber, 2014):
· Exploitation of co-tenancy
· Secure architecture for the cloud
· Accountability for outsourced data
· Confidentiality of data and computation
· Privacy
· Verifying outsourced computation
· Verifying capability
· Cloud forensics
· Misuse detection
· Resource accounting and economic attacks
Additionally, even some non-technical areas (e.g., regulatory
compliance legal jurisdiction) and many security researches'

assumptions pose security challenges to which no meaningful
solutions have yet been made (Prakash & Darbari, 2012). All of
these have kept security in cloud computing systems a current
and hot research subject.
security issues of cloud computing systems and how to solve
them. Based on the results of your research, discuss the
following tasks:
· Identify 2 security issues associated with cloud computing
system.
· What are the root causes of these 2 security issues?
· How can these 2 security issues be solved?
===============================================
==============================
Unit 10 -2
Throughout this class, you have touched many current hot
research subjects in big data analytics. You are familiar with
some of problems that are still waiting for a better solution in
big data analytics. Assume that you will write a research paper
on a big data analytic related subject. Discuss the following:
· Present your paper’s title, the motivation in maximum 3
sentences, the problem statement in 1 sentence, and the
hypothesis statement in 1 sentence.
· The hypothesis statement will include a proposed solution to
address the root cause of the problem presented in your problem
statement.
· Present 2 research questions, and discuss your thought process
on how you came up with your research questions based on the

motivations, the problem statement, and the hypothesis
statement.
· Use 1 sentence to specify the new contribution made to the
body of knowledge by your proposed solution.
===============================================
=================================

Discussion Board 1 – 2 Within the Discussion Board area, write 4

Recommended

Recommended

More Related Content

Similar to Discussion Board 1 – 2 Within the Discussion Board area, write 4

Similar to Discussion Board 1 – 2 Within the Discussion Board area, write 4 (20)

More from LyndonPelletier761

More from LyndonPelletier761 (20)

Recently uploaded

Recently uploaded (20)

Discussion Board 1 – 2 Within the Discussion Board area, write 4