Iwsm2014 performance measurement for cloud computing applications using iso...
Project Report (Summer 2016)
1. The Hong Kong University of Science and Technology
Department of Civil Engineering
UROP 1000: Summer 2016
Advisor: Prof WANG, Yu-Hsing
Metadata Event Log: A Data Management System to Store Inconsistent Data Entries
Topic 1: A Data-Driven Approach for Real-Time Landslide Monitoring and Early
Warning System
LAI, Yong Xin (20306541)
Topic 2: Big Data Landslide Early Warning System with Apache Spark and Scala
CHANG, Bing An Andrew (20307648)
NG, Zhi Yong Ignavier (20311194)
THAM, Brendan Guang Yao (20307935)
WONG, Wen Yan (20318893)
Abstract: In recent years, big data analytics is the new approach with which researchers and
companies conduct experiments and analysis. This is particularly true as the field of seismic
monitoring is taking its first steps in landslide monitoring through large-scale deployment of
sensors (Ooi et al. 2016). There is a need to create a new data management system oriented to
suit this project as it deals with inconsistent, unstructured metadata. This report provides an
overview of popular data management conventions and why it does not suit our needs.
Subsequently, we introduce a new data management format, called Metadata Event Log
(MEL), complemented with functions written in Apache Spark and Scala to access and return
the correct set of data. Finally, we discuss lessons we learned along the way and potential
future developments.
2. 1. Introduction
Traditionally, seismic monitoring is a permanent network of state-of-the-art seismological
and geophysical sensors connected by a telecommunications network for monitoring and
research purposes. (GSN, 2016) However, this setup has its limitations. Firstly, they are
expensive to install, as they require boreholes to install deep in the earth’s crust for accurate
readings. Permanent installation limits researchers from focusing in on one particular region,
as the setup typically studies geography on a large scale. Secondly, hardware and sensors are
inaccessible, therefore limits hardware upgrades, software updates, recalibration or
replacement of faulty hardware. In contrast, our project takes a different approach to seismic
field monitoring by installing sensors on the surface of the earth’s crust. This setup is paired
with large-scale deployment of consumer off the shelf electronics for field monitoring. (Ooi
et al. 2016).
With increased accessibility, researchers are able to frequently perform software updates,
hardware replacement, recalibration or replacement of faulty hardware. Thus, frequent
maintenance and upgrade activities create the problem of tracking the changes made to one
particular node (changes/logs are recorded as metadata). Furthermore, the sheer scale of
nodes deployed in the field further amplifies this problem. The dynamic nature of updates
requires a Data Management System (DMS) flexible enough to record these changes as
metadata.
2. Comparison of existing DMS and why it does not fulfil our needs
Over the years there are some popular data management formats, which are designed to
accommodate data exchange; Standard for the international Exchange of Earthquake Data
3. (SEED) and Centre of Seismic Studies Database Management System (CSSDBMS). However,
these databases are not built to accommodate dynamic metadata management.
In terms of SEED (FDSN, 2014), the data model utilizes sequences of format objects. The
main flaw is that the model assumes that data transmitted back to the server is already in a
form of a lineage (figure 1.1) and readily usable for analysis, which is not the case for data
fetched back from the field. Besides that, the blockette (figure 1.2) system assumes
hierarchical relationship between data collected. The hierarchical data model can only be
applied if we are certain which set of possible dependent variable could be modified due to
the change of any one independent variable. However, this is not the case for metadata
collected in out project, because the changes of one metric may or may not affect other
metrics, albeit the affected ones could be of more than one metric. Therefore, the transient
and inconsistent nature of our node system deems the SEED system unviable to be
implemented.
In terms of CSSDBMS (Anderson et al. 1990), the database has a highly complicated and
rigid relationship between data (figure 1.3). This paradigm of data management system
utilizes relational database management system (RDBMS) because of the consistent,
hierarchical data model. RDBMS offers the advantage of optimized access and query as all
superclass and subclasses are well defined. In the case of our project, the data collected is
partially hierarchical. (figure 1.4) However, the pre-determined nature of the RDBMS is not
efficient to store transient, unstructured, and inconsistent relational data as reflected on our
project. If we do insist on the implementation of RDBMS, we need to update the entire
database for every time that we append new metadata. More importantly, the complexity of
4. the redesigned database increases with variety of metadata collected as the number of layers
between superclass and subclasses increase.
3. A new Data Management System: Metadata Event Log (MEL)
3.1 What MEL is and what it records
The MEL is a collection of immutable time series where each row (or entry) represents
an event that occurred at some point of the node’s lifetime. (figure 2.1) Events can be
categorized into 2 types: a) recalibration activity, and b) hardware maintenance and
replacement activity. Each of these events must be associated to a particular node, and
information about each node (metadata) is recorded under their respective fields of the
column header. Also, each entry stores metadata of raw data collected by each node, and this
special metadata is called metricName.
3.2 Convention to record data: Identity and events
Firstly, the user should record non-nullable fields; that is, the starting time of the event in
UTC date format. i.e.: “2016-08-10T15:22:57+08:00”, and at least one of “network”,
“station”, “location”, and “node”. Upon completion of recording, any further data recorded
are classified as event data.
3.2.1 Recording hardware replacement activities
If this activity does not occur, the entries for all hardware metadata entries are left null.
Else, the user records the new hardware name as the entry. (figure 2.1)
3.2.2 Recording maintenance activities: A binary method
5. The user should input the time of which maintenance occurs and input a “1” under the
“maintanenceActivity” column. When maintenance activity is completed, the user should
input a “0” under the same column. (figure 2.2)
3.3 Benefits of MEL
In short, the MEL format is more catered to accommodate compared to traditional
DMS for our project (Ooi et al, 2016) because it of its robustness to store new metadata
types and events that occur (figure 2.1). Fundamentally, MEL is created to remove
limitations in querying: for example, a data analyst may search for all SD_card changes
within a node or conversely, search for all nodes a particular SD_card has been in. This
flexibility is enabled by the flat data organization of the tabular form of the MEL; meaning,
the hierarchical characteristics of the metadata is not reflected through this convention of
recording events. In contrast, querying flexibility cannot be implemented by RDMBS, SEED
or CSSDBMS data management systems because of the pre-determined nature of data storage.
4 Accessing MEL
4.1 Access functions and inference feature: query() and filterDF() functions
MEL by itself is meaningless; we need to functions to extract the correct set of data within
in the lifetime of a node. Therefore, we have written 2 functions; that is, filterDF(), and
query(). filterDF() is a fetch-and-return function which is used as an intermediary function;
that is, it returns a dataframe containing events associated to user specified metadata, subject
to availability. The purpose of filter to return entries within that time period only. query()is a
function which calls filterDF() multiple times to return inferred results as a dataframe.
4.2 Purpose of inference feature
6. Inference is a feature, which searches for time mismatches or empty entries and returns
the “nearest-previous” entry. Each event is stored as an entry, and it is assumed no changes
occur until the next entry. It is possible that a data analyst queries for some event that occurs
before the start time of the specified time period, therefore returns insufficient data for
analysis. Thus, the query() function uses inference to return the most recent previous entry
for the specified event. To illustrate the purpose of inferencing, consider the example as
follows:
Suppose user queries for field calibration metadata from 2013-01-0T00:0:00+08:00 to
2015-01-0T00:0:00+08:00 for a particular node, and events of field calibration occurred
during 2012, 2014, and 2016. In this case, the returned dataframe is a single entry of field
calibration metadata during 2014. Querying without inferencing returns ambiguous results
because the analyst doesn’t know what calibration values to use from the start time (2013) to
the returned calibration record (2014). Hence, the missing calibration values need to be
inferred from the “nearest-previous” entry. Inference is only required to return “nearest-
previous” entry, not the “nearest-future”. This is because we assume that no changes occur
until the next entry in the log, therefore, all relevant metadata from the first entry onwards
is complete within the queried time period.
4.3 How inferencing is implemented
This section is best explained with the following example. (User input conventions
are explained in section 4.5). Suppose user queries for simCode and fieldCalibration in Table
1 using conditions (denote as inputCondition1): isNotNullConds = Seq(“fieldCalibration”,
containConds = Seq(Seq(“node”, “A”, “B”, “C”, “E”)), startTime = “2014.5”, endTime =
“2019”. Note that the dataframe before and after startTime are defined as tempDF and
rawDF respectively.
7. Firstly, filterDF(denote as inputCondition1) is implemented to return rawDF (as
shown in Table 2). All the remaining nodes, ListOfDistinctNodes = Seq(“node”, “A”, “B”,
“C”) can be found from rawDF. In order to return tempDF, filterDF is further called again,
but with different conditions (denote as inputCondition2): isNotNullConds =
Seq(“fieldCalibration”, containConds = Seq(Seq(ListOfDistinctNodes)), startTime = “2001”,
endTime = “2014.5”. Note that startTime is inputted as the absolute starting time of MEL,
while endTime is inputted as the startTime inputted by user (inputCondition1). Result of
tempDF is shown in Table 3.
tempDF is further split according to respective nodes, results in Table 4a and Table 4b,
which are stored as a sequence named SequenceOfDataFrames[]. In order to increase
accessibility of DataFrame, SequenceOfDataFrames[] is converted to a 3D array format. For
each 3D array, one inferred row is produced, results in Table 5a and Table 5b. Noted that
from Table 4a to Table 5a, time is changed to startTime inputted by user (inputCondition1)
because all required conditions are non-null. From Table 4b to Table 5b, value of simCode is
inferred vertically upward to search for the most recent non-null value, which is s3, while
value of fieldCalibration is directly used in Table 5b. Lastly, rawDF (Table 2) and inferred
row (Table 5a and Table 5b) are combined to return desired result (Table 6) to user.
4.4 User documentation: How to use query()
Under the function called query, there are multiple input arguments that the user
needs to provide; that is, containsConds, isNotNullConds, startTime, and endTime. The end
goal of the function is to return a DataFrame containing all entries of the specified metadata
(listed under isNotNullConds), all of which exists from the correct node, station or network in
8. the specified list where inference done on containsConds arguments. The user will first input
the desired start and end time under startTime and endTime in the UTC data format. i.e.:
2015-01-23T15:22:30+08:00. Next, the desired metadata such as fieldCalibration or
SD_Card will be inputted as a 1D sequence in the isNotNullConds argument while the
specific node, station or network will be inputted as a 2D sequence in the containsConds
argument. This section is best illustrated with the following example.
Suppose a data analyst wants to query a set of fieldCalibration, SD_Card data
containing all nodes in the “LuShan” network, and the specific individual nodes “1792” and
“E” (all of which are in the same specified timeframe). The identifier arguments of the
specific node(s), station(s), or network belong under the argument containsConds. The input
syntax should be in the form Seq(Seq(“Network”,”Lushan”), Seq(“Node”, “1792”, “E”)).
Basically, the convention is the type (i.e. node, station OR network) should be the first
element in the inner sequence, whereas the actual names (“Lushan”, “E”) are the subsequent
elements of each corresponding type. It is important to note that each type should be
inputted in separate sequences. Next, the corresponding set of metadata should be inputted
as the arguments to the identifier isNotNullConds in the syntax as follows:
Seq(“fieldCalibration”, “SD_Card”). Should the analyst leaves the isNotNullConds blank, the
function will return all metadata that exists to the corresponding node, station or network.
4.5 User Documentation: How the user should interpret results
The dataframe returned is shown as in table 6.. One special metadata that the user needs
to manually infer is the calibration data. Calibration can be done on different levels of the
project i.e. network, station, location node etc. In order to determine on which level the
9. calibration is done, the user needs to infer to the leftmost filled column. This section is
best explained by referring to figure 2.2:
i.e. at time t1 , field calibration is performed on all the nodes the “LuShan” network.
At time t3 , field calibration is performed on the specific node “E” in station “001_COTS” ,
which is in location “001” of the “LuShan” network.
5. Conclusion and further developments
In conclusion, we have achieved our desired goal to create a new DMS for efficient
metadata management. We have learned that traditional DMSs have their pros and cons, but
for the purposes of our project, the demand for a scalable, robust DMS is greater than what
existing ones provide. This is due to the messiness and transient nature of metadata generated.
In the near future, we have two frontiers of further development of this database project; that
is, ranking system of results, and over the air (OTA) calibration metadata transfer.
It is foreseeable that, given the scale of this project, as the number of node being
implemented increases over time, it is more challenging for the query() function users to
mentally keep track of the names and corresponding "active time periods" of each node. Thus,
we plan to develop a ranking system that statistically records the frequency of different
startTime, endTime, containConds, and notNullConds values being called by user. This
allows the system to predict the user's most favorable set of input parameters and provide
necessary recommendations to the user while typing. Besides, the ranking system being
implemented needs to be user-specific, as different users tend to focus on different columns
of the logbook for analysis.
10. In terms of OTA calibration metadata transfer, we intend to build a mobile app for to aid
contractors in node deployment. In essence, the contractors will simultaneously input the
calibration metadata of the associated node during installation while on-site. This data will
then be sent from the app to the server for pre-processing before appending it on the MEL.
11. Appendix
Figure 1.1: Standard for the
international Exchange of Earthquake
Data (SEED)
Figure 1.2: SEED Blockette Convention
Figure 1.3: Centre of Seismic Studies Database Management System (CSSDBMS)
Structured Relationship Model
12. Figure 1.4: Traditional Hierarchal Structure
Time Node Name SD Card 3Gmodem
t1 "U" "M7"
Sim Code
t2 "U" "61E13CU957744"
t3 "U" "C6"
t4 "U" "M9"
APP RPi
t5 "U" "2P" "A+"
t6 "U" "JA18" "61Y15BU133338" "4K"
Figure 2.1: Example of Metadata Event Log
Time Network Location Station Node
Lab
Calibration
Field
Calibration
Orientation
Field
Maintenance
t1 "Lushan" 1
t2 "LuShan" "001" 1
t3 "LuShan" "001" "001_COTS" "E" 1
t4 "LuShan" "001" 0
t5 "LuShan" "001" -- -- -- --
t6 "LuShan" "001" "001_COTS" "E" -- -- -- --
Figure 2.2: Recording maintenance activities - A binary method
Network
“Lushan”
Location
“001”
Station Node Sensor
“001_COTS”“E” Gyrometer
Level
Example
Flow Chart
Keys:
13. Implementation of Inferencing
Table 1: Raw data from MEL containing entire history of record to be queried
time node simCode fieldCalibration
"2011" "C" s1 f1
"2012" "E" s2 f2
"2013" "A" s3 f3
"2014" "A" f4
"2015" "B" f5
"2016" "C" f6
"2017" "A" s4 f7
"2018" "D" f8
Table 2: rawDF
time node simCode fieldCalibration
"2015" "B" f6
"2016" "C" f7
"2017" "A" s4 f8
Table 3: tempDF
time node simCode fieldCalibration
"2011" "C" s1 f1
"2013" "A" s3 f3
"2014" "A" f4
Table 4a: tempDF, split according to node
time node simCode fieldCalibration
"2011" "C" s1 f1
Table 4b: tempDF, split according to node
time node simCode fieldCalibration
"2013" "A" s3 f3
"2014" "A" f4
14. Table 5a: Inferencing 1
time node simCode fieldCalibration
"2014.5" "C" s1 f1
Table 5b: Inferencing 2
time node simCode fieldCalibration
"2014.5" "A" s3 f4
Table 6: Final Result
time node simCode fieldCalibration
"2014.5" "C" s1 f1
"2014.5" "A" s3 f4
"2015" "B" f5
"2016" "C" f6
"2017" "A" s4 f7
15. References:
1. Ghee Leng Ooi, Pin Siang Tan, Mei Ling Leung, Hoi Lun Lui, Yee Shien Yeo,
Jimmy Wu, Jun-Ting Lin, Yu-Hsing Wang, Kuo-Lung Wang, Meei-Ling Lin, Qian
Zhang (2016). The DESIGnSLM Architecture: A Data-Enabled Scalable
Instrumentation for Geotechnical Engineering, Seismic and Landslide Monitoring
2. GSN: Global Seismographic Network. (n.d.). Retrieved August 15, 2016, from
http://earthquake.usgs.gov/monitoring/gsn/
3. FDSN: International Federation of Digital Seismograph Networks Incorporated
Institutions for Seismology (2014). Standard for the Exchange of Earthquake Data
(SEED) Format Version 2.12
4. J. Anderson, W.E Farrell, K. Garcia, J. Given, H. Swanger (1990). Center For Seismic
Studies Version 3 Database: Schema Reference Manual
5. J. Anderson, H. Swanger (1990). Center For Seismic Studies Version 3 Database:
SQL Tutorial