SlideShare a Scribd company logo
1 of 15
Download to read offline
The Hong Kong University of Science and Technology
Department of Civil Engineering
UROP 1000: Summer 2016
Advisor: Prof WANG, Yu-Hsing
Metadata Event Log: A Data Management System to Store Inconsistent Data Entries
Topic 1: A Data-Driven Approach for Real-Time Landslide Monitoring and Early
Warning System
LAI, Yong Xin (20306541)
Topic 2: Big Data Landslide Early Warning System with Apache Spark and Scala
CHANG, Bing An Andrew (20307648)
NG, Zhi Yong Ignavier (20311194)
THAM, Brendan Guang Yao (20307935)
WONG, Wen Yan (20318893)
Abstract: In recent years, big data analytics is the new approach with which researchers and
companies conduct experiments and analysis. This is particularly true as the field of seismic
monitoring is taking its first steps in landslide monitoring through large-scale deployment of
sensors (Ooi et al. 2016). There is a need to create a new data management system oriented to
suit this project as it deals with inconsistent, unstructured metadata. This report provides an
overview of popular data management conventions and why it does not suit our needs.
Subsequently, we introduce a new data management format, called Metadata Event Log
(MEL), complemented with functions written in Apache Spark and Scala to access and return
the correct set of data. Finally, we discuss lessons we learned along the way and potential
future developments.
1. Introduction
Traditionally, seismic monitoring is a permanent network of state-of-the-art seismological
and geophysical sensors connected by a telecommunications network for monitoring and
research purposes. (GSN, 2016) However, this setup has its limitations. Firstly, they are
expensive to install, as they require boreholes to install deep in the earth’s crust for accurate
readings. Permanent installation limits researchers from focusing in on one particular region,
as the setup typically studies geography on a large scale. Secondly, hardware and sensors are
inaccessible, therefore limits hardware upgrades, software updates, recalibration or
replacement of faulty hardware. In contrast, our project takes a different approach to seismic
field monitoring by installing sensors on the surface of the earth’s crust. This setup is paired
with large-scale deployment of consumer off the shelf electronics for field monitoring. (Ooi
et al. 2016).
With increased accessibility, researchers are able to frequently perform software updates,
hardware replacement, recalibration or replacement of faulty hardware. Thus, frequent
maintenance and upgrade activities create the problem of tracking the changes made to one
particular node (changes/logs are recorded as metadata). Furthermore, the sheer scale of
nodes deployed in the field further amplifies this problem. The dynamic nature of updates
requires a Data Management System (DMS) flexible enough to record these changes as
metadata.
2. Comparison of existing DMS and why it does not fulfil our needs
Over the years there are some popular data management formats, which are designed to
accommodate data exchange; Standard for the international Exchange of Earthquake Data
(SEED) and Centre of Seismic Studies Database Management System (CSSDBMS). However,
these databases are not built to accommodate dynamic metadata management.
In terms of SEED (FDSN, 2014), the data model utilizes sequences of format objects. The
main flaw is that the model assumes that data transmitted back to the server is already in a
form of a lineage (figure 1.1) and readily usable for analysis, which is not the case for data
fetched back from the field. Besides that, the blockette (figure 1.2) system assumes
hierarchical relationship between data collected. The hierarchical data model can only be
applied if we are certain which set of possible dependent variable could be modified due to
the change of any one independent variable. However, this is not the case for metadata
collected in out project, because the changes of one metric may or may not affect other
metrics, albeit the affected ones could be of more than one metric. Therefore, the transient
and inconsistent nature of our node system deems the SEED system unviable to be
implemented.
In terms of CSSDBMS (Anderson et al. 1990), the database has a highly complicated and
rigid relationship between data (figure 1.3). This paradigm of data management system
utilizes relational database management system (RDBMS) because of the consistent,
hierarchical data model. RDBMS offers the advantage of optimized access and query as all
superclass and subclasses are well defined. In the case of our project, the data collected is
partially hierarchical. (figure 1.4) However, the pre-determined nature of the RDBMS is not
efficient to store transient, unstructured, and inconsistent relational data as reflected on our
project. If we do insist on the implementation of RDBMS, we need to update the entire
database for every time that we append new metadata. More importantly, the complexity of
the redesigned database increases with variety of metadata collected as the number of layers
between superclass and subclasses increase.
3. A new Data Management System: Metadata Event Log (MEL)
3.1 What MEL is and what it records
The MEL is a collection of immutable time series where each row (or entry) represents
an event that occurred at some point of the node’s lifetime. (figure 2.1) Events can be
categorized into 2 types: a) recalibration activity, and b) hardware maintenance and
replacement activity. Each of these events must be associated to a particular node, and
information about each node (metadata) is recorded under their respective fields of the
column header. Also, each entry stores metadata of raw data collected by each node, and this
special metadata is called metricName.
3.2 Convention to record data: Identity and events
Firstly, the user should record non-nullable fields; that is, the starting time of the event in
UTC date format. i.e.: “2016-08-10T15:22:57+08:00”, and at least one of “network”,
“station”, “location”, and “node”. Upon completion of recording, any further data recorded
are classified as event data.
3.2.1 Recording hardware replacement activities
If this activity does not occur, the entries for all hardware metadata entries are left null.
Else, the user records the new hardware name as the entry. (figure 2.1)
3.2.2 Recording maintenance activities: A binary method
The user should input the time of which maintenance occurs and input a “1” under the
“maintanenceActivity” column. When maintenance activity is completed, the user should
input a “0” under the same column. (figure 2.2)
3.3 Benefits of MEL
In short, the MEL format is more catered to accommodate compared to traditional
DMS for our project (Ooi et al, 2016) because it of its robustness to store new metadata
types and events that occur (figure 2.1). Fundamentally, MEL is created to remove
limitations in querying: for example, a data analyst may search for all SD_card changes
within a node or conversely, search for all nodes a particular SD_card has been in. This
flexibility is enabled by the flat data organization of the tabular form of the MEL; meaning,
the hierarchical characteristics of the metadata is not reflected through this convention of
recording events. In contrast, querying flexibility cannot be implemented by RDMBS, SEED
or CSSDBMS data management systems because of the pre-determined nature of data storage.
4 Accessing MEL
4.1 Access functions and inference feature: query() and filterDF() functions
MEL by itself is meaningless; we need to functions to extract the correct set of data within
in the lifetime of a node. Therefore, we have written 2 functions; that is, filterDF(), and
query(). filterDF() is a fetch-and-return function which is used as an intermediary function;
that is, it returns a dataframe containing events associated to user specified metadata, subject
to availability. The purpose of filter to return entries within that time period only. query()is a
function which calls filterDF() multiple times to return inferred results as a dataframe.
4.2 Purpose of inference feature
Inference is a feature, which searches for time mismatches or empty entries and returns
the “nearest-previous” entry. Each event is stored as an entry, and it is assumed no changes
occur until the next entry. It is possible that a data analyst queries for some event that occurs
before the start time of the specified time period, therefore returns insufficient data for
analysis. Thus, the query() function uses inference to return the most recent previous entry
for the specified event. To illustrate the purpose of inferencing, consider the example as
follows:
Suppose user queries for field calibration metadata from 2013-01-0T00:0:00+08:00 to
2015-01-0T00:0:00+08:00 for a particular node, and events of field calibration occurred
during 2012, 2014, and 2016. In this case, the returned dataframe is a single entry of field
calibration metadata during 2014. Querying without inferencing returns ambiguous results
because the analyst doesn’t know what calibration values to use from the start time (2013) to
the returned calibration record (2014). Hence, the missing calibration values need to be
inferred from the “nearest-previous” entry. Inference is only required to return “nearest-
previous” entry, not the “nearest-future”. This is because we assume that no changes occur
until the next entry in the log, therefore, all relevant metadata from the first entry onwards
is complete within the queried time period.
4.3 How inferencing is implemented
This section is best explained with the following example. (User input conventions
are explained in section 4.5). Suppose user queries for simCode and fieldCalibration in Table
1 using conditions (denote as inputCondition1): isNotNullConds = Seq(“fieldCalibration”,
containConds = Seq(Seq(“node”, “A”, “B”, “C”, “E”)), startTime = “2014.5”, endTime =
“2019”. Note that the dataframe before and after startTime are defined as tempDF and
rawDF respectively.
Firstly, filterDF(denote as inputCondition1) is implemented to return rawDF (as
shown in Table 2). All the remaining nodes, ListOfDistinctNodes = Seq(“node”, “A”, “B”,
“C”) can be found from rawDF. In order to return tempDF, filterDF is further called again,
but with different conditions (denote as inputCondition2): isNotNullConds =
Seq(“fieldCalibration”, containConds = Seq(Seq(ListOfDistinctNodes)), startTime = “2001”,
endTime = “2014.5”. Note that startTime is inputted as the absolute starting time of MEL,
while endTime is inputted as the startTime inputted by user (inputCondition1). Result of
tempDF is shown in Table 3.
tempDF is further split according to respective nodes, results in Table 4a and Table 4b,
which are stored as a sequence named SequenceOfDataFrames[]. In order to increase
accessibility of DataFrame, SequenceOfDataFrames[] is converted to a 3D array format. For
each 3D array, one inferred row is produced, results in Table 5a and Table 5b. Noted that
from Table 4a to Table 5a, time is changed to startTime inputted by user (inputCondition1)
because all required conditions are non-null. From Table 4b to Table 5b, value of simCode is
inferred vertically upward to search for the most recent non-null value, which is s3, while
value of fieldCalibration is directly used in Table 5b. Lastly, rawDF (Table 2) and inferred
row (Table 5a and Table 5b) are combined to return desired result (Table 6) to user.
4.4 User documentation: How to use query()
Under the function called query, there are multiple input arguments that the user
needs to provide; that is, containsConds, isNotNullConds, startTime, and endTime. The end
goal of the function is to return a DataFrame containing all entries of the specified metadata
(listed under isNotNullConds), all of which exists from the correct node, station or network in
the specified list where inference done on containsConds arguments. The user will first input
the desired start and end time under startTime and endTime in the UTC data format. i.e.:
2015-01-23T15:22:30+08:00. Next, the desired metadata such as fieldCalibration or
SD_Card will be inputted as a 1D sequence in the isNotNullConds argument while the
specific node, station or network will be inputted as a 2D sequence in the containsConds
argument. This section is best illustrated with the following example.
Suppose a data analyst wants to query a set of fieldCalibration, SD_Card data
containing all nodes in the “LuShan” network, and the specific individual nodes “1792” and
“E” (all of which are in the same specified timeframe). The identifier arguments of the
specific node(s), station(s), or network belong under the argument containsConds. The input
syntax should be in the form Seq(Seq(“Network”,”Lushan”), Seq(“Node”, “1792”, “E”)).
Basically, the convention is the type (i.e. node, station OR network) should be the first
element in the inner sequence, whereas the actual names (“Lushan”, “E”) are the subsequent
elements of each corresponding type. It is important to note that each type should be
inputted in separate sequences. Next, the corresponding set of metadata should be inputted
as the arguments to the identifier isNotNullConds in the syntax as follows:
Seq(“fieldCalibration”, “SD_Card”). Should the analyst leaves the isNotNullConds blank, the
function will return all metadata that exists to the corresponding node, station or network.
4.5 User Documentation: How the user should interpret results
The dataframe returned is shown as in table 6.. One special metadata that the user needs
to manually infer is the calibration data. Calibration can be done on different levels of the
project i.e. network, station, location node etc. In order to determine on which level the
calibration is done, the user needs to infer to the leftmost filled column. This section is
best explained by referring to figure 2.2:
i.e. at time t1 , field calibration is performed on all the nodes the “LuShan” network.
At time t3 , field calibration is performed on the specific node “E” in station “001_COTS” ,
which is in location “001” of the “LuShan” network.
5. Conclusion and further developments
In conclusion, we have achieved our desired goal to create a new DMS for efficient
metadata management. We have learned that traditional DMSs have their pros and cons, but
for the purposes of our project, the demand for a scalable, robust DMS is greater than what
existing ones provide. This is due to the messiness and transient nature of metadata generated.
In the near future, we have two frontiers of further development of this database project; that
is, ranking system of results, and over the air (OTA) calibration metadata transfer.
It is foreseeable that, given the scale of this project, as the number of node being
implemented increases over time, it is more challenging for the query() function users to
mentally keep track of the names and corresponding "active time periods" of each node. Thus,
we plan to develop a ranking system that statistically records the frequency of different
startTime, endTime, containConds, and notNullConds values being called by user. This
allows the system to predict the user's most favorable set of input parameters and provide
necessary recommendations to the user while typing. Besides, the ranking system being
implemented needs to be user-specific, as different users tend to focus on different columns
of the logbook for analysis.
In terms of OTA calibration metadata transfer, we intend to build a mobile app for to aid
contractors in node deployment. In essence, the contractors will simultaneously input the
calibration metadata of the associated node during installation while on-site. This data will
then be sent from the app to the server for pre-processing before appending it on the MEL.
Appendix
Figure 1.1: Standard for the
international Exchange of Earthquake
Data (SEED)
Figure 1.2: SEED Blockette Convention
Figure 1.3: Centre of Seismic Studies Database Management System (CSSDBMS)
Structured Relationship Model
Figure 1.4: Traditional Hierarchal Structure
Time Node Name SD Card 3Gmodem
t1 "U" "M7"
Sim Code
t2 "U" "61E13CU957744"
t3 "U" "C6"
t4 "U" "M9"
APP RPi
t5 "U" "2P" "A+"
t6 "U" "JA18" "61Y15BU133338" "4K"
Figure 2.1: Example of Metadata Event Log
Time Network Location Station Node
Lab
Calibration
Field
Calibration
Orientation
Field
Maintenance
t1 "Lushan" 1
t2 "LuShan" "001" 1
t3 "LuShan" "001" "001_COTS" "E" 1
t4 "LuShan" "001" 0
t5 "LuShan" "001" -- -- -- --
t6 "LuShan" "001" "001_COTS" "E" -- -- -- --
Figure 2.2: Recording maintenance activities - A binary method
Network
“Lushan”
Location
“001”
Station Node Sensor
“001_COTS”“E” Gyrometer
Level
Example
Flow Chart
Keys:
Implementation of Inferencing
Table 1: Raw data from MEL containing entire history of record to be queried
time node simCode fieldCalibration
"2011" "C" s1 f1
"2012" "E" s2 f2
"2013" "A" s3 f3
"2014" "A" f4
"2015" "B" f5
"2016" "C" f6
"2017" "A" s4 f7
"2018" "D" f8
Table 2: rawDF
time node simCode fieldCalibration
"2015" "B" f6
"2016" "C" f7
"2017" "A" s4 f8
Table 3: tempDF
time node simCode fieldCalibration
"2011" "C" s1 f1
"2013" "A" s3 f3
"2014" "A" f4
Table 4a: tempDF, split according to node
time node simCode fieldCalibration
"2011" "C" s1 f1
Table 4b: tempDF, split according to node
time node simCode fieldCalibration
"2013" "A" s3 f3
"2014" "A" f4
Table 5a: Inferencing 1
time node simCode fieldCalibration
"2014.5" "C" s1 f1
Table 5b: Inferencing 2
time node simCode fieldCalibration
"2014.5" "A" s3 f4
Table 6: Final Result
time node simCode fieldCalibration
"2014.5" "C" s1 f1
"2014.5" "A" s3 f4
"2015" "B" f5
"2016" "C" f6
"2017" "A" s4 f7
References:
1. Ghee Leng Ooi, Pin Siang Tan, Mei Ling Leung, Hoi Lun Lui, Yee Shien Yeo,
Jimmy Wu, Jun-Ting Lin, Yu-Hsing Wang, Kuo-Lung Wang, Meei-Ling Lin, Qian
Zhang (2016). The DESIGnSLM Architecture: A Data-Enabled Scalable
Instrumentation for Geotechnical Engineering, Seismic and Landslide Monitoring
2. GSN: Global Seismographic Network. (n.d.). Retrieved August 15, 2016, from
http://earthquake.usgs.gov/monitoring/gsn/
3. FDSN: International Federation of Digital Seismograph Networks Incorporated
Institutions for Seismology (2014). Standard for the Exchange of Earthquake Data
(SEED) Format Version 2.12
4. J. Anderson, W.E Farrell, K. Garcia, J. Given, H. Swanger (1990). Center For Seismic
Studies Version 3 Database: Schema Reference Manual
5. J. Anderson, H. Swanger (1990). Center For Seismic Studies Version 3 Database:
SQL Tutorial

More Related Content

What's hot

Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clusteringpaperpublications3
 
A New Mechanism for Service Recovery Technology by using Recovering Service’s...
A New Mechanism for Service Recovery Technology by using Recovering Service’s...A New Mechanism for Service Recovery Technology by using Recovering Service’s...
A New Mechanism for Service Recovery Technology by using Recovering Service’s...ijfcstjournal
 
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...graphhoc
 
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)平原 謝
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graphijdms
 
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudAM Publications
 
Ijircce publish this paper
Ijircce publish this paperIjircce publish this paper
Ijircce publish this paperSANTOSH WAYAL
 
Research of Embedded GIS Data Management Strategies for Large Capacity
Research of Embedded GIS Data Management Strategies for Large CapacityResearch of Embedded GIS Data Management Strategies for Large Capacity
Research of Embedded GIS Data Management Strategies for Large CapacityNooria Sukmaningtyas
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...AM Publications
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET Journal
 
Efficient Cost Minimization for Big Data Processing
Efficient Cost Minimization for Big Data ProcessingEfficient Cost Minimization for Big Data Processing
Efficient Cost Minimization for Big Data ProcessingIRJET Journal
 
Reengineering of relational databases to objectoriented
Reengineering of relational databases to objectorientedReengineering of relational databases to objectoriented
Reengineering of relational databases to objectorientedeSAT Publishing House
 
Reengineering of relational databases to object oriented database
Reengineering of relational databases to object oriented databaseReengineering of relational databases to object oriented database
Reengineering of relational databases to object oriented databaseeSAT Journals
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir
 

What's hot (19)

Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
A New Mechanism for Service Recovery Technology by using Recovering Service’s...
A New Mechanism for Service Recovery Technology by using Recovering Service’s...A New Mechanism for Service Recovery Technology by using Recovering Service’s...
A New Mechanism for Service Recovery Technology by using Recovering Service’s...
 
Fn3110961103
Fn3110961103Fn3110961103
Fn3110961103
 
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...
 
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graph
 
Aa31163168
Aa31163168Aa31163168
Aa31163168
 
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud
 
Ijircce publish this paper
Ijircce publish this paperIjircce publish this paper
Ijircce publish this paper
 
Research of Embedded GIS Data Management Strategies for Large Capacity
Research of Embedded GIS Data Management Strategies for Large CapacityResearch of Embedded GIS Data Management Strategies for Large Capacity
Research of Embedded GIS Data Management Strategies for Large Capacity
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
 
Ijcet 06 07_003
Ijcet 06 07_003Ijcet 06 07_003
Ijcet 06 07_003
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database Techniques
 
Efficient Cost Minimization for Big Data Processing
Efficient Cost Minimization for Big Data ProcessingEfficient Cost Minimization for Big Data Processing
Efficient Cost Minimization for Big Data Processing
 
Reengineering of relational databases to objectoriented
Reengineering of relational databases to objectorientedReengineering of relational databases to objectoriented
Reengineering of relational databases to objectoriented
 
Reengineering of relational databases to object oriented database
Reengineering of relational databases to object oriented databaseReengineering of relational databases to object oriented database
Reengineering of relational databases to object oriented database
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
 

Viewers also liked (13)

O samaín
O samaínO samaín
O samaín
 
resume8-9
resume8-9resume8-9
resume8-9
 
то 15
то 15то 15
то 15
 
Programa 2010
Programa 2010Programa 2010
Programa 2010
 
Camper Dream
Camper DreamCamper Dream
Camper Dream
 
Итоги недели. 14-21 ноября 2011 г.
Итоги недели. 14-21 ноября 2011 г.Итоги недели. 14-21 ноября 2011 г.
Итоги недели. 14-21 ноября 2011 г.
 
Образование и наука в СССР
Образование и наука в СССРОбразование и наука в СССР
Образование и наука в СССР
 
Natal Plaquinhas divertidas
Natal Plaquinhas divertidasNatal Plaquinhas divertidas
Natal Plaquinhas divertidas
 
Mi experiencia como emprendedor SumaCRM.com - Startup School
Mi experiencia como emprendedor SumaCRM.com - Startup SchoolMi experiencia como emprendedor SumaCRM.com - Startup School
Mi experiencia como emprendedor SumaCRM.com - Startup School
 
Pv sindhu images, Photos, Videos, Wiki, Latest News
Pv sindhu images, Photos, Videos, Wiki, Latest NewsPv sindhu images, Photos, Videos, Wiki, Latest News
Pv sindhu images, Photos, Videos, Wiki, Latest News
 
урок 3
урок 3урок 3
урок 3
 
cellandbiosci_Reviewers2016_10_1186_s13
cellandbiosci_Reviewers2016_10_1186_s13cellandbiosci_Reviewers2016_10_1186_s13
cellandbiosci_Reviewers2016_10_1186_s13
 
TJU_Newsletter 2008_summer Li photo
TJU_Newsletter 2008_summer Li photoTJU_Newsletter 2008_summer Li photo
TJU_Newsletter 2008_summer Li photo
 

Similar to Project Report (Summer 2016)

A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENTA CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENTIJwest
 
Stream Processing Environmental Applications in Jordan Valley
Stream Processing Environmental Applications in Jordan ValleyStream Processing Environmental Applications in Jordan Valley
Stream Processing Environmental Applications in Jordan ValleyCSCJournals
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudIJAAS Team
 
Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012Bert Taube
 
Data performance characterization of frequent pattern mining algorithms
Data performance characterization of frequent pattern mining algorithmsData performance characterization of frequent pattern mining algorithms
Data performance characterization of frequent pattern mining algorithmsIJDKP
 
Data repository for sensor network a data mining approach
Data repository for sensor network  a data mining approachData repository for sensor network  a data mining approach
Data repository for sensor network a data mining approachijdms
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEIJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEIJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEIJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEIJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEIJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEIJCSEA Journal
 
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET Journal
 
Target Response Electrical usage Profile Clustering using Big Data
Target Response Electrical usage Profile Clustering using Big DataTarget Response Electrical usage Profile Clustering using Big Data
Target Response Electrical usage Profile Clustering using Big DataIRJET Journal
 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design phanleson
 
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...IJMER
 
Data characterization towards modeling frequent pattern mining algorithms
Data characterization towards modeling frequent pattern mining algorithmsData characterization towards modeling frequent pattern mining algorithms
Data characterization towards modeling frequent pattern mining algorithmscsandit
 
Iwsm2014 performance measurement for cloud computing applications using iso...
Iwsm2014   performance measurement for cloud computing applications using iso...Iwsm2014   performance measurement for cloud computing applications using iso...
Iwsm2014 performance measurement for cloud computing applications using iso...Nesma
 

Similar to Project Report (Summer 2016) (20)

A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENTA CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
 
Stream Processing Environmental Applications in Jordan Valley
Stream Processing Environmental Applications in Jordan ValleyStream Processing Environmental Applications in Jordan Valley
Stream Processing Environmental Applications in Jordan Valley
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
Sdd 4
Sdd 4Sdd 4
Sdd 4
 
Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012
 
Data performance characterization of frequent pattern mining algorithms
Data performance characterization of frequent pattern mining algorithmsData performance characterization of frequent pattern mining algorithms
Data performance characterization of frequent pattern mining algorithms
 
Data repository for sensor network a data mining approach
Data repository for sensor network  a data mining approachData repository for sensor network  a data mining approach
Data repository for sensor network a data mining approach
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCEAPPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
 
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
 
Target Response Electrical usage Profile Clustering using Big Data
Target Response Electrical usage Profile Clustering using Big DataTarget Response Electrical usage Profile Clustering using Big Data
Target Response Electrical usage Profile Clustering using Big Data
 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design
 
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
 
Data characterization towards modeling frequent pattern mining algorithms
Data characterization towards modeling frequent pattern mining algorithmsData characterization towards modeling frequent pattern mining algorithms
Data characterization towards modeling frequent pattern mining algorithms
 
Iwsm2014 performance measurement for cloud computing applications using iso...
Iwsm2014   performance measurement for cloud computing applications using iso...Iwsm2014   performance measurement for cloud computing applications using iso...
Iwsm2014 performance measurement for cloud computing applications using iso...
 

Project Report (Summer 2016)

  • 1. The Hong Kong University of Science and Technology Department of Civil Engineering UROP 1000: Summer 2016 Advisor: Prof WANG, Yu-Hsing Metadata Event Log: A Data Management System to Store Inconsistent Data Entries Topic 1: A Data-Driven Approach for Real-Time Landslide Monitoring and Early Warning System LAI, Yong Xin (20306541) Topic 2: Big Data Landslide Early Warning System with Apache Spark and Scala CHANG, Bing An Andrew (20307648) NG, Zhi Yong Ignavier (20311194) THAM, Brendan Guang Yao (20307935) WONG, Wen Yan (20318893) Abstract: In recent years, big data analytics is the new approach with which researchers and companies conduct experiments and analysis. This is particularly true as the field of seismic monitoring is taking its first steps in landslide monitoring through large-scale deployment of sensors (Ooi et al. 2016). There is a need to create a new data management system oriented to suit this project as it deals with inconsistent, unstructured metadata. This report provides an overview of popular data management conventions and why it does not suit our needs. Subsequently, we introduce a new data management format, called Metadata Event Log (MEL), complemented with functions written in Apache Spark and Scala to access and return the correct set of data. Finally, we discuss lessons we learned along the way and potential future developments.
  • 2. 1. Introduction Traditionally, seismic monitoring is a permanent network of state-of-the-art seismological and geophysical sensors connected by a telecommunications network for monitoring and research purposes. (GSN, 2016) However, this setup has its limitations. Firstly, they are expensive to install, as they require boreholes to install deep in the earth’s crust for accurate readings. Permanent installation limits researchers from focusing in on one particular region, as the setup typically studies geography on a large scale. Secondly, hardware and sensors are inaccessible, therefore limits hardware upgrades, software updates, recalibration or replacement of faulty hardware. In contrast, our project takes a different approach to seismic field monitoring by installing sensors on the surface of the earth’s crust. This setup is paired with large-scale deployment of consumer off the shelf electronics for field monitoring. (Ooi et al. 2016). With increased accessibility, researchers are able to frequently perform software updates, hardware replacement, recalibration or replacement of faulty hardware. Thus, frequent maintenance and upgrade activities create the problem of tracking the changes made to one particular node (changes/logs are recorded as metadata). Furthermore, the sheer scale of nodes deployed in the field further amplifies this problem. The dynamic nature of updates requires a Data Management System (DMS) flexible enough to record these changes as metadata. 2. Comparison of existing DMS and why it does not fulfil our needs Over the years there are some popular data management formats, which are designed to accommodate data exchange; Standard for the international Exchange of Earthquake Data
  • 3. (SEED) and Centre of Seismic Studies Database Management System (CSSDBMS). However, these databases are not built to accommodate dynamic metadata management. In terms of SEED (FDSN, 2014), the data model utilizes sequences of format objects. The main flaw is that the model assumes that data transmitted back to the server is already in a form of a lineage (figure 1.1) and readily usable for analysis, which is not the case for data fetched back from the field. Besides that, the blockette (figure 1.2) system assumes hierarchical relationship between data collected. The hierarchical data model can only be applied if we are certain which set of possible dependent variable could be modified due to the change of any one independent variable. However, this is not the case for metadata collected in out project, because the changes of one metric may or may not affect other metrics, albeit the affected ones could be of more than one metric. Therefore, the transient and inconsistent nature of our node system deems the SEED system unviable to be implemented. In terms of CSSDBMS (Anderson et al. 1990), the database has a highly complicated and rigid relationship between data (figure 1.3). This paradigm of data management system utilizes relational database management system (RDBMS) because of the consistent, hierarchical data model. RDBMS offers the advantage of optimized access and query as all superclass and subclasses are well defined. In the case of our project, the data collected is partially hierarchical. (figure 1.4) However, the pre-determined nature of the RDBMS is not efficient to store transient, unstructured, and inconsistent relational data as reflected on our project. If we do insist on the implementation of RDBMS, we need to update the entire database for every time that we append new metadata. More importantly, the complexity of
  • 4. the redesigned database increases with variety of metadata collected as the number of layers between superclass and subclasses increase. 3. A new Data Management System: Metadata Event Log (MEL) 3.1 What MEL is and what it records The MEL is a collection of immutable time series where each row (or entry) represents an event that occurred at some point of the node’s lifetime. (figure 2.1) Events can be categorized into 2 types: a) recalibration activity, and b) hardware maintenance and replacement activity. Each of these events must be associated to a particular node, and information about each node (metadata) is recorded under their respective fields of the column header. Also, each entry stores metadata of raw data collected by each node, and this special metadata is called metricName. 3.2 Convention to record data: Identity and events Firstly, the user should record non-nullable fields; that is, the starting time of the event in UTC date format. i.e.: “2016-08-10T15:22:57+08:00”, and at least one of “network”, “station”, “location”, and “node”. Upon completion of recording, any further data recorded are classified as event data. 3.2.1 Recording hardware replacement activities If this activity does not occur, the entries for all hardware metadata entries are left null. Else, the user records the new hardware name as the entry. (figure 2.1) 3.2.2 Recording maintenance activities: A binary method
  • 5. The user should input the time of which maintenance occurs and input a “1” under the “maintanenceActivity” column. When maintenance activity is completed, the user should input a “0” under the same column. (figure 2.2) 3.3 Benefits of MEL In short, the MEL format is more catered to accommodate compared to traditional DMS for our project (Ooi et al, 2016) because it of its robustness to store new metadata types and events that occur (figure 2.1). Fundamentally, MEL is created to remove limitations in querying: for example, a data analyst may search for all SD_card changes within a node or conversely, search for all nodes a particular SD_card has been in. This flexibility is enabled by the flat data organization of the tabular form of the MEL; meaning, the hierarchical characteristics of the metadata is not reflected through this convention of recording events. In contrast, querying flexibility cannot be implemented by RDMBS, SEED or CSSDBMS data management systems because of the pre-determined nature of data storage. 4 Accessing MEL 4.1 Access functions and inference feature: query() and filterDF() functions MEL by itself is meaningless; we need to functions to extract the correct set of data within in the lifetime of a node. Therefore, we have written 2 functions; that is, filterDF(), and query(). filterDF() is a fetch-and-return function which is used as an intermediary function; that is, it returns a dataframe containing events associated to user specified metadata, subject to availability. The purpose of filter to return entries within that time period only. query()is a function which calls filterDF() multiple times to return inferred results as a dataframe. 4.2 Purpose of inference feature
  • 6. Inference is a feature, which searches for time mismatches or empty entries and returns the “nearest-previous” entry. Each event is stored as an entry, and it is assumed no changes occur until the next entry. It is possible that a data analyst queries for some event that occurs before the start time of the specified time period, therefore returns insufficient data for analysis. Thus, the query() function uses inference to return the most recent previous entry for the specified event. To illustrate the purpose of inferencing, consider the example as follows: Suppose user queries for field calibration metadata from 2013-01-0T00:0:00+08:00 to 2015-01-0T00:0:00+08:00 for a particular node, and events of field calibration occurred during 2012, 2014, and 2016. In this case, the returned dataframe is a single entry of field calibration metadata during 2014. Querying without inferencing returns ambiguous results because the analyst doesn’t know what calibration values to use from the start time (2013) to the returned calibration record (2014). Hence, the missing calibration values need to be inferred from the “nearest-previous” entry. Inference is only required to return “nearest- previous” entry, not the “nearest-future”. This is because we assume that no changes occur until the next entry in the log, therefore, all relevant metadata from the first entry onwards is complete within the queried time period. 4.3 How inferencing is implemented This section is best explained with the following example. (User input conventions are explained in section 4.5). Suppose user queries for simCode and fieldCalibration in Table 1 using conditions (denote as inputCondition1): isNotNullConds = Seq(“fieldCalibration”, containConds = Seq(Seq(“node”, “A”, “B”, “C”, “E”)), startTime = “2014.5”, endTime = “2019”. Note that the dataframe before and after startTime are defined as tempDF and rawDF respectively.
  • 7. Firstly, filterDF(denote as inputCondition1) is implemented to return rawDF (as shown in Table 2). All the remaining nodes, ListOfDistinctNodes = Seq(“node”, “A”, “B”, “C”) can be found from rawDF. In order to return tempDF, filterDF is further called again, but with different conditions (denote as inputCondition2): isNotNullConds = Seq(“fieldCalibration”, containConds = Seq(Seq(ListOfDistinctNodes)), startTime = “2001”, endTime = “2014.5”. Note that startTime is inputted as the absolute starting time of MEL, while endTime is inputted as the startTime inputted by user (inputCondition1). Result of tempDF is shown in Table 3. tempDF is further split according to respective nodes, results in Table 4a and Table 4b, which are stored as a sequence named SequenceOfDataFrames[]. In order to increase accessibility of DataFrame, SequenceOfDataFrames[] is converted to a 3D array format. For each 3D array, one inferred row is produced, results in Table 5a and Table 5b. Noted that from Table 4a to Table 5a, time is changed to startTime inputted by user (inputCondition1) because all required conditions are non-null. From Table 4b to Table 5b, value of simCode is inferred vertically upward to search for the most recent non-null value, which is s3, while value of fieldCalibration is directly used in Table 5b. Lastly, rawDF (Table 2) and inferred row (Table 5a and Table 5b) are combined to return desired result (Table 6) to user. 4.4 User documentation: How to use query() Under the function called query, there are multiple input arguments that the user needs to provide; that is, containsConds, isNotNullConds, startTime, and endTime. The end goal of the function is to return a DataFrame containing all entries of the specified metadata (listed under isNotNullConds), all of which exists from the correct node, station or network in
  • 8. the specified list where inference done on containsConds arguments. The user will first input the desired start and end time under startTime and endTime in the UTC data format. i.e.: 2015-01-23T15:22:30+08:00. Next, the desired metadata such as fieldCalibration or SD_Card will be inputted as a 1D sequence in the isNotNullConds argument while the specific node, station or network will be inputted as a 2D sequence in the containsConds argument. This section is best illustrated with the following example. Suppose a data analyst wants to query a set of fieldCalibration, SD_Card data containing all nodes in the “LuShan” network, and the specific individual nodes “1792” and “E” (all of which are in the same specified timeframe). The identifier arguments of the specific node(s), station(s), or network belong under the argument containsConds. The input syntax should be in the form Seq(Seq(“Network”,”Lushan”), Seq(“Node”, “1792”, “E”)). Basically, the convention is the type (i.e. node, station OR network) should be the first element in the inner sequence, whereas the actual names (“Lushan”, “E”) are the subsequent elements of each corresponding type. It is important to note that each type should be inputted in separate sequences. Next, the corresponding set of metadata should be inputted as the arguments to the identifier isNotNullConds in the syntax as follows: Seq(“fieldCalibration”, “SD_Card”). Should the analyst leaves the isNotNullConds blank, the function will return all metadata that exists to the corresponding node, station or network. 4.5 User Documentation: How the user should interpret results The dataframe returned is shown as in table 6.. One special metadata that the user needs to manually infer is the calibration data. Calibration can be done on different levels of the project i.e. network, station, location node etc. In order to determine on which level the
  • 9. calibration is done, the user needs to infer to the leftmost filled column. This section is best explained by referring to figure 2.2: i.e. at time t1 , field calibration is performed on all the nodes the “LuShan” network. At time t3 , field calibration is performed on the specific node “E” in station “001_COTS” , which is in location “001” of the “LuShan” network. 5. Conclusion and further developments In conclusion, we have achieved our desired goal to create a new DMS for efficient metadata management. We have learned that traditional DMSs have their pros and cons, but for the purposes of our project, the demand for a scalable, robust DMS is greater than what existing ones provide. This is due to the messiness and transient nature of metadata generated. In the near future, we have two frontiers of further development of this database project; that is, ranking system of results, and over the air (OTA) calibration metadata transfer. It is foreseeable that, given the scale of this project, as the number of node being implemented increases over time, it is more challenging for the query() function users to mentally keep track of the names and corresponding "active time periods" of each node. Thus, we plan to develop a ranking system that statistically records the frequency of different startTime, endTime, containConds, and notNullConds values being called by user. This allows the system to predict the user's most favorable set of input parameters and provide necessary recommendations to the user while typing. Besides, the ranking system being implemented needs to be user-specific, as different users tend to focus on different columns of the logbook for analysis.
  • 10. In terms of OTA calibration metadata transfer, we intend to build a mobile app for to aid contractors in node deployment. In essence, the contractors will simultaneously input the calibration metadata of the associated node during installation while on-site. This data will then be sent from the app to the server for pre-processing before appending it on the MEL.
  • 11. Appendix Figure 1.1: Standard for the international Exchange of Earthquake Data (SEED) Figure 1.2: SEED Blockette Convention Figure 1.3: Centre of Seismic Studies Database Management System (CSSDBMS) Structured Relationship Model
  • 12. Figure 1.4: Traditional Hierarchal Structure Time Node Name SD Card 3Gmodem t1 "U" "M7" Sim Code t2 "U" "61E13CU957744" t3 "U" "C6" t4 "U" "M9" APP RPi t5 "U" "2P" "A+" t6 "U" "JA18" "61Y15BU133338" "4K" Figure 2.1: Example of Metadata Event Log Time Network Location Station Node Lab Calibration Field Calibration Orientation Field Maintenance t1 "Lushan" 1 t2 "LuShan" "001" 1 t3 "LuShan" "001" "001_COTS" "E" 1 t4 "LuShan" "001" 0 t5 "LuShan" "001" -- -- -- -- t6 "LuShan" "001" "001_COTS" "E" -- -- -- -- Figure 2.2: Recording maintenance activities - A binary method Network “Lushan” Location “001” Station Node Sensor “001_COTS”“E” Gyrometer Level Example Flow Chart Keys:
  • 13. Implementation of Inferencing Table 1: Raw data from MEL containing entire history of record to be queried time node simCode fieldCalibration "2011" "C" s1 f1 "2012" "E" s2 f2 "2013" "A" s3 f3 "2014" "A" f4 "2015" "B" f5 "2016" "C" f6 "2017" "A" s4 f7 "2018" "D" f8 Table 2: rawDF time node simCode fieldCalibration "2015" "B" f6 "2016" "C" f7 "2017" "A" s4 f8 Table 3: tempDF time node simCode fieldCalibration "2011" "C" s1 f1 "2013" "A" s3 f3 "2014" "A" f4 Table 4a: tempDF, split according to node time node simCode fieldCalibration "2011" "C" s1 f1 Table 4b: tempDF, split according to node time node simCode fieldCalibration "2013" "A" s3 f3 "2014" "A" f4
  • 14. Table 5a: Inferencing 1 time node simCode fieldCalibration "2014.5" "C" s1 f1 Table 5b: Inferencing 2 time node simCode fieldCalibration "2014.5" "A" s3 f4 Table 6: Final Result time node simCode fieldCalibration "2014.5" "C" s1 f1 "2014.5" "A" s3 f4 "2015" "B" f5 "2016" "C" f6 "2017" "A" s4 f7
  • 15. References: 1. Ghee Leng Ooi, Pin Siang Tan, Mei Ling Leung, Hoi Lun Lui, Yee Shien Yeo, Jimmy Wu, Jun-Ting Lin, Yu-Hsing Wang, Kuo-Lung Wang, Meei-Ling Lin, Qian Zhang (2016). The DESIGnSLM Architecture: A Data-Enabled Scalable Instrumentation for Geotechnical Engineering, Seismic and Landslide Monitoring 2. GSN: Global Seismographic Network. (n.d.). Retrieved August 15, 2016, from http://earthquake.usgs.gov/monitoring/gsn/ 3. FDSN: International Federation of Digital Seismograph Networks Incorporated Institutions for Seismology (2014). Standard for the Exchange of Earthquake Data (SEED) Format Version 2.12 4. J. Anderson, W.E Farrell, K. Garcia, J. Given, H. Swanger (1990). Center For Seismic Studies Version 3 Database: Schema Reference Manual 5. J. Anderson, H. Swanger (1990). Center For Seismic Studies Version 3 Database: SQL Tutorial