2. High-Level Design
XML Data XML Extraction
Engine (PL/SQL)
XML Repository 1
XML Repository 2
S&P GCC Editorial Platform
When an article is published from the editorial platform, the XML is parsed on the fly to extract key
components of an article and is then stored into the repository. Further, the application will also utilize
the repository content to build collections of articles and/or update the existing ones.
3. Detailed Design
Key components of the design
• A hybrid data model consisting of Relational and Binary XML columns
• Composite partitioned table -
• Range partitioned based on Article Publish Date
• List sub-partitioned based on Article Type
• Dynamic query generation for real-time determination of target repositories
In the next few slides, we will dive deeper into the details of each of the above components…
4. Detailed Design cont’d…
Hybrid Data Model
This model explores the possibility of combining the usual data-type columns such as: VARCHAR2, DATE
and ‘TIMESTAMP WITH TIME ZONE’ with XMLTYPE columns that have an underlying secure-file binary
storage for the most efficient utilization of memory.
With this model, there is a lot of scope to perform near real-time data analytics on the source XML across
a wide spectrum, since this design enables faster data retrieval times.
For instance, in our implementation, we have created the table with key attributes such as: Article-ID,
Article-Type, Publish-Date, Publish-Date with TimeZone and Article-Headline. This enables the front-end
application to perform order intensive queries across multiple ranges of dates/timestamps. In our earlier
implementation, the main drawback was performing such queries on an entire XML. This used to create a
lot of I/O and allocate unnecessary logging.
5. Detailed Design cont’d…
Composite Partitioning
This feature is utilized to enable the design to be scalable with higher volumes. Our implementation uses
a combination of Range and List partitioning as below:
Range Partitioning (with row movement) – partitions created on the ‘Article Publish-Date’ as the partition
key. Every 3 months (i.e. a quarter) worth of data is designed to fit into one partition. Further, if the
partition key is updated to a date that falls into another partition, the ‘row movement’ feature has been
enabled to support this kind of activity too.
List Partitioning – sub-partitions created on the basis of ‘Article-Type’. This means that every single
partition will consist of multiple sub-partitions.
6. Detailed Design cont’d…
Dynamic Query Generation
The primary reason for utilizing dynamic SQL was to enable the application to be scalable. Our
implementation consists of a simple mapping table – called aptly the ‘ARTICLE_LEGEND’. The purpose of
this table is to provide a mapping of each article type to its destined repository.
Since the target repository is unknown until run-time, dynamic SQL provides the necessary functionality
to achieve the binding real-time.
Some code snippets from our implementation will allow us to understand this better:
• Selecting the target repository based on article-type
SELECT xml_table INTO vc_xmltabname
FROM remus.article_legend
WHERE article_type = vc_article_type;
• Inserting into the repository with bound variables at run-time
l_sqlstr := 'INSERT INTO remus.'||vc_xmltabname||' X VALUES (:1,:2,:3,:4,:5,:6)';
EXECUTE IMMEDIATE l_sqlstr USING vc_acticle_id,vc_article_type,l_pub_date,l_pub_dt_time,l_article_headline, xmlFile_tgt;
7. Data Migration in Production
Challenges:
It’s worth mentioning the issues encountered to migrate large volumes of existing production XML
content into the newly designed tables -
• For transferring a mere 2000 articles using the regular ‘INSERT INTO…SELECT’ clause used to take about
30 minutes and would end up consuming large amounts of undo segment space. Clearly, to transfer
nearly 50K articles, this method proved to be non-pragmatic.
• Another concern was in extracting various nodes of the XML in order to fit the data into the new data
model. Of course an alternative was to come up with a PL/SQL cursor and the usual XML EXTRACT
functions to achieve this – which would have resulted in a lot of development time. However, we tried a
more elegant approach.
A break-through…
We used an XMLTABLE function to extract the various nodes of the XML on-the-fly and insert into the new
tables using a PL/SQL cursor.
XMLTABLE
8. Maintenance
The only overhead of the design is that it requires frequent DDL interactions with the DB objects. This is
because new partitions and sub-partitions need to be added as a function of Time and Application
scalability respectively.
Example:
Adding a partition:
alter table GCC_UMS_XML_TAB add partition part_2012_q1 values less than (TO_DATE('2012-04-01', 'YYYY-MM-DD‘));
Adding a sub-partition:
alter table GCC_UMS_XML_TAB
modify partition part_2010_q1 add subpartition part_2010_q1_sp_USEIPC values ('USEIPC');
9. Benefits and Possibilities
The content & mainly the design lends itself easily for Business Intelligence and Data Mining activities.
The possibilities are endless:
Business Intelligence (BI):
On any given day, S&P analysts and reporters may publish various articles on a specific company. All of this
can be drilled across the article spectrum in ERL to generate a report for internal use such as: generating
trends w.r.t which industry vertical was being given more focus on any specific day etc,.. This would give
more visibility to the Editorial team and can aid them in productivity.
Bottom-Line: A report may be a quicker and/or effective way of delivering key info to the end users.
Data Mining:
• The data from various articles can be mined and useful research specific to a company can be derived
real-time – which can empower potential investors with key market info at critical times during the day
(for e.g. while carrying out trades on the company etc..). An example could be the STARS change on a
company.
10. Comparison of various designs
Feature
REMUS V1
Binary XML storage with
functional indexes
REMUS V2
Binary XML storage with XML-
based indexes
REMUS V3
Hybrid data model with
composite partitioning
XML parsing
Involves framing queries with
various XML functions to extract
data on-the-fly and hence prone
to error.
Involves framing queries with
various XML functions to
extract data on-the-fly and
hence prone to error.
Simple relational queries
since there is no need for
parsing XML on-the-fly
Performance
Degrades considerably with
volume (since the XML functions
act on each XML in the result-
set)
Degrades considerably with
volume (since the XML
functions act on each XML in
the result-set)
Rapid responses (since the
necessary data attributes
are present in relational
columns)
Indexing
Needs additional indices if the
application queries hit different
parts of the XML other than the
ones covered in the existing
indices
Indexed on an X-Path and
hence scalable for application
query changes
Regular table indexes
Maintenance
Easily maintained Easily maintained Needs frequent interactions
at the DDL level (since new
partitions and sub-partitions
need to be added regularly)