SlideShare a Scribd company logo
1 of 13
REMUS V3.0 – Design Overview
- Prashasth Patil
High-Level Design
XML Data XML Extraction
Engine (PL/SQL)
XML Repository 1
XML Repository 2
S&P GCC Editorial Platform
When an article is published from the editorial platform, the XML is parsed on the fly to extract key
components of an article and is then stored into the repository. Further, the application will also utilize
the repository content to build collections of articles and/or update the existing ones.
Detailed Design
Key components of the design
• A hybrid data model consisting of Relational and Binary XML columns
• Composite partitioned table -
• Range partitioned based on Article Publish Date
• List sub-partitioned based on Article Type
• Dynamic query generation for real-time determination of target repositories
In the next few slides, we will dive deeper into the details of each of the above components…
Detailed Design cont’d…
Hybrid Data Model
This model explores the possibility of combining the usual data-type columns such as: VARCHAR2, DATE
and ‘TIMESTAMP WITH TIME ZONE’ with XMLTYPE columns that have an underlying secure-file binary
storage for the most efficient utilization of memory.
With this model, there is a lot of scope to perform near real-time data analytics on the source XML across
a wide spectrum, since this design enables faster data retrieval times.
For instance, in our implementation, we have created the table with key attributes such as: Article-ID,
Article-Type, Publish-Date, Publish-Date with TimeZone and Article-Headline. This enables the front-end
application to perform order intensive queries across multiple ranges of dates/timestamps. In our earlier
implementation, the main drawback was performing such queries on an entire XML. This used to create a
lot of I/O and allocate unnecessary logging.
Detailed Design cont’d…
Composite Partitioning
This feature is utilized to enable the design to be scalable with higher volumes. Our implementation uses
a combination of Range and List partitioning as below:
Range Partitioning (with row movement) – partitions created on the ‘Article Publish-Date’ as the partition
key. Every 3 months (i.e. a quarter) worth of data is designed to fit into one partition. Further, if the
partition key is updated to a date that falls into another partition, the ‘row movement’ feature has been
enabled to support this kind of activity too.
List Partitioning – sub-partitions created on the basis of ‘Article-Type’. This means that every single
partition will consist of multiple sub-partitions.
Detailed Design cont’d…
Dynamic Query Generation
The primary reason for utilizing dynamic SQL was to enable the application to be scalable. Our
implementation consists of a simple mapping table – called aptly the ‘ARTICLE_LEGEND’. The purpose of
this table is to provide a mapping of each article type to its destined repository.
Since the target repository is unknown until run-time, dynamic SQL provides the necessary functionality
to achieve the binding real-time.
Some code snippets from our implementation will allow us to understand this better:
• Selecting the target repository based on article-type
SELECT xml_table INTO vc_xmltabname
FROM remus.article_legend
WHERE article_type = vc_article_type;
• Inserting into the repository with bound variables at run-time
l_sqlstr := 'INSERT INTO remus.'||vc_xmltabname||' X VALUES (:1,:2,:3,:4,:5,:6)';
EXECUTE IMMEDIATE l_sqlstr USING vc_acticle_id,vc_article_type,l_pub_date,l_pub_dt_time,l_article_headline, xmlFile_tgt;
Data Migration in Production
Challenges:
It’s worth mentioning the issues encountered to migrate large volumes of existing production XML
content into the newly designed tables -
• For transferring a mere 2000 articles using the regular ‘INSERT INTO…SELECT’ clause used to take about
30 minutes and would end up consuming large amounts of undo segment space. Clearly, to transfer
nearly 50K articles, this method proved to be non-pragmatic.
• Another concern was in extracting various nodes of the XML in order to fit the data into the new data
model. Of course an alternative was to come up with a PL/SQL cursor and the usual XML EXTRACT
functions to achieve this – which would have resulted in a lot of development time. However, we tried a
more elegant approach.
A break-through…
We used an XMLTABLE function to extract the various nodes of the XML on-the-fly and insert into the new
tables using a PL/SQL cursor.
XMLTABLE
Maintenance
The only overhead of the design is that it requires frequent DDL interactions with the DB objects. This is
because new partitions and sub-partitions need to be added as a function of Time and Application
scalability respectively.
Example:
Adding a partition:
alter table GCC_UMS_XML_TAB add partition part_2012_q1 values less than (TO_DATE('2012-04-01', 'YYYY-MM-DD‘));
Adding a sub-partition:
alter table GCC_UMS_XML_TAB
modify partition part_2010_q1 add subpartition part_2010_q1_sp_USEIPC values ('USEIPC');
Benefits and Possibilities
The content & mainly the design lends itself easily for Business Intelligence and Data Mining activities.
The possibilities are endless:
Business Intelligence (BI):
On any given day, S&P analysts and reporters may publish various articles on a specific company. All of this
can be drilled across the article spectrum in ERL to generate a report for internal use such as: generating
trends w.r.t which industry vertical was being given more focus on any specific day etc,.. This would give
more visibility to the Editorial team and can aid them in productivity.
Bottom-Line: A report may be a quicker and/or effective way of delivering key info to the end users.
Data Mining:
• The data from various articles can be mined and useful research specific to a company can be derived
real-time – which can empower potential investors with key market info at critical times during the day
(for e.g. while carrying out trades on the company etc..). An example could be the STARS change on a
company.
Comparison of various designs
Feature
REMUS V1
Binary XML storage with
functional indexes
REMUS V2
Binary XML storage with XML-
based indexes
REMUS V3
Hybrid data model with
composite partitioning
XML parsing
Involves framing queries with
various XML functions to extract
data on-the-fly and hence prone
to error.
Involves framing queries with
various XML functions to
extract data on-the-fly and
hence prone to error.
Simple relational queries
since there is no need for
parsing XML on-the-fly
Performance
Degrades considerably with
volume (since the XML functions
act on each XML in the result-
set)
Degrades considerably with
volume (since the XML
functions act on each XML in
the result-set)
Rapid responses (since the
necessary data attributes
are present in relational
columns)
Indexing
Needs additional indices if the
application queries hit different
parts of the XML other than the
ones covered in the existing
indices
Indexed on an X-Path and
hence scalable for application
query changes
Regular table indexes
Maintenance
Easily maintained Easily maintained Needs frequent interactions
at the DDL level (since new
partitions and sub-partitions
need to be added regularly)
Performance Metrics
The performance gains when compared to Remus V2.0 are clearly significant…
Data Migration in Prod
XMLTABLE function usage
SELECT
r2."Article_ID",
r2."Article_Type",
TRUNC(to_timestamp_tz(replace(r2."Publish_Date",'T',' '),'YYYY-MM-DD HH24:MI:SSTZH:TZM')) as
"Publish_Date",
TO_TIMESTAMP_TZ(replace(r2."Publish_Date",'T',' '),'YYYY-MM-DD HH24:MI:SSTZH:TZM') as
"Publish_Dt_Time",
r2."Article_Headline",
r1.object_value as "Doc“
from
remusxmltab r1,
XMLTABLE('/Editorial'
PASSING r1.object_value
COLUMNS
"Article_ID" varchar2(50) PATH '/Editorial/@id',
"Article_Type" varchar2(50) PATH '/Editorial/@type',
"Publish_Date" varchar2(50) PATH '/Editorial/MetaData/Property[@name="PublishDate"]',
"Article_Headline" varchar2(200) PATH '/Editorial/Headline'
) r2;
BACK

More Related Content

What's hot

MS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningMS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningDataminingTools Inc
 
BI Publisher Data model design document
BI Publisher Data model design documentBI Publisher Data model design document
BI Publisher Data model design documentadivasoft
 
Database Foundation Training
Database Foundation TrainingDatabase Foundation Training
Database Foundation TrainingFranky Lao
 
Ssis sql ssrs_ssas_sp_mdx_hb_li
Ssis sql ssrs_ssas_sp_mdx_hb_liSsis sql ssrs_ssas_sp_mdx_hb_li
Ssis sql ssrs_ssas_sp_mdx_hb_liHong-Bing Li
 
An introduction to new data warehouse scalability features in sql server 2008
An introduction to new data warehouse scalability features in sql server 2008An introduction to new data warehouse scalability features in sql server 2008
An introduction to new data warehouse scalability features in sql server 2008Klaudiia Jacome
 
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODINagendra K
 
Useful PL/SQL Supplied Packages
Useful PL/SQL Supplied PackagesUseful PL/SQL Supplied Packages
Useful PL/SQL Supplied PackagesMaria Colgan
 
SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationNicolas Fränkel
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Flink Forward
 
Lsmw ppt in SAP ABAP
Lsmw ppt in SAP ABAPLsmw ppt in SAP ABAP
Lsmw ppt in SAP ABAPAabid Khan
 
Architecture of integration services
Architecture of integration servicesArchitecture of integration services
Architecture of integration servicesSlava Kokaev
 
Sql architecture
Sql architectureSql architecture
Sql architecturerchakra
 
Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Mahesh Vallampati
 
SAP Legacy System Migration Workbench (LSMW): Introduction
SAP Legacy System Migration Workbench (LSMW): IntroductionSAP Legacy System Migration Workbench (LSMW): Introduction
SAP Legacy System Migration Workbench (LSMW): IntroductionJonathan Eemans
 
Cassandra Data Migration
Cassandra Data MigrationCassandra Data Migration
Cassandra Data MigrationBiagio Onorato
 

What's hot (20)

MS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningMS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data mining
 
BI Publisher Data model design document
BI Publisher Data model design documentBI Publisher Data model design document
BI Publisher Data model design document
 
Abap faq
Abap faqAbap faq
Abap faq
 
Database Foundation Training
Database Foundation TrainingDatabase Foundation Training
Database Foundation Training
 
Ssis sql ssrs_ssas_sp_mdx_hb_li
Ssis sql ssrs_ssas_sp_mdx_hb_liSsis sql ssrs_ssas_sp_mdx_hb_li
Ssis sql ssrs_ssas_sp_mdx_hb_li
 
An introduction to new data warehouse scalability features in sql server 2008
An introduction to new data warehouse scalability features in sql server 2008An introduction to new data warehouse scalability features in sql server 2008
An introduction to new data warehouse scalability features in sql server 2008
 
Ssis 2008
Ssis 2008Ssis 2008
Ssis 2008
 
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
 
Part5 sql tune
Part5 sql tunePart5 sql tune
Part5 sql tune
 
SSIS control flow
SSIS control flowSSIS control flow
SSIS control flow
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODI
 
Useful PL/SQL Supplied Packages
Useful PL/SQL Supplied PackagesUseful PL/SQL Supplied Packages
Useful PL/SQL Supplied Packages
 
SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy application
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
 
Lsmw ppt in SAP ABAP
Lsmw ppt in SAP ABAPLsmw ppt in SAP ABAP
Lsmw ppt in SAP ABAP
 
Architecture of integration services
Architecture of integration servicesArchitecture of integration services
Architecture of integration services
 
Sql architecture
Sql architectureSql architecture
Sql architecture
 
Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2
 
SAP Legacy System Migration Workbench (LSMW): Introduction
SAP Legacy System Migration Workbench (LSMW): IntroductionSAP Legacy System Migration Workbench (LSMW): Introduction
SAP Legacy System Migration Workbench (LSMW): Introduction
 
Cassandra Data Migration
Cassandra Data MigrationCassandra Data Migration
Cassandra Data Migration
 

Viewers also liked

Azhar Nabil Menshawy
Azhar Nabil MenshawyAzhar Nabil Menshawy
Azhar Nabil MenshawyAzhar Nabil
 
The complete christmas home collection
The complete christmas home collectionThe complete christmas home collection
The complete christmas home collectionHomespace Direct
 
Déclaration Sociale Nominative - arrêté du 30 novembre 2016
Déclaration Sociale Nominative - arrêté du 30 novembre 2016Déclaration Sociale Nominative - arrêté du 30 novembre 2016
Déclaration Sociale Nominative - arrêté du 30 novembre 2016Société Tripalio
 
Strata Hadoop - Hadoop enabler of Data Driven Strategy - 20160601 vF
Strata Hadoop - Hadoop enabler of Data Driven Strategy - 20160601 vFStrata Hadoop - Hadoop enabler of Data Driven Strategy - 20160601 vF
Strata Hadoop - Hadoop enabler of Data Driven Strategy - 20160601 vFAbed Ajraou
 
The Hunt 100 - The Top Consumer Goods Companies - 2017 (Published List)
The Hunt 100 - The Top Consumer Goods Companies - 2017 (Published List)The Hunt 100 - The Top Consumer Goods Companies - 2017 (Published List)
The Hunt 100 - The Top Consumer Goods Companies - 2017 (Published List)Heather Whaley
 
Emergence profile done/ new dentistry technology
Emergence profile done/ new dentistry technologyEmergence profile done/ new dentistry technology
Emergence profile done/ new dentistry technologyIndian dental academy
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaDataWorks Summit/Hadoop Summit
 
Article Pascal GARNERO - Janvier 2015.
Article Pascal GARNERO - Janvier 2015.Article Pascal GARNERO - Janvier 2015.
Article Pascal GARNERO - Janvier 2015.Pascal Garnero
 
Урожай – Витязь
Урожай – ВитязьУрожай – Витязь
Урожай – ВитязьAl Maks
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 

Viewers also liked (16)

Azhar Nabil Menshawy
Azhar Nabil MenshawyAzhar Nabil Menshawy
Azhar Nabil Menshawy
 
The complete christmas home collection
The complete christmas home collectionThe complete christmas home collection
The complete christmas home collection
 
Déclaration Sociale Nominative - arrêté du 30 novembre 2016
Déclaration Sociale Nominative - arrêté du 30 novembre 2016Déclaration Sociale Nominative - arrêté du 30 novembre 2016
Déclaration Sociale Nominative - arrêté du 30 novembre 2016
 
Museu
MuseuMuseu
Museu
 
Strata Hadoop - Hadoop enabler of Data Driven Strategy - 20160601 vF
Strata Hadoop - Hadoop enabler of Data Driven Strategy - 20160601 vFStrata Hadoop - Hadoop enabler of Data Driven Strategy - 20160601 vF
Strata Hadoop - Hadoop enabler of Data Driven Strategy - 20160601 vF
 
The Hunt 100 - The Top Consumer Goods Companies - 2017 (Published List)
The Hunt 100 - The Top Consumer Goods Companies - 2017 (Published List)The Hunt 100 - The Top Consumer Goods Companies - 2017 (Published List)
The Hunt 100 - The Top Consumer Goods Companies - 2017 (Published List)
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Emergence profile done/ new dentistry technology
Emergence profile done/ new dentistry technologyEmergence profile done/ new dentistry technology
Emergence profile done/ new dentistry technology
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
 
Article Pascal GARNERO - Janvier 2015.
Article Pascal GARNERO - Janvier 2015.Article Pascal GARNERO - Janvier 2015.
Article Pascal GARNERO - Janvier 2015.
 
Урожай – Витязь
Урожай – ВитязьУрожай – Витязь
Урожай – Витязь
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 

Similar to Remus_3_0

Super applied in a sitecore migration project
Super applied in a sitecore migration projectSuper applied in a sitecore migration project
Super applied in a sitecore migration projectdodoshelu
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architectureAjeet Singh
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAnswerModules
 
Andrii Sliusar "Module Architecture of React-Redux Applications"
Andrii Sliusar "Module Architecture of React-Redux Applications"Andrii Sliusar "Module Architecture of React-Redux Applications"
Andrii Sliusar "Module Architecture of React-Redux Applications"LogeekNightUkraine
 
Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)kiwiboris
 
Semantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesSemantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesDeniz Kılınç
 
Impact 2014 - IIB - selecting the right transformation option
Impact 2014 -  IIB - selecting the right transformation optionImpact 2014 -  IIB - selecting the right transformation option
Impact 2014 - IIB - selecting the right transformation optionAndrew Coleman
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfan
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsMoutasm Tamimi
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...Dave Stokes
 
MySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersMySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersDave Stokes
 
2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy
2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy
2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easyIEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easy
IEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easyIEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easy
IEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easyIEEEFINALYEARSTUDENTPROJECTS
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
 

Similar to Remus_3_0 (20)

Oracle
OracleOracle
Oracle
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Super applied in a sitecore migration project
Super applied in a sitecore migration projectSuper applied in a sitecore migration project
Super applied in a sitecore migration project
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
r4
r4r4
r4
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
 
Andrii Sliusar "Module Architecture of React-Redux Applications"
Andrii Sliusar "Module Architecture of React-Redux Applications"Andrii Sliusar "Module Architecture of React-Redux Applications"
Andrii Sliusar "Module Architecture of React-Redux Applications"
 
Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)
 
Semantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesSemantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sources
 
Impact 2014 - IIB - selecting the right transformation option
Impact 2014 -  IIB - selecting the right transformation optionImpact 2014 -  IIB - selecting the right transformation option
Impact 2014 - IIB - selecting the right transformation option
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
 
5010
50105010
5010
 
MySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersMySQL 8.0 Featured for Developers
MySQL 8.0 Featured for Developers
 
2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy
2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy
2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy
 
IEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easy
IEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easyIEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easy
IEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easy
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...
 

Remus_3_0

  • 1. REMUS V3.0 – Design Overview - Prashasth Patil
  • 2. High-Level Design XML Data XML Extraction Engine (PL/SQL) XML Repository 1 XML Repository 2 S&P GCC Editorial Platform When an article is published from the editorial platform, the XML is parsed on the fly to extract key components of an article and is then stored into the repository. Further, the application will also utilize the repository content to build collections of articles and/or update the existing ones.
  • 3. Detailed Design Key components of the design • A hybrid data model consisting of Relational and Binary XML columns • Composite partitioned table - • Range partitioned based on Article Publish Date • List sub-partitioned based on Article Type • Dynamic query generation for real-time determination of target repositories In the next few slides, we will dive deeper into the details of each of the above components…
  • 4. Detailed Design cont’d… Hybrid Data Model This model explores the possibility of combining the usual data-type columns such as: VARCHAR2, DATE and ‘TIMESTAMP WITH TIME ZONE’ with XMLTYPE columns that have an underlying secure-file binary storage for the most efficient utilization of memory. With this model, there is a lot of scope to perform near real-time data analytics on the source XML across a wide spectrum, since this design enables faster data retrieval times. For instance, in our implementation, we have created the table with key attributes such as: Article-ID, Article-Type, Publish-Date, Publish-Date with TimeZone and Article-Headline. This enables the front-end application to perform order intensive queries across multiple ranges of dates/timestamps. In our earlier implementation, the main drawback was performing such queries on an entire XML. This used to create a lot of I/O and allocate unnecessary logging.
  • 5. Detailed Design cont’d… Composite Partitioning This feature is utilized to enable the design to be scalable with higher volumes. Our implementation uses a combination of Range and List partitioning as below: Range Partitioning (with row movement) – partitions created on the ‘Article Publish-Date’ as the partition key. Every 3 months (i.e. a quarter) worth of data is designed to fit into one partition. Further, if the partition key is updated to a date that falls into another partition, the ‘row movement’ feature has been enabled to support this kind of activity too. List Partitioning – sub-partitions created on the basis of ‘Article-Type’. This means that every single partition will consist of multiple sub-partitions.
  • 6. Detailed Design cont’d… Dynamic Query Generation The primary reason for utilizing dynamic SQL was to enable the application to be scalable. Our implementation consists of a simple mapping table – called aptly the ‘ARTICLE_LEGEND’. The purpose of this table is to provide a mapping of each article type to its destined repository. Since the target repository is unknown until run-time, dynamic SQL provides the necessary functionality to achieve the binding real-time. Some code snippets from our implementation will allow us to understand this better: • Selecting the target repository based on article-type SELECT xml_table INTO vc_xmltabname FROM remus.article_legend WHERE article_type = vc_article_type; • Inserting into the repository with bound variables at run-time l_sqlstr := 'INSERT INTO remus.'||vc_xmltabname||' X VALUES (:1,:2,:3,:4,:5,:6)'; EXECUTE IMMEDIATE l_sqlstr USING vc_acticle_id,vc_article_type,l_pub_date,l_pub_dt_time,l_article_headline, xmlFile_tgt;
  • 7. Data Migration in Production Challenges: It’s worth mentioning the issues encountered to migrate large volumes of existing production XML content into the newly designed tables - • For transferring a mere 2000 articles using the regular ‘INSERT INTO…SELECT’ clause used to take about 30 minutes and would end up consuming large amounts of undo segment space. Clearly, to transfer nearly 50K articles, this method proved to be non-pragmatic. • Another concern was in extracting various nodes of the XML in order to fit the data into the new data model. Of course an alternative was to come up with a PL/SQL cursor and the usual XML EXTRACT functions to achieve this – which would have resulted in a lot of development time. However, we tried a more elegant approach. A break-through… We used an XMLTABLE function to extract the various nodes of the XML on-the-fly and insert into the new tables using a PL/SQL cursor. XMLTABLE
  • 8. Maintenance The only overhead of the design is that it requires frequent DDL interactions with the DB objects. This is because new partitions and sub-partitions need to be added as a function of Time and Application scalability respectively. Example: Adding a partition: alter table GCC_UMS_XML_TAB add partition part_2012_q1 values less than (TO_DATE('2012-04-01', 'YYYY-MM-DD‘)); Adding a sub-partition: alter table GCC_UMS_XML_TAB modify partition part_2010_q1 add subpartition part_2010_q1_sp_USEIPC values ('USEIPC');
  • 9. Benefits and Possibilities The content & mainly the design lends itself easily for Business Intelligence and Data Mining activities. The possibilities are endless: Business Intelligence (BI): On any given day, S&P analysts and reporters may publish various articles on a specific company. All of this can be drilled across the article spectrum in ERL to generate a report for internal use such as: generating trends w.r.t which industry vertical was being given more focus on any specific day etc,.. This would give more visibility to the Editorial team and can aid them in productivity. Bottom-Line: A report may be a quicker and/or effective way of delivering key info to the end users. Data Mining: • The data from various articles can be mined and useful research specific to a company can be derived real-time – which can empower potential investors with key market info at critical times during the day (for e.g. while carrying out trades on the company etc..). An example could be the STARS change on a company.
  • 10. Comparison of various designs Feature REMUS V1 Binary XML storage with functional indexes REMUS V2 Binary XML storage with XML- based indexes REMUS V3 Hybrid data model with composite partitioning XML parsing Involves framing queries with various XML functions to extract data on-the-fly and hence prone to error. Involves framing queries with various XML functions to extract data on-the-fly and hence prone to error. Simple relational queries since there is no need for parsing XML on-the-fly Performance Degrades considerably with volume (since the XML functions act on each XML in the result- set) Degrades considerably with volume (since the XML functions act on each XML in the result-set) Rapid responses (since the necessary data attributes are present in relational columns) Indexing Needs additional indices if the application queries hit different parts of the XML other than the ones covered in the existing indices Indexed on an X-Path and hence scalable for application query changes Regular table indexes Maintenance Easily maintained Easily maintained Needs frequent interactions at the DDL level (since new partitions and sub-partitions need to be added regularly)
  • 11. Performance Metrics The performance gains when compared to Remus V2.0 are clearly significant…
  • 12.
  • 13. Data Migration in Prod XMLTABLE function usage SELECT r2."Article_ID", r2."Article_Type", TRUNC(to_timestamp_tz(replace(r2."Publish_Date",'T',' '),'YYYY-MM-DD HH24:MI:SSTZH:TZM')) as "Publish_Date", TO_TIMESTAMP_TZ(replace(r2."Publish_Date",'T',' '),'YYYY-MM-DD HH24:MI:SSTZH:TZM') as "Publish_Dt_Time", r2."Article_Headline", r1.object_value as "Doc“ from remusxmltab r1, XMLTABLE('/Editorial' PASSING r1.object_value COLUMNS "Article_ID" varchar2(50) PATH '/Editorial/@id', "Article_Type" varchar2(50) PATH '/Editorial/@type', "Publish_Date" varchar2(50) PATH '/Editorial/MetaData/Property[@name="PublishDate"]', "Article_Headline" varchar2(200) PATH '/Editorial/Headline' ) r2; BACK