This document proposes HadoopXML, a system for efficiently processing massive XML data and multiple twig pattern queries in parallel using Hadoop. Key features of HadoopXML include:
1) It partitions large XML files and processes them in parallel across nodes while preserving structural information.
2) It simultaneously processes multiple twig pattern queries with a shared input scan, without needing separate MapReduce jobs for each query.
3) It enables query processing tasks to share input scans and intermediate results like path solutions, reducing redundant processing and I/O.
4) It provides load balancing to fairly distribute twig join operations across nodes.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
By programming both the data plane and the control plane, network operators can adapt their networks to
their needs. Thanks to research over the past decade, this concept has more formulized and more
technologically feasible. However, since control plane programmability came first, it has already been
successfully implemented in the real network and is beginning to pay off. Today, the data plane
programmability is evolving very rapidly to reach this level, attracting the attention of researchers and
developers: Designing data plane languages, application development on it, formulizing software switches
and architecture that can run data plane codes and the applications, increasing performance of software
switch, and so on. As the control plane and data plane become more open, many new innovations and
technologies are emerging, but some experts warn that consumers may be confused as to which of the many
technologies to choose. This is a testament to how much innovation is emerging in the network. This paper
outlines some emerging applications on the data plane and offers opportunities for further improvement
and optimization. Our observations show that most of the implementations are done in a test environment
and have not been tested well enough in terms of performance, but there are many interesting works, for
example, previous control plane solutions are being implemented in the data plane.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataWaqas Tariq
We propose a list data sharing model, which utilizes semantics expressed in DTD for concurrency control of shared XML trees. In this model, tree updating actions such as inserting and/or deleting subtrees are allowed only for repetitive parts. The proposed model guarantees that the resulting XML tree is valid even when tree update actions are applied concurrently. In addition, we propose a new multi-granularity locking mechanism called list locking protocol. This protocol locks on the (index) list of repetitive children nodes and thus allows updates to the descendents when the node child¡¯s subtree is being deleted or inserted. This protocol is expected to be more accessible and to produce fewer locking objects on XML data compared to other methods. Moreover, the prototype system shows that list locking is well suited to user interface of shared XML clients by enabling/disabling corresponding edit operation controls.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
By programming both the data plane and the control plane, network operators can adapt their networks to
their needs. Thanks to research over the past decade, this concept has more formulized and more
technologically feasible. However, since control plane programmability came first, it has already been
successfully implemented in the real network and is beginning to pay off. Today, the data plane
programmability is evolving very rapidly to reach this level, attracting the attention of researchers and
developers: Designing data plane languages, application development on it, formulizing software switches
and architecture that can run data plane codes and the applications, increasing performance of software
switch, and so on. As the control plane and data plane become more open, many new innovations and
technologies are emerging, but some experts warn that consumers may be confused as to which of the many
technologies to choose. This is a testament to how much innovation is emerging in the network. This paper
outlines some emerging applications on the data plane and offers opportunities for further improvement
and optimization. Our observations show that most of the implementations are done in a test environment
and have not been tested well enough in terms of performance, but there are many interesting works, for
example, previous control plane solutions are being implemented in the data plane.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataWaqas Tariq
We propose a list data sharing model, which utilizes semantics expressed in DTD for concurrency control of shared XML trees. In this model, tree updating actions such as inserting and/or deleting subtrees are allowed only for repetitive parts. The proposed model guarantees that the resulting XML tree is valid even when tree update actions are applied concurrently. In addition, we propose a new multi-granularity locking mechanism called list locking protocol. This protocol locks on the (index) list of repetitive children nodes and thus allows updates to the descendents when the node child¡¯s subtree is being deleted or inserted. This protocol is expected to be more accessible and to produce fewer locking objects on XML data compared to other methods. Moreover, the prototype system shows that list locking is well suited to user interface of shared XML clients by enabling/disabling corresponding edit operation controls.
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSijdms
Due to the flexibility and the easy use of XML, it is nowadays widely used in a vast number of application areas and new information is increasingly being encoded as XML documents. Therefore, it is important to provide a repository for XML documents, which supports efficient management and storage of XML data.For this purpose, many proposals have been made, the most common ones are node labeling schemes. On
the other hand, XML repeatedly uses tags to describe the data itself. This self-describing nature of XML makes it verbose with the result that the storage requirements of XML are often expanded and can be excessive. In addition, the increased size leads to increased costs for data manipulation. Therefore, it also seems natural to use compression techniques to increase the efficiency of storing and querying XML data.
In our previous works, we aimed at combining the advantages of both areas (labeling and compaction technologies), Specially, we took advantage of XML structural peculiarities for attempting to reduce storage space requirements and to improve the efficiency of XML query processing using labeling schemes. In this paper, we continue our investigations on variations of binary string encoding forms to decrease the
label size. Also We report the experimental results to examine the impact of binary string encoding on the query performance and the storage size needed to store the compacted XML documents.
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTScsandit
Since Extensible Markup Language abbreviated as XML, became an official World Wide Web
Consortium recommendation in 1998, XML has emerged as the predominant mechanism for
data storage and exchange, in particular over the World Web. Due to the flexibility and the easy
use of XML, it is nowadays widely used in a vast number of application areas and new
information is increasingly being encoded as XML documents. Because of the widespread use of
XML and the large amounts of data that are represented in XML, it is therefore important to
provide a repository for XML documents, which supports efficient management and storage of
XML data. Since the logical structure of an XML document is an ordered tree consisting of tree
nodes, establishing a relationship between nodes is essential for processing the structural part
of the queries. Therefore, tree navigation is essential to answer XML queries. For this purpose,
many proposals have been made, the most common ones are node labeling schemes. On the
other hand, XML repeatedly uses tags to describe the data itself. This self-describing nature of
XML makes it verbose with the result that the storage requirements of XML are often expanded
and can be excessive. In addition, the increased size leads to increased costs for data
manipulation. Therefore, it also seems natural to use compression techniques to increase the
efficiency of storing and querying XML data. In our previous works, we aimed at combining the
advantages of both areas (labeling and compaction technologies), Specially, we took advantage
of XML structural peculiarities for attempting to reduce storage space requirements and to
improve the efficiency of XML query processing using labeling schemes. In this paper, we
continue our investigations on variations of binary string encoding forms to decrease the label
size. Also We report the experimental results to examine the impact of binary string encoding on
reducing the storage size needed to store the compacted XML documents.
Modern data lakes are now built on cloud storage, helping organizations leverage the scale and economics of object storage while simplifying overall data storage and analysis flow
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATIONijcsit
XML is
gradually
emplo
yed as
a standard of data exchange
in
web
environment
since its inception
in the
90s
until
present
.
It
serves
as a data exchange between system
s
and other application
s
.
Meanwhile t
he data
volume has grown substantially
in the web and
thus effective methods
of
storing and retrieving
these
data
is
essential
.
One recommended way is
p
hysically or virtually
fragments
the large chunk of data
and
distributes
the fragments
into different nodes.
F
ragmentation design
of XML document
contains of two
parts: fragmentat
ion operation and fragmentation method. The
three
fragmentation o
peration
s are
Horizontal, Vertical
and Hybrid. It
determines how the XML should be fragmented.
This
paper
aims
to give
an overview on the fragmentation design consideration
and
subsequently,
propose a
fragmentation
technique
using
number addressing
.
Transforming data-centric eXtensible markup language into relational database...journalBEEI
eXtensible markup language (XML) appeared internationally as the format for data representation over the web. Yet, most organizations are still utilising relational databases as their database solutions. As such, it is crucial to provide seamless integration via effective transformation between these database infrastructures. In this paper, we propose XML-REG to bridge these two technologies based on node-based and path-based approaches. The node-based approach is good to annotate each positional node uniquely, while the path-based approach provides summarised path information to join the nodes. On top of that, a new range labelling is also proposed to annotate nodes uniquely by ensuring the structural relationships are maintained between nodes. If a new node is to be added to the document, re-labelling is not required as the new label will be assigned to the node via the new proposed labelling scheme. Experimental evaluations indicated that the performance of XML-REG exceeded XMap, XRecursive, XAncestor and Mini-XML concerning storing time, query retrieval time and scalability. This research produces a core framework for XML to relational databases (RDB) mapping, which could be adopted in various industries.
Facilitating Busines Interoperability from the Semantic WebRoberto García
Most approaches to B2B interoperability are based on language syntax standardisation, usually by XML Schemas. However, due to XML expressivity limitations, they are difficult to put into practice because language semantics are not available for computerised means. Therefore, there are many attempts to use formal semantics for B2B based on ontologies. However, this is a difficult jump as there is already a huge XML-based B2B framework and ontology-based approaches lack momentum. Our approach to solve this impasse is based on a di-rect and transparent transfer of existing XML Schemas and XML data to the semantic world. This process is based on a XML Schema to web ontology mapping combined with an XML data to semantic web data one. Once in the semantic space, it is easier to integrate different business standards using ontology alignment tools and to develop business information systems thanks to semantics-aware tools.
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionThomas Lee
In this paper, we propose new models and algorithms to perform practical computations on W3C XML Schemas, which are schema minimization, schema equivalence testing, subschema testing and subschema extraction. We have conducted experiments on an e-commerce standard XSD called xCBL to demonstrate the e?ectiveness of our algorithms. One experiment has refuted the claim that the xCBL 3.5 XSD is compatible with the xCBL 3.0 XSD. Another experiment has shown that the xCBL XSDs can be effectively trimmed into small subschemas for specific applications, which has significantly reduced schema processing time.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
20240605 QFM017 Machine Intelligence Reading List May 2024
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries
1. HadoopXML: A Suite for Parallel Processing of Massive
XML Data with Multiple Twig Pattern Queries
Hyebong Choi‡ Kyong-Ha Lee‡ Soo-Hyong Kim‡
hbchoi@dbserver.kaist.ac.kr bart7449@gmail.com kimsh@dbserver.kaist.ac.kr
Yoon-Joon Lee ‡
Bongki Moon§
yoonjoon.lee@kaist.ac.kr bkmoon@cs.arizona.edu
‡
Computer Science Dept., KAIST, Daejeon, 301-781, Korea
§
Dept. of Computer Science, University of Arizona, Tucson, Arizona, 85721, USA
ABSTRACT in a huge XML file in disk and the volume is continuously grow-
The volume of XML data is tremendous in many areas, but espe- ing. This makes it difficult to process the data within XML pub/sub
cially in data logging and scientific areas. XML data in the ar- systems or single-site XML databases. It is because conventional
eas are accumulated over time as new data are continuously col- XML pub/sub systems are mainly devised to consider a series of
lected. It is a challenge to process massive XML data with multi- small-size XML documents and XML databases are not optimized
ple twig pattern queries given by multiple users in a timely manner. for such a big XML file that also must be appended or even substi-
We showcase HadoopXML, a system that simultaneously processes tuted by a new XML file frequently. Thus, it is prudent to process
many twig pattern queries for a massive volume of XML data with user queries over XML data with MapReduce [4].
Hadoop. Specifically, HadoopXML provides an efficient way to To address this issue, we devise HadoopXML which provides
process a single large XML file in parallel. It processes multi- facilities to efficiently process a massive volume of XML data in
ple twig pattern queries simultaneously with a shared input scan. parallel. HadoopXML is a set of applications developed on the
Users do not need to iterate M/R jobs for each query. HadoopXML popular MapReduce framework, Hadoop [1]. Main features of
also saves many I/Os by enabling twig pattern queries to share their HadoopXML are as follows: First, it provides an efficient means
path solutions each other. Moreover, HadoopXML provides a so- to process a massive volume of XML data in parallel. It parti-
phisticated runtime load balancing scheme for fairly assigning mul- tions XML data into blocks with no loss of structural informa-
tiple twig pattern joins across nodes. With synthetic and real world tion. Second, HadoopXML processes multiple twig pattern queries
XML dataset, we demonstrate how efficiently HadoopXML pro- simultaneously. There is no need to iterate M/R jobs for each
cesses many twig pattern queries in a shared and balanced way. query in a query set. Third, HadoopXML enables query processing
tasks to share input scans and their intermediate results with each
other. A path solution is shared by multiple twig pattern queries
Categories and Subject Descriptors that contain the common path pattern. Moreover, it saves many I/Os
H.2.4 [Database Management]: Systems—query processing; D.1.3 by removing many redundant intermediate results as we substitute
[Software]: Programming technique—concurrent programming many path patterns that include //, * to distinct root-to-leaf paths.
Lastly, HadoopXML provides a sophisticated runtime load balanc-
General Terms ing scheme for evenly distributing twig joins across nodes. The rest
of this proposal is organized as follows. Section 2 describes our
algorithms, experimentation, performance
system architecture. Section 3 explains features of HadoopXML.
Section 4 presents implementation details. Section 5 describes our
Keywords demonstration scenarios.
XML, parallel processing, query optimization, MapReduce
2. SYSTEM ARCHITECTURE
1. INTRODUCTION HadoopXML processes XML data in 3 steps: preprocessing and
XML is one of the most prominent data formats and many data 2 consecutive M/R jobs. In preprocessing step (shown in Fig. 1),
have been produced and transformed into the format. Specifically, XML data are partitioned into equal-sized blocks and then loaded
scientific data and log messages are often kept in the form of XML. into HDFS. Also, elements are labeled for the use in twig pattern
Such XML data are large and also growing very quickly. For exam- joins and the labels are written in label blocks separate from XML
ple, UniprotKB, which provides the collection of functional infor- blocks. In the stage, HadoopXML also decomposes a given set
mation on proteins, now hits more than 108GB a file [2]. Moreover, of queries into linear path patterns. Then, it builds an NFA-style
new elements and attributes are continuously appended to existing query index and a table that holds mapping information between
XML files as they are generated over time. In a typical scenario, given queries and the decomposed path patterns.
users prepare their queries in advance even when XML data are In the 1st M/R job, the query index is loaded into each map-
not completely produced. This is akin to the background of XML per via distributedCache mechanism in Hadoop beforehand.
pub/sub systems, but different in that the data is sometimes stored After that, mappers read XML blocks as SAX streams and filter out
only the labels of elements matched with the decomposed path pat-
Copyright is held by the author/owner(s).
CIKM’12, October 29–November 2, 2012, Maui, HI, USA. terns. Then, reducers group the labels by PathId and count the
ACM 978-1-4503-1156-4/12/10. number of labels for each pathId. The path solutions and size in-
2. Query index Query practice. Sharing path solutions reduces redundant processing of
builder index
HDFS
path patterns and saves many I/Os [8]. In this respect, we borrow
XPath Query
Path
patterns
the concept of path sharing from YFilter [5]. Moreover, path so-
queries Decomposition XML Label lutions are shared by multiple twig pattern joins in HadoopXML.
Relationship block1 block1
btw. paths
Copy to
While joining path solutions for processing twig patterns, a group
and twigs XML Label
HDFS block2 block2 of join operations assigned to the same reducer share the path solu-
Label blocks tions each other if the path patterns are shared by the twig patterns.
…
…
A large Partitioning
XML Label
XML file & Labeling XML blocks
blockn blockn
This helps reduce the overall I/O cost of join operations.
Block collocation
Converting to distinct path patterns
Figure 1: Preprocessing step in HadoopXML Many path patterns may be matched with a single root-to-leaf path
in practice. For example, assume that path patterns /a//c, //c
formation are stored in HDFS. After that, our multi-query optimizer and /a/*/c are matched with a root-to-leaf path /a/b/c in an
decides which reducer in the next M/R job will perform which twig XML file. If the paths are treated as different each other, three path
pattern join for balancing workloads across nodes, based on size in- solutions are redundantly produced for a single distinct path during
formation for the path solutions and the mapping table. query processing. By converting redundant path patterns to root-
In the 2nd M/R job, mappers read grouped path solutions and to-leaf paths which are distinct in an input XML, we nicely reduce
tag reduce ids to the grouped path solutions as keys. Since mapped the sizes of path solutions and save many I/Os. In order to support
outputs are shuffled by intermediate keys, path solutions tagged by this feature, HadoopXML extracts distinct root-to-leaf paths during
the same reducer id go to the same reducer together. Finally, re- data loading in preprocessing step.
ducers perform twig pattern joins and output final results to HDFS.
Fig. 2 illustrates data flows in two M/R jobs in HadoopXML. Runtime load balancing and multi query optimization
A straggling task lags overall job execution in Mapreduce. This
3. FEATURES OF HADOOPXML problem becomes more severe if it happens in reduce stage. MapRe-
duce’s native runtime scheduling does not work well especially for
HadoopXML has many features for efficient XML data process-
reducers. HadoopXML rather uses dynamic shuffling scheme that
ing. Since fault-tolerance and scalability are its primary goal, Hadoop
balances workloads across reducers at runtime. To achieve this,
is not optimized for I/O efficiency [6]. Thus, we endeavor to in-
HadoopXML estimates the cost of each twig join operation be-
crease I/O efficiency but without modification of Hadoop internals.
fore actual joining. It is achieved by computing the cost of each
join operation, as the worst-case I/O and CPU time complexities
Partitioning with no loss of structural information
of twigStack algorithm is linear in the sum of sizes of input path
With labeled values, each label block records a root-to-leaf path
solutions. The sizes of path solutions is counted in the 1st reduce
that represents the structural information for the start of the corre-
stage. The cost estimation also considers the sizes of path solutions
sponding XML block. For example, consider an XML document
shared by multiple twig pattern queries. Then, it assigns join oper-
with four elements: <a><b><c></c><d></d></b></a>. If
ations into reducers at runtime such that every reducer has the same
the XML file is partitioned into two blocks and the second block
overall cost of join operations each other.
contains an XML fragment </d></b></a>, we keep a root-to-
leaf path /a/b/d for the start of the block. When a map task reads
the second block, a query index is first fed with the SAX stream re- 4. IMPLEMENTATION
stored from the root-to-leaf path string /a/b/d before actual block We implemented HadoopXML with Hadoop version 0.21.0. Our
reading. This guarantees that a query index in each map task starts cluster consists of 9 nodes, running on CentOS 6.2. A master is
with correct internal states when processing XML blocks that lie in equipped with an AMD Athlon II x4 620 processor, 8GB memory
the "middle" of the SAX stream. and a 7200RPM HDD. The other nodes are designated as slaves,
each of which has an Intel i5-2500k processor, 8GB memory and
Collocating XML blocks and label blocks a 7200RPM HDD. All nodes are connected via Gigabit switching
HadoopXML reads both XML blocks and their corresponding la- hub. We use default settings for our Hadoop cluster for fair com-
bel blocks during query processing. If two blocks are stored sep- parison. Region numbering scheme [7] is used for labeling XML,
arately in two nodes, additional network I/Os occur as the system but modified for the support of big XML files. Since end values in
reads blocks via network, delaying map stage. To increase spatial the numbering scheme are generated in postorder, labels are kept in
locality, we extend block placement policy in HDFS so as to put memory until we meet endElement(), causing a memory space
an XML block and its corresponding label block together into the problem in such a big XML file. Our scheme reads an XML block,
same node. then promptly appends labels into the corresponding label block.
After data loading, HadoopXML sorts labels by start in preorder.
Multiple twig pattern matchings in parallel For path filtering, we use the NFA-style query index in YFilter [5].
In HadoopXML, multiple join operations are distributed across nodes We also use TwigStack algorithm [3] to implement holistic twig
and executed in parallel as many as the number of reducers. We also pattern joins in the 2nd M/R jobs, but other holistic join techniques
implement each join operation with an I/O optimal holistic twig can be used in HadoopXML with no loss of generality. Finally, we
pattern join algorithm for improving I/O efficiency in HadoopXML [3]. also extend DataPlacementPolicy class in HDFS in order to
collocate XML blocks and their corresponding label blocks.
Sharing input scan and path solutions
MapReduce’s batch nature makes it difficult to support ad-hoc queries
like DBMS. To iterate the same M/R job from input scan to reduce 5. DEMONSTRATION SCENARIO
stage for each query is wasteful in many cases. Moreover, many Table 1 presents statistics for XML datasets used in our experi-
twig patten queries share linear path patterns with each other in ments. The demonstration will use only a small fraction of one syn-
3. Mappers
Mappers Reducers Reducers
XML Label Tagging
Path Path Holistic Final
block1 Path Reducer ID
filtering solutions Counting
solutions twig join answers
solutions Tagging
XML Label Path
Path Reducer ID Final
block2 Path Holistic
filtering solutions
Path solutions twig join answers
…
Counting Tagging
solutions Shuffle
Solutions Reducer ID
XML Label Path by ReducerId
Path <Path ID, a list of labels> Size information
blockn solutions
filtering for path solutions Distributed cache
<Path ID, label>
Size information Relationship
Query index for path solutions btw. path patterns & Multi query
Distributed cache twig patterns optimizer
(a) (b)
Figure 2: (a) path query processing in the 1st M/R job (b) twig pattern joins in the 2nd M/R job
50000 7226
45000 Labeling 7000 1st MR 2nd MR 2000 1st MR 2nd MR
Elapsed time (sec)
Elapsed time (sec)
5630 1800
Loading time (sec)
40000 Copy to HDFS 6000 5485 1600
35000 5000 4930
1400
30000 4000
4095 1200
25000 1000
3000
20000 800
15000 2000 600
10000 1000
400
500 365 361 398 394 517 200
5000 99 86 104 94 119
0
1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k
0 XM
ark1
0
XM
ark1
00
XM
ark1
Unir
000
ef10
Unip
arc
Unip
rotK XMark10 XMark100 XMark1000 Uniref100 Uniparc UniprotKB
0 B with 1k, 2k, 4k, 8k, and 16k queries with 1k, 2k, 4k, 8k, and 16k queries
(a) data loading time (b) execution time for synthetic dataset (c) execution time for real-world dataset
Block collocation Non-collocation
Execution time of 1st M/R job (sec)
Execution time of 2nd reducer (sec)
Execution time of 1st M/R job (sec)
(size) Distinct path XMark100 (time)
1587
Path solution size (MB)
1700 1563 (size) Normal path XMark100 (time) XMark100 balanced UniprotKB balanced
1600
1500 (size) Distinct path Uniref100 (time)
XMark100 random UniprotKB random
1400 700
1148 (size) Normal path Uniref100 (time)
1300 25600
1200 1141
12800 512000 600
1100
6400 256000 500
1000 128000
900 3200 64000 400
800 609 621 1600
700 32000
600 800 16000 300
500 379 388 400 8000
400 200
300 200 4000
171 179 100 100
200 39 41 2000
100
0 1000 2000 4000 8000 16000 0
XM XM XM UNIR UNIP UNIP 1k 2k 4k 8k 16k
ark1
0
ark1
00
ark1
000 EF ARC Rot the number of queries
The number of queries
(d) effect of block collocation (e) effect of converting paths to distinct paths (f) effect of multi query optimization
Figure 3: Experimental results
Acknowledgments
Table 1: Statistics of XML dataset
Filename UniRef100 UniParc UniProtKB XMark1000 We thank to Jiaheng Lu for providing us with Java version of twig
File size(KB) 25,088,663 38,334,953 108,283,066 117,159,962 join algorithms. This work was partly supported by NRF grant
# of elements 335,153,446 360,376,852 2,110,330,358 1,670,594,672 funded by the Korea government (MEST)(No. 2011-0016282).
# of attributes 589,568,839 1,215,063,103 383,127,024 2,783,354,175
Depth in avg. 4.5649 3.7753 4.3326 4.7375 6. REFERENCES
Max depth 6 5 7 12 [1] Hadoop. http://hadoop.apache.org, Apache Software
# distinct paths 30 24 149 548 Foundation.
[2] A. Bairoch et al. The universal protein resource (uniprot).
thetic and one real-world data set due to the limited demonstration Nucleic acids research, 33(suppl 1):D154–D159, 2005.
time and the nature of MPP(Massive Parallel Processing) applica- [3] N. Bruno et al. Holistic twig joins: optimal xml pattern
tions. However, we still present our experimental results done with matching. In Proceedings of ACM SIGMOD, pages 310–321.
all the dataset in fig. 3. Currently, HadoopXML supports a subset ACM, 2002.
of XPath 1.0 language, i.e. {/,//,*, @, []}.
[4] J. Dean et al. Mapreduce: Simplified data processing on large
In our demonstration, users will be given a list of sample XPath
clusters. Communications of the ACM, 51(1):107–113, 2008.
queries generated from DTDs for the datasets in Table 1. Users
[5] Y. Diao et al. Path sharing and predicate evaluation for
can also edit the queries with their tastes. Users are then allowed
high-performance xml filtering. ACM Transactions on
to load sample XML files into Hadoop XML and run their queries
Database Systems, 28(4):467–516, 2003.
themselves. During the processing, users will be explained step by
step with Hadoop GUI how the system processes a massive volume [6] K. Lee et al. Parallel data processing with mapreduce: a
XML data. Users will also check how features of HadoopXML survey. ACM SIGMOD Record, 40(4):11–20, 2012.
affect the overall performance of the system as they can turn on [7] Q. Li et al. Indexing and querying xml data for regular path
and off the features, e.g. block collocation, sharing input scan & expressions. In Proceedings of VLDB, pages 361–370, 2001.
path solutions, load balancing and so on. [8] T. Nykiel et al. Mrshare: Sharing across multiple queries in
mapreduce. Proceedings of the VLDB Endowment,
3(1-2):494–505, 2010.