We developed a real-time, visual analytics tool for clinical decision support. The system expands the “recall of past experience” approach that a provider (physician) uses to formulate a course of action for a given patient. By utilizing Big-Data techniques, we enable the provider to recall all similar patients from an institution’s electronic medical record (EMR) repository, to explore “what-if” scenarios, and to collect these evidence-based cohorts for future statistical validation and pattern mining.
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...ijcsit
A data warehouse integrates data from various and heterogeneous data sources and creates a consolidated view of the data that is optimized for reporting and analysis. Today, business and technology are constantly evolving, which directly affects the data sources. New data sources can emerge while some can become unavailable. The DW or the data mart that is based on these data sources needs to reflect these changes. Various solutions to adapt a data warehouse after the changes
in the data sources and the business requirements have been proposed in the literature [1]. However, research in the problem of DW evolution has focused mainly on managing changes in the dimensional model while other aspects related to the ETL, and maintaining the history of changes has not been addressed. The paper presents a Meta Data vault model that includes a data vault based data warehouse and a master data management. A major area of focus in this research is to keep both history of changes and a “single version of the truth,” through an MDM, integrated with the DW. The paper also outlines the load patterns used to load data into the data warehouse and materialized views to deliver data to end-users. To test the proposed model, we have used big data sets from the biomedical field and for each modification of the data source schema, we outline the changes that need to be made to the EDW, the data marts and the ETL.
NEO4J, SQLITE AND MYSQL FOR HOSPITAL LOCALIZATIONacijjournal
can be easily provided when there is powerful graph databases like neo4j that can be used to efficiently
query the graphs with multiple attributes. For instance, querying a system with medical and hospital data
can be used to address the problem of location wise medical decision making. Here in this paper we
present a neo4j as a solution to medical query
Presentation for UP Health Informatics HI201 under Dr. Iris Tan and Dr. Mike Muin. The topic for discussion Interoperability & Standards, a healthcare scenario was given regarding two disparate information systems, one found in a clinic, another with a hospital information system. #MSHI #HI201
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...ijcsit
A data warehouse integrates data from various and heterogeneous data sources and creates a consolidated view of the data that is optimized for reporting and analysis. Today, business and technology are constantly evolving, which directly affects the data sources. New data sources can emerge while some can become unavailable. The DW or the data mart that is based on these data sources needs to reflect these changes. Various solutions to adapt a data warehouse after the changes
in the data sources and the business requirements have been proposed in the literature [1]. However, research in the problem of DW evolution has focused mainly on managing changes in the dimensional model while other aspects related to the ETL, and maintaining the history of changes has not been addressed. The paper presents a Meta Data vault model that includes a data vault based data warehouse and a master data management. A major area of focus in this research is to keep both history of changes and a “single version of the truth,” through an MDM, integrated with the DW. The paper also outlines the load patterns used to load data into the data warehouse and materialized views to deliver data to end-users. To test the proposed model, we have used big data sets from the biomedical field and for each modification of the data source schema, we outline the changes that need to be made to the EDW, the data marts and the ETL.
NEO4J, SQLITE AND MYSQL FOR HOSPITAL LOCALIZATIONacijjournal
can be easily provided when there is powerful graph databases like neo4j that can be used to efficiently
query the graphs with multiple attributes. For instance, querying a system with medical and hospital data
can be used to address the problem of location wise medical decision making. Here in this paper we
present a neo4j as a solution to medical query
Presentation for UP Health Informatics HI201 under Dr. Iris Tan and Dr. Mike Muin. The topic for discussion Interoperability & Standards, a healthcare scenario was given regarding two disparate information systems, one found in a clinic, another with a hospital information system. #MSHI #HI201
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
It is a attempt to provide unified view of open data, In this system, data is collected from different sources in different formats. Data producer will define semantic relationship among datasets which is to input to our DC system.
Data consumer can pick a set of dataset randomly(or depends on his/her interest) and ask system to get HTTP API for it. System will identify which datasets are linked with each other (connected components) and generate HTTP API for each component which will produce unified output in JSON/XML format
It helps in maintaining loose coupling between underline storage structure and consumer client which is built on open data.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Role of Data Cleaning in Data WarehouseRamakant Soni
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing
one single patient’s information gathered from a large volume of data over a long period of time. The
main objective of this paper is to represent our ontology-based information retrieval approach for
clinical Information System. We have performed a Case Study in the real life hospital settings. The results
obtained illustrate the feasibility of the proposed approach which significantly improved the information
retrieval process on a large volume of data over a long period of time from August 2011 until January
2012
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
In this paper, the authors describe an approach for sharing sensitive medical data with the consent of the
data owner. The framework builds on the advantages of the Semantic Web technologies and makes it
secure and robust for sharing sensitive information in a controlled environment. The framework uses a
combination of Role-Based and Rule-Based Access Policies to provide security to a medical data
repository as per the FAIR guidelines. A lightweight ontologywas developed, to collect consent from the
users indicating which part of their data they want to share with another user having a particular role.
Here, the authors have considered the scenario of sharing the medical data by the owner of data, say the
patient, with relevant persons such as physicians, researchers, pharmacist, etc. To prove this concept, the
authors developed a prototype and validated using the Sesame OpenRDF Workbench with 202,908 triples
and a consent graph stating consents per patient.
Real Time Web-based Data Monitoring and Manipulation System to Improve Transl...CSCJournals
The use of the internet technology and web browser capabilities of the internet has provided researchers/scientists with many advantages, which includes but not limited to ease of access, platform independence of computer systems, relatively low cost of web access etc. Hence online collaboration like social networks and information/data exchange among individuals and organizations can now be done seamlessly. In practice, many investigators rely heavily on different data modalities for studying and analyzing their research/study and also for producing quality reports. The lack of coherency and inconsistencies in data sets can dramatically reduce the quality of research data. Thus to prevent loss of data quality and value and provide the needed functionality of data, we have proposed a novel approach as an ad-hoc component for data monitoring and manipulation called RTWebDMM (Real Time Web-based Data Monitoring and Manipulation) system to improve the quality of translational research data. The RTWebDMM is proposed as an auditor, monitor, and explorer for improving the way in which investigators access and interact with the data sets in real time using a web browser. The performance of the proposed approach was evaluated with different data sets from various studies. It is demonstrated that the approach yields very promising results for data quality improvement while leveraging on a web-enabled environment.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
There is a wide range of available biological databases developed by bioinformatics experts, employing different methods to extract biological data. In this paper, we investigate and evaluate the performance of some of these methods in terms of their ability to efficiently access bioinformatics databases using webbased interfaces. These methods retrieve bioinformatics information using structured and semi-structured data tools, which are able to retrieve data from remote database servers. This study distinguishes each of these approaches and contrasts these tools. We used Sequence Retrieval System (SRS) and Entrez search tools for structured data, while Perl and BioPerl search programs were used for semi-structured data to retrieve complex queries including a combination of text and numeric information. The study concludes that the use of semi-structured data tools for accessing bioinformatics databases is a viable alternative to the structured tools, though each method is shown to have certain inherent advantages and disadvantages.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
There is a wide range of available biological databases developed by bioinformatics experts, employing different methods to extract biological data. In this paper, we investigate and evaluate the performance of some of these methods in terms of their ability to efficiently access bioinformatics databases using webbased interfaces. These methods retrieve bioinformatics information using structured and semi-structured
data tools, which are able to retrieve data from remote database servers. This study distinguishes each of these approaches and contrasts these tools. We used Sequence Retrieval System (SRS) and Entrez search tools for structured data, while Perl and BioPerl search programs were used for semi-structured data to retrieve complex queries including a combination of text and numeric information. The study concludes that the use of semi-structured data tools for accessing bioinformatics databases is a viable alternative to the structured tools, though each method is shown to have certain inherent advantages and disadvantages.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
It is a attempt to provide unified view of open data, In this system, data is collected from different sources in different formats. Data producer will define semantic relationship among datasets which is to input to our DC system.
Data consumer can pick a set of dataset randomly(or depends on his/her interest) and ask system to get HTTP API for it. System will identify which datasets are linked with each other (connected components) and generate HTTP API for each component which will produce unified output in JSON/XML format
It helps in maintaining loose coupling between underline storage structure and consumer client which is built on open data.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Role of Data Cleaning in Data WarehouseRamakant Soni
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing
one single patient’s information gathered from a large volume of data over a long period of time. The
main objective of this paper is to represent our ontology-based information retrieval approach for
clinical Information System. We have performed a Case Study in the real life hospital settings. The results
obtained illustrate the feasibility of the proposed approach which significantly improved the information
retrieval process on a large volume of data over a long period of time from August 2011 until January
2012
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
In this paper, the authors describe an approach for sharing sensitive medical data with the consent of the
data owner. The framework builds on the advantages of the Semantic Web technologies and makes it
secure and robust for sharing sensitive information in a controlled environment. The framework uses a
combination of Role-Based and Rule-Based Access Policies to provide security to a medical data
repository as per the FAIR guidelines. A lightweight ontologywas developed, to collect consent from the
users indicating which part of their data they want to share with another user having a particular role.
Here, the authors have considered the scenario of sharing the medical data by the owner of data, say the
patient, with relevant persons such as physicians, researchers, pharmacist, etc. To prove this concept, the
authors developed a prototype and validated using the Sesame OpenRDF Workbench with 202,908 triples
and a consent graph stating consents per patient.
Real Time Web-based Data Monitoring and Manipulation System to Improve Transl...CSCJournals
The use of the internet technology and web browser capabilities of the internet has provided researchers/scientists with many advantages, which includes but not limited to ease of access, platform independence of computer systems, relatively low cost of web access etc. Hence online collaboration like social networks and information/data exchange among individuals and organizations can now be done seamlessly. In practice, many investigators rely heavily on different data modalities for studying and analyzing their research/study and also for producing quality reports. The lack of coherency and inconsistencies in data sets can dramatically reduce the quality of research data. Thus to prevent loss of data quality and value and provide the needed functionality of data, we have proposed a novel approach as an ad-hoc component for data monitoring and manipulation called RTWebDMM (Real Time Web-based Data Monitoring and Manipulation) system to improve the quality of translational research data. The RTWebDMM is proposed as an auditor, monitor, and explorer for improving the way in which investigators access and interact with the data sets in real time using a web browser. The performance of the proposed approach was evaluated with different data sets from various studies. It is demonstrated that the approach yields very promising results for data quality improvement while leveraging on a web-enabled environment.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
There is a wide range of available biological databases developed by bioinformatics experts, employing different methods to extract biological data. In this paper, we investigate and evaluate the performance of some of these methods in terms of their ability to efficiently access bioinformatics databases using webbased interfaces. These methods retrieve bioinformatics information using structured and semi-structured data tools, which are able to retrieve data from remote database servers. This study distinguishes each of these approaches and contrasts these tools. We used Sequence Retrieval System (SRS) and Entrez search tools for structured data, while Perl and BioPerl search programs were used for semi-structured data to retrieve complex queries including a combination of text and numeric information. The study concludes that the use of semi-structured data tools for accessing bioinformatics databases is a viable alternative to the structured tools, though each method is shown to have certain inherent advantages and disadvantages.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
There is a wide range of available biological databases developed by bioinformatics experts, employing different methods to extract biological data. In this paper, we investigate and evaluate the performance of some of these methods in terms of their ability to efficiently access bioinformatics databases using webbased interfaces. These methods retrieve bioinformatics information using structured and semi-structured
data tools, which are able to retrieve data from remote database servers. This study distinguishes each of these approaches and contrasts these tools. We used Sequence Retrieval System (SRS) and Entrez search tools for structured data, while Perl and BioPerl search programs were used for semi-structured data to retrieve complex queries including a combination of text and numeric information. The study concludes that the use of semi-structured data tools for accessing bioinformatics databases is a viable alternative to the structured tools, though each method is shown to have certain inherent advantages and disadvantages.
Driving Deep Semantics in Middleware and Networks: What, why and how?Amit Sheth
Amit Sheth, "Driving Deep Semantics in Middleware and Networks: What, why and how?," Keynote talk at Semantic Sensor Networks Workshop at the 5th International Semantic Web Conference (ISWC-2006), November 6, 2006, Athens, Georgia, USA.
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE csandit
During genomics life science research, the data volume of whole genomics and life science algorithm is going bigger and bigger, which is calculated as TB, PB or EB etc. The key problem will be how to store and analyze the data with optimized way. This paper demonstrates how Intel Big Data Technology and Architecture help to facilitate and accelerate the genomics life science research in data store and utilization. Intel defines high performance GenomicsDB for variant call data query and Lustre filesystem with Hierarchal Storage Management for genomics data store. Based on these great technology, Intel defines genomics knowledge share and exchange architecture, which is landed and validated in BGI China and Shanghai Children Hospital with very positive feedback. And these big data technology can definitely be scaled to much more genomics life science partners in the world.
Big Data Technology Accelerate Genomics Precision Medicinecscpconf
During genomics life science research, the data volume of whole genomics and life science
algorithm is going bigger and bigger, which is calculated as TB, PB or EB etc. The key problem
will be how to store and analyze the data with optimized way. This paper demonstrates how
Intel Big Data Technology and Architecture help to facilitate and accelerate the genomics life
science research in data store and utilization. Intel defines high performance GenomicsDB for
variant call data query and Lustre filesystem with Hierarchal Storage Management for
genomics data store. Based on these great technology, Intel defines genomics knowledge share
and exchange architecture, which is landed and validated in BGI China and Shanghai Children
Hospital with very positive feedback. And these big data technology can definitely be scaled to
much more genomics life science partners in the world.
Dynamic Rule Base Construction and Maintenance Scheme for Disease Predictionijsrd.com
Business and healthcare application are tuned to automatically detect and react events generated from local are remote sources. Event detection refers to an action taken to an activity. The association rule mining techniques are used to detect activities from data sets. Events are divided into 2 types' external event and internal event. External events are generated under the remote machines and deliver data across distributed systems. Internal events are delivered and derived by the system itself. The gap between the actual event and event notification should be minimized. Event derivation should also scale for a large number of complex rules. Attacks and its severity are identified from event derivation systems. Transactional databases and external data sources are used in the event detection process. The new event discovery process is designed to support uncertain data environment. Uncertain derivation of events is performed on uncertain data values. Relevance estimation is a more challenging task under uncertain event analysis. Selectability and sampling mechanism are used to improve the derivation accuracy. Selectability filters events that are irrelevant to derivation by some rules. Selectability algorithm is applied to extract new event derivation. A Bayesian network representation is used to derive new events given the arrival of an uncertain event and to compute its probability. A sampling algorithm is used for efficient approximation of new event derivation. Medical decision support system is designed with event detection model. The system adopts the new rule mapping mechanism for the disease analysis. The rule base construction and maintenance operations are handled by the system. Rule probability estimation is carried out using the Apriori algorithm. The rule derivation process is optimized for domain specific model.
CLOUD-BASED DEVELOPMENT OF SMART AND CONNECTED DATA IN HEALTHCARE APPLICATIONijdpsjournal
There is a need of data integration in cloud – based system, we propose an Information Integration and Informatics framework for cloud – based healthcare application. The data collected by the Electronic Health Record System need to be smart and connected, so we use informatica for the connection of data
from different database. Traditional Electronic Health Record Systems are based on different technologies, languages and Electronic Health Record Standards. Electronic Health Record System stores data based on interaction between patient and provider. There are scalable cloud infrastructures, distributed and heterogeneous healthcare systems and there is a need to develop advanced healthcare application. This advance healthcare application will improve the integration of required data and helps in fast interaction between the patient and the service providers. Thus there is the development of smart
and connected data in healthcare application of cloud. The proposed system is developed by using cloud platform Aneka.
I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the "outsourcing" function--what people call "cloud" today), using biomedical examples in particular.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1. Patient-Like-Mine
A Real Time, Visual Analytics Tool for Clinical Decision Support
Peter Li1
, Ph.D., Simon N. Yates2
, Jenna K. Lovely3
, Pharm.D. R.Ph. BCPS, David W. Larson4
, M.D. MBA
Department of Surgery, Mayo Clinic, Rochester, MN USA
1
jenli.peter@gmail.com, 2
yates.simon@mayo.edu, 3
lovely.jenna@mayo.edu, 4
larson.david2@mayo.edu
Abstract— We developed a real-time, visual analytics tool for
clinical decision support. The system expands the “recall of
past experience” approach that a provider (physician) uses to
formulate a course of action for a given patient. By utilizing
Big-Data techniques, we enable the provider to recall all
similar patients from an institution’s electronic medical record
(EMR) repository, to explore “what-if” scenarios, and to
collect these evidence-based cohorts for future statistical
validation and pattern mining.
Keywords- electronic medical record, clinical decision
support, real-time analytics, visual analytics, data mining.
I. INTRODUCTION
When determining the course of action for a given patient,
the physician has to integrate clinical knowledge, state of the
patient, and his/her own personal experience. Clinical
knowledge is composed of models of diseases and their
progression and interventions. However, many patients have
comorbidities and medical histories that cause the disease
progression and the optimal treatment plan to deviate from
the known clinical models. As a result, the physician (or, in
general, the provider team) has to recall from their own
experience of similar past cases to develop the care plan.
Unfortunately, this approach is often subjective and limited
to the team’s ability to recall all relevant cases. We
developed an interactive, visual analytics tool, “Patient-Like-
Mine”, in the Mayo Clinic’s Enhanced Analytics to Surgical
Excellence (EASE) program that focuses on improving
Clinical Pathway compliance [1] as well as recognizing
patterns early to assist this clinical decision making process.
We remove the limits of subjectivity by accessing the
institutional Electronic Medical Record (EMR) database so
that the collective experience from ALL the past and present
patients with a similar background could be utilized in real-
time care planning.
Operational requirements of “Patient-Like-Mine”
includes:
• Perform search on a large (>1 billion facts) and complex
(>1 thousand properties) with up-to-date (>1 thousand
data points/second) dataset in real-time.
• Align the resulting patients’ (i.e. the cohort’s) medical
histories to provider-chosen “landmark” events, e.g.
time of surgery or date of admission.
• Restrict the value and the relative time of ANY clinically
relevant parameter based on the provider’s judgment,
e.g. “systolic blood pressure between 50 and 90, 3 days
post-operation”.
• Provide an interactive, graphical user interface to
compare the given patient to “similar patients” and to
explore “what-if” scenarios in a transparent fashion.
II. APPROACH
The architecture for this project is based on the following
four major components:
• Automated deployment for Azure public (and private)
cloud nodes with Health Insurance Portability and
Accountability Act (HIPAA)-compliant disk encryption,
strict firewall configurations, and secure login. The
cloud is necessary for scalable performance, but we also
need to secure the protected health information (PHI)
that will now be placed onto the cloud nodes.
• A scalable search engine that can expand and shrink
based on existing load. For this project, we chose
ElasticSearch (http://elastic.co) for model simplicity
(JSON-based), query expressiveness (structured and
text), inherent performance (subsecond response), and
secure transport protocol (SSL/TLS). We developed
data transform and importing tools for Fast Healthcare
Interoperability Resources (FHIR, http://hl7.org/fhir),
Reference Information Model (RIM, http://hl7.org/
implement/standards/rim.cfm), and SQL-based sources
that captured EMR at Mayo Clinic.
• A schema-driven abstraction layer for ElasticSearch
query building and data export. This generates
consistent queries from declarative API calls and
converts the return results from nested JSON to a tabular
format for UI ingestion and downstream statistical
analysis.
• An intuitive, graphical UI for exploring “Patient-Like-
Mine”. A single patient’s structured clinical parameters
can be plotted via a “strip-chart” graphic with a
zoomable and scrollable interface. By extension, a set of
“similar patients” in a cohort would paint a distribution
contour over the same chart. By changing the constraints
over clinical outcome and intervention parameters, we
can analyze “what-if” scenarios to facilitate clinical
decision making.
The project is divided into 3 phases. The first phase is
architectural and implementation feasibility. Second phase is
clinical practice validation. Third phase is distribution and
expansion into additional clinical practices.
III. RESULTS
We present the results from the recently completed first
phase, demonstrating architectural and implementation
feasibility.
2. A. Cloud Deployment
We developed an Azure-cloud deployment tool based on
Linux shell scripts for ease of portability and extensibility. A
given cloud deployment is defined by a set of configuration
files that describe the system (IT) and application parameters
(Figure 1). System configuration includes disk encryption
via LUKS (for encryption-at-rest), firewall settings, and
logins via public-key based ssh only. Passwords, certificates,
and keys are generated on-the-fly so that each deployment
will has its own set of security tokens.
Figure 1. System architecture for cloud deployment.
Current applications supported are ElasticSearch (ES)
and nodejs (Figure 2). ES can be deployed with SHIELD
module for secure internode and client communication (for
encryption-in-flight). While other ES modules, such as
Kibana, are also deployed, but they are not considered to be
secure at this time, and is limited to non-production
environments. Nodejs is deployed on intranet/on-premise
VM that has credentials to connect to local LDAP server for
authentication and authorization. ES will only allow
connections from this nodejs server via https.
Figure 2. Application architecture for cloud deployment.
B. ElasticSearch Engine
ES is an open source, commercial product that serves as
the search engine for many well-known internet sites (see
http://elastic.co). It is built on top of Apache Lucene
indexing engine (http://lucene.apache.org) processing JSON-
based documents, but with distributed, multi-node scalability
as a core feature. It has a very large set of querying
commands, flexible aggregations, and parent-child
relationship. In addition, it has a large ecosystem of tools and
language support.
Typical EMR captures clinical data from a messaging
format (HL7 v2 messages). However, with the industry
acceptance of FHIR and RIM messaging objects, we are
seeing more complex and polymorphic schema. These
complex datasets require transformations to an analytics-
friendly fact or event schema. We also took advantage of ES
supported parent-child relationships and JSON-based nested
array elements to provide additional properties for landmark
event alignment and relative times (Figure 3).
Figure 3. Mapping of HL7 schema to Event-based schema.
C. Schema-based Abstraction Module
While ElasticSearch query language is very expressive
and powerful, it is also very easy to create queries that return
non-intuitive answers because of the different contexts
involving nested elements, parent-child relationships, and
aggregates. Constraint clauses associated with different
parent-child/nested elements need to be repeated at different
places of the query construct to ensure proper object
selection and projection (Figure 4). We built an abstraction
module that takes a set of schema-based constraints and
generates a consistent query for execution.
To simplify downstream processing of returned results
from an ElasticSearch query, this abstraction module also
includes a schema-based specification for transforming
JSON-based data into a flatter table structure for use. This
also simplifies exports to any statistical analysis that may
need to be performed. This module is implemented in
Javascript for use in both client browser and nodejs web
server.
Encrypted
Data
Volume
Azure
Firewall
Azure Cloud Environment
Corporate
VPN
Gateway
Azure
VPN
Gateway
Cloud Nodes
Cloud-based
VMs
Cloud Nodes
Cloud-based
VMs
Internet
Encrypted
Traffic
SSL
encrypted
packets
LUKS
encrypted
blocks
Azure
Portal
Deployment Node/s
Linux/shell or
Windows/powershell
Secure RESTful
Azure Commands
and Response
Azure
Deployments
Intranet Environment
Corporate
Firewall
SSH Multi-Factor
Authentication
Config
Files
Restricted
Permissions
SSH
for nodes
Deployed Assets
Firewall
Encrypted
Data
Volume
Deployment
Data, Keys,
Passwords
Deployment Node/s
Linux/shell or
Windows/powershell
https
User
ES Slave
VM
Encrypted
Data
Volume
ES Slave
VM
Encrypted
Data
Volume
ES Master
VM
Encrypted
Data
Volume
...
ES Master
VM
Encrypted
Data
Volume
LDAP
Nodejs
VM
Config &
Encrypted
Certificate
Files
Config &
Enscrypted
Certificate
Files
VPN Tunnel
ssh/scp
https
https
ssh configuration
Pa#ent
Event
Rela#ve+
Time
Ref+
Event
Document
RefDef
Parent/Child/Rou#ng
Id
source
value
Name
first
last
BirthDate
Gender
Race
…
AttributesClass
Id
source
value
Type
ClinDate
Display
Observation
value
Order
…
Id
source
value
RefEvent
RelTime
…
Id
source
value
RefDef
…
Id
source
value
Display
EventType
EventTime
PatientQual
…
contains
is;acontains
contains
generates
generates
Pa#ent;Like;Mine+Schema
FHIR+Resources+
(mongo+DB)
Observa#on+
Medica#onAdministra#on+
Encounter+
Order+
Procedure+
…
Pa#ent
Document FHIR;based+
Mongo2ES+
Mapper
RDBMS
SQL;based+
Mapper
RIM;based+
repository
RIM;based+
Mapper
Legend:
3. Figure 4. Constraint duplication for consistent results under ES.
D. Intuitive UI
Data overload is often a complaint about today’s EMR.
In addition to presenting the patient data, we will also be
presenting data from 10’s to 1000’s of similar patients. The
cohort data must be summarized so that a visual comparison
can be made readily between the patient of interest and the
cohort. We extend the box-and-whisker statistical plots into
continuous contours to support continuous variables (Figure
5). This graphic allows the provider to quickly determine if
the trajectory of a patient’s clinical parameter is within
nominal bounds, based on a cohort of similar patients.
Figure 5. Construction of cohort contour graphs.
To build a specific real-time cohort, we allow the user to
create, move, and size constraint “boxes” over the graph of
the patient data. Each constraint box acts as a filter for
matching patients whose data intersect the “bounding box”
(Figure 6). A physician can decide when and where to place
such constraints to create a specific cohort based on his/her
clinical assessment, i.e. a customized, ad-hoc “patient-like-
mine” cohort to predict patient outcome and to select best
treatment options.
Figure 6. Impact of constraint creation on contours.
IV. CONCLUSION
The objective of the EASE program is to improve patient
care and outcome. Our big-data approach creates a
transparent, interactive environment that enables a provider
to formulate more specific plan for a given patient using real-
world evidence found in an institutional EMR. By using a
very flexible UI, the provider or team can also explore
“what-if” scenarios that would have previously taken a
statistical/database team and considerable time to develop.
However, the number of “similar” patients in any given
repository can be limited, reducing the robustness of this
recall-based paradigm. By building this system on a cloud-
based architecture, it can solve this problem simply by
including more patients from other institutions. Thereby
increasing the chance of finding patients that match a set of
complex or rare clinical features.
Our next phase is to demonstrate clinical utility and
relevance through deployment in a controlled practice setting
that will provide systematic feedback for improvements and
define novel use cases.
ACKNOWLEDGMENT
This project is funded by Mayo Clinic Clinical Practice
Committee and the Office of Information and Knowledge
Management. We also thank the efforts of the many people
on the core EASE team and Mayo Clinic IT support staff.
REFERENCES
[1] DW Larson, JK Lovely, RR Cima, EJ Dozois, H Chua, BG Wolff, JH
Pemberton, RR Devine, M Huebner., “Outcomes after
implementation of a multimodal standard care pathway for
laparoscopic colorectal surgery.,” Br J Surg. 2014 Jul;101(8):1023-30.
doi: 10.1002/bjs.9534. Epub 2014 May 15.
Pa#ent
Event
Rela#ve+
Time
Id
source
value
Name
first
last
BirthDate
Gender
Race
…
Id
source
value
Type
ClinDate
Display
Observation
value
Order
…
Id
source
value
RefEvent
RelTime
…
contains
contains
A
B
C
ElasticSearch Pseudoquery:
{
query: {
<clause A>,
child: { Event : {
<clause B>,
child: { RelativeTime : {
<clause C>
} }
} }
},
inner : { Event : {
query : {
<clause B>,
child : {
<clause C>
}
},
inner : { RelativeTime : {
query : {
<clause C>
}
} }
} }
}
Pa#ent'A''
2015-04
Pa#ent'B'
2015-01
Pa#ent'C'
2014-12
Pa#ent'D'
2015-03
Our'Pa#ent'in'May'2015'
with'a'"Landmark"'event
…
1.'Find'others'
with'same'event
2.'Align'to'our'
#me'frame
4.'Calculate'5%,'
10%'25%,'75%,'
90%'95%#le'
contours
3.'Overlay
5.'Compare
Add#Constraint
Add#Constraint