This document discusses dimensionality reduction techniques for genomic data. It begins with an introduction to genomics data science and the "curse of dimensionality" for high-dimensional data. Popular dimensionality reduction methods like PCA, ISOMap and t-SNE are then described. Specific use cases are discussed, such as using PCA to analyze population variations, ISOMap to infer cell populations from single-cell RNA-seq data, and t-SNE to visualize tissue expression profiles from the GTEx dataset. The presentation encourages exploring different dimensionality reduction methods and visualizing the results.
Writing long sentences is bit boring, but with text prediction in the keyboard technology has made
this simple. Learning technology behind the keyboard is developing fast and has become more accurate.
Learning technologies such as machine learning, deep learning here play an important role in predicting the
text. Current trending techniques in deep learning has opened door for data analysis. Emerging technologies
such has Region CNN, Recurrent CNN have been under consideration for the analysis. Many techniques have
been used for text sequence prediction such as Convolutional Neural Networks (CNN), Recurrent Neural
Networks (RNN), and Recurrent Convolution Neural Networks (RCNN). This paper aims to provide a
comparative study of different techniques used for text prediction.
IRJET - Deep Learning based Bone Tumor Detection with Real Time DatasetsIRJET Journal
This document presents a proposed method for detecting bone tumors using deep learning and recurrent neural networks. Specifically, it involves using MRI images as input data and extracting features through segmentation and techniques like HOG. Recurrent neural networks like simple RNNs and LSTMs are then used to both impute any missing data in images and predict bone tumors. This approach is meant to increase accuracy over other methods by handling missing image parts. The proposed system is analyzed to show it can provide accurate bone tumor detection and diagnostic suggestions when evaluating medical examination data.
TOP 5 Most View Article From Academia in 2019sipij
TOP 5 Most View Article From Academia in 2019
Signal & Image Processing : An International Journal (SIPIJ)
ISSN : 0976 - 710X (Online) ; 2229 - 3922 (print)
http://www.airccse.org/journal/sipij/index.html
Testing Uncertainty of Cyber-Physical Systems in IoT Cloud Infrastructures: C...Hong-Linh Truong
Today’s cyber-physical systems (CPS) span IoT and cloud-based
datacenter infrastructures, which are highly heterogeneous with
various types of uncertainty. Thus, testing uncertainties in these
CPS is a challenging and multidisciplinary activity. We need several
tools for modeling, deployment, control, and analytics to test and
evaluate uncertainties for different configurations of the same CPS.
In this paper, we explain why using state-of-the art model-driven
engineering (MDE) and model-based testing (MBT) tools is not
adequate for testing uncertainties of CPS in IoT Cloud infrastruc-
tures. We discus how to combine them with techniques for elastic
execution to dynamically provision both CPS under test and testing
utilities to perform tests in various IoT Cloud infrastructures.
Human Re-identification with Global and Local Siamese Convolution Neural NetworkTELKOMNIKA JOURNAL
This document proposes a global and local structure of Siamese Convolution Neural Network (SCNN) to perform human re-identification in single-shot approaches. The network extracts features from global and local parts of input images. A decision fusion technique then combines the global and local features. Experimental results on the VIPeR dataset show the proposed method achieves a normalized Area Under Curve score of 95.75% without occlusion, outperforming using local or global features alone. With occlusion, the score is 77.5%, still better than alternatives. The method performs well for re-identification including in occlusion cases by leveraging both global and local information.
This document summarizes research on an object recognition system that uses distinctive intermediate-level features (e.g. automatically extracted 2D boundary fragments) as keys within a local context region. These keys are assembled within a loose global context to identify objects. The system demonstrates good recognition of a variety of 3D shapes, with tests on over 2000 images evaluating performance under increasing clutter, occlusion, and database size. The system represents an improvement over other methods by being robust to occlusion and clutter without requiring whole-object segmentation.
Writing long sentences is bit boring, but with text prediction in the keyboard technology has made
this simple. Learning technology behind the keyboard is developing fast and has become more accurate.
Learning technologies such as machine learning, deep learning here play an important role in predicting the
text. Current trending techniques in deep learning has opened door for data analysis. Emerging technologies
such has Region CNN, Recurrent CNN have been under consideration for the analysis. Many techniques have
been used for text sequence prediction such as Convolutional Neural Networks (CNN), Recurrent Neural
Networks (RNN), and Recurrent Convolution Neural Networks (RCNN). This paper aims to provide a
comparative study of different techniques used for text prediction.
IRJET - Deep Learning based Bone Tumor Detection with Real Time DatasetsIRJET Journal
This document presents a proposed method for detecting bone tumors using deep learning and recurrent neural networks. Specifically, it involves using MRI images as input data and extracting features through segmentation and techniques like HOG. Recurrent neural networks like simple RNNs and LSTMs are then used to both impute any missing data in images and predict bone tumors. This approach is meant to increase accuracy over other methods by handling missing image parts. The proposed system is analyzed to show it can provide accurate bone tumor detection and diagnostic suggestions when evaluating medical examination data.
TOP 5 Most View Article From Academia in 2019sipij
TOP 5 Most View Article From Academia in 2019
Signal & Image Processing : An International Journal (SIPIJ)
ISSN : 0976 - 710X (Online) ; 2229 - 3922 (print)
http://www.airccse.org/journal/sipij/index.html
Testing Uncertainty of Cyber-Physical Systems in IoT Cloud Infrastructures: C...Hong-Linh Truong
Today’s cyber-physical systems (CPS) span IoT and cloud-based
datacenter infrastructures, which are highly heterogeneous with
various types of uncertainty. Thus, testing uncertainties in these
CPS is a challenging and multidisciplinary activity. We need several
tools for modeling, deployment, control, and analytics to test and
evaluate uncertainties for different configurations of the same CPS.
In this paper, we explain why using state-of-the art model-driven
engineering (MDE) and model-based testing (MBT) tools is not
adequate for testing uncertainties of CPS in IoT Cloud infrastruc-
tures. We discus how to combine them with techniques for elastic
execution to dynamically provision both CPS under test and testing
utilities to perform tests in various IoT Cloud infrastructures.
Human Re-identification with Global and Local Siamese Convolution Neural NetworkTELKOMNIKA JOURNAL
This document proposes a global and local structure of Siamese Convolution Neural Network (SCNN) to perform human re-identification in single-shot approaches. The network extracts features from global and local parts of input images. A decision fusion technique then combines the global and local features. Experimental results on the VIPeR dataset show the proposed method achieves a normalized Area Under Curve score of 95.75% without occlusion, outperforming using local or global features alone. With occlusion, the score is 77.5%, still better than alternatives. The method performs well for re-identification including in occlusion cases by leveraging both global and local information.
This document summarizes research on an object recognition system that uses distinctive intermediate-level features (e.g. automatically extracted 2D boundary fragments) as keys within a local context region. These keys are assembled within a loose global context to identify objects. The system demonstrates good recognition of a variety of 3D shapes, with tests on over 2000 images evaluating performance under increasing clutter, occlusion, and database size. The system represents an improvement over other methods by being robust to occlusion and clutter without requiring whole-object segmentation.
SECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATIONsipij
This paper presents a hybrid watermarking technique for medical images. The method uses a combination
of three transforms: Discrete Wavelet Transform (DWT), Discrete Cosine Transform (DCT), and singular
value decomposition (SVD). Then, the paper discusses the results of applying the combined method on
different medical images from eight patients. The images were watermarked with a small watermark image
representing the patients' medical data. The visual quality of the watermarked images (before and after
attacks) was analyzed using five quality metrics: PSNR, WSNR, PSNR-HVS-M, PSNR-HVS, and MSSIM.
The first four metrics' average values of the watermarked medical images before attacks were
approximately 32 db, 35 db, 42 db, and 40 db respectively; while the MSSM index indicated a similarity
between the original and watermarked images of more than 97%. However, the metric values decreased
significantly after attacking the images with various operations even though the watermark image could be
retrieved after almost all attacks. In brief, the initial results indicate that watermarking medical images
with patients' data does not significantly affect their visual quality and they can still be used by medical
staff
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORKpijans
Wireless sensor network is a set of tiny elements i.e. sensors. WSN is used in the field of Health Monitoring,
Civil Construction, Military Applications and Agricultural etc., for monitoring environmental parameters.
The WSN is having the challenges like less processing power, less memory, less bandwidth and battery
powered. The data monitored through the sensors would be sent to the sink for data processing. The data
sent from sensor node can be controlled for saving the energy, as maximum energy is consumed for
transmission of data and it is not possible to replace the batteries frequently. In this work threshold based
and adaptive threshold based data reduction techniques with energy efficient shortest path are used for
minimizing the energy of sensor node and the network. Adaptive approach saves energy and reduce data by
30% to 40% as compared to threshold based approach
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORKpijans
ABSTRACT
Wireless sensor network is a set of tiny elements i.e. sensors. WSN is used in the field of Health Monitoring, Civil Construction, Military Applications and Agricultural etc., for monitoring environmental parameters.The WSN is having the challenges like less processing power, less memory, less bandwidth and battery
powered. The data monitored through the sensors would be sent to the sink for data processing. The data sent from sensor node can be controlled for saving the energy, as maximum energy is consumed for transmission of data and it is not possible to replace the batteries frequently. In this work threshold based and adaptive threshold based data reduction techniques with energy efficient shortest path are used for minimizing the energy of sensor node and the network. Adaptive approach saves energy and reduce data by 30% to 40% as compared to threshold based approach.
Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...IJDKP
Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. There is an urgent need for a new generation of computational theories and tools to assist researchers in extracting useful information from the rapidly growing volumes of digital data.
This Journal provides a forum for researchers who address this issue and to present their work in a peer-reviewed open access forum. Authors are solicited to contribute to the Journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the following areas, but are not limited to these topics only.
An Effect of Compressive Sensing on Image SteganalysisIRJET Journal
This document discusses using compressive sensing (CS) to recover secret signals that have been embedded in images through steganography. It proposes applying CS to extract features from stego-images using a block CS measurements matrix in order to detect differences between cover and stego-images. The reconstructed image from CS is then used to recover the embedded secret signal values. The method is compared to existing steganalysis techniques that cannot recover secret signals but can only detect their presence. It is concluded that applying CS in this domain could provide an improved way to detect and recover hidden secret messages from digital images.
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...CSCJournals
The forensic investigators always search for fingerprint evidence which is seen as one of the best types of physical evidence linking a suspect to the crime. In this paper discrete wavelet transform (DWT) and the singular value decomposition (SVD) has been used to estimate a person’s age using his/her fingerprint. The most robust K nearest neighbor (KNN) used as a classifier. The evaluation of the system is carried on using internal database of 3570 fingerprints in which 1980 were male fingerprints and 1590 were female fingerprints. Tested fingerprint is grouped into any one of the following five groups: up to 12, 13-19, 20-25, 26-35 and 36 and above. By the proposed method, fingerprints were classified accurately by 96.67%, 71.75%, 86.26%, 76.39% and 53.14% in five groups respectively for male and similarly classified by 66.67%, 63.64%, 76.77%, 72.41% and 16.79% in five groups respectively for female.
1) The document proposes a method called constrained asymmetric multi-task discriminant component analysis (cAMT-DCA) to address the problem of cross-scenario person re-identification using data from other scenarios.
2) cAMT-DCA jointly learns the similarity measurements for re-identification in different scenarios in an asymmetric way, with the goal of improving re-identification in the target scenario. It aims to extract a discriminant shared component across tasks that reduces overlap between cross-task data in the shared latent space.
3) The method maximizes local inter-class variation and cross-task data discrepancy, while minimizing local intra-class variation. It can solve the optimization problem with a closed-form
This document summarizes a project report on developing a secure steganography scheme. It was developed at Ahmedabad University by Anuradha Chaudhary under the guidance of Nikita Desai and Dr. Mehul Raval. The project aims to propose a novel steganography technique where an image is divided into blocks and data is embedded in blocks selected based on a global variance threshold. Experimental results on 2000 grayscale images show that the technique provides high embedding capacity and works best with highly textured images.
Durant cette présentation, nous introduirons des concepts de bases de la science de la donnée et discuterons d’un projet réalisé chez un de nos client.
Nous découvrirons, comment on peut facilement réaliser des projets de science de la donnée à l’aide du langage de programmation statistique R, ainsi que de son intégration dans la nouvelle suite de Microsoft SQL Server 2016.
Webinar: Using R for Advanced Analytics with MongoDBMongoDB
This document discusses using MongoDB and R together for advanced analytics. It begins by explaining why R and MongoDB are popular tools individually and the benefits of using them together. It then covers MongoDB's aggregation framework and how to connect R to MongoDB. Two use cases are presented: genome-wide association analysis using genomic data from HapMap, and vehicle situational awareness using open data from the city of Chicago. The document concludes by discussing schema design considerations and scaling out analytics to Spark.
This document presents a framework for context-aware service recommendation. It begins with background on web services and the challenge of recommendation given data sparsity. It then discusses using context information like user location and service provider to learn features. A probabilistic matrix factorization approach is introduced to model user-service interactions in a joint low-rank feature space. The framework incorporates learning user-specific and service-specific context-aware features which are combined in a unified model. An experiment evaluates the approach on a real-world dataset and compares to baseline methods.
SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...P singh
The document discusses a proposed method for video watermarking that uses spatial and frequency domain techniques for embedding watermark information, and tests the method's robustness against rational 6th order distortion. The key steps are: (1) extracting frames from a video and selecting the highest entropy frame, (2) using spread spectrum and LSB techniques to embed a watermark in the spatial domain and DWT in the frequency domain, (3) applying rational 6th order distortion to test the effect on the watermarked video, (4) calculating metrics like correlation, SSIM, PSNR, BER and MSE to evaluate the method and detect the watermark from the distorted video. The results show the values of correlation and SSIM
ICCES 2017 - Crowd Density Estimation Method using Regression AnalysisAhmed Gad
The oral presentation of the paper titled "Crowd Density Estimation Method using Multiple Feature Categories and Multiple Regression Models".
This paper was accepted for publication and oral presentation in the 12th IEEE International Conference on Computer Engineering and Systems (ICCES 2017) held from 19 to 20 December 2017 in Cairo, Egypt.
The paper proposed a new method to estimate the number of people within crowded scenes using regression analysis. The two challenges in crowd density estimation using regression analysis are perspective distortion and non-linearity. This paper solves the perspective distortion using perspective normalization which is the best way to deal with that problem based on recent works.
The second challenge is solved by creating a new combination of features collected from multiple already existing categories including segmented region, texture, edge, and keypoints. This paper created a feature vector of length 164.
Five regression models are used which are GPR, RF, RPF, LASSO, and KNN.
Based on the experimental results, our proposed method gives better results than previous works.
----------------------------------
أحمد فوزي جاد Ahmed Fawzy Gad
قسم تكنولوجيا المعلومات Information Technology (IT) Department
كلية الحاسبات والمعلومات Faculty of Computers and Information (FCI)
جامعة المنوفية, مصر Menoufia University, Egypt
Teaching Assistant/Demonstrator
ahmed.fawzy@ci.menofia.edu.eg
---------------------------------
Find me on:
Blog
(Arabic) https://aiage-ar.blogspot.com.eg/
(English) https://aiage.blogspot.com.eg/
YouTube
https://www.youtube.com/AhmedGadFCIT
Google Plus
https://plus.google.com/u/0/+AhmedGadIT
SlideShare
https://www.slideshare.net/AhmedGadFCIT
LinkedIn
https://www.linkedin.com/in/ahmedfgad
reddit
https://www.reddit.com/user/AhmedGadFCIT
ResearchGate
https://www.researchgate.net/profile/Ahmed_Gad13
Academia
https://menofia.academia.edu/Gad
Google Scholar
https://scholar.google.com.eg/citations?user=r07tjocAAAAJ&hl=en
Mendelay
https://www.mendeley.com/profiles/ahmed-gad12
ORCID
https://orcid.org/0000-0003-1978-8574
StackOverFlow
http://stackoverflow.com/users/5426539/ahmed-gad
Twitter
https://twitter.com/ahmedfgad
Facebook
https://www.facebook.com/ahmed.f.gadd
Pinterest
https://www.pinterest.com/ahmedfgad
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-ResolutionTaegyun Jeon
This document summarizes deep learning approaches for single image super-resolution (SISR). It begins with an overview of SISR, describing traditional interpolation-based methods and challenges. It then covers recent developments in using deep convolutional neural networks (CNNs) for SISR, summarizing influential models like SRCNN, VDSR, DRCN, and SRGAN. Various CNN architectures are discussed, including residual blocks and generative adversarial networks. The document also reviews SISR datasets, evaluation metrics, and losses like mean squared error and perceptual losses. In summary, it provides a comprehensive overview of the shift from traditional methods to modern deep learning techniques for single image super resolution.
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...IRJET Journal
This document discusses a survey of clustering data streams based on shared density between micro-clusters. It describes how current reclustering approaches for data stream clustering ignore density information between micro-clusters, which can result in inaccurate cluster assignments. The paper proposes DBSTREAM, a new approach that captures shared density between micro-clusters using a density graph. This density information is then used in the reclustering process to generate final clusters based on actual density between adjacent micro-clusters rather than assumptions about data distribution.
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
More Related Content
Similar to Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic
SECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATIONsipij
This paper presents a hybrid watermarking technique for medical images. The method uses a combination
of three transforms: Discrete Wavelet Transform (DWT), Discrete Cosine Transform (DCT), and singular
value decomposition (SVD). Then, the paper discusses the results of applying the combined method on
different medical images from eight patients. The images were watermarked with a small watermark image
representing the patients' medical data. The visual quality of the watermarked images (before and after
attacks) was analyzed using five quality metrics: PSNR, WSNR, PSNR-HVS-M, PSNR-HVS, and MSSIM.
The first four metrics' average values of the watermarked medical images before attacks were
approximately 32 db, 35 db, 42 db, and 40 db respectively; while the MSSM index indicated a similarity
between the original and watermarked images of more than 97%. However, the metric values decreased
significantly after attacking the images with various operations even though the watermark image could be
retrieved after almost all attacks. In brief, the initial results indicate that watermarking medical images
with patients' data does not significantly affect their visual quality and they can still be used by medical
staff
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORKpijans
Wireless sensor network is a set of tiny elements i.e. sensors. WSN is used in the field of Health Monitoring,
Civil Construction, Military Applications and Agricultural etc., for monitoring environmental parameters.
The WSN is having the challenges like less processing power, less memory, less bandwidth and battery
powered. The data monitored through the sensors would be sent to the sink for data processing. The data
sent from sensor node can be controlled for saving the energy, as maximum energy is consumed for
transmission of data and it is not possible to replace the batteries frequently. In this work threshold based
and adaptive threshold based data reduction techniques with energy efficient shortest path are used for
minimizing the energy of sensor node and the network. Adaptive approach saves energy and reduce data by
30% to 40% as compared to threshold based approach
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORKpijans
ABSTRACT
Wireless sensor network is a set of tiny elements i.e. sensors. WSN is used in the field of Health Monitoring, Civil Construction, Military Applications and Agricultural etc., for monitoring environmental parameters.The WSN is having the challenges like less processing power, less memory, less bandwidth and battery
powered. The data monitored through the sensors would be sent to the sink for data processing. The data sent from sensor node can be controlled for saving the energy, as maximum energy is consumed for transmission of data and it is not possible to replace the batteries frequently. In this work threshold based and adaptive threshold based data reduction techniques with energy efficient shortest path are used for minimizing the energy of sensor node and the network. Adaptive approach saves energy and reduce data by 30% to 40% as compared to threshold based approach.
Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...IJDKP
Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. There is an urgent need for a new generation of computational theories and tools to assist researchers in extracting useful information from the rapidly growing volumes of digital data.
This Journal provides a forum for researchers who address this issue and to present their work in a peer-reviewed open access forum. Authors are solicited to contribute to the Journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the following areas, but are not limited to these topics only.
An Effect of Compressive Sensing on Image SteganalysisIRJET Journal
This document discusses using compressive sensing (CS) to recover secret signals that have been embedded in images through steganography. It proposes applying CS to extract features from stego-images using a block CS measurements matrix in order to detect differences between cover and stego-images. The reconstructed image from CS is then used to recover the embedded secret signal values. The method is compared to existing steganalysis techniques that cannot recover secret signals but can only detect their presence. It is concluded that applying CS in this domain could provide an improved way to detect and recover hidden secret messages from digital images.
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...CSCJournals
The forensic investigators always search for fingerprint evidence which is seen as one of the best types of physical evidence linking a suspect to the crime. In this paper discrete wavelet transform (DWT) and the singular value decomposition (SVD) has been used to estimate a person’s age using his/her fingerprint. The most robust K nearest neighbor (KNN) used as a classifier. The evaluation of the system is carried on using internal database of 3570 fingerprints in which 1980 were male fingerprints and 1590 were female fingerprints. Tested fingerprint is grouped into any one of the following five groups: up to 12, 13-19, 20-25, 26-35 and 36 and above. By the proposed method, fingerprints were classified accurately by 96.67%, 71.75%, 86.26%, 76.39% and 53.14% in five groups respectively for male and similarly classified by 66.67%, 63.64%, 76.77%, 72.41% and 16.79% in five groups respectively for female.
1) The document proposes a method called constrained asymmetric multi-task discriminant component analysis (cAMT-DCA) to address the problem of cross-scenario person re-identification using data from other scenarios.
2) cAMT-DCA jointly learns the similarity measurements for re-identification in different scenarios in an asymmetric way, with the goal of improving re-identification in the target scenario. It aims to extract a discriminant shared component across tasks that reduces overlap between cross-task data in the shared latent space.
3) The method maximizes local inter-class variation and cross-task data discrepancy, while minimizing local intra-class variation. It can solve the optimization problem with a closed-form
This document summarizes a project report on developing a secure steganography scheme. It was developed at Ahmedabad University by Anuradha Chaudhary under the guidance of Nikita Desai and Dr. Mehul Raval. The project aims to propose a novel steganography technique where an image is divided into blocks and data is embedded in blocks selected based on a global variance threshold. Experimental results on 2000 grayscale images show that the technique provides high embedding capacity and works best with highly textured images.
Durant cette présentation, nous introduirons des concepts de bases de la science de la donnée et discuterons d’un projet réalisé chez un de nos client.
Nous découvrirons, comment on peut facilement réaliser des projets de science de la donnée à l’aide du langage de programmation statistique R, ainsi que de son intégration dans la nouvelle suite de Microsoft SQL Server 2016.
Webinar: Using R for Advanced Analytics with MongoDBMongoDB
This document discusses using MongoDB and R together for advanced analytics. It begins by explaining why R and MongoDB are popular tools individually and the benefits of using them together. It then covers MongoDB's aggregation framework and how to connect R to MongoDB. Two use cases are presented: genome-wide association analysis using genomic data from HapMap, and vehicle situational awareness using open data from the city of Chicago. The document concludes by discussing schema design considerations and scaling out analytics to Spark.
This document presents a framework for context-aware service recommendation. It begins with background on web services and the challenge of recommendation given data sparsity. It then discusses using context information like user location and service provider to learn features. A probabilistic matrix factorization approach is introduced to model user-service interactions in a joint low-rank feature space. The framework incorporates learning user-specific and service-specific context-aware features which are combined in a unified model. An experiment evaluates the approach on a real-world dataset and compares to baseline methods.
SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...P singh
The document discusses a proposed method for video watermarking that uses spatial and frequency domain techniques for embedding watermark information, and tests the method's robustness against rational 6th order distortion. The key steps are: (1) extracting frames from a video and selecting the highest entropy frame, (2) using spread spectrum and LSB techniques to embed a watermark in the spatial domain and DWT in the frequency domain, (3) applying rational 6th order distortion to test the effect on the watermarked video, (4) calculating metrics like correlation, SSIM, PSNR, BER and MSE to evaluate the method and detect the watermark from the distorted video. The results show the values of correlation and SSIM
ICCES 2017 - Crowd Density Estimation Method using Regression AnalysisAhmed Gad
The oral presentation of the paper titled "Crowd Density Estimation Method using Multiple Feature Categories and Multiple Regression Models".
This paper was accepted for publication and oral presentation in the 12th IEEE International Conference on Computer Engineering and Systems (ICCES 2017) held from 19 to 20 December 2017 in Cairo, Egypt.
The paper proposed a new method to estimate the number of people within crowded scenes using regression analysis. The two challenges in crowd density estimation using regression analysis are perspective distortion and non-linearity. This paper solves the perspective distortion using perspective normalization which is the best way to deal with that problem based on recent works.
The second challenge is solved by creating a new combination of features collected from multiple already existing categories including segmented region, texture, edge, and keypoints. This paper created a feature vector of length 164.
Five regression models are used which are GPR, RF, RPF, LASSO, and KNN.
Based on the experimental results, our proposed method gives better results than previous works.
----------------------------------
أحمد فوزي جاد Ahmed Fawzy Gad
قسم تكنولوجيا المعلومات Information Technology (IT) Department
كلية الحاسبات والمعلومات Faculty of Computers and Information (FCI)
جامعة المنوفية, مصر Menoufia University, Egypt
Teaching Assistant/Demonstrator
ahmed.fawzy@ci.menofia.edu.eg
---------------------------------
Find me on:
Blog
(Arabic) https://aiage-ar.blogspot.com.eg/
(English) https://aiage.blogspot.com.eg/
YouTube
https://www.youtube.com/AhmedGadFCIT
Google Plus
https://plus.google.com/u/0/+AhmedGadIT
SlideShare
https://www.slideshare.net/AhmedGadFCIT
LinkedIn
https://www.linkedin.com/in/ahmedfgad
reddit
https://www.reddit.com/user/AhmedGadFCIT
ResearchGate
https://www.researchgate.net/profile/Ahmed_Gad13
Academia
https://menofia.academia.edu/Gad
Google Scholar
https://scholar.google.com.eg/citations?user=r07tjocAAAAJ&hl=en
Mendelay
https://www.mendeley.com/profiles/ahmed-gad12
ORCID
https://orcid.org/0000-0003-1978-8574
StackOverFlow
http://stackoverflow.com/users/5426539/ahmed-gad
Twitter
https://twitter.com/ahmedfgad
Facebook
https://www.facebook.com/ahmed.f.gadd
Pinterest
https://www.pinterest.com/ahmedfgad
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-ResolutionTaegyun Jeon
This document summarizes deep learning approaches for single image super-resolution (SISR). It begins with an overview of SISR, describing traditional interpolation-based methods and challenges. It then covers recent developments in using deep convolutional neural networks (CNNs) for SISR, summarizing influential models like SRCNN, VDSR, DRCN, and SRGAN. Various CNN architectures are discussed, including residual blocks and generative adversarial networks. The document also reviews SISR datasets, evaluation metrics, and losses like mean squared error and perceptual losses. In summary, it provides a comprehensive overview of the shift from traditional methods to modern deep learning techniques for single image super resolution.
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...IRJET Journal
This document discusses a survey of clustering data streams based on shared density between micro-clusters. It describes how current reclustering approaches for data stream clustering ignore density information between micro-clusters, which can result in inaccurate cluster assignments. The paper proposes DBSTREAM, a new approach that captures shared density between micro-clusters using a density graph. This density information is then used in the reclustering process to generate final clusters based on actual density between adjacent micro-clusters rather than assumptions about data distribution.
Similar to Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic (20)
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic
1. Biotechnology and genomics deal with sensitive information and intellectual property. Seven Bridges Genomics will protect the confidentiality of your data and
proprietary approaches. Similarly, we look to you to protect our interests in our intellectual property. Seven Bridges Genomics does not accept any liability for
information contained in this document. All information provided in this document is subject to change without notice. sevenbridges.com
Dimensionality reduction and visualization
techniques for high-dimensional genomic data
Dusan Ranđelović
Bioinformatics Analyst, Seven Bridges
DATA SCIENCE CONFERENCE 3.0
Good morning!
Thank you all for coming to this talk.
My name is Dusan and for the past 2 years I have been occasionaly playing with some interesting genomics datasets.
I work as Bioinformatics Analyst, in a company called “Seven Bridges Genomics”, so by now you are already guessing that this talk will be more on the science side of data science. I will however focus on some particular methods for analysis of biological data - which are quite interesting and widely applicable in general data science - primarily methods of dimensionality reduction.
I hope you’ll find this overview helpful and if you come to like computational biology and genomics after this talk it’s just a bonus :).
First - I’ll talk about some specifics of genomics and introduce just enough cell biology needed so that you could understand the use-cases. And then I will talk about dimensionality reduction and show-case some high-dimensional datasets in genomics.
There is a saying that data scientist is “Person who is better at programming than any… ”, but in case of genomics data science I would like to argue that those two disciplines are not enough. Genomics data scientist depends heavily on the domain knowledge.
In order to understand the results of analysis, to pose the right questions or even to recognise features in a dataset you need to have some basic knowledge about the processes in a cell and in data-generation. So the domain is actually crucial here.
Apaet from this - there is usually not only one approach to each study, but several experiments on different levels (DNA, RNA level, some biomedical measurements, phenotype quantification, etc.) - and this so called multi-omics approach, that is so powerful for clinicians, gives headaches to genomics data scientists.
Another difficulty here is that population scale studies and per-sample studies deal with equally unknown phenomena - there is so many associations and correlation but discipline is so young and there are not many theoretical models to help guide new analysis.
As I said we will need some basic cell biology here, but I’ll try to be brief :) - you probably remember most of this from high-school anyway, right?
Cells of complex organisms have a nucleus in which there are some long molecules called DNA, packed in chromosoms. DNA molecules serve as a blueprint for making other molecules that are involved in every function of a cell, like RNA molecules or indirectly proteins. Whole DNA material is called a genome and some small regions of it are famous genes.
What is important here is that DNA is structured and is the same in every cell. It is composed of billions of smaller molecules adenine, thymine, cytosin and guanine and could be represented as long sequence of letters ACTG -> unique for each individual.
Overall, complex interactions of DNA, RNA, proteins and environment makes what we call a phenotype -> some physical charateristics, like eye color or a disease.
If we would be somehow able digitize these molecules, we would get a picture of processes in a cell that cause diseases or are responsible for some phenotype. That digitization is possible and is called sequencing.
From a data science perspective: when you sequence a genome, the end-result is a dataset that says on which positions among 3 billion letters of your genome there is a variation or mutation - something different than some reference genome. You could see these millions of differences as features of your dataset to explore! Another common digitization is to count RNA molecules transcribed from genes, which usually gives you datasets with 10s of thousands features.
Sequencing is what gave birth to genomics, the study of the whole genome, and since 2003 this technology breaks Moore’s law and is currently one of the greatest sources of big data.
In order to profile complex biological processes we measure as much as we can. It’s fortunate that we could measure all that features of some process at once - but with complexity and lots of features comes a Curse.
And it is called the Curse of dimensionality.
What is meant by this are some geometric and probabilistic consequences of dealing with high domensional feature space.
For example: if we have some number of samples and we measure 1, 2, or 3 features of those same samples, what we notice is that by taking more features into account we are sampling from increasing feature-space - and our samples are less and less representative of that feature-space. Main definition of curse of dimensionality is that we need exponentially more samples if we are increasing number of features.
On the other hand - even if we have enough samples - there are some geometrical consequences of going higher in dimensionality.
In machine learning, wheter unsupervised or supervised - we are usually interested in finding distances between datapoints, in order to establish some similarity metric. And distances and neighbourhoods in high-dimensional feature-space are problematic. If you look at the right image you’ll see that greatest circle inside a square covers 78% of it, which could also be seen as greatest neighbourhood inside 2dim feature-space. If we increase the dimensionality, we see that a sphere covers 52% of a cube, and going further we have only 0.24% of 10dim hypersphere covering 10dim hypercube. This really changes the meaning of near and far - since most data points are far away in the corners.
So if we go to 100 or 1000 dimensions we have sparcity introduced merely by geometry of such a space. Locality is also broken, and # of samples needed grows rapidly. Additionaly, most algorithms have some optimal number of features to work with, as seen on the classifier performance curve from the right.
So what do we do to avoid the Curse? - We try to reduce the dimensions.
We could and should always try to reduce dimensionality of a dataset if we suspect that intrinsic dimensionality - the one that completely describes the effect that we are measuring - is lower than number of features. For example, on the image here you see small images of letter A and since images are usually described by values inside a pixels, we could imagine this dataset being of for example 64x64 features. But if we are interested in transformations of A present in these pictures we could see that only 2 transformations are applied: scaling and rotation - so instead of 64x64 features we could have only 2 to represent the whole dataset.
Reduction of dimensionality is done when we are doing feature selection or filtering of only interesting features or feature engineering - construction of new, better features from the ones we measure.
It is sometimes imperative to reduce dimensions merely because of computational complexity or to compress the dataset.
But, most interesting purpose of dimensionality reduction is to do visualization and exploratory analysis.
Dimensionality reduction techniques are unsupervised machine learning techniques to learn the embedding of high-dimensional dataset in lower dimensions, usually 2 or 3 if we aim to explore and visualize the dataset.
The last note from this more formal explanation from the slide is about keeping geometry intact as much as possible - that part is the hardest since datasets could have some weird topological or metrical properties to it.
Number of dimensionality reduction methods and techniques is rapidly rising, especially of non-linear ones. Methods could be divided by nature of realtionships among features to linear and non-linear. We have division on convex and non-convex methods if objective function that is optimized is convex or not. Important distinction is between global and local methods (ones that preserve global dissimilarities in a dataset and ones that keep local similarity better).
No matter which method, similar to clustering or classification, what we need among datapoints is some established similarity measure.
When we speak about similarity two other terms usually come in: neighborhood and distance.
Neighborhood of a datapoint could be defined by all the points that fall under some radius (left image on the slide) or specified number of points in decreasing order. But to measure distances in order to calculate neighborhoods or in order to set some similarity measure we have many choices. Euclidean distance in higher dimensions is most common but in some case some others are used, like geodesic distance or even some non-metric distances as we will see in examples.
Most non-linear techniques could be seen as finding lower dimensional manifold on which datapoints lay. Similarity between datapoints is than measured along this manifold. Manifold learning is sometimes synonim for dimensionality reduction.
As you saw there are different criteria to divide techniques and only one of taxonomies is given here on the slide. Going further I’ll talk into more details about PCA, ISOMap and tSNE.
And as promised I’ll show-case these methods on some real genomics datasets.
First one is Principal component analysis technique done on Simons Diversity dataset, as part of finding out about stratification of human population by genomic variants or mutations.
Simons Diversity dataset contains 300 genomes from 142 populations. It’s 35TB of raw sequencing data and processed data. On Seven Bridges, we host this complete dataset. We have also done reproduction of published studies.
We know that different human populations have common phenotypes, which should also be found on genome level. If we take all mutations/variations from simons dataset and reduce them to 2 dimensions based on how variable are they in samples - we should get similar samples clustered together. We use global disimilarity between datapoints here.
What we actually did, we took only non-African samples, and only one type of mutations, and all of that only on one chromosome, and we still got some nice separation in 2 dimensions! On the plot you could see different colors for different populations but that is only for plotting, this was done unsupervised by PCA. This was done on SBG platform with R/Bioconductor tool SNPrelate
PCA is a linear technique that finds directions along which variance of the data is maximized (so called eigenvectors). Eigenvectors are basis of M matrix here on slide and are all mutually ortogonal/normal and could be found as solutions to second eqn. By decomposing initial data matrix X this way we get ordered principal components: independent features in decreasing order of contribution to overall variance. How many principal components we take for further analysis could be determined by second plot here: proportinal variance explained.
Second case is a bit more complex - since it deals with highly non-linear data. Goal is to infer cell populations from single cell RNA-seq data.
In single cell RNA-seq experiment we are looking at RNA molecules and counting them in each cell.
What you get as a result after processing raw sequencing data is matrix like in lower right corner: cells as columns and different molecules that correspond to genes as rows.
If you remember me saying: cells have same DNA but depending on cell type and function different cells have different RNAs, and we should be able to cluster cells by their expression profiles. Usually clustering is easier if done in reduced space.
Multiple-hypothesis testing -> if you torture your data enough, it will confess
I have used this scikit-learn based framework to look at different projections of one particular single cell study. What was challenging here was to confirm cell types after reduction and clustering and for that I have looked at how well some cell’s expression profile corelated with some known molecular pathway (blue to red scale right).
One non-linear technique capable of dealing with this kind of data is ISOmap, which uses geodesic distances along the manifold to model similarity between points. Plastic example of how this distance is more useful could be seen on artificial dataset on the left image: x and y are near in euclidean sense, but far apart if you measure along the manifold.
And last one is tSNE used to visualize tissue-specific expression profiles.
Dataset is somewhat similar to single cell RNA-seq one - but in this case we have RNAs from different tissues of 100s of people. We expect to see tissues separated in some lower dim space.
Original study used PCA and k-means in low dim. And hierarchical clustering in high dimensional space, but when tSNE reanalysis is done it showed better separation of tissues in tSNE space.
tSNE is non-convex method which means every time it will give slightly different results. Similarity between data points is not even a metric, it is conditional probability that some nearby point is neighbouring. In lower space we try to keep those probabilities but with t-distr. instead of normal. tSNE preserves local similarity between datapoints and is very effective and popular technique, but difficult to interpret.