This document describes a Spark solution for calculating the rank product of gene expression data across multiple studies. It contains sections on the author's biography, definitions of key concepts like rank and rank product, a description of the input data format and storage, and an overview of the Spark-based algorithm to calculate rank product in two steps by finding the mean per gene per study and then computing the overall rank product. The solution uses either groupByKey() or combineByKey() to aggregate the rank values by gene.
Hybrid Algorithm for Clustering Mixed Data SetsIOSR Journals
This document summarizes a hybrid algorithm for clustering mixed data sets that was proposed in reference [1]. The algorithm uses a genetic k-means approach to cluster both numeric and categorical data, overcoming limitations of other algorithms that can only handle one data type. It aims to minimize the total within-cluster variation to group similar objects. The selection operator uses proportional selection to determine the population for the next generation based on each solution's probability and fitness. The algorithm was reviewed, implemented in a prototype application, and found to improve performance compared to other related clustering algorithms like GKMODE and IGKA that also handle mixed data types.
This document summarizes a machine learning project to predict insurance claim severities for the Kaggle "Allstate Claims Severity" competition. It describes the dataset, preprocessing steps including one-hot encoding and outlier removal. A deep neural network model was trained using H2O. Hyperparameters were optimized in 4 phases: activation function, network architecture, outliers/epochs, and learning rate. 21 model submissions achieved test MAEs from 1,169 to 1,292, outperforming random forest benchmarks. Rectifier activation, 3 hidden layers of 1000 neurons total, removing outliers, 100 epochs, and a learning rate of 0.0001 produced strong results.
This document discusses enabling interoperability between finite element tools using the HDF5 data format. It describes extracting data like element connectivity, node coordinates, and material properties from an Abaqus output database and exporting it to an HDF5 file with metadata. This allows other tools to access and analyze the data in a standardized way. Future work includes fully transferring problem definitions and exporting images to HDF5 to enable greater integration between simulation programs.
Automated Software Requirements LabelingData Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Machine Learning for Requirements Engineering
Speaker: Jon Patton
This project applies a number of machine learning, deep learning, and NLP techniques to solve challenging problems in requirements engineering.
Link to code and webpage:
http://shashankg7.github.io/word2graph2vec/
Link to slides:
http://www.slideshare.net/nprateek/predictive-text-embedding-using-line
Link to report:
https://www.overleaf.com/read/sqhkzfvjhfkp
A tutorial on how to create mappings using ontop, how inference (OWL 2 QL and RDFS) plays a role answering SPARQL queries in ontop, and how ontop's support for on-the-fly SQL query translation enables scenarios of semantic data access and data integration.
Often information is spread among
several data sources, such as hospital databases, lab databases,
spreadsheets, etc. Moreover, the complexity of each of these data sources
might make it difficult for end-users to access them, and even
more, to query all of them at the same time.
A new solution that has been proposed to this problem is
ontology-based data access (OBDA).
OBDA is a popular paradigm, developed since the mid 2000s, to query
various types of data sources
using a common vocabulary familiar to the end-users. In a nutshell
OBDA separates the user
from the data sources (relational databases, CVS files, etc.) by means
of an ontology, which is a common terminology that provides the user with a
convenient query vocabulary, hides the structure of the data sources,
and can enrich incomplete data with background knowledge. About a
dozen OBDA systems have been implemented in both academia and
industry.
In this tutorial we will give an overview of OBDA, and our system -ontop-
which is currently being used in the context of the European project
Optique. We will discuss how to use -ontop- for data integration,
in particular concentrating on:
– How to create an ontology (common vocabulary) for a life science domain.
– How to map available data sources to this ontology.
– How to query the database using the terms in the ontology.
– How to check consistency of the data sources w.r.t. the ontology
The document describes a technique called STRICT that uses TextRank and POSRank algorithms to identify important terms from a software change task description to generate an effective initial search query. An experiment on 1,939 change tasks from 8 open source projects found that STRICT improved the query effectiveness in 57.84% of cases compared to baseline queries like title alone. STRICT also showed better retrieval performance based on metrics like mean average precision and mean recall compared to state-of-the-art techniques. The approach validates the use of graph-based ranking algorithms to address the challenge of generating relevant initial search queries from natural language change task descriptions.
Hybrid Algorithm for Clustering Mixed Data SetsIOSR Journals
This document summarizes a hybrid algorithm for clustering mixed data sets that was proposed in reference [1]. The algorithm uses a genetic k-means approach to cluster both numeric and categorical data, overcoming limitations of other algorithms that can only handle one data type. It aims to minimize the total within-cluster variation to group similar objects. The selection operator uses proportional selection to determine the population for the next generation based on each solution's probability and fitness. The algorithm was reviewed, implemented in a prototype application, and found to improve performance compared to other related clustering algorithms like GKMODE and IGKA that also handle mixed data types.
This document summarizes a machine learning project to predict insurance claim severities for the Kaggle "Allstate Claims Severity" competition. It describes the dataset, preprocessing steps including one-hot encoding and outlier removal. A deep neural network model was trained using H2O. Hyperparameters were optimized in 4 phases: activation function, network architecture, outliers/epochs, and learning rate. 21 model submissions achieved test MAEs from 1,169 to 1,292, outperforming random forest benchmarks. Rectifier activation, 3 hidden layers of 1000 neurons total, removing outliers, 100 epochs, and a learning rate of 0.0001 produced strong results.
This document discusses enabling interoperability between finite element tools using the HDF5 data format. It describes extracting data like element connectivity, node coordinates, and material properties from an Abaqus output database and exporting it to an HDF5 file with metadata. This allows other tools to access and analyze the data in a standardized way. Future work includes fully transferring problem definitions and exporting images to HDF5 to enable greater integration between simulation programs.
Automated Software Requirements LabelingData Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Machine Learning for Requirements Engineering
Speaker: Jon Patton
This project applies a number of machine learning, deep learning, and NLP techniques to solve challenging problems in requirements engineering.
Link to code and webpage:
http://shashankg7.github.io/word2graph2vec/
Link to slides:
http://www.slideshare.net/nprateek/predictive-text-embedding-using-line
Link to report:
https://www.overleaf.com/read/sqhkzfvjhfkp
A tutorial on how to create mappings using ontop, how inference (OWL 2 QL and RDFS) plays a role answering SPARQL queries in ontop, and how ontop's support for on-the-fly SQL query translation enables scenarios of semantic data access and data integration.
Often information is spread among
several data sources, such as hospital databases, lab databases,
spreadsheets, etc. Moreover, the complexity of each of these data sources
might make it difficult for end-users to access them, and even
more, to query all of them at the same time.
A new solution that has been proposed to this problem is
ontology-based data access (OBDA).
OBDA is a popular paradigm, developed since the mid 2000s, to query
various types of data sources
using a common vocabulary familiar to the end-users. In a nutshell
OBDA separates the user
from the data sources (relational databases, CVS files, etc.) by means
of an ontology, which is a common terminology that provides the user with a
convenient query vocabulary, hides the structure of the data sources,
and can enrich incomplete data with background knowledge. About a
dozen OBDA systems have been implemented in both academia and
industry.
In this tutorial we will give an overview of OBDA, and our system -ontop-
which is currently being used in the context of the European project
Optique. We will discuss how to use -ontop- for data integration,
in particular concentrating on:
– How to create an ontology (common vocabulary) for a life science domain.
– How to map available data sources to this ontology.
– How to query the database using the terms in the ontology.
– How to check consistency of the data sources w.r.t. the ontology
The document describes a technique called STRICT that uses TextRank and POSRank algorithms to identify important terms from a software change task description to generate an effective initial search query. An experiment on 1,939 change tasks from 8 open source projects found that STRICT improved the query effectiveness in 57.84% of cases compared to baseline queries like title alone. STRICT also showed better retrieval performance based on metrics like mean average precision and mean recall compared to state-of-the-art techniques. The approach validates the use of graph-based ranking algorithms to address the challenge of generating relevant initial search queries from natural language change task descriptions.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
DCG and NDCG are evaluation measures used to evaluate the performance of ranking models. DCG measures the goodness of a ranking list based on the labels of each document, where documents with higher labels contribute more to the DCG score. NDCG normalizes the DCG score to range from 0 to 1. MAP measures the average precision of a ranking list where labels are binary as relevant or not. Kendall's Tau evaluates how closely the order of items in two lists match on a scale from -1 to 1.
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
This document discusses preparing data for analysis. It covers the need for data exploration including validation, sanitization, and treatment of missing values and outliers. The main steps in statistical data analysis are also presented. Specific techniques discussed include calculating frequency counts and descriptive statistics to understand the distribution and characteristics of variables in a loan data set with 250,000 observations. SAS procedures like Proc Freq, Proc Univariate, and Proc Means are demonstrated for exploring the data.
Using Interactive GA for Requirements PrioritizationFrancis Palma
The document discusses using an interactive genetic algorithm (IGA) approach for requirements prioritization. The IGA aims to minimize disagreement between a total order of prioritized requirements and various constraints, such as those encoded with the requirements or expressed by users. It considers user knowledge, minimizes user requests, and ensures robustness to errors. The IGA process involves acquiring requirements and domain knowledge, running an interactive genetic algorithm to compute solutions, and outputting the ranked requirements. It is demonstrated on a real-world case study where IGA produced improved prioritizations with reduced user effort compared to other approaches.
The document describes several alternative models for information retrieval, including fuzzy set models, extended Boolean models, generalized vector space models, latent semantic indexing models, neural network models, and Bayesian network models. It provides details on fuzzy set models that allow gradual membership in sets, extended Boolean models that combine Boolean queries with vector space characteristics, and Bayesian networks that use directed acyclic graphs and conditional probabilities.
1. The document summarizes a presentation about Apache Mahout, an open source machine learning library. It discusses algorithms like clustering, classification, topic modeling and recommendations.
2. It provides an overview of clustering Reuters documents using K-means in Mahout and demonstrates how to generate vectors, run clustering and inspect clusters.
3. It also discusses classification techniques in Mahout like Naive Bayes, logistic regression and support vector machines and shows code examples for generating feature vectors from data.
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
This document proposes an enhanced incremental clustering algorithm for processing online data. It discusses existing clustering algorithms like leader clustering and hierarchical clustering, which have limitations in handling dynamic data. The proposed algorithm aims to dynamically create initial clusters and rearrange clusters based on data characteristics, allowing for more accurate clustering of online data over time. It also introduces a new frequency-based method for searching and retrieving specific data from fixed clusters.
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.
Reconstruction of a Complete Dataset from an Incomplete Dataset by ARA (Attri...Waqas Tariq
Preprocessing is crucial steps used for variety of data warehousing and mining Real world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. Accuracy of any mining algorithm greatly depends on the input data sets. Incomplete data sets have become almost ubiquitous in a wide variety of application domains. Common examples can be found in climate and image data sets, sensor data sets and medical data sets. The incompleteness in these data sets may arise from a number of factors: in some cases it may simply be a reflection of certain measurements not being available at the time; in others the information may be lost due to partial system failure; or it may simply be a result of users being unwilling to specify attributes due to privacy concerns. When a significant fraction of the entries are missing in all of the attributes, it becomes very difficult to perform any kind of reasonable extrapolation on the original data. For such cases, we introduce the novel idea of attribute weightage, in which we give weight to every attribute for prediction of the complete data set from incomplete data sets, on which the data mining algorithms can be directly applied. The attraction behind the idea of weights on attribute and finally averaging it. We demonstrate the effectiveness of the approach on a variety of real data sets. This paper describes a theory and implementation of a new filter ARA (Attribute Relation Analysis) to the WEKA workbench, for finding the complete dataset from an incomplete dataset.
This document describes an approach for supporting software change tasks using automated query reformulations. It begins with an example of a software change request between Alex and Bob. It then discusses using techniques like TextRank and POSRank, adapted from PageRank, to identify important terms from change requests for querying a codebase. The approach was evaluated on a dataset of over 1,900 change tasks from eight open source projects. Experimental results found the proposed approach improved 57.84% of queries and outperformed existing state-of-the-art methods based on measures like query effectiveness and retrieval performance.
This document discusses developing classifiers for a census income dataset using the WEKA data mining tool. It summarizes preprocessing steps like handling missing values, removing outliers, and balancing class distributions. It then evaluates various classifiers like J48 decision trees and k-nearest neighbors (IBk) on the preprocessed data. The best performing model was an ensemble "vote" classifier that combined the predictions of J48, IBk, and logistic regression models, achieving 87.3% accuracy and an ROC area of 0.947.
Intelligent query converter a domain independent interfacefor conversionIAEME Publication
This document describes an "Intelligent Query Converter" (IQC) system that converts natural language queries in English to SQL queries. IQC uses semantic matching techniques and WordNet to match queries to database schema without requiring domain-specific configuration. The system classifies queries based on presence of value keywords and conjunctive clauses. Experiments show IQC correctly answered 64.5-87.9% of queries for different databases, outperforming Microsoft's English Query system which answered correctly 29.0-51.8% of time. IQC provides a domain-independent natural language interface to databases using semantic analysis techniques.
The document presents a new technique called ACER for automated query reformulation to improve concept location in source code. ACER uses a novel term weighting method called CodeRank that considers source code semantics and structure. It selects the best reformulation from candidates generated by CodeRank using machine learning on resampled data. An experimental evaluation on 1,675 change requests from 8 projects found that ACER improved 71% of queries and decreased mean rank, outperforming previous methods. The results validate that exploiting source code structure and semantics through CodeRank and data resampling in ACER enhances query reformulation over traditional techniques.
This document discusses Docker, a tool that allows users to run applications securely isolated in a container that runs on the host operating system. It explores key Docker concepts like images, containers, repositories and how they work. It also provides examples of common Docker commands to pull, run, stop and manage images and containers.
Srikanth Kodeboyina completed the DEV 360 - Apache Spark Essentials course offered by MapR Academy on December 12, 2015. A certificate of completion was issued with the certificate number 56vho9jk85o3, which can be verified online at http://verify.skilljar.com/c/56vho9jk85o3.
This is first part of Spark 2 new features overview
This topic covers API changes; Structured Streaming; Encoders; Memory Management in Spark; Tungsten issues; Catalyst features)
Spark 2.0 estará entre nosotros en los próximos meses, nuestro amado Framework de procesamiento en paralelo va a sufrir un gran cambio y debemos estar preparados para afrontarlo.
Durante meses la comunidad y Databricks han estado trabajando en Spark para atender a todas las peticiones de usuarios, se han esforzado en desarrollar potentes utilidades y en continuar haciendo de Spark la herramienta numero uno de Big Data.
The document discusses using the OpenStack command line interface (CLI) to manage an OpenStack platform. It provides examples of commands for various OpenStack services like Keystone, Glance, Neutron, Nova, Cinder, Swift, Heat, and Ceilometer. The commands can be used to create, delete, modify, query and display resources and include examples like keystone user create, neutron network create, nova boot instance, and ceilometer meter list.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
This document is a presentation by Alexey Zinovyev about Java Big Data full stack development. It discusses Alexey's background and contacts, required skills for Java Big Data development like SQL, Linux, Java and backend skills. It then covers topics like NoSQL databases, Hadoop, Spark, machine learning with MLlib and deep learning. It provides different ways to learn these topics including books, online courses, conferences and mentoring. It encourages learning through hands-on projects and recommends starting with tools like Weka, MongoDB, Hadoop and AWS.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
DCG and NDCG are evaluation measures used to evaluate the performance of ranking models. DCG measures the goodness of a ranking list based on the labels of each document, where documents with higher labels contribute more to the DCG score. NDCG normalizes the DCG score to range from 0 to 1. MAP measures the average precision of a ranking list where labels are binary as relevant or not. Kendall's Tau evaluates how closely the order of items in two lists match on a scale from -1 to 1.
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
This document discusses preparing data for analysis. It covers the need for data exploration including validation, sanitization, and treatment of missing values and outliers. The main steps in statistical data analysis are also presented. Specific techniques discussed include calculating frequency counts and descriptive statistics to understand the distribution and characteristics of variables in a loan data set with 250,000 observations. SAS procedures like Proc Freq, Proc Univariate, and Proc Means are demonstrated for exploring the data.
Using Interactive GA for Requirements PrioritizationFrancis Palma
The document discusses using an interactive genetic algorithm (IGA) approach for requirements prioritization. The IGA aims to minimize disagreement between a total order of prioritized requirements and various constraints, such as those encoded with the requirements or expressed by users. It considers user knowledge, minimizes user requests, and ensures robustness to errors. The IGA process involves acquiring requirements and domain knowledge, running an interactive genetic algorithm to compute solutions, and outputting the ranked requirements. It is demonstrated on a real-world case study where IGA produced improved prioritizations with reduced user effort compared to other approaches.
The document describes several alternative models for information retrieval, including fuzzy set models, extended Boolean models, generalized vector space models, latent semantic indexing models, neural network models, and Bayesian network models. It provides details on fuzzy set models that allow gradual membership in sets, extended Boolean models that combine Boolean queries with vector space characteristics, and Bayesian networks that use directed acyclic graphs and conditional probabilities.
1. The document summarizes a presentation about Apache Mahout, an open source machine learning library. It discusses algorithms like clustering, classification, topic modeling and recommendations.
2. It provides an overview of clustering Reuters documents using K-means in Mahout and demonstrates how to generate vectors, run clustering and inspect clusters.
3. It also discusses classification techniques in Mahout like Naive Bayes, logistic regression and support vector machines and shows code examples for generating feature vectors from data.
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
This document proposes an enhanced incremental clustering algorithm for processing online data. It discusses existing clustering algorithms like leader clustering and hierarchical clustering, which have limitations in handling dynamic data. The proposed algorithm aims to dynamically create initial clusters and rearrange clusters based on data characteristics, allowing for more accurate clustering of online data over time. It also introduces a new frequency-based method for searching and retrieving specific data from fixed clusters.
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.
Reconstruction of a Complete Dataset from an Incomplete Dataset by ARA (Attri...Waqas Tariq
Preprocessing is crucial steps used for variety of data warehousing and mining Real world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. Accuracy of any mining algorithm greatly depends on the input data sets. Incomplete data sets have become almost ubiquitous in a wide variety of application domains. Common examples can be found in climate and image data sets, sensor data sets and medical data sets. The incompleteness in these data sets may arise from a number of factors: in some cases it may simply be a reflection of certain measurements not being available at the time; in others the information may be lost due to partial system failure; or it may simply be a result of users being unwilling to specify attributes due to privacy concerns. When a significant fraction of the entries are missing in all of the attributes, it becomes very difficult to perform any kind of reasonable extrapolation on the original data. For such cases, we introduce the novel idea of attribute weightage, in which we give weight to every attribute for prediction of the complete data set from incomplete data sets, on which the data mining algorithms can be directly applied. The attraction behind the idea of weights on attribute and finally averaging it. We demonstrate the effectiveness of the approach on a variety of real data sets. This paper describes a theory and implementation of a new filter ARA (Attribute Relation Analysis) to the WEKA workbench, for finding the complete dataset from an incomplete dataset.
This document describes an approach for supporting software change tasks using automated query reformulations. It begins with an example of a software change request between Alex and Bob. It then discusses using techniques like TextRank and POSRank, adapted from PageRank, to identify important terms from change requests for querying a codebase. The approach was evaluated on a dataset of over 1,900 change tasks from eight open source projects. Experimental results found the proposed approach improved 57.84% of queries and outperformed existing state-of-the-art methods based on measures like query effectiveness and retrieval performance.
This document discusses developing classifiers for a census income dataset using the WEKA data mining tool. It summarizes preprocessing steps like handling missing values, removing outliers, and balancing class distributions. It then evaluates various classifiers like J48 decision trees and k-nearest neighbors (IBk) on the preprocessed data. The best performing model was an ensemble "vote" classifier that combined the predictions of J48, IBk, and logistic regression models, achieving 87.3% accuracy and an ROC area of 0.947.
Intelligent query converter a domain independent interfacefor conversionIAEME Publication
This document describes an "Intelligent Query Converter" (IQC) system that converts natural language queries in English to SQL queries. IQC uses semantic matching techniques and WordNet to match queries to database schema without requiring domain-specific configuration. The system classifies queries based on presence of value keywords and conjunctive clauses. Experiments show IQC correctly answered 64.5-87.9% of queries for different databases, outperforming Microsoft's English Query system which answered correctly 29.0-51.8% of time. IQC provides a domain-independent natural language interface to databases using semantic analysis techniques.
The document presents a new technique called ACER for automated query reformulation to improve concept location in source code. ACER uses a novel term weighting method called CodeRank that considers source code semantics and structure. It selects the best reformulation from candidates generated by CodeRank using machine learning on resampled data. An experimental evaluation on 1,675 change requests from 8 projects found that ACER improved 71% of queries and decreased mean rank, outperforming previous methods. The results validate that exploiting source code structure and semantics through CodeRank and data resampling in ACER enhances query reformulation over traditional techniques.
This document discusses Docker, a tool that allows users to run applications securely isolated in a container that runs on the host operating system. It explores key Docker concepts like images, containers, repositories and how they work. It also provides examples of common Docker commands to pull, run, stop and manage images and containers.
Srikanth Kodeboyina completed the DEV 360 - Apache Spark Essentials course offered by MapR Academy on December 12, 2015. A certificate of completion was issued with the certificate number 56vho9jk85o3, which can be verified online at http://verify.skilljar.com/c/56vho9jk85o3.
This is first part of Spark 2 new features overview
This topic covers API changes; Structured Streaming; Encoders; Memory Management in Spark; Tungsten issues; Catalyst features)
Spark 2.0 estará entre nosotros en los próximos meses, nuestro amado Framework de procesamiento en paralelo va a sufrir un gran cambio y debemos estar preparados para afrontarlo.
Durante meses la comunidad y Databricks han estado trabajando en Spark para atender a todas las peticiones de usuarios, se han esforzado en desarrollar potentes utilidades y en continuar haciendo de Spark la herramienta numero uno de Big Data.
The document discusses using the OpenStack command line interface (CLI) to manage an OpenStack platform. It provides examples of commands for various OpenStack services like Keystone, Glance, Neutron, Nova, Cinder, Swift, Heat, and Ceilometer. The commands can be used to create, delete, modify, query and display resources and include examples like keystone user create, neutron network create, nova boot instance, and ceilometer meter list.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
This document is a presentation by Alexey Zinovyev about Java Big Data full stack development. It discusses Alexey's background and contacts, required skills for Java Big Data development like SQL, Linux, Java and backend skills. It then covers topics like NoSQL databases, Hadoop, Spark, machine learning with MLlib and deep learning. It provides different ways to learn these topics including books, online courses, conferences and mentoring. It encourages learning through hands-on projects and recommends starting with tools like Weka, MongoDB, Hadoop and AWS.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.
This document summarizes a presentation about unit testing Spark applications. The presentation discusses why it is important to run Spark locally and as unit tests instead of just on a cluster for faster feedback and easier debugging. It provides examples of how to run Spark locally in an IDE and as ScalaTest unit tests, including how to create test RDDs and DataFrames and supply test data. It also discusses testing concepts for streaming applications, MLlib, GraphX, and integration testing with technologies like HBase and Kafka.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
Catalyst is becoming one of the most important components in Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames, Datasets, to streaming. At its core, Catalyst is a general library for manipulating trees. Based on this library, we have built a modular compiler frontend for Spark, including a query analyzer, optimizer, and an execution planner. In this talk, I will first introduce the concepts of Catalyst trees, followed by major features that were added in order to support Spark’s powerful API abstractions. Audience will walk away with a deeper understanding of how Spark 2.0 works under the hood.
Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
NOTE: This was converted to Powerpoint from Keynote. Slideshare does not play the embedded videos. You can download the powerpoint from slideshare and import it into keynote. The videos should work in the keynote.
Abstract:
In this presentation, we will describe the "Spark Kernel" which enables applications, such as end-user facing and interactive applications, to interface with Spark clusters. It provides a gateway to define and run Spark tasks and to collect results from a cluster without the friction associated with shipping jars and reading results from peripheral systems. Using the Spark Kernel as a proxy, applications can be hosted remotely from Spark.
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
The majority of reported Spark deployments are now in the cloud. In such an environment, it is preferable for Spark to access data directly from services such as Amazon S3, thereby decoupling storage and compute. However, there are limitations to object stores such as S3. Chained or concurrent ETL jobs often run into issues on S3 due to inconsistent file listings and the lack of atomic rename support. Metadata performance also becomes an issue when running jobs over many thousands to millions of files.
Speaker: Eric Liang
This talk was originally presented at Spark Summit East 2017.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Système de recommandations de produits sur un site marchand par Koby KARP, Data Scientist (Equancy) & Hervé MIGNOT, Partner at Equancy
La recommandation reste un outil clé pour la personnalisation des sites marchands et le sujet est loin d’être épuisé. La prise en compte de la particularité d’un marché peut nécessité d’adapter le traitement et les algorithmes utilisés. Après une revue des techniques de recommandations, nous présenterons la démarche spécifique que nous avons adopté. Le système a été développé sous Spark pour la préparation des données et le calcul des modèles de recommandations. Une API simple et son service ont été développé pour délivrer les recommandations aux applications clientes.
The document describes a dataset containing on-time performance records for 95% of commercial flights in the United States. It includes over 30 fields of information for each flight such as airline, departure/arrival times, delays, distances, and causes of delays. An example record from the dataset is shown containing values for many of the fields.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Fast Data Intelligence in the IoT - real-time data analytics with SparkBas Geerdink
This document discusses fast data intelligence in the Internet of Things using real-time data analytics with Apache Spark Streaming and MLlib. It describes how the IoT is generating more streaming data from multiple sources, requiring new techniques for fast and scalable processing. Common patterns like the Lambda architecture and reactive principles are outlined. Tools like Apache Kafka, Apache Cassandra, and Apache Spark are presented for building architectures to handle streaming data flows and perform tasks like machine learning. Key takeaways are that reactive and scalable systems are needed to process IoT data streams, and open source tools can help achieve this.
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
Tactical Data Science Tips: Python and Spark TogetherDatabricks
This document summarizes a talk given by Bill Chambers on processing data with Spark and Python. It discusses 5 ways to process data: RDDs, DataFrames, Koalas, UDFs, and pandasUDFs. It then covers two data science use cases - growth forecasting and churn prediction - and how they were implemented using these different processing methods based on characteristics like the number of input rows, features, and required models. The talk emphasizes using DataFrames and pandasUDFs for optimal performance and flexibility. It also highlights tracking models with MLFlow for consistency in production.
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit
The document describes eBay's use of Spark and MLlib to build a scalable machine learning pipeline for metadata discovery from product listings. eBay developed supervised machine learning models using Spark MLlib to classify product listings and discover new metadata brands with high accuracy. The models were trained on historical metadata and product data, and can process large amounts of data quickly to discover new metadata on a daily basis, speeding up metadata discovery from months to days. Spark MLlib allows for efficient model development locally and simple deployment to production at scale.
Investigating the Quality Aspects of Crowd-Sourced Developer Forum: A Case St...University of Saskatchewan
- The document discusses investigating quality aspects of crowd-sourced developer forums using Stack Overflow as a case study. It aims to analyze the usability, parsability, and reproducibility of code snippets on Stack Overflow.
- An exploratory study found that around 67% of issues reported in Stack Overflow questions could be reproduced, but many faced challenges like missing external libraries or important code parts. Questions with reproducible issues were more likely to receive accepted answers.
- The document proposes guidelines for Stack Overflow users to improve question quality and issue reproducibility by avoiding redundancy, managing dependencies, including executable code for debugging, and focusing each question on a single issue.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
This session talks about how unit testing of Spark applications is done, as well as tells the best way to do it. This includes writing unit tests with and without Spark Testing Base package, which is a spark package containing base classes to use when writing tests with Spark.
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark Summit
This document discusses how Spark can serve as a gateway to typed functional programming by introducing developers to Scala and functional programming concepts. It covers how Spark leverages concepts like lazy evaluation, immutable data structures, higher-order functions, and types to provide a powerful and scalable framework. While complex, these functional techniques help tackle challenges of scale and complexity that arise in real-world AI systems.
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
Deep Reinforcement Learning (DRL) is a thriving area in the current AI battlefield. AlphaGO by DeepMind is a very successful application of DRL which has drawn the attention of the entire world. Besides playing games, DRL also has many practical use in industry, e.g. autonomous driving, chatbots, financial investment, inventory management, and even recommendation systems. Although DRL applications has something in common with supervised Computer Vision or Natural Language Processing tasks, they are unique in many ways.
For example, they have to interact (explore) with the environment to obtain training samples along the optimization, and the method to improve the model is usually different from common supervised applications. In this talk we will share our experience of building Deep Reinforcement Learning applications on BigDL/Spark. BigDL is a well-developed deep learning library on Spark which is handy for Big Data users, but it has been mostly used for supervised and unsupervised machine learning. We have made extensions particularly for DRL algorithms (e.g. DQN, PG, TRPO and PPO, etc.), implemented classical DRL algorithms, built applications with them and did performance tuning. We are happy to share what we have learnt during this process.
We hope our experience will help our audience learn how to build a RL application on their own for in their production business.
Privacy-Preserving Multi-Keyword Fuzzy Search over Encrypted Data in the CloudMateus S. H. Cruz
The document proposes a method for privacy-preserving multi-keyword fuzzy search over encrypted data. It uses Bloom filters to represent encrypted indexes and queries, and locality sensitive hashing functions to allow fuzzy matching of keywords. An inner product calculation is used to determine similarity between encrypted indexes and queries. The proposal includes an enhanced scheme that adds a pseudorandom function for additional security against background knowledge attacks. Experiments demonstrate the performance and accuracy of the approach.
1) The document discusses concepts related to computer programming including classes, objects, and Visual Studio projects and solutions. It provides examples of classes like Car and Cat and demonstrates how to declare classes, add methods, and instantiate objects in C#.
2) It also summarizes key differences between classes and objects, and describes how to add a class to a C# project. Basic data types and built-in classes like Console and String are also covered.
3) The document concludes with explanations of what a Visual Studio project and solution are, noting that a project contains source code and settings while a solution combines related projects.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
The Building Blocks of QuestDB, a Time Series Database
Spark Solution for Rank Product
1. Spark Solution
for
Rank Product
Mahmoud Parsian
Ph.D in Computer Science
Senior Architect @ illumina1
August 25, 2015
1
www.illumina.com
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 1 / 55
2. Table of Contents
1 Biography
2 What is Rank Product?
3 Basic Definitions
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 2 / 55
3. Biography
Outline
1 Biography
2 What is Rank Product?
3 Basic Definitions
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 3 / 55
4. Biography
Who am I?
Name: Mahmoud Parsian
Education: Ph.D in Computer Science
Work: Senior Architect @Illumina, Inc
Lead Big Data Team @Illumina
Develop scalable regression algorithms
Develop DNA-Seq and RNA-Seq workflows
Use Java/MapReduce/Hadoop/Spark/HBase
Author: of 3 books
Data Algorithms (O’Reilly:
http://shop.oreilly.com/product/0636920033950.do/)
JDBC Recipies (Apress: http://apress.com/)
JDBC MetaData Recipies (Apress: http://apress.com/))
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 4 / 55
5. What is Rank Product?
Outline
1 Biography
2 What is Rank Product?
3 Basic Definitions
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 5 / 55
6. What is Rank Product?
Problem Statement: Data Algorithms Book
Bonus Chapter: http://shop.oreilly.com/product/0636920033950.do
Source code: https://github.com/mahmoudparsian/data-algorithms-book
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 6 / 55
7. What is Rank Product?
Problem Statement
Given a large set of studies, where each study is a set of (Key, Value)
pairs
The goal is to find the Rank Product of all given Keys for all studies.
Magnitude of this data is challenging to store and analyze:
Several hundreds of studies
Each study has billions of (Key, Value) pairs
Find ”rank product” for all given Keys
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 7 / 55
8. What is Rank Product?
Problem Statement
Given a large set of studies, where each study is a set of (Key, Value)
pairs
The goal is to find the Rank Product of all given Keys for all studies.
Magnitude of this data is challenging to store and analyze:
Several hundreds of studies
Each study has billions of (Key, Value) pairs
Find ”rank product” for all given Keys
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 7 / 55
9. What is Rank Product?
Magnitude of Data per Analysis
100’s of studies
Each study may have billions of (K, V) pairs
Example: 300 studies
Each study: 2000,000,000 (K, V) pairs
Analyze: 600,000,000,000 (K, V) pairs
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 8 / 55
10. What is Rank Product?
Magnitude of Data per Analysis
100’s of studies
Each study may have billions of (K, V) pairs
Example: 300 studies
Each study: 2000,000,000 (K, V) pairs
Analyze: 600,000,000,000 (K, V) pairs
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 8 / 55
11. Basic Definitions
Outline
1 Biography
2 What is Rank Product?
3 Basic Definitions
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 9 / 55
12. Basic Definitions
Some Basic Definitions
Study is a set of (K, V) pairs
Example-1: K=geneID, V=geneValue
Example-2: K=userID, V=user’s followers in social networks
Example-3: K=bookID, V=book rating by users
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 10 / 55
13. Basic Definitions
Some Basic Definitions
Study is a set of (K, V) pairs
Example-1: K=geneID, V=geneValue
Example-2: K=userID, V=user’s followers in social networks
Example-3: K=bookID, V=book rating by users
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 10 / 55
14. Basic Definitions
What is a Ranking?
Let S = {(K1, 40), (K2, 70), (K3, 90), (K4, 80)}
Then Rank(S) = {(K1, 4), (K2, 3), (K3, 1), (K4, 2)}
Since 90 > 80 > 70 > 40
Ranks are assigned as: 1, 2, 3, 4, ..., N
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 11 / 55
15. Basic Definitions
What is Rank Product?
Let {A1, ..., Ak} be a set of (key-value) pairs where keys are unique
per dataset.
Example of (key-value) pairs:
(K,V) = (item, number of items sold)
(K,V) = (user, number of followers for the user)
(K,V) = (gene, test expression)
Then the ranked product of {A1, ..., Ak} is computed based on the
ranks ri for key i across all k datasets. Typically ranks are assigned
based on the sorted values of datasets.
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 12 / 55
16. Basic Definitions
What is Rank Product?
Let A1 = {(K1, 30), (K2, 60), (K3, 10), (K4, 80)},
then Rank(A1) = {(K1, 3), (K2, 2), (K3, 4), (K4, 1)}
since 80 > 60 > 30 > 10
Note that 1 is the highest rank (assigned to the largest value).
Let A2 = {(K1, 90), (K2, 70), (K3, 40), (K4, 50)},
Rank(A2) = {(K1, 1), (K2, 2), (K3, 4), (K4, 3)}
since 90 > 70 > 50 > 40
Let A3 = {(K1, 4), (K2, 8)}
Rank(A3) = {(K1, 2), (K2, 1)}
since 8 > 4
The rank product of {A1, A2, A3} is expressed as:
{(K1, 3
√
3 × 1 × 2), (K2, 3
√
2 × 2 × 1), (K3, 2
√
4 × 4), (K4, 2
√
1 × 3)}
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 13 / 55
17. Basic Definitions
Calculation of the Rank Product
Given n genes and k replicates,
Let eg,i be the fold change and rg,i the rank of gene g in the i’th
replicate.
Compute the rank product (RP) via the geometric mean:
RP(g) =
k
i=1
rg,i
1/k
RP(g) = k
k
i=1
rg,i
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 14 / 55
18. Input Data
Outline
1 Biography
2 What is Rank Product?
3 Basic Definitions
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 15 / 55
19. Input Data
Input Data Format
Set of k studies {S1, S2, ..., Sk}
Each study has billions of (Key, Value) pairs
Sample Record:
<key-as-string><,><value-as-double-data-type>
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 16 / 55
20. Input Data
Input Data Persistence
Data persists in HDFS
Directory structure:
/input/study-001/file-001-1.txt
/input/study-001/file-001-2.txt
/input/study-001/file-001-3.txt
...
/input/study-235/file-235-1.txt
/input/study-235/file-235-2.txt
/input/study-235/file-235-3.txt
...
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 17 / 55
21. Input Data
Formalizing Rank Product
Let S = {S1, S2, ..., Sk} be a set of k studies, where k > 0 and each
study represent a micro-array experiment
Let Si (i = 1, 2, ..., k) be a study, which has an arbitrary number of
assays identified by {Ai1, Ai2, ...}
Let each assay (can be represented as a text file) be a set of arbitrary
number of records in the following format:
<gene_id><,><gene_value_as_double_data-type>
Let gene_id be in {g1, g2, ..., gn} (we have n genes).
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 18 / 55
22. Input Data
Rank Product: in 2 Steps
Let S = {S1, S2, ..., Sk} be a set of k studies:
STEP-1: find the mean of values per study per gene
you may replace the ”mean” function by your desired function
finding mean involves groupByKey() or combineByKey()
STEP-2: perform the ”Rank Product” per gene across all studies
finding ”Rank Product” involves groupByKey() or combineByKey()
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 19 / 55
23. Input Data
Formalizing Rank Product
The last step will be to find the rank product for each gene per study:
S1 = {(g1, r11), (g2, r12), ...}
S2 = {(g1, r21), (g2, r22), ...}
...
Sk = {(g1, rk1), (g2, rk2), ...}
then Ranked Product of gj =
RP(gj ) =
k
i=1
ri,j
1/k
or
RP(gj ) = k
k
i=1
ri,j
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 20 / 55
24. Input Data
Spark Solution for Rank Product
1 Read k input paths (each path represents a study, which may have
any number of assay text files).
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
we use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks). Spark indices will
start from 0, we will add 1 when computing the ranked product.
5 Finally compute the Rank Product per gene for all studies. This can
be accomplished by grouping all ranks by the key (we may use
JavaPairRDD.groupByKey() or JavaPairRDD.combineByKey() –
note that, in general, JavaPairRDD.combineByKey() is more
efficient than JavaPairRDD.groupByKey()).
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 21 / 55
25. Input Data
Spark Solution for Rank Product
1 Read k input paths (each path represents a study, which may have
any number of assay text files).
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
we use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks). Spark indices will
start from 0, we will add 1 when computing the ranked product.
5 Finally compute the Rank Product per gene for all studies. This can
be accomplished by grouping all ranks by the key (we may use
JavaPairRDD.groupByKey() or JavaPairRDD.combineByKey() –
note that, in general, JavaPairRDD.combineByKey() is more
efficient than JavaPairRDD.groupByKey()).
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 21 / 55
26. Input Data
Spark Solution for Rank Product
1 Read k input paths (each path represents a study, which may have
any number of assay text files).
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
we use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks). Spark indices will
start from 0, we will add 1 when computing the ranked product.
5 Finally compute the Rank Product per gene for all studies. This can
be accomplished by grouping all ranks by the key (we may use
JavaPairRDD.groupByKey() or JavaPairRDD.combineByKey() –
note that, in general, JavaPairRDD.combineByKey() is more
efficient than JavaPairRDD.groupByKey()).
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 21 / 55
27. Input Data
Spark Solution for Rank Product
1 Read k input paths (each path represents a study, which may have
any number of assay text files).
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
we use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks). Spark indices will
start from 0, we will add 1 when computing the ranked product.
5 Finally compute the Rank Product per gene for all studies. This can
be accomplished by grouping all ranks by the key (we may use
JavaPairRDD.groupByKey() or JavaPairRDD.combineByKey() –
note that, in general, JavaPairRDD.combineByKey() is more
efficient than JavaPairRDD.groupByKey()).
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 21 / 55
28. Input Data
Spark Solution for Rank Product
1 Read k input paths (each path represents a study, which may have
any number of assay text files).
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
we use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks). Spark indices will
start from 0, we will add 1 when computing the ranked product.
5 Finally compute the Rank Product per gene for all studies. This can
be accomplished by grouping all ranks by the key (we may use
JavaPairRDD.groupByKey() or JavaPairRDD.combineByKey() –
note that, in general, JavaPairRDD.combineByKey() is more
efficient than JavaPairRDD.groupByKey()).
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 21 / 55
29. Input Data
Two Spark Solutions: groupByKey() and combineByKey()
Two solutions are provided using Spark-1.4.0:
SparkRankProductUsingGroupByKey
uses groupByKey()
SparkRankProductUsingCombineByKey
uses combineByKey()
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 22 / 55
30. Input Data
How does groupByKey() work
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 23 / 55
31. Input Data
How does reduceByKey() work
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 24 / 55
32. Rank Product Algorithm
Outline
1 Biography
2 What is Rank Product?
3 Basic Definitions
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 25 / 55
33. Rank Product Algorithm
Rank Product Algorithm in Spark
Algorithm: High-Level Steps
Step Description
STEP-1 import required interfaces and classes
STEP-2 handle input parameters
STEP-3 create a Spark context object
STEP-4 create list of studies (1, 2, ..., K)
STEP-5 compute mean per gene per study
STEP-6 sort by values
STEP-7 assign rank
STEP-8 compute rank product
STEP-9 save the result in HDFS
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 26 / 55
34. Rank Product Algorithm
Main Driver
Listing 1: performRrankProduct()
1 public static void main(String[] args) throws Exception {
2 // args[0] = output path
3 // args[1] = number of studies (K)
4 // args[2] = input path for study 1
5 // args[3] = input path for study 2
6 // ...
7 // args[K+1] = input path for study K
8 final String outputPath = args[0];
9 final String numOfStudiesAsString = args[1];
10 final int K = Integer.parseInt(numOfStudiesAsString);
11 List<String> inputPathMultipleStudies = new ArrayList<String>();
12 for (int i=1; i <= K; i++) {
13 String singleStudyInputPath = args[1+i];
14 inputPathMultipleStudies.add(singleStudyInputPath);
15 }
16 performRrankProduct(inputPathMultipleStudies, outputPath);
17 System.exit(0);
18 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 27 / 55
35. Rank Product Algorithm
groupByKey() vs. combineByKey()
Which one should we use? combineByKey() or groupByKey()?
According to their semantics, both will give you the same answer.
But combineByKey() is more efficient.
In some situations, groupByKey() can even cause of out of disk
problems. In general, reduceByKey(), and combineByKey() are
preferred over groupByKey().
Spark shuffling is more efficient for reduceByKey() than
groupByKey() and the reason is this: in the shuffle step for
reduceByKey(), data is combined so each partition outputs at most
one value for each key to send over the network, while in shuffle step
for groupByKey(), all the data is wastefully sent over the network
and collected on the reduce workers.
To understand the difference, the following figures show how the
shuffle is done for reduceByKey() and groupByKey()
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 28 / 55
37. Rank Product Algorithm
Understanding reduceByKey() or combineByKey()
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 30 / 55
38. Rank Product Algorithm
Listing 2: performRrankProduct()
1 public static void performRrankProduct(
2 final List<String> inputPathMultipleStudies,
3 final String outputPath) throws Exception {
4 // create a context object, which is used
5 // as a factory for creating new RDDs
6 JavaSparkContext context = Util.createJavaSparkContext(useYARN);
7
8 // Spark 1.4.0 requires an array for creating union of many RDDs
9 int index = 0;
10 JavaPairRDD<String, Double>[] means =
11 new JavaPairRDD[inputPathMultipleStudies.size()];
12 for (String inputPathSingleStudy : inputPathMultipleStudies) {
13 means[index] = computeMeanByGroupByKey(context, inputPathSingleStudy);
14 index++;
15 }
16
17 // next compute rank
18 ...
19 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 31 / 55
39. Rank Product Algorithm
Listing 3: performRrankProduct()
1 public static void performRrankProduct(
2 ...
3 // next compute rank
4 // 1. sort values based on absolute value of mean value
5 // 2. assign rank from 1 to N
6 // 3. calculate rank product for each gene
7 JavaPairRDD<String,Long>[] ranks = new JavaPairRDD[means.length];
8 for (int i=0; i < means.length; i++) {
9 ranks[i] = assignRank(means[i]);
10 }
11 // calculate ranked products
12 // <gene, T2<rankedProduct, N>>
13 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts =
14 computeRankedProducts(context, ranks);
15
16 // save the result, shuffle=true
17 rankedProducts.coalesce(1,true).saveAsTextFile(outputPath);
18
19 // close the context and we are done
20 context.close();
21 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 32 / 55
40. Rank Product Algorithm
STEP-3: create a Spark context object
1 public static JavaSparkContext createJavaSparkContext(boolean useYARN) {
2 JavaSparkContext context;
3 if (useYARN) {
4 context = new JavaSparkContext("yarn-cluster", "MyAnalysis"); // YARN
5 }
6 else {
7 context = new JavaSparkContext(); // Spark cluster
8 }
9 // inject efficiency
10 SparkConf sparkConf = context.getConf();
11 sparkConf.set("spark.kryoserializer.buffer.mb","32");
12 sparkConf.set("spark.shuffle.file.buffer.kb","64");
13 // set a fast serializer
14 sparkConf.set("spark.serializer",
15 "org.apache.spark.serializer.KryoSerializer");
16 sparkConf.set("spark.kryo.registrator",
17 "org.apache.spark.serializer.KryoRegistrator");
18 return context;
19 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 33 / 55
41. Rank Product Algorithm
Listing 4: STEP-4: compute mean
1 static JavaPairRDD<String, Double> computeMeanByGroupByKey(
2 JavaSparkContext context,
3 final String inputPath) throws Exception {
4 JavaPairRDD<String, Double> genes =
5 getGenesUsingTextFile(context, inputPath, 30);
6
7 // group values by gene
8 JavaPairRDD<String, Iterable<Double>> groupedByGene = genes.groupByKey();
9
10 // calculate mean per gene
11 JavaPairRDD<String, Double> meanRDD = groupedByGene.mapValues(
12 new Function<
13 Iterable<Double>, // input
14 Double // output: mean
15 >() {
16 @Override
17 public Double call(Iterable<Double> values) {
18 double sum = 0.0;
19 int count = 0;
20 for (Double v : values) {
21 sum += v;
22 count++;
23 }
24 // calculate mean of samples
25 double mean = sum / ((double) count);
26 return mean;
27 }
28 });
29 return meanRDD;
30 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 34 / 55
42. Rank Product Algorithm
Listing 5: getGenesUsingTextFile()
1 static JavaPairRDD<String, Double> getGenesUsingTextFile(
2 JavaSparkContext context,
3 final String inputPath,
4 final int numberOfPartitions) throws Exception {
5
6 // read input and create the first RDD
7 // JavaRDD<String>: where String = "gene,test_expression"
8 JavaRDD<String> records = context.textFile(inputPath, numberOfPartitions);
9
10 // for each record, we emit (K=gene, V=test_expression)
11 JavaPairRDD<String, Double> genes
12 = records.mapToPair(new PairFunction<String, String, Double>() {
13 @Override
14 public Tuple2<String, Double> call(String rec) {
15 // rec = "gene,test_expression"
16 String[] tokens = StringUtils.split(rec, ",");
17 // tokens[0] = gene
18 // tokens[1] = test_expression
19 return new Tuple2<String, Double>(
20 tokens[0], Double.parseDouble(tokens[1]));
21 }
22 });
23 return genes;
24 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 35 / 55
43. Rank Product Algorithm
Listing 6: Assign Rank
1 // result is JavaPairRDD<String, Long> = (gene, rank)
2 static JavaPairRDD<String,Long> assignRank(JavaPairRDD<String,Double> rdd){
3 // swap key and value (will be used for sorting by key); convert value to abs(value)
4 JavaPairRDD<Double,String> swappedRDD = rdd.mapToPair(
5 new PairFunction<Tuple2<String, Double>, // T: input
6 Double, // K
7 String>>(){ // V
8 public Tuple2<Double,String> call(Tuple2<String, Double> s) {
9 return new Tuple2<Double,String>(Math.abs(s._2), s._1);
10 }
11 });
12 // we need 1 partition so that we can zip numbers into this RDD by zipWithIndex()
13 JavaPairRDD<Double,String> sorted = swappedRDD.sortByKey(false, 1); // sort means descending
14 // JavaPairRDD<T,Long> zipWithIndex()
15 // Long values will be 0, 1, 2, ...; for ranking, we need 1, 2, 3, ..., therefore, we will add 1
16 JavaPairRDD<Tuple2<Double,String>,Long> indexed = sorted.zipWithIndex();
17 // next convert JavaPairRDD<Tuple2<Double,String>,Long> into JavaPairRDD<String,Long>
18 // JavaPairRDD<Tuple2<value,gene>,rank> into JavaPairRDD<gene,rank>
19 JavaPairRDD<String, Long> ranked = indexed.mapToPair(
20 new PairFunction<Tuple2<Tuple2<Double,String>,Long>, // T: input
21 String, // K: mapped_id
22 Long>() { // V: rank
23 public Tuple2<String, Long> call(Tuple2<Tuple2<Double,String>,Long> s) {
24 return new Tuple2<String,Long>(s._1._2, s._2 + 1); // ranks are 1, 2, ..., n
25 }
26 });
27 return ranked;
28 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 36 / 55
44. Rank Product Algorithm
Listing 7: Compute Rank Product using groupByKey()
1 static JavaPairRDD<String, Tuple2<Double, Integer>> computeRankedProducts(
2 JavaSparkContext context,
3 JavaPairRDD<String, Long>[] ranks) {
4 JavaPairRDD<String, Long> unionRDD = context.union(ranks);
5
6 // next find unique keys, with their associated values
7 JavaPairRDD<String, Iterable<Long>> groupedByGeneRDD = unionRDD.groupByKey();
8
9 // next calculate ranked products and the number of elements
10 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = groupedByGeneRDD.mapValues(
11 new Function<
12 Iterable<Long>, // input: means for all studies
13 Tuple2<Double, Integer> // output: (rankedProduct, N)
14 >() {
15 @Override
16 public Tuple2<Double, Integer> call(Iterable<Long> values) {
17 int N = 0;
18 long products = 1;
19 for (Long v : values) {
20 products *= v;
21 N++;
22 }
23 double rankedProduct = Math.pow( (double) products, 1.0/((double) N));
24 return new Tuple2<Double, Integer>(rankedProduct, N);
25 }
26 });
27 return rankedProducts;
28 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 37 / 55
45. Rank Product Algorithm
Next FOCUS on combineByKey()
We do need to develop 2 functions:
computeMeanByCombineByKey()
computeRankedProductsUsingCombineByKey()
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 38 / 55
46. Rank Product Algorithm
combineByKey(): how does it work?
combineByKey() is the most general of the per-key aggregation
functions. Most of the other per-key combiners are implemented
using it.
Like aggregate(), combineByKey() allows the user to return values
that are not the same type as our input data.
To understand combineByKey(), it is useful to think of how it
handles each element it processes.
As combineByKey() goes through the elements in a partition, each
element either has a key it has not seen before or has the same key as
a previous element.
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 39 / 55
47. Rank Product Algorithm
combineByKey() definition
1 public <C> JavaPairRDD<K,C> combineByKey(
2 Function<V,C> createCombiner, // V -> C
3 Function2<C,V,C> mergeValue, // C+V -> C
4 Function2<C,C,C> mergeCombiners // C+C -> C
5 )
6 Description: Generic function to combine the elements for each key
7 using a custom set of aggregation functions. Turns a JavaPairRDD[(K, V)]
8 into a result of type JavaPairRDD[(K, C)], for a "combined type" C. Note
9 that V and C can be different -- for example, one might group an RDD of
10 type (Int, Int) into an RDD of type (Int, List[Int]).
11 Users provide three functions:
12 - createCombiner, which turns a V into a C (e.g., creates a one-element list)
13 - mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
14 - mergeCombiners, to combine two Cs into a single one.
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 40 / 55
48. Rank Product Algorithm
computeMeanByCombineByKey():
Define C data structure
1 /**
2 * AverageCount is used by combineByKey() to hold
3 * the total values and their count.
4 */
5 static class AverageCount implements Serializable {
6 double total;
7 int count;
8
9 public AverageCount(double total, int count) {
10 this.total = total;
11 this.count = count;
12 }
13
14 public double average() {
15 return total / (double) count;
16 }
17 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 41 / 55
49. Rank Product Algorithm
computeMeanByCombineByKey()
1 static JavaPairRDD<String, Double> computeMeanByCombineByKey(
2 JavaSparkContext context,
3 final String inputPath) throws Exception {
4
5 JavaPairRDD<String, Double> genes = getGenesUsingTextFile(context, inputPath, 30);
6
7 // we need 3 function to be able to use combineByKey()
8 Function<Double, AverageCount> createCombiner = ...
9 Function2<AverageCount, Double, AverageCount> addAndCount = ...
10 Function2<AverageCount, AverageCount, AverageCount> mergeCombiners = ...
11
12 JavaPairRDD<String, AverageCount> averageCounts =
13 genes.combineByKey(createCombiner, addAndCount, mergeCombiners);
14
15 // now compute the mean/average per gene
16 JavaPairRDD<String,Double> meanRDD = averageCounts.mapToPair(
17 new PairFunction<
18 Tuple2<String, AverageCount>, // T: input
19 String, // K
20 Double // V
21 >() {
22 @Override
23 public Tuple2<String,Double> call(Tuple2<String, AverageCount> s) {
24 return new Tuple2<String,Double>(s._1, s._2.average());
25 }
26 });
27 return meanRDD;
28 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 42 / 55
50. Rank Product Algorithm
3 functions for combineByKey()
1 Function<Double, AverageCount> createCombiner = new Function<Double, AverageCount>() {
2 @Override
3 public AverageCount call(Double x) {
4 return new AverageCount(x, 1);
5 }
6 };
7
8 Function2<AverageCount, Double, AverageCount> addAndCount =
9 new Function2<AverageCount, Double, AverageCount>() {
10 @Override
11 public AverageCount call(AverageCount a, Double x) {
12 a.total += x;
13 a.count += 1;
14 return a;
15 }
16 };
17
18 Function2<AverageCount, AverageCount, AverageCount> mergeCombiners =
19 new Function2<AverageCount, AverageCount, AverageCount>() {
20 @Override
21 public AverageCount call(AverageCount a, AverageCount b) {
22 a.total += b.total;
23 a.count += b.count;
24 return a;
25 }
26 };
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 43 / 55
51. Rank Product Algorithm
Per-key average using combineByKey() in Python
1 sumCount = nums.combineByKey(
2 (lambda x: (x,1)),
3 (lambda x, y: (x[0] + y, x[1] + 1)),
4 (lambda x, y: (x[0] + y[0], x[1] + y[1]))
5 )
6
7 mean = sumCount.map(lambda key, xy: (key, xy[0]/xy[1])).collectAsMap()
8 print(*mean)
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 44 / 55
52. Rank Product Algorithm
Sample Run using groupByKey(): script
1 # define input/output for Hadoop/HDFS
2 #// args[0] = output path
3 #// args[1] = number of studies (K)
4 #// args[2] = input path for study 1
5 #// args[3] = input path for study 2
6 #// ...
7 #// args[K+1] = input path for study K
8 OUTPUT=/rankproduct/output
9 NUM_OF_INPUT=3
10 INPUT1=/rankproduct/input1
11 INPUT2=/rankproduct/input2
12 INPUT3=/rankproduct/input3
13
14 # remove all files under input
15 $HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT
16 #
17 # remove all files under output
18 driver=org.dataalgorithms.bonus.rankproduct.spark.SparkRankProductUsingGroupByKey
19 $SPARK_HOME/bin/spark-submit --class $driver
20 --master yarn-cluster
21 --jars $OTHER_JARS
22 --conf "spark.yarn.jar=$SPARK_JAR"
23 $APP_JAR $OUTPUT $NUM_OF_INPUT $INPUT1 $INPUT2 $INPUT3
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 45 / 55
54. Rank Product Algorithm
Sample Run using groupByKey(): output
1 # hadoop fs -cat /rankproduct/output/part*
2 (K_2,(1.5874010519681994,3))
3 (K_3,(4.0,2))
4 (K_1,(1.8171205928321397,3))
5 (K_4,(1.7320508075688772,2))
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 47 / 55
55. Rank Product Algorithm
computeRankedProductsUsingCombineByKey():
Define C data structure
1 /**
2 * RankProduct is used by combineByKey() to hold
3 * the total product values and their count.
4 */
5 static class RankProduct implements Serializable {
6 long product;
7 int count;
8
9 public RankProduct(long product, int count) {
10 this.product = product;
11 this.count = count;
12 }
13
14 public double rank() {
15 return Math.pow((double) product, 1.0/ (double) count);
16 }
17 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 48 / 55
56. Rank Product Algorithm
computeRankedProductsUsingCombineByKey()
1 // JavaPairRDD<String, Tuple2<Double, Integer>> = <gene, T2(rankedProduct, N>>
2 // where N is the number of elements for computing the rankedProduct
3 static JavaPairRDD<String, Tuple2<Double, Integer>> computeRankedProductsUsingCombineByKey(
4 JavaSparkContext context,
5 JavaPairRDD<String, Long>[] ranks) {
6 JavaPairRDD<String, Long> unionRDD = context.union(ranks);
7
8 // we need 3 function to be able to use combinebyKey()
9 Function<Long, RankProduct> createCombiner = ...
10 Function2<RankProduct, Long, RankProduct> addAndCount = ...
11 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners = ...
12
13 // next find unique keys, with their associated copa scores
14 JavaPairRDD<String, RankProduct> combinedByGeneRDD =
15 unionRDD.combineByKey(createCombiner, addAndCount, mergeCombiners);
16
17 // next calculate ranked products and the number of elements
18 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = combinedByGeneRDD.mapValues(
19 new Function<
20 RankProduct, // input: RankProduct
21 Tuple2<Double, Integer> // output: (RankedProduct.count)
22 >() {
23 @Override
24 public Tuple2<Double, Integer> call(RankProduct value) {
25 double theRankedProduct = value.rank();
26 return new Tuple2<Double, Integer>(theRankedProduct, value.count);
27 }
28 });
29 return rankedProducts;
30 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 49 / 55
57. Rank Product Algorithm
3 functions for
computeRankedProductsUsingCombineByKey()
1 Function<Long, RankProduct> createCombiner = new Function<Long, RankProduct>() {
2 @Override
3 public RankProduct call(Long x) {
4 return new RankProduct(x, 1);
5 }
6 };
7
8 Function2<RankProduct, Long, RankProduct> addAndCount =
9 new Function2<RankProduct, Long, RankProduct>() {
10 @Override
11 public RankProduct call(RankProduct a, Long x) {
12 a.product *= x;
13 a.count += 1;
14 return a;
15 }
16 };
17
18 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners =
19 new Function2<RankProduct, RankProduct, RankProduct>() {
20 @Override
21 public RankProduct call(RankProduct a, RankProduct b) {
22 a.product *= b.product;
23 a.count += b.count;
24 return a;
25 }
26 };
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 50 / 55
58. Rank Product Algorithm
REVISED computeRankedProductsUsingCombineByKey():
Define C data structure
1 static class RankProduct implements Serializable {
2 long product;
3 int count;
4
5 public RankProduct(long product, int count) {
6 this.product = product;
7 this.count = count;
8 }
9
10 public product(long value) {
11 this.product *= value;
12 this.count++;
13 }
14
15 public product(RankProduct pr) {
16 this.product *= pr.value;
17 this.count += pr.count;
18 }
19
20 public double rank() {
21 return Math.pow((double) product, 1.0/ (double) count);
22 }
23 }
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 51 / 55
59. Rank Product Algorithm
REVISED 3 functions for
computeRankedProductsUsingCombineByKey()
1 Function<Long, RankProduct> createCombiner = new Function<Long, RankProduct>() {
2 public RankProduct call(Long x) {
3 return new RankProduct(x, 1);
4 }
5 };
6
7 Function2<RankProduct, Long, RankProduct> addAndCount =
8 new Function2<RankProduct, Long, RankProduct>() {
9 public RankProduct call(RankProduct a, Long x) {
10 a.product(x);
11 return a;
12 }
13 };
14
15 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners =
16 new Function2<RankProduct, RankProduct, RankProduct>() {
17 public RankProduct call(RankProduct a, RankProduct b) {
18 a.product(b);
19 return a;
20 }
21 };
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 52 / 55
60. Moral of Story
Outline
1 Biography
2 What is Rank Product?
3 Basic Definitions
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 53 / 55
61. Moral of Story
Moral of Story...
Understand data requirements
Select proper platform for implementation: Spark
Partition your RDDs properly
number of cluster nodes
number of cores per node
amount of RAM
Avoid unnecessary computations (g1, g2), (g2, g1)
Use filter() often to remove non-needed RDD elements
Avoid unnecessary RDD.saveAsTextFile(path)
Run both on YARN and Spark cluster and compare performance
use groupByKey() cautiously
use combineByKey() over groupByKey()
Verify your test results against R language
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 54 / 55
62. Moral of Story
Thank you!
Questions?
Mahmoud Parsian Ph.D in Computer Science Spark Solution for Rank Product 55 / 55