Talking Data is the largest independent big data service company in China. Their network covers 70% of the mobile services nationwide with 3 billion ad clicks per day. Amongst those clicks, 90% are potentially fraudulent. Click fraud is happening at an overwhelming volume leading to misusage of data and wasting money. Hence, Kaggle (a platform for predictive modeling and analytics competitions from the U.S.) has partnered up with TalkingData to help resolve this issue.
This paper is to build predictive analysis models using traditional and Big Data methods to determine whether a smartphone app will be downloaded after clicking an advertisement. We have used data named “TalkingData AdTracking Fraud Detection Challenge”, which is of 7GB and given by a Kaggle competition. Four classification models are implemented with this massive data set in order to predict fraud in both traditional and Big Data methods. We define it fraud when the user clicked on an advertisement without downloading. The traditional platform has a resource limitation to build models with data set over a giga-byte so that we generate a sample data for the traditional models and adopt the full data set for the models in the Big Data Spark ML systems. We also present the accuracy and performance of the models implemented in both traditional and Big Data systems.
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
This document discusses applying machine learning and artificial intelligence techniques like deep neural networks to problems in genomics and agriculture. It provides examples of using Google Cloud platforms and services for storing and analyzing large genomic datasets, as well as developing models for tasks like variant calling from sequencing data and marker-assisted breeding. The document advocates that Google is well-positioned to handle massive volumes of genomic and agricultural data and help advance the application of AI in these domains.
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
This document discusses Google's capabilities for handling large genomic and biomedical data sets. It describes how Google uses technologies like Google Cloud, BigQuery, Dataflow and TensorFlow to process, store and analyze massive volumes of genomic and medical data. Google's systems can handle hundreds of terabytes to petabytes of data and enable fast querying and machine learning on these data sets. The document also provides examples of how Google is applying these capabilities to challenges in genomics, healthcare and precision medicine.
This document discusses big data and analytics, outlining five trends and five research challenges. It begins by defining big data in terms of volume, velocity, variety, veracity and value. It then discusses the origins and evolution of big data, from early statistics to modern data science. Analytics is defined as using data to make empirically-derived, statistically valid decisions. The document outlines how hardware choices led to scaling out data processing across clusters rather than scaling up on single machines. It also provides examples of fields that generate huge volumes of data from billion dollar instruments like CERN's Large Hadron Collider and genomic sequencing facilities.
Machine learning in the life sciences with knimeGreg Landrum
This document discusses using machine learning and the KNIME platform to build predictive models for problems in the life sciences using molecular data. It provides an example of building a random forest model to predict biological activity of molecules using molecular fingerprints as features. The model achieves high accuracy but predicts inactivity for almost all molecules due to class imbalance in the data. To address this, the document suggests adjusting the decision boundary of the model by setting it at the point on the ROC curve that retrieves most actives without including too many inactives. In summary, it presents an example of applying machine learning to predict biological activity from molecular data and discusses techniques for handling class imbalance.
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
Deep learning has enabled dramatic advances in image recognition performance. In this talk I will discuss using a deep convolutional neural network to detect genetic variation in aligned next-generation sequencing human read data. Our method, called DeepVariant, both outperforms existing genotyping tools and generalizes across genome builds and even to other species. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
This document discusses applying machine learning and artificial intelligence techniques like deep neural networks to problems in genomics and agriculture. It provides examples of using Google Cloud platforms and services for storing and analyzing large genomic datasets, as well as developing models for tasks like variant calling from sequencing data and marker-assisted breeding. The document advocates that Google is well-positioned to handle massive volumes of genomic and agricultural data and help advance the application of AI in these domains.
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
This document discusses Google's capabilities for handling large genomic and biomedical data sets. It describes how Google uses technologies like Google Cloud, BigQuery, Dataflow and TensorFlow to process, store and analyze massive volumes of genomic and medical data. Google's systems can handle hundreds of terabytes to petabytes of data and enable fast querying and machine learning on these data sets. The document also provides examples of how Google is applying these capabilities to challenges in genomics, healthcare and precision medicine.
This document discusses big data and analytics, outlining five trends and five research challenges. It begins by defining big data in terms of volume, velocity, variety, veracity and value. It then discusses the origins and evolution of big data, from early statistics to modern data science. Analytics is defined as using data to make empirically-derived, statistically valid decisions. The document outlines how hardware choices led to scaling out data processing across clusters rather than scaling up on single machines. It also provides examples of fields that generate huge volumes of data from billion dollar instruments like CERN's Large Hadron Collider and genomic sequencing facilities.
Machine learning in the life sciences with knimeGreg Landrum
This document discusses using machine learning and the KNIME platform to build predictive models for problems in the life sciences using molecular data. It provides an example of building a random forest model to predict biological activity of molecules using molecular fingerprints as features. The model achieves high accuracy but predicts inactivity for almost all molecules due to class imbalance in the data. To address this, the document suggests adjusting the decision boundary of the model by setting it at the point on the ROC curve that retrieves most actives without including too many inactives. In summary, it presents an example of applying machine learning to predict biological activity from molecular data and discusses techniques for handling class imbalance.
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
Deep learning has enabled dramatic advances in image recognition performance. In this talk I will discuss using a deep convolutional neural network to detect genetic variation in aligned next-generation sequencing human read data. Our method, called DeepVariant, both outperforms existing genotyping tools and generalizes across genome builds and even to other species. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Towards Automatic Composition of Multicomponent Predictive SystemsManuel Martín
Automatic composition and parametrisation of multicomponent predictive systems (MCPSs) consisting of chains of data transformation steps is a challenging task. In this paper we propose and describe an extension to the Auto-WEKA software which now allows to compose and optimise such flexible MCPSs by using a sequence of WEKA methods. In the experimental analysis we focus on examining the impact of significantly extending the search space by incorporating additional hyperparameters of the models, on the quality of the found solutions. In a range of extensive experiments three different optimisation strategies are used to automatically compose MCPSs on 21 publicly available datasets. A comparison with previous work indicates that extending the search space improves the classification accuracy in the majority of the cases. The diversity of the found MCPSs are also an indication that fully and automatically exploiting different combinations of data cleaning and preprocessing techniques is possible and highly beneficial for different predictive models. This can have a big impact on high quality predictive models development, maintenance and scalability aspects needed in modern application and deployment scenarios.
(2016)application of parallel glowworm swarm optimization algorithm for data ...Akram Pasha
The document describes a research article that proposes applying a parallel glowworm swarm optimization algorithm for clustering large unstructured data sets. The algorithm uses optimized glowworm swarms to evaluate the clustering problem and find multiple cluster centroids. It employs the MapReduce framework for parallelization, which balances the load, localizes the data, and provides fault tolerance. Experiments showed the algorithm scales well with increasing data set sizes and achieves near-linear speed while maintaining high-quality clustering, demonstrating it is more efficient than traditional algorithms for clustering large unstructured data.
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
Efficient recommender systems are critical for the success of many industries, such as job recommendation, news recommendation, ecommerce, etc. This talk will illustrate how to build an efficient document recommender system by leveraging Natural Language Processing(NLP) and Deep Neural Networks (DNNs). The end-to-end flow of the document recommender system is build on AWS at scale, using Analytics Zoo for Spark and BigDL. The system first processes text rich documents into embeddings by incorporating Global Vectors (GloVe), then trains a K-means model using native Spark APIs to cluster users into several groups. The system further trains a recommender model for each group, and gives an ensemble prediction for each test record. By adopting the end-to-end pipeline of Analytics Zoo solution, we saw about 10% improvement of mean reciprocal ranking and 6% of precision respectively compared to the search recommendations for a job recommendation study.
Speaker: Guoqiong Song
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
The document proposes an improved model for big data analytics using dynamic multi-swarm optimization and unsupervised learning algorithms. It develops an algorithm called DynamicK-reference Clustering that combines dynamic multi-swarm optimization with a k-reference clustering algorithm. The k-reference clustering algorithm uses reference distance weighting, Euclidean distance, and chi-square relative frequency to cluster mixed datasets. It was tested on several datasets from a machine learning repository and was shown to more efficiently cluster large, mixed datasets than other clustering algorithms like k-means and particle swarm optimization. The dynamic multi-swarm optimization helps guide the clustering algorithm to obtain more accurate cluster formations by providing the best initial value of k clusters.
Quick presentation for the OpenML workshop in Eindhoven 2014Manuel Martín
This document summarizes Manuel Martín Salvador's background and research interests in automated and adaptive data pre-processing for building predictive models. It discusses how data pre-processing makes up a large portion of the data mining process but is labor intensive. The document also outlines OpenML, a scientific workflow platform and repository for machine learning experiments, and highlights opportunities to increase the number and types of pre-processing methods available on the platform as well as improve flow representation and recommendation.
Concept Drift Identification using Classifier Ensemble Approach IJECEIAES
Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.
A cyber physical stream algorithm for intelligent software defined storageMade Artha
The document presents a new Cyber Physical Stream (CPS) algorithm for selecting predominant items from large data streams. The algorithm works well for item frequencies starting from 2%. It is designed for use in intelligent Software-Defined Storage systems combined with fuzzy indexing. Experiments show CPS improves accuracy and efficiency over previous algorithms. CPS is inspired by a brain model and works by incrementing a "voltage" value when items match and decrementing it otherwise, selecting the item with highest voltage. It performs well on both uniform random and Zipf's law distributed streams, with optimal parameter values depending on the distribution.
The premise of this paper is to discover frequent patterns by the use of data grids in WEKA 3.8 environment. Workload imbalance occurs due to the dynamic nature of the grid computing hence data grids are used for the creation and validation of data. Association rules are used to extract the useful information from the large database. In this paper the researcher generate the best rules by using WEKA 3.8 for better performance. WEKA 3.8 is used to accomplish best rules and implementation of various algorithms.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
[215]streetwise machine learning for painless parkingNAVER D2
The document summarizes research on using machine learning techniques for optimizing parking policies. It discusses using parking data from various sources like sensors and payments to set pricing, guide enforcement, and help drivers find spaces. Pricing models are developed to maximize the overall value people get from the parking system. A voting rule is proposed as a simple way to adjust prices based on occupancy levels over time. Spatial and temporal sampling techniques are explored to reduce sensor costs while still obtaining high quality data, such as prioritizing observations of locations with higher predictive uncertainty.
This document summarizes a paper presented at the 2011 International Conference on Recent Trends in Information Systems. The paper proposes a new algorithm for online mining of association rules in large databases. It introduces the concept of an adjacency lattice to store pre-processed itemsets in a way that reduces disk I/O during online queries. The proposed algorithm generates rules by constructing a weighted directed graph and performing depth-first search. It generates all essential rules while having fewer edges than the lattice used in existing algorithms, allowing more efficient online rule generation.
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
This document discusses how biomedical discovery is being disrupted by big data. Large genomic, phenotype, and environmental datasets are needed to understand complex diseases that result from combinations of many rare variants. However, analyzing large biomedical data is costly and difficult given the standard model of local computing. The document proposes creating large "commons" of community data and computing as an instrument for big data discovery. Examples are given of the Cancer Genome Atlas project, which has petabytes of research data on thousands of cancer patients, and how tumors evolve over time. Overall, the document argues that new models of shared biomedical clouds and commons are needed to enable cost-effective analysis of big biomedical data.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
Introduction to biological network analysis and visualization with Cytoscape (using the latest version 3.4).
This is a first half of the lecture for Applied Bioinformatics lecture at TSRI.
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
The document discusses applications of deep learning concepts and techniques to problems in genomics and precision agriculture. It describes how deep neural networks can be used for tasks like calling genetic variants from DNA sequencing data more accurately, enabling marker-assisted breeding in crops by identifying desirable genetic variants, and integrating diverse data sources like images and sensor data for optimization in precision agriculture. The document also discusses opportunities and challenges for applying these approaches to problems in the cannabis industry.
This thesis focuses on performance management techniques for cloud services. It presents work in three key areas: 1) Developing a scalable and generic resource allocation protocol for large cloud environments. 2) Building performance models to predict response times and capacity for a distributed key-value store. 3) Enabling real-time prediction of service metrics using analytics on low-level system statistics. The thesis contributes solutions for these challenging problems and identifies open questions around decentralized resource allocation, online performance management, and analytics-based forecasting at large scales.
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python” Lviv Startup Club
Machine learning enables computers to learn from data and experiences to act without being explicitly programmed. The goal of machine learning is to use example data or past experience to solve problems. There are different styles of machine learning algorithms such as supervised learning where the training data is labeled, and unsupervised learning where the training data is unlabeled. Machine learning problems can involve regression, classification, or clustering. The machine learning process involves preparing data, applying learning algorithms to create models, and deploying chosen models through applications and APIs.
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
KMIS International Conference 2021.
This talk aims to provide insights and performance of predictive models for Airbnb Rating using Big Data and distributed parallel computing systems. We have predicted and classified using Two-Class Classification models if a property has a high or a low rating based on the features of the listing. It helps the hosts to know if their property is suitable and how their listing compares to other similar listings. We compare the results and the performance of rating prediction models with accuracy and computing time metrics.
Towards Automatic Composition of Multicomponent Predictive SystemsManuel Martín
Automatic composition and parametrisation of multicomponent predictive systems (MCPSs) consisting of chains of data transformation steps is a challenging task. In this paper we propose and describe an extension to the Auto-WEKA software which now allows to compose and optimise such flexible MCPSs by using a sequence of WEKA methods. In the experimental analysis we focus on examining the impact of significantly extending the search space by incorporating additional hyperparameters of the models, on the quality of the found solutions. In a range of extensive experiments three different optimisation strategies are used to automatically compose MCPSs on 21 publicly available datasets. A comparison with previous work indicates that extending the search space improves the classification accuracy in the majority of the cases. The diversity of the found MCPSs are also an indication that fully and automatically exploiting different combinations of data cleaning and preprocessing techniques is possible and highly beneficial for different predictive models. This can have a big impact on high quality predictive models development, maintenance and scalability aspects needed in modern application and deployment scenarios.
(2016)application of parallel glowworm swarm optimization algorithm for data ...Akram Pasha
The document describes a research article that proposes applying a parallel glowworm swarm optimization algorithm for clustering large unstructured data sets. The algorithm uses optimized glowworm swarms to evaluate the clustering problem and find multiple cluster centroids. It employs the MapReduce framework for parallelization, which balances the load, localizes the data, and provides fault tolerance. Experiments showed the algorithm scales well with increasing data set sizes and achieves near-linear speed while maintaining high-quality clustering, demonstrating it is more efficient than traditional algorithms for clustering large unstructured data.
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
Efficient recommender systems are critical for the success of many industries, such as job recommendation, news recommendation, ecommerce, etc. This talk will illustrate how to build an efficient document recommender system by leveraging Natural Language Processing(NLP) and Deep Neural Networks (DNNs). The end-to-end flow of the document recommender system is build on AWS at scale, using Analytics Zoo for Spark and BigDL. The system first processes text rich documents into embeddings by incorporating Global Vectors (GloVe), then trains a K-means model using native Spark APIs to cluster users into several groups. The system further trains a recommender model for each group, and gives an ensemble prediction for each test record. By adopting the end-to-end pipeline of Analytics Zoo solution, we saw about 10% improvement of mean reciprocal ranking and 6% of precision respectively compared to the search recommendations for a job recommendation study.
Speaker: Guoqiong Song
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
The document proposes an improved model for big data analytics using dynamic multi-swarm optimization and unsupervised learning algorithms. It develops an algorithm called DynamicK-reference Clustering that combines dynamic multi-swarm optimization with a k-reference clustering algorithm. The k-reference clustering algorithm uses reference distance weighting, Euclidean distance, and chi-square relative frequency to cluster mixed datasets. It was tested on several datasets from a machine learning repository and was shown to more efficiently cluster large, mixed datasets than other clustering algorithms like k-means and particle swarm optimization. The dynamic multi-swarm optimization helps guide the clustering algorithm to obtain more accurate cluster formations by providing the best initial value of k clusters.
Quick presentation for the OpenML workshop in Eindhoven 2014Manuel Martín
This document summarizes Manuel Martín Salvador's background and research interests in automated and adaptive data pre-processing for building predictive models. It discusses how data pre-processing makes up a large portion of the data mining process but is labor intensive. The document also outlines OpenML, a scientific workflow platform and repository for machine learning experiments, and highlights opportunities to increase the number and types of pre-processing methods available on the platform as well as improve flow representation and recommendation.
Concept Drift Identification using Classifier Ensemble Approach IJECEIAES
Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.
A cyber physical stream algorithm for intelligent software defined storageMade Artha
The document presents a new Cyber Physical Stream (CPS) algorithm for selecting predominant items from large data streams. The algorithm works well for item frequencies starting from 2%. It is designed for use in intelligent Software-Defined Storage systems combined with fuzzy indexing. Experiments show CPS improves accuracy and efficiency over previous algorithms. CPS is inspired by a brain model and works by incrementing a "voltage" value when items match and decrementing it otherwise, selecting the item with highest voltage. It performs well on both uniform random and Zipf's law distributed streams, with optimal parameter values depending on the distribution.
The premise of this paper is to discover frequent patterns by the use of data grids in WEKA 3.8 environment. Workload imbalance occurs due to the dynamic nature of the grid computing hence data grids are used for the creation and validation of data. Association rules are used to extract the useful information from the large database. In this paper the researcher generate the best rules by using WEKA 3.8 for better performance. WEKA 3.8 is used to accomplish best rules and implementation of various algorithms.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
[215]streetwise machine learning for painless parkingNAVER D2
The document summarizes research on using machine learning techniques for optimizing parking policies. It discusses using parking data from various sources like sensors and payments to set pricing, guide enforcement, and help drivers find spaces. Pricing models are developed to maximize the overall value people get from the parking system. A voting rule is proposed as a simple way to adjust prices based on occupancy levels over time. Spatial and temporal sampling techniques are explored to reduce sensor costs while still obtaining high quality data, such as prioritizing observations of locations with higher predictive uncertainty.
This document summarizes a paper presented at the 2011 International Conference on Recent Trends in Information Systems. The paper proposes a new algorithm for online mining of association rules in large databases. It introduces the concept of an adjacency lattice to store pre-processed itemsets in a way that reduces disk I/O during online queries. The proposed algorithm generates rules by constructing a weighted directed graph and performing depth-first search. It generates all essential rules while having fewer edges than the lattice used in existing algorithms, allowing more efficient online rule generation.
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
This document discusses how biomedical discovery is being disrupted by big data. Large genomic, phenotype, and environmental datasets are needed to understand complex diseases that result from combinations of many rare variants. However, analyzing large biomedical data is costly and difficult given the standard model of local computing. The document proposes creating large "commons" of community data and computing as an instrument for big data discovery. Examples are given of the Cancer Genome Atlas project, which has petabytes of research data on thousands of cancer patients, and how tumors evolve over time. Overall, the document argues that new models of shared biomedical clouds and commons are needed to enable cost-effective analysis of big biomedical data.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
Introduction to biological network analysis and visualization with Cytoscape (using the latest version 3.4).
This is a first half of the lecture for Applied Bioinformatics lecture at TSRI.
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
The document discusses applications of deep learning concepts and techniques to problems in genomics and precision agriculture. It describes how deep neural networks can be used for tasks like calling genetic variants from DNA sequencing data more accurately, enabling marker-assisted breeding in crops by identifying desirable genetic variants, and integrating diverse data sources like images and sensor data for optimization in precision agriculture. The document also discusses opportunities and challenges for applying these approaches to problems in the cannabis industry.
This thesis focuses on performance management techniques for cloud services. It presents work in three key areas: 1) Developing a scalable and generic resource allocation protocol for large cloud environments. 2) Building performance models to predict response times and capacity for a distributed key-value store. 3) Enabling real-time prediction of service metrics using analytics on low-level system statistics. The thesis contributes solutions for these challenging problems and identifies open questions around decentralized resource allocation, online performance management, and analytics-based forecasting at large scales.
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python” Lviv Startup Club
Machine learning enables computers to learn from data and experiences to act without being explicitly programmed. The goal of machine learning is to use example data or past experience to solve problems. There are different styles of machine learning algorithms such as supervised learning where the training data is labeled, and unsupervised learning where the training data is unlabeled. Machine learning problems can involve regression, classification, or clustering. The machine learning process involves preparing data, applying learning algorithms to create models, and deploying chosen models through applications and APIs.
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
KMIS International Conference 2021.
This talk aims to provide insights and performance of predictive models for Airbnb Rating using Big Data and distributed parallel computing systems. We have predicted and classified using Two-Class Classification models if a property has a high or a low rating based on the features of the listing. It helps the hosts to know if their property is suitable and how their listing compares to other similar listings. We compare the results and the performance of rating prediction models with accuracy and computing time metrics.
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
This paper compares the performance of scalable predictive analysis models using XGBoost in Big Data. The performance measurement is based on the training computing time and accuracy with AUR and Precision of a model. We developed XGBoost classification models with Airbnb listing dataset that predict the recommendation of the listings. The models are built in PySpark Rapids, BigDL, and H2O Sparkling with CPU and GPU on AWS EMR. We observed that BigDL with GPU is 25 – 50% faster training time than other platforms. H2O Sparkling has 5 - 7% better AUC and 0.7% better Precision than others.
Rating Prediction using Deep Learning and SparkJongwook Woo
Distributed Deep Learning to predict Amazon review data rating in Spark using Analytics Zoo on AWS, which is published at "Rating Prediction using Deep Learning and Spark" at The 11th Internation Conference on Internet (ICONI 2019), Hanoi, Vietnam, Dec 15 - 18 2019
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
This document provides an introduction to big data and artificial intelligence presented by Jongwook Woo. It discusses Woo's background and experience, provides an overview of big data including issues with traditional data handling approaches and the need for scalable solutions like Hadoop. It also covers machine learning and deep learning techniques for predictive analysis using big data, and provides examples applying these techniques to COVID-19 data and financial fraud detection.
History and Trend of Big Data and Deep LearningJongwook Woo
This document contains a presentation by Jongwook Woo on the history and trends of big data and deep learning. It discusses the evolution of data storage and analysis from traditional systems to modern big data platforms like Hadoop and Spark that can handle large, complex datasets in a distributed, cost-effective manner. It also covers the rise of deep learning techniques using neural networks and how they can be applied to big data at scale, such as for predictive analytics, using distributed deep learning frameworks on existing big data clusters.
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
This talk aims at providing insights, performance, and architecture on Financial Fraud Detection on a mobile money transactional activity in Azure ML and Spark. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure ML and Spark ML, which are traditional systems and Big Data respectively. I will present predictive analysis with several classification models experimenting in Azure and Spark ML. Besides, scalability of Spark ML will be presented for the models with different number of nodes for Spark clusters in Amazon AWS.
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
Data is at the center of digital transformation; using data to drive action is how transformation happens. But data is messy, and it’s everywhere. It’s in the cloud and on-premises. It’s in different types and formats. By the time all this data is moved, consolidated, and cleansed, it can take weeks to build a predictive model.
Even with data lakes, efficiently integrating multi-structured data from different data sources and streams is a major challenge. Enterprises struggle with a stew of data integration tools, application integration middleware, and various data quality and master data management software. How can we simplify this complexity to accelerate and de-risk analytic projects?
The data warehouse—once seen as only for traditional business intelligence applications — has learned new tricks. Join James Curtis from 451 Research and Pivotal’s Bob Glithero for an interactive discussion about the modern analytic data warehouse. In this webinar, we’ll share insights such as:
- Why after much experimentation with other architectures such as data lakes, the data warehouse has reemerged as the platform for integrated operational analytics
- How consolidating structured and unstructured data in one environment—including text, graph, and geospatial data—makes in-database, highly parallel, analytics practical
- How bringing open-source machine learning, graph, and statistical methods to data accelerates analytical projects
- How open-source contributions from a vibrant community of Postgres developers reduces adoption risk and accelerates innovation
We thank you in advance for joining us.
Presenter : Bob Glithero, PMM, Pivotal and James Curtis Senior Analyst, 451 Research
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
This document discusses Jongwook Woo's work with Big Data AI at CalStateLA. It introduces Woo and his background, provides an overview of big data and how distributed systems enable scalable analysis of massive datasets. It also describes predictive analytics using machine learning and deep learning on big data, and how integrating GPUs into big data clusters can improve parallel processing for tasks like traffic analysis.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Introduction to Big Data and its TrendsJongwook Woo
Big Data has been popular last 10 years using Hadoop and Spark for data analysis and prediction with large scale data sets in distributed parallel computing systems. Its platform has expanded using NoSQL DB and Search Engine as well and has been more popular along cloud computing. Then, Deep Learning has become a buzzword past several years using GPU and Big Data. It makes even small companies and labs to own supercomputers with a small amount of budgets, which is the situation of “Dream Comes True” in the IT and business. In this talk, the history and trends of Big Data and AI platforms are introduced and Big Data predictive analysis should be presented.
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
The document discusses using Intel Analytics Zoo and Alluxio for ultra fast deep learning in hybrid cloud environments. Analytics Zoo provides an end-to-end deep learning pipeline that can prototype on a laptop using sample data and experiment on clusters with historical data, while Alluxio enables zero-copy access to remote data for accelerated analytics. Performance tests showed Alluxio providing up to a 1.5x speedup for data loading compared to accessing data directly from cloud storage. Real-world customers are using the combined Analytics Zoo and Alluxio solution for deep learning, recommendation systems, computer vision, and time series applications.
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
Advances in cell biology and creation of an immense amount of data are converging with advances in Machine learning to analyze this data. Biology is experiencing its AI moment and driving the massive computation involved in understanding biological mechanisms and driving interventions. Learn about how cutting edge technologies such as Software Guard Extensions (SGX) in the latest Intel Xeon Processors and Open Federated Learning (OpenFL), an open framework for federated learning developed by Intel, are helping advance AI in gene therapy, drug design, disease identification and more.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian from the University of Oxford explains Data Science and its relationship with Big Data and Cloud Computing. Then he illustrates using AzureML to perform a simple data science analytics.
Database@Home : The Future is Data DrivenTammy Bednar
These slides were presented during the Database@Home : Data-Driven Apps event. This session will discuss the importance of data to an organisation and the need to build applications where the value within that data can easily be exploited. To achieve that aim we need to start building applications that benefit from the flexibility of new development paradigms but don't create artificial barriers of complexity that stop us from easily responding to change within our organisations.
Similar to AdClickFraud_Bigdata-Apic-Ist-2019 (20)
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
1. Jongwook Woo
HiPIC
CalStateLA
APIC-IST 2019
June 24 2019
Neha Gupta, Hai Anh Le,
Maria Boldina , Jongwook Woo
Big Data AI Center (BigDAI)
California State University Los Angeles
Predicting fraud of AD click
using
Traditional and Spark ML
2. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Introduction
Data Set
Data Fields Details
Experiment Environment: Traditional and Big Data Systems
Work Flow in Azure ML
Data Bricks : Data Engineering
Algorithms
Appendix
References
3. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Introduction
A person, automated script or computer program imitates a
legitimate user
clicking on an ad without having an actual interest in the target of the ad's
link
resulting in misleading click data and wasted money
Companies suffers from huge volumes of fraudulent traffic
Especially, in mobile market in the world
Goal
Predict who will download the apps
Using Classification model
Traditional and Big Data approach
4. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Introduction(Cont’d)
TalkingData
China’s largest independent big data service platform
– covers over 70% of active mobile devices nationwide
handles 3 billion clicks per day
– 90% of which are potentially fraudulent
Goal of the Predictive Analysis
Predict whether a user will download an app
– after clicking on a mobile app advertisement
To better target the audience,
– to avoid fraudulent practices
– and save money
5. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Set
Dataset: TalkingData AdTracking Fraud Detection
https://www.kaggle.com/c/talkingdata-adtracking-fraud-
detection/data
Dataset Property:
Original dataset size: 7GB
– contains 200 million clicks over 4 day period
Dataset format: csv
Fields: 8
– Target Column to Predict: ‘is_attributed’
6. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Fields Details
7. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment:
Traditional and Big Data Systems
8. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Traditional
Azure ML Studio:
Traditional for small data set
Free Workspace
10GB storage
Single node
Implement fundamental prediction models
– Using Sample data: 80MB (1.1% of the original data set)
Select the best model among number of classifications
9. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Spark
Spark ML:
Data Filtering:
– 1 GB from 8 GB
• Implemented Python code to reduce size to 1GB (15%)
– We have experimental result with 8GB as well
• For another publication
Databricks Subscription
– Cluster 4.0 (includes Apache Spark 2.3.0, Scala 2.11)
• 2 Spark Workers with total of 16 GB Memory and 4 Cores
• Python 2.7
• File System : Databricks File System
10. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Spark (Cont’d)
Oracle Big Data Spark Cluster
Oracle BDCE
Python 2.7.x, Spark 2.1.x
10 nodes,
– 20 OCPUs, 300GB Memory, 1,154GB Storage
11. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Work Flow in Azure ML
Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
– Understanding Data
– Data preparation
– Balancing data statistically
2. Data Science: Machine Learning (ML)
– Model building and validation
• Classification algorithms
– Model evaluation
– Model interpretation
12. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
Unbalanced dataset
1: 0.19% App downloaded
0: 99.81% App not
downloaded
1GB filtered dataset
still too large for the
traditional systems: Azure
ML Studio
More sampling needed for
Azure ML
13. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
SMOTE: Synthetic Minority Over
Sampling Technique takes a subset of
data from the minority class and creates
new synthetic similar instances
Helps balance data & avoid overfitting
Increased percent of minority class (1) from
0.19% to 11%
Stratified Split ensures that the output
dataset contains a representative
sample of the values in the selected
column
Ensures that the random sample does not contain
all rows with just 0s
8% sample used = 80 MB
14. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Algorithms in Azure ML Studio
Two-Class Classification:
classify the elements of a given set into two groups
– either downloaded, is_attributed (1)
– or not downloaded, is_attributed (0)
Decision trees
often perform well on imbalanced datasets
– as their hierarchical structure allows them to learn signals from both classes.
Tree ensembles almost always outperform singular decision trees
– Algorithm #1: Two-class Decision Jungle
– Algorithm #2: Two-class Decision Forest
15. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Selecting Performance Metrics
False Positives indicate
the model predicted an app was downloaded when in fact it wasn’t
Goal: minimize the FP => To save $$$
16. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #1: TWO-CLASS DECISION JUNGLE
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Features used: all 7
17. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #1: Tune Model Hyperparameters
Without Tune
Hyperparameters
With Tune
Hyperparameters
AUC = 0.905 vs 0.606
Precision = 1.0
TP = 35, FP = 0
18. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #2: TWO-CLASS DECISION FOREST
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Permutation Feature
Importance
19. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #2: Improving Precision
Precision
increased to 0.992
FP decreased from
1,659 to 377
FN increased from
1,834 to 5,142 By increasing
threshold from 0.5
to 0.8
20. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental Results in Azure ML Studio
Performance:
Execution time with sample data set: 1GB
Decision Forrest
– takes 2.5 hours
Decision Jungle
– takes 3 hours 19 min
Good Guide from the models of Azure ML Studio
to adopt the 2 similar algorithms for Spark ML
– Decision Tree
– Random Forest
21. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental Results in AzureML
Two-class Decision Forest is the best model!
22. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment with Spark ML in Databricks
1. Load the data source
1.03 GB
Same filtered data set as Azure ML
2. Train and build the models
o Balanced data statistically
3. Evaluate
23. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
Generate features
Feature 1: extract day of the week and hour of the day from the click time
Feature 2: group clicks by combination of
– (Ip, Day_of_week_number and Hour)
Feature 3: group clicks by combination of
– (Ip, App, Operating System, Day_of_week_number and Hour)
Feature 4: group clicks by combination of
– (App, Day_of_week_number and Hour)
Feature 5: group clicks by combination of
– (Ip, App, Device and Operating System)
Feature 6: group clicks by combination of
– (Ip, Device and Operating System)
24. 24
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Decision Tree Classifier
Confusion Matrix
25. 25
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Random Forrest Classifier
Confusion Matrix
26. 2626
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML Result Comparison
Decision Tree Classifier is relatively the better model!
Decision Tree
Classifier
Random Forest
Classifier
AUC 0.815 0.746
PRECISION 0.822 0.878
RECALL 0.633 0.495
TP 86,683 67,726
FP 18,727 9,408
TN 7,112,961 7,122,280
FN 50,074 69,031
RMSE 0.0972 0.1038
27. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment in Oracle Cluster
Oracle Big Data Spark Cluster
10 nodes, 20 OCPUs, 300GB Memory, 1,154GB Storage
1. Load the data source
1.03 GB
2. Sample the balanced data based on Downloaded
116 MB
3. Train and build the models
o Balanced data statistically
4. Evaluate
28. 2828
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
29. 2929
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
• Azure ML Two-class Decision Forest is the best model!
• Spark ML code need to be updated for the better accuracy
• Balanced Sampling based on the fraud in Oracle:
• Decision Tree has 0.935 in Precision
• Execution Time: 24 secs
30. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
31. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Appendix
Data Set Details (Cont‘d)
32. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
Precision
TP / (TP + FP)
Recall
TP / (TP + FN)
Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)
33. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial
Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS),
VOL.28│NO.4│December 2018, pp308~319
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark
34. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
12. Tensor Flow Deep Learning Open SAP
13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive
15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-
accuracy-precision-recall-and-f1-score/