A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...Databricks
The strongest indicator of a cancer patient's prognosis is the number of mitotic bodies that a pathologist manually counts from the high-resolution whole-slide histopathology images. Obviously, it is not efficient to manually count the mitosis number. But it is still challenging to automate the process of mitosis detection due to the limited training datasets and the intensive computing involved in the model training and inference. This presentation introduces a large-scale deep learning approach to train a two-stage CNN-based model with high accuracy to detect the mitosis locations directly from the high-resolution whole-slide images. In details, we first train a nuclei detection model to remove the background information from the raw whole-slide histopathology images. Second, a customized ResNet-50 model is trained on the cleaned dataset in the first step. The first step saves the training time while improving the model performance in the second step. A false-positive oversampling approach is used to further improve the model performance. With these models, the inference process is conducted to detect the mitosis locations from the large volume of histopathology images in parallel. Meanwhile, the whole pipeline, including data preprocessing, model training, hyperparameter tuning, and inference, is parallelized by utilizing the distributed TensorFlow, Apache Spark, and HDFS. The experiences and techniques in this project can be applied to other large scale deep learning problems as well.
Speaker: Fei Hu
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...Databricks
The strongest indicator of a cancer patient's prognosis is the number of mitotic bodies that a pathologist manually counts from the high-resolution whole-slide histopathology images. Obviously, it is not efficient to manually count the mitosis number. But it is still challenging to automate the process of mitosis detection due to the limited training datasets and the intensive computing involved in the model training and inference. This presentation introduces a large-scale deep learning approach to train a two-stage CNN-based model with high accuracy to detect the mitosis locations directly from the high-resolution whole-slide images. In details, we first train a nuclei detection model to remove the background information from the raw whole-slide histopathology images. Second, a customized ResNet-50 model is trained on the cleaned dataset in the first step. The first step saves the training time while improving the model performance in the second step. A false-positive oversampling approach is used to further improve the model performance. With these models, the inference process is conducted to detect the mitosis locations from the large volume of histopathology images in parallel. Meanwhile, the whole pipeline, including data preprocessing, model training, hyperparameter tuning, and inference, is parallelized by utilizing the distributed TensorFlow, Apache Spark, and HDFS. The experiences and techniques in this project can be applied to other large scale deep learning problems as well.
Speaker: Fei Hu
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
We provide real time big data training in Chennai by industrial experts with real time scenarios.
Our Advanced topics will enhance the students expectations into high level knowledge in Big Data Technology.
For More Info.Reach our Big Data Technical Team@ +91 96677211551/56
The Experience of Big data Training Experts Team.
www.thecreatingexperts.com
SAP BEST INSTITUTES IN CHENNAI
http://www.youtube.com/watch?v=UpWthI0P-7g
Big Data Analytics for connected home: a few usecases, some important messages and a little example. Presentation given at CEA Cadarache - Cité des Nouvelles Energies at the strategic comittee of ARCSIS (http://www.arcsis.org/missions.html)
Machine Learning and AI: Core Methods and ApplicationsQuantUniversity
This session was presented at the CFA Institute on May 6th 2020
This deep-dive session discusses core methods and applications to provide an understanding of supervised and unsupervised machine learning. Participants will be introduced to advanced topics that include time series analysis, reinforcement learning, anomaly detection, and natural language processing. Case studies will also examine how to predict interest rates and credit risk with alternative data sets and how to analyze earning calls from EDGAR using Natural Language Processing Techniques.
In this talk, we discuss QuTrack, a Blockchain-based approach to track experiment and model changes primarily for AI and ML models. In addition, we discuss how change analytics can be used for process improvement and to enhance the model development and deployment processes.
The problem of scene classification in surveillance footage is of great importance for ensuring security in public areas. With challenges such as low quality feeds, occlusion, viewpoint variations, background clutter etc. The task is both challenging and error-prone. Therefore it is important to keep the false positives low to maintain a high accuracy of detection. In this paper, we adapt high performing CNN architectures to identify abandoned luggage in a surveillance feed. We explore several CNN based approaches, from Transfer Learning on the Imagenet dataset to one-shot detection using architectures such as YOLOv3. Using network visualization techniques, we gain insight into what the neural network sees and the basis of classification decision. The experiments have been conducted on real world datasets, and highlights the complexity in such classifications. Obtained results indicate that a combination of proposed techniques outperforms the individual approaches.
Author: Utkarsh Contractor
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
We provide real time big data training in Chennai by industrial experts with real time scenarios.
Our Advanced topics will enhance the students expectations into high level knowledge in Big Data Technology.
For More Info.Reach our Big Data Technical Team@ +91 96677211551/56
The Experience of Big data Training Experts Team.
www.thecreatingexperts.com
SAP BEST INSTITUTES IN CHENNAI
http://www.youtube.com/watch?v=UpWthI0P-7g
Big Data Analytics for connected home: a few usecases, some important messages and a little example. Presentation given at CEA Cadarache - Cité des Nouvelles Energies at the strategic comittee of ARCSIS (http://www.arcsis.org/missions.html)
Machine Learning and AI: Core Methods and ApplicationsQuantUniversity
This session was presented at the CFA Institute on May 6th 2020
This deep-dive session discusses core methods and applications to provide an understanding of supervised and unsupervised machine learning. Participants will be introduced to advanced topics that include time series analysis, reinforcement learning, anomaly detection, and natural language processing. Case studies will also examine how to predict interest rates and credit risk with alternative data sets and how to analyze earning calls from EDGAR using Natural Language Processing Techniques.
In this talk, we discuss QuTrack, a Blockchain-based approach to track experiment and model changes primarily for AI and ML models. In addition, we discuss how change analytics can be used for process improvement and to enhance the model development and deployment processes.
The problem of scene classification in surveillance footage is of great importance for ensuring security in public areas. With challenges such as low quality feeds, occlusion, viewpoint variations, background clutter etc. The task is both challenging and error-prone. Therefore it is important to keep the false positives low to maintain a high accuracy of detection. In this paper, we adapt high performing CNN architectures to identify abandoned luggage in a surveillance feed. We explore several CNN based approaches, from Transfer Learning on the Imagenet dataset to one-shot detection using architectures such as YOLOv3. Using network visualization techniques, we gain insight into what the neural network sees and the basis of classification decision. The experiments have been conducted on real world datasets, and highlights the complexity in such classifications. Obtained results indicate that a combination of proposed techniques outperforms the individual approaches.
Author: Utkarsh Contractor
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
1. 1
Pengantar
Datamining
Anto Satriyo Nugroho, Dr.Eng
Center for Information & Communication Technology,
Agency for the Assessment & Application of Technology (PTIK-BPPT)
Email: anto.satriyo@bppt.go.id, asnugroho@ieee.org
URL: http://asnugroho.net
2. • Apakah Datamining itu ?
• Teknik dalam datamining
• Contoh Aplikasi Datamining
• Tutorial Pemakaian Software Datamining “WEKA”
• Further Readings
Agenda
3. • Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Commercial Viewpoint
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
4. Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
5. – Goal:
To
predict
class
(star
or
galaxy)
of
sky
objects,
especially
visually
faint
ones,
based
on
the
telescopic
survey
images
(from
Palomar
Observatory).
– 3000
images
with
23,040
x
23,040
pixels
per
image.
– Approach:
• Segment
the
image.
• Measure
image
aJributes
(features)
-‐
40
of
them
per
object.
• Model
the
class
based
on
these
features.
• Success
Story:
Could
find
16
new
high
red-‐shiP
quasars,
some
of
the
farthest
objects
that
are
difficult
to
find!
Large
Scale
Data
:
Sky
Survey
Cataloging
7. 7
n Measuring the expression of
genes
n Possible to obtain the expression
of thousands of genes
n Disease classification
Microarray
http://cmgm.stanford.edu/pbrown/array.html
8. • Definition: automatically (or semiautomatically) process of
discovering meaningful pattern in data
• extracting
– implicit
– previously unknown
– potentially useful
information from data
Definition of Datamining
10. Datamining Tasks
• Prediction
use some variables to predict unknown or future
values of other variables
• Description
find human-interpretable patterns that describe the
data
15. Rule : find the most similar pattern from the training set,
then assign the class of the test data by the
class of that pattern
X
Class A
Class B
X is test pattern
Class of the nearest pattern is A
class of is A
X
Nearest Neighbor Classifier
17. Cash register data :
“Customer who bought A and B will have high
probability to buy expensive product C”
Marketing Strategy:
n Sell A, B and C as one set
n Place A, B and C in one corner
n Etc
A, B C
⇒
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
18. D
{ }
m
i
i
i ,...,
, 1
1
=
Ι
Y
X ⇒
: Items (products)
Database : transactions
φ
=
∩
⊆
⊆
Y
X
I
Y
I
X
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
19. Y
X ∪
Y
X ⇒
Confidence c% : The ratio between transactions
to the total transactions
of product X
Support s% : The ratio between transaction
to the total transactions
antecedent
consequent
Y
X ⇒
Confidence & Supports
21. • Items : m à the number of association rules
• m: 100 à about 57,000 rules m: 100 à5.15 x 10
47
• Large number of rules are generated, but the only few of
them are really useful
• Useful rules :
– high score of both support & confidence
– Low score of support : the rules are applicable for only
few cases
( )
2
2
2
−
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
∑ =
k
m
k
k
m
Confidence & Supports
23. Two aspects:
- architecture : how the neurons are connected
- training algorithm:
algorithm to adjust the synapses to enable the
ANN perform desired input-output mapping
Artificial Neural Network
24. two aspects:
- architecture : multilayer perceptron
- training algorithm: backpropagation algorithm
(invented by Rumelhart, 1986)
Input information Output
Input Layer Output Layer
w
Hidden Layer
w
Artificial Neural Network
25. decrement of error during the training phase
of neural networks
=
“knowledge” acquisition
Artificial Neural Network
(training phase)
26. • Invented by Vapnik (1992)
• SVM satisfied three conditions for ideal pattern
recognition method
– Robustness
– Theoretically Analysis
– Feasibility
• In principal, SVM works as binary classifier
• Structural-Risk Minimization
Support Vector Machines
30. • Fog forecasting
• Bioinformatics
• Sky survey Cataloging (Fayyad et al.)
• Spatio-Temporal Analysis of Disease Spreading using
Webmining
• Foreign Exchange Rate Prediction
• Network Intrusion Detection
• Etc.
Application of Datamining
31. Sky Survey Cataloging
• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects, especially
visually faint ones, based on the telescopic survey images (from
Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars, some
of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
32. Early
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Courtesy: http://aps.umn.edu
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Classifying Galaxies
33. Further Readings
• Buku-buku datamining a.l.
• Pang Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to
Datamining, Addison Wesley, 2006
• Datamining: Teknik Pemanfaatan Data untuk Keperluan Bisnis, Budi
Santosa, Graha Ilmu, 2007
• Tutorial on datamining by Dr.Iko Pramudiono (NTT, Japan)
http://datamining.japati.net/
• Datamining, Knowledge Discovery and Bioinformatics, Shinichi Morishita
(winner of KDD 2001): http://asnugroho.wordpress.com/2006/02/05/
datamining-knowledge-discovery-bioinformatics-terjemahan-artikel-prof-
shinichi-morishita/ (password: gomibako)
• AS. Nugroho: Datamining dalam Bioinformatika: menggali informasi
terpendam dalam lautan data biologi, SDA Asia Magazine, No.13, pp.
64-66, March 2006 http://asnugroho.wordpress.com/2006/02/06/peran-
datamining-dalam-bioinformatika/