In this presentation, I present how to properly discover, analyze and find trends in various types of healthcare data in order to utilize machine learning algorithms to predict future trends in the data. This presentation directly discusses the implications of data analysis in predicting benign and malignant cancers but the same techniques in this presentation can be applied to any other types of data in the real world.
For a more in-depth presentation, please watch the video presentation of this slideshow linked here: https://youtu.be/gXSl2iWcJ00
2. WHO AM I? My name is Michael Batavia and I’m a freshman at the
NYU Tandon School of Engineering.
I’ve done several research projects with machine learning
and deep learning and consider it my specialty.
Some projects:
• My Regeneron STS 2021 Winning Project
• My Winning App for the 2020 Congressional App
Challenge in District NY-14
• A Full Kaggle Data Analysis on Avocado Prices
Now, I’m here to teach you the best data analysis tools to
deal with complicated health care data.
3. WHAT ARE WE LOOKING
FOR?
• Before we begin to describe the data analysis techniques that we might use to
analyze healthcare data,
• Let’s think about the types of data that we might have.
• What are possible data inputs that you might need to pre-process when
you first obtain your healthcare data?
4. TYPES OF HEALTHCARE DATA
Different types of data that we might be given to analyze:
• ECG chart readings
• Slide images from an H&E Machine of metastatic / benign lymph nodes
• A chart containing a variety of serum measurements from diabetes patients
• JPEG / PNG photos of cancerous tumors (already preprocessed)
• A chart containing various physical attributes of a patient (height, weight, BMI, muscle mass, body fat percentage)
• Pictures of construction equipment to detect in metropolitan cities (computer vision for the blind)
• Test measurements for prosthetic limbs for veterans
• Real-time measurements of computer vision enabled walking canes
5. WHAT DOES THE DATA LOOK LIKE?
• For this workshop, we’re going to talk about the most common type of data that you
will be working with in healthcare:
• IMAGES!
When working with images, it is critical to know how many images you have on the
topic and what file format the images are. Make sure that any data analysis you do
can suit the file format specified in the images.
6. ORGANIZIN
G YOUR
DATA
Once you have converted your data into an
appropriate file format (via the use of online tools or
programming APIs), you need to know how to
organize your data.
• Think! If you have a bunch of breast cancer
data containing various pictures of metastatic
and benign tissue and their corresponding
labels, how would you organize the data?
This may seem like a simple step but it makes your
life much easier if you do this before you do any
complex data maneuvers.
8. LOADING IN YOUR DATA
If you have organized your data sufficiently, you can load in the data to your designated programming language.
Here are some tips to help with the loading of your data:
• If you created a CSV file like I stated earlier, you can use imaging libraries like PIL to load in images with
common file formats to Python using their file location. Then, you can replace the file name in your CSV to the
actual stored image in memory.
• If loading in a large set of images, be wary of memory restrictions. Try to load data in mini-batches when
possible to prevent loading delays and out of memory crashes.
• Try shuffling the files when they are loaded into the program. This will help the eventual machine learning
portion of your program not remember sequential patterns with your files. Randomness is good!
When you load in your photos, use libraires like numpy to convert those images into multi-dimensional arrays for
future pre-processing. Look at the image on the next slide for an explanation of multi-dimensional arrays.
10. THE REAL WORLD OF HEALTHCARE DATA
If you have a collection of images containing magnified benign and malignant tumors for throat
cancer detection, what inherent problem would you find in the data?
12. NORMALIZING DATA
Once you have loaded in your image data, it is considered custom to normalize it in order for the machine
learning algorithms to be able to deal with the wide range of pixel values across the three or more channels
in the images.
Normalizing an image is quite simple once you have converted the images in your program into multi-
dimensional arrays of pixel values. All you have to do is divide each pixel value in the array by 255 (the
maximum pixel value). Libraries like numpy can do this easily through the power of broadcasting. An
alternative technique that can also be done is standardization with nearly identical results.
13. AUGMENTING YOUR DATA
A common technique that is done with most images that are put through neural networks is to augment them through an image data
generator.
An image data generator will take already existing images and apply small transformations of them customized to your liking. These
transformation depend on the type of augmentations you would expect to have in your testing data. A full list of common translations
are listed below. All you need to do is to apply the generator to the training data of the neural network before the training begins and
both the original and augmented images will be fed into the neural network!
Common Augmentations:
• Rotating 90 Degrees either clockwise or counterclockwise
• Flipping an image vertically or horizontally
• Translating the image vertically or horizontally
• Increasing the brightness or contrast of an image
• Shearing or zooming in on an image
14. APPLYING A NEURAL NETWORK
For those of you who have experience with neural networks, what type of neural
network would work best when dealing with images?
What type of inner components would you put in the neural network to lead to optimal
results?
15. CHOOSING YOUR NEURAL NETWORK
The most common type of neural network to use for analyzing patterns in images is a
convolutional neural network (CNN).
A convolutional neural network is a specific type of neural network that is able to find both
small and large patterns in classes of images through processes called convolutions and
pooling to differentiate them from other classes. The inherent minute differences between
images is how a convolutional neural network can complete classification so fast and
accurately compared to manual classification.
With healthcare data, this is especially important! Fast classifications lead to rapid pre-
diagnosis!
16.
17. USING A PRE-TRAINED NEURAL NETWORK
Although custom-crafted CNNs may successfully classify between two or more classes, it
may sometimes be easier to use a pre-trained CNN to deal with your problem.
A pre-trained neural network is a neural network that has already been compiled and
trained on a specific training set. Common types of pre-trained neural networks include
ResNet, YOLOv5, VGG16 and EfficientNet.
Since pre-trained neural networks are only supposed to work with a specific dataset, it is up
to you to customize the inputs and outputs of the neural network to work with your dataset. It
is also your responsibility to choose whether to apply the pre-trained weights from the neural
network into your own network. This often depends on the types of images that the pre-
trained network was trained on.
18.
19. OPTIMIZING YOUR NEURAL NETWORK
There are some optimizations that you can do to your neural network to make it even faster and possibly even more efficient. It’s up to you
whether to implement these optimizations to your custom neural network or to your pre-trained neural network.
1. Fine-tune layers of a pre-trained neural network
a. Leads to the pre-trained neural network learning more specific patterns for your classification task
2. Use tuned versions of the ReLU activation layer
a. Avoids the “dying ReLU” problem via possible exploding or vanishing gradients.
b. Instead, use LeakyReLU or ELU activation.
3. Use early stopping and model checkpoints.
a. With early stopping, you can save CPU/GPU usage on your computer when your model tends to not improve more than a
threshold ε that you set in the code.
b. With model checkpoints, you can start/stop training in your neural network at any time. You can also save a checkpoint when your
model is finished to make it portable. This way, you can test the model anywhere on any data rather than having it restricted to
your computer.
4. Create a constant that reflects the class imbalance in the data (output bias).
a. Using this constant will speed up convergence in your neural network and eliminate training periods where the network is just
learning the class imbalance.
b. The formula to calculate this output bias is shown in the figure to the right.
20. NEURAL NETWORK METRICS FOR HEALTHCARE
DATA
When our neural network finally trains, what metrics do we want to look for in order to see how well the network can differentiate between benign/malignant
cancer cells or between positive/negative diagnosis of diabetes.
• Validation Accuracy
• This metric measures the accuracy of the neural network’s classifications on a set of data that the neural network has never seen (validation
data). This metric helps to see how the neural network can generalize on new data outside of its training data.
• Precision
• This metric measures how much of the classifications made of one class are correct. For example, a model with 50% cancer diagnosis
precision means that when the model predicts a cancer cell is malignant, it is correct 50% of the time.
• Recall
• This metric measures how much of the actual class was identified by the model. For example, a model with 11% cancer diagnosis recall
means that the neural network can correctly identify 11% of all malignant tumors in the data.
• AUROC
• The area under the receiving operating characteristic curve (AUROC) is a measure on how well your neural network can effectively distinguish
between two classes. This is usually reported as another validation metric in addition to validation accuracy when doing binary classification
for healthcare data.
21. CREATING RESULT GRAPHS
Once you obtain your metrics, it is quite wise to create graphs to represent the results of your deep
learning investigation so that other scientists can easily understand your conclusions.
You can easily do this through the use of graphing libraries like matplotlib, seaborn or bokeh. All you
need to do is just which plots to present with your research. Common choices include a plot of
your neural network’s training and validation accuracy over time, a plot of your neural
network’s training and validation loss over time, a plot of your neural network’s architecture
and possible visualizations of your data with comparisons between the ground truth label and
the network’s predicted label.
You can also create other graphs but these graphs depend on the specific experiments that you
perform in your paper. Let’s take a look at a few examples.
24. Research Quality Colab
Notebook
If you are interested in seeing how I was able to create a
research paper using a custom-crafted CNN to detect real-
world breast cancer tumor images in lymph nodes, I highly
suggest you check out my Colab notebook here:
https://github.com/AstroNoodles/Mini-
Projects/blob/master/Parallel_Sync_CNN_Research_B.ipyn
b.
It makes use of a lot more advanced techniques such as
hypertuning, the use of TensorBoard and batch
normalization but through more research and
experimentation, you will be able to learn these ideas as
well as submit your own research project or create your own
deep neural network to submit to competitions like this one!
25. Thanks for Listening
To My Presentation!
I hope you enjoyed it as much as I had fun
making it!
Are there any questions for me?
My website: https://astronoodles.github.io/