Automated Pathologists?
Breast Cancer Detection Using Machine Learning
Your doctor suspects you to have breast cancer after the mammogram. They book you for a biopsy to confirm the diagnosis. You wait for the date that the doctor will remove cells from your body to be tested in a lab by a pathologist. But what do you do after the biopsy? In Canada, it typically takes 10 days for a biopsy test to return, and even after the dreadful wait, how can you be sure it was the correct diagnosis?
Misdiagnosis In Healthcare
A study published in the Journal of the America Medical Association determined that only one-half of the abnormal, precancerous breast cells were diagnosed correctly, and around one-third of the cases were misdiagnosed as normal or not worrisome.
This is a little concerning…
Moreover, many of the normal cases were diagnosed as suspicious, resulting in healthy women receiving invasive and potentially dangerous treatments. What’s worse is that around 13% of women develop breast cancer in their lifetime. While pathologists are really good at diagnosing severe cases, it becomes difficult when looking at precancerous or mild cases of breast cancer… So how can we increase the success rates for diagnoses?
Well, that’s what we will look at today! 🌝
I created a breast cancer detection model that uses Machine Learning to determine whether one has breast cancer or not.
How Breast Cancer is Currently Diagnosed
To understand how my project works, let’s look at how professional pathologists determine whether or not the cells are cancerous.
Pathologists examine the cells and tissues under a microscope and look for a few key indicators:
- ↔️ Size — Cancer cells tend to be larger than normal, healthy cells and often vary in size when compared to neighbouring cancer cells.
- 🎨 Colour — Cancer cells tend to be darker than normal, healthy cells. Furthermore, when the nucleus of cancer cells is stained with certain dyes, it looks darker. The nucleus from a cancer cell is usually larger because it often contains an excess amount of DNA.
- 🔷 Shape — Normal cells have a certain job to complete in the body, and hence have a certain shape. Cancer cells do not and thus have a distorted shape.
- 🔗 Arrangement — Cells often have specific arrangements, for example, gland tissue in the breast is organized into lobules, where the milk is produced, and ducts, where the milk is carried from the lobules to the nipple. Cancer cells do not multiply in an organized manner, but rather clump together recognizably.
Pathologists can look at these key indicators and decide if the specimen is cancerous and send the result to the doctors to then be relayed to the patient. 🔍 For my project, I took these key indicators and trained my model to recognize them so it accurately diagnoses and differentiates the differences between normal cells and cancerous cells.🕵🏽♀️
My Project
As I mentioned, I used machine learning and trained a model that could differentiate whether or not the sample was cancerous. You may know machine learning is a subset of artificial intelligence and on a high level, machine learning is a method of analyzing data that automates analytical model building. It basically can learn from data and identify patterns that can be used to make decisions with minimal human intervention. Machine learning can be split into four main categories: supervised, semi-supervised, unsupervised, and reinforcement.
- Supervised learning is when the model is trained by example. The algorithm is provided with a known dataset that includes both inputs and desired outputs. The algorithm then finds a method to determine how to arrive at those outputs. The algorithm is corrected by the operator and this process continues until the algorithm achieves a high level of accuracy.
- Semi-supervised learning is essentially the same as supervised learning except it includes both labeled and unlabelled data. The algorithm uses the labeled data to learn how to label the unlabelled data.
- Unsupervised learning is when the algorithm interprets a large data set and interprets correlations and relationships. The algorithm tries to organize the information either through clustering or grouping or simply arranging it in a more organized manner.
- Reinforcement learning is when the algorithm is provided with a set of actions, parameters, and end values or goals. Reinforcement learning teaches through trial and error to the most optimal possibility.
My Approach
Data
I went with supervised learning as my data had known inputs and outputs which I could train my algorithm based on, but before I could start the actual machine learning portion of my project I imported some libraries to make my life easier.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
The first step I took was visualizing the data to see if there were empty boxes, and what I found was that the the last column was completely empty, so I deleted it.
#Count the number of rows and columns in the data setdf.shape#Count the number of empty (NaN, NAN, na) values in each columndf.isna().sum()#drop the last column (it's empty)df = df.dropna(axis=1)
I also found that the “id” column in my data was irrelevant to the project and eliminated it as; however, instead of deleting it from the dataset, I made sure to not use it while visualizing and training the data instead. Furthermore, I changed the objects that defined cancer as either malignant or benign to binary integers so my model would produce either a 0 or 1 as the output — 0 depicting the tumour is benign and 1 depicting the cancer is cancerous.
#encode the cateforical data values (M and B)from sklearn.preprocessing import LabelEncoderlabelencoder_Y = LabelEncoder()df.iloc[:,1] = labelencoder_Y.fit_transform(df.iloc[:,1].values)
Visualizing the Data
The next step I took was creating graphs for my data. What I wanted to to was make it easier for myself to visualize the data and see the patterns my model was going to be detecting. I used the variable “sns” to represent the Seaborn python library and called pair plot functions that creates a grid of Axes such that each variable in data will have a y value and an x value. I also limited the data to only the first 5 rows and eliminated the “id” column when making the visualizations. I also created other visualizations including a heat map and mapping correlation between the data. Check out my code to see that!
Machine Learning Part
Here’s where the fun stuff happens. After taking the time to clean the data and creating visualizations to understand the relationships, it’s time to crate the machine learning model! 🎉
Huge shoutout to randerson112358’s article for taking me through the whole process, and if anyone here is beginner and wants to dive into the world of machine learning, I would definitely recommend his tutorial.
The next step was separating my data into x (whether or not the tumour is malignant) and y (the other information that helps determine whether or not the tumour is malignant).
#Spilt the data set into x and yX = df.iloc[:,2:31].valuesY = df.iloc[:,1].values
To train a machine learning mode, you need a training set––data I would train my model with––and a testing set––data I would test my final model’s accuracy with. This was we can avoid overfitting the model. Overfitting occurs when the model is very well upon the given dataset but fails when working with new datasets. I wanted to split my data with a 75:25 ratio, this mean 75% of my data would be used to train my model and 25% would be used to test whether or not my model is accurate. To accomplish this I imported and used the train_test_split function from the sklearn API.
#split the dataset into 75% training and 25% testingfrom sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25 , random_state = 0)
As I mentioned earlier , I went with supervised learning to train my model, but within supervised learning, there are tons of different algorithms like random forests, decision trees, linear regression, logistic regression, and the list goes on. I decided to try three different models: logistic regression, random forest classifier, and a decision tree.
I created a function that would train all three of the models one after another. For each one I imported the model from sklearn, assigned it to a variable and trained it on the data we split earlier.
#create a function for the modelsdef models(X_train, Y_train):#Logistic Regression from sklearn.linear_model import LogisticRegression
log = LogisticRegression(random_state=0)
log.fit(X_train, Y_train)#Descision Tree from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion = 'entropy', random_state
= 0)
tree.fit(X_train, Y_train)#Random Forest Classifer from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 15, criterion =
'entropy', random_state = 0)
forest.fit(X_train, Y_train)
Then to check how accurate the model was with the training data I printed the score for each of them.
#Print the models accuracy on the training dataprint('[0]LogisticRegression Training Accuracy:', log.score(X_train, Y_train))print('[1]Decision Tree Classifier Accuracy:', tree.score(X_train, Y_train))print('[2]Random Forest Classifier Training Accuracy', forest.score(X_train, Y_train))
In reality, the training accuracy can’t show us much as what really matters is the accuracy on the testing data. To do this, once again from the sklearn API I imported the classification_report function and the accuracy_score function. I them created a function the runs through the three models and prints the model name, classification report and accuracy score. The classification report is used to measure the quality of the predictions while the accuracy score gives us the final result of how many the model got correct and incorrect.
#Show a way to get metrics of the modelsfrom sklearn.metrics import classification_reportfrom sklearn.metrics import accuracy_scorefor i in range(len(model)):
print('Model ', i)
print()
print( classification_report(Y_test, model[i].predict(X_test)))
print()
print( accuracy_score(Y_test, model[i].predict(X_test)))
I was able to train and test the models and achieved the following accuracies.
- Random Forest Classifier: 0.972027972027972%
- Decision Tree: 0.9370629370629371%
- Logistic Regression: 0.951048951048951%
While a 96.5% accuracy is pretty good, since my model is something that would be implemented into healthcare, I wanted to take the accuracy rate higher. Thus, I tried tuning the random forest classifier and was able to get to an accuracy rate of 97.9%. Lastly, I tried the XGBoost gradient boosted decision tree and got an accuracy rate of 99.3% after some tuning → I repeated the same step as for the other models.
Here is a link to my code if you want to check it out! 🔦
How does this work?
We know using libraries with pre-set frameworks work, but what’s happening in the background? How are three lines of code able to detect cancer? To answer this, let’s dive into how gradient boosted decision trees work.
Step 1––Calculate the average of the target label
For our specific example, since all of the possible labels are from 0–1, the average would fall in that area. 0 referring to specimen being benign, and 1 referring to the specimen being cancerous.
Step 2 — Calculate the residuals
In our example, this would mean calculating how off the average is from the actual value.
Step 3 — Construct a decision tree
Then, we build a tree intending to predict the residuals; basically, each leaf will contain a prediction as to the value of the residual. A leaf is the end node of a decision tree. In the event of multiple residuals then leaves, some residuals will share the same leaf.
Step 4 — Predict the target label using all of the decision trees
The data sample passes through the decision nodes until it reaches a lead to predict whether or not the specimen is cancerous. Through experimentation, we have found that taking small steps words the desired solution creates a bias with a lower overall variance. This means that the model can achieve better accuracy on samples outside of the training data. To prevent this from occurring, a hyper-parameter “learning rate” is introduced. When making a prediction, each residual is multiplied by the learning rate, forcing the model to use more decision trees which each take a small step towards the final solution.
Step 5 — Compute the new residuals
New residuals are calculated for the leaves of the next decision tree and more decision trees are made.
Step 6 — Repeat steps 3–5 until the # of iterations matches the number specified by the hyper-parameter
This hyper-parameter is the number of estimators. For my model, I used 10,000 estimators.
Step 7 — Use all of the trees in the ensemble and make a final prediction
The final prediction will be equal to the mean calculated in the first step and all of the residuals multiplied by the learning rate.
And we’re done! 🥳 🎉
Future Applications of Artificial Intelligence in Healthcare
False-positive mammograms and the over-diagnosis of breast cancer among women ages 40 to 59 alone cost the healthcare system $4 billion annually. 💸Furthermore, there are thousands of women who are falsely assured they do not have breast cancer. Not only is there a huge market for technology that can better diagnose humans, but a huge necessity. It can save millions of lives and machine learning can be applied to many other diseases other than breast cancer such as pneumonia, lung cancer, and strokes. Technology like artificial intelligence may have a long way to go before it can be largely implemented in the healthcare industry, but it is important to recognize that it can provide quick and accurate care management — all with a few taps.
** 📣 Go check out my article about the future of healthcare to learn more about how CNNs work and how we can apply technology to other areas of healthcare!
TL;DR
- Misdiagnosis is a huge modern-day problem and 1 in 3 misdiagnoses results in serious injury or death.
- I created a project using machine learning that can differentiate between benign breast cells and malignant breast cancer cells with a 99.3% accuracy rate. 💪
- False-positive mammograms and the over-diagnosis of breast cancer among women ages 40 to 59 alone cost the healthcare system $4 billion annually creating a huge market for technology that can improve medical diagnoses. 💰
Main Take-Aways ✨
- Machine learning and deep learning have tons of applications that can find a space in healthcare. From diagnoses to prescribing medications we can drastically improve the quality, efficiency and scalability of the most important components of the healthcare system.
- We will always need doctors to give their professional opinion, but we can implement technology as a second layer, making sure it is accurate.
- Data is something we lack at the moment. While there are many privacy concerns surrounding collection patient information, it is going to be necessary if we wish to transform the healthcare system and largely implement technology.
- Always try different models and make sure you don’t end up tuning it so it fits either the training or the testing data. It is a fine balance of getting the accuracies for both high enough.
Hey there! Thanks for reading through my article, I hope it provided you with some insights. ❤️
If you would like to read more articles like this and continue hearing my thoughts about healthcare and technology, make sure to subscribe to my medium! 📣
If you would like to learn more about me, talk about a new discovery in healthcare, or chat over some coffee, feel free to connect with me on LinkedIn or follow me on Twitter!
Bye for now! 😉