In February we’ve held the internal test of our product on prostate cancer detection. At this stage it is very important for us to reveal all possible weak points of our product in order to do the final work on it on our way to market entrance.
The test was held in February with the support of Department of Pathology of Moscow Central Hospital, which has been our partner since our very start. During the test we have compared the results of diagnostics of three participants:
- the pathologist of Moscow Central Hospital with 1 year of medical experience;
- the pathologist of Moscow Central Hospital with 5 year of medical experience;
- Skychain prostate cancer neural network.
All of the three participants had to analyze 108 digital slides, which belong to 10 different patients. Each slide also featured 2 to 4 biopsy samples. Each participant should have analyzed each biopsy sample and fill out the form on what nosologies are found in the sample. The list of nosologies:
- AT (atrophy);
- O (acinar carcinoma)
- PENST (foam-cell carcinoma);
- PROT (ductal carcinoma);
- VHR (chronic inflammation);
- N (normal tissue);
- PIN (prostatic intraepithelial neoplasia).
For each O (acinar carcinoma), the Gleason Score is also provided. The Gleason Score ranges from 1–5 and describes how much the cancer from a biopsy looks like healthy tissue (lower score) or abnormal tissue (higher score). Most cancers score a grade of 3 or higher.
Since prostate tumors are often made up of cancerous cells that have different grades, two grades are assigned for each patient. A primary grade is given to describe the cells that make up the largest area of the tumor and a secondary grade is given to describe the cells of the next largest area. For instance, if the Gleason Score is written as 3+4=7, it means most of the tumor is grade 3 and the next largest section of the tumor is grade 4, together they make up the total Gleason Score. If the cancer is almost entirely made up of cells with the same score, the grade for that area is counted twice to calculated the total Gleason Score.
Moreover, in cases of O (acinar cancer), each participant had to provide the estimate percentage of tumor on each slide.
All of the cases were preanalyzed by the expert pathologist with more than 10 years of experience and all the outcomes were confirmed in the future. The opinion of an expert was considered as a reference standard.
The time taken for each specialist and neural network to carry out their analysis was also taken into account.
We have recieved the results of each participant on each exact slide.
This is the example of doctor’s opinion on one of the patients:
Slide 1: O, PIN, Gleason Score 3 + 4 = 7, occupies 30% of the punctate length.
Slide 2: O, PIN, Gleason Score 4 + 3 = 7, occupies 100% of the punctate length.
Slide 3: O, Gleason Score 3 + 4 = 7 points, occupies 90% of the punctate length.
Slide4: O, PIN, Gleason Score 3 + 3 = 6 points, occupies 90% of the punctate length.
Slide 5: O, PIN, Gleason Score 3 + 3 = 6 points, occupies 90% of the punctate length.
Slide 6: O, PIN, Gleason Score 3 + 3 = 6 points, occupies 80% of the punctate length.
Slide 7: O, Gleason Score 4 + 4 = 8 points, occupies 50% of the punctate length.
Slide 8: O, Gleason Score 4 + 3 = 7 points, occupies 40% of the punctate length.
Slide 9: O, PIN, Gleason Score 4 + 4 = 8 points, takes 50% of the punctate length.
Slide 10: O, Gleason Score 5 + 5 = 10 points, occupies 40% of the punctate length.
Slide 11: O, Gleason Score 5 + 5 = 10 points, occupies 90% of the punctate length.
Slide 12: N, small columns of prostate tissue without tumor growth.
Diagnosis labels (O, PENST, PIN, PROT) and Gleason Scores refer to categorical variables. You cannot apply normal math operations to categorical variables (other than “equal” or “not equal”), even if they are numerical values. Calculation of Cohen’s kappa coefficient through an error matrix is suitable for evaluating such variables. Cohen’s kappa measures agreement between two evaluators, each classifying N items in C mutually exclusive categories.
For each patient, the estimates predicted by the neural network are compared with the corresponding estimates of the selected doctors, and Cohen’s kappa is calculated. Then the distribution of this metric is calculated for the selected doctor.
Explanation: Cohen’s kappa measures agreement between two evaluators, each classifying N elements in C mutually exclusive categories. In our case, the Cohen’s kappa will be calculated on 108 test slides for N experts + the prediction of our neural network, that is, we will get N + 1 values.
Cohen’s Kappa Formula:
where pо is the relative observable agreement between the evaluators (identical in accuracy), and pе is the hypothetical probability of a random agreement, using the observed data to calculate the likelihood of each observer randomly seeing each category.
Simply put, Cohen’s Kappa shows how the opinion of one expert coinscides with the opinion of reference standard. The closer it is to 1, the more the opinions coincide.
As you can see, the body of the pink candlestick (VHR) is closer to 1 in Skychain’s column. That means that Skychain outperforms both experts in identifiying VHR (chronic inflammation). The same can be clearly said for AT and PENST labels. As for PROT label, Skychain is clearly at 1, meaning that it didn’t have any mistakes iin identifying ductal carcinoma. In other classes Skychain showcases quite comparable results to both experts.
If we average Cohen’s kappa for each patient, the 7 values “collapse” into one. As a result, for 10 patients, we get 1 list of Cohen’s average aligners, which still consists of 10 elements. We build a similar diagram.
By the position of the median value, doctors and Skychain can be compared with each other in general. The median value of both doctors correlates with their experience, and Skychain having a better median value than both experts.
By the size of the boxes, one can draw conclusions about which doctor or neural network is more stable in general. The blue box is lower than the orange box, and the green values are higher. This suggests that the predictions of Skychain by class are, on the whole, somewhat more stable and of better quality.
In general, we can conclude that Skychain has outperformed both experts in several classes and showed quite comparable results in others.
We were quite surprised to see the results of our work being competitive with real doctors. However, there is still much to do, since we are able to make mistakes in diagnosing several classes. We plan to use more data on the “weak” categories to show better results in our next test.
But what about the time?
Of course, as speed is one of the biggest advantages of AI, Skychain managed to do its analysis much faster than both experts did.
Time spent on the analysis:
- Pathologist with 1 year of practice — ~5,5 hours
- Pathologist with 5 years of practice — ~4,7 hours
- Skychain — ~0,75 hours
As you can see, Skychain managed to do the job much faster.
There was also one curious slide we would like to tell you more about.
For this tissue sample we have recieved following results:
- Pathologist with 1 year of practice
O — 30%; AT, N
- Pathologist with 5 years of practice
O — 1,29%; VHR;AT;N
- Reference standard
O — 1%; VHR;AT;N
As you can see, the pathologist of 1 year has found the cancer presence, but identified it having affected at least 30% of tissue area and not finding the signs of VHR, the chronic inflammation.
The pathologist with 5 years of practice has made a big mistake, labeling this tissue as completely normal.
As for Skychain, it found everything the reference standard showed it has. Moreover, since AI calculates area more precisely, it found the 1,29% of cancer presence instead of 1% by the expert pathologist.
This particular case is quite demonstrative, because the cancer was missed by the specialist, who has 5 years of experience (and quite a while in medical university before that). If it is missed, in a timespan of 1 year it will develop from Gleason 3 to Gleason 4 or even 5, making the prognosis for the patient much worse, lowering the chances for patient’s survival.
However, if this specialist used Skychain, it couldn’t be missed. Doctor would have recieved the highlighted slide, would have paid the attention to the cancer presence and would have probably saved the patient’s life.
Thanks for the support and stay tuned for future updates!
Alexander Oksanenko, Skychain Team
If you have any questions about Skychain, don’t hesitate to write to Alexander Oksanenko on Telegram and on email: firstname.lastname@example.org.